NetTrack: Tracking Highly Dynamic Objects with a Net
CVPR 2024


  • Guangze Zheng
  • Shijie Lin
  • Haobo Zuo
  • Changhong Fu
  • Jia Pan*



Abstract

The complex dynamicity of open-world objects presents non-negligible challenges for multi-object tracking (MOT), often manifested as severe deformations, fast motion, and occlusions. Most methods that solely depend on coarse-grained object cues, such as boxes and the overall appearance of the object, are susceptible to degradation due to distorted internal relationships of dynamic objects. To address this problem, this work proposes NetTrack, an efficient, generic, and affordable tracking framework to introduce fine-grained learning that is robust to dynamicity. Specifically, NetTrack constructs a dynamicity-aware association with a fine-grained Net, leveraging point-level visual cues. Correspondingly, a fine-grained sampler and matching method have been incorporated. Furthermore, NetTrack learns object-text correspondence for fine-grained localization. To evaluate MOT in extremely dynamic open-world scenarios, a bird flock tracking (BFT) dataset is constructed, which exhibits high dynamicity with diverse species and open-world scenarios. Comprehensive evaluation on BFT validates the effectiveness of fine-grained learning on object dynamicity, and thorough transfer experiments on challenging open-world benchmarks, i.e., TAO, TAO-OW, AnimalTrack, and GMOT-40, validate the strong generalization ability of NetTrack even without finetuning.


a The visualization of the proposed NetTrack is similar to a Net. Object dynamicity distorts the internal relationships of the object, presenting challenges for traditional coarse-grained tracking methods that rely solely on bounding boxes. While NetTrack introduces fine-grained Nets that are robust to dynamicity. b Qualitative results of NetTrack tracking highly dynamic objects under open-world tracking and referring expression comprehension settings. Dynamicity like deformation and fast motion results in drastic changes in the coarse-grained representation, while the fine-grained Nets can contract robustly. The dashed boxes represent the object position from the previous time step. c We propose a challenging benchmark named BFT, dedicated to evaluating highly dynamic object tracking with abundant scenarios shown in the external circular and diverse species shown in the central word cloud.

Video


Approach

The high dynamicity of open-world objects, manifested as severe deformation, fast motion, and frequent occlusion, poses challenges for existing methods in two major aspects:
1) Association For most methods relying solely on coarse-grained visual representations, the high dynamicity renders the temporal continuity fragile in terms of association.
2) Localization SoTA methods typically learn the coarse-grained correspondence between the entire image and text in pre-training, where the dynamic objects are hard to localize.

In this work, we propose NetTrack, introducing fine-grained learning to address the above two aspects. Regarding association, NetTrack utilizes physical points on the object's appearance that are less susceptible to object dynamicity and form fine-grained visual cues. For localization, grounded pre-training is utilized to learn fine-grained correspondences between objects and text.


Object-text correspondence for fine-grained localization This work adopts phrase grounding as pre-training method. Compared to CLIP-based tracking methods that utilize coarse-grained image-text correspondence, NetTrack can more effectively distinguish dynamic objects. NetTrack can also learn contextual information from LLM to achieve practical real-world applications for efficient dynamic object tracking.
Fine-grained Net for dynamicity-aware association NetTrack tracks the object with a fine-grained Net, which leverages points of interest (POIs) on the surface of object appearance. We design a fine-grained sampler to discover potential POIs and utilize fine-grained visual cues of these points, along with the emerging physical point tracking methods for robust tracking. Subsequently, a simple yet effective fine-grained similarity calculation method is proposed to determine the containment relationship between the tracked POIs and candidate objects.

Data

a Bird flocks are among the most dynamic objects to track in the open world and thus are considered ideal subjects for this work. We construct BFT dataset, which incorporates 22 bird species and 14 common natural and cultural scenes, covering six continents. After careful data curation and annotation, the BFT dataset contains 106 videos, split into 35/25/36 for train/val/test. b The severe aspect ratio change of objects in BFT dataset is shown, which demonstrates the high dynamicity. c The large object motion also shows the high dynamicity of the BFT dataset.

Results

The experimental results mainly demonstrate that: 1) Even in the zero-shot open-world tracking setting, NetTrack achieves superior performance compared to SoTA finetuned closed-set trackers. 2) In comparison to the results after fine-tuning (lines 9-12), closed-set trackers exhibit sub-optimal zero-shot generalization ability (lines 13,14,17,18) in highly dynamic open-world scenarios.
For more details, please refer to our paper.

Citation

AخA
 
@inproceedings{nettrack2024cvpr,
    title={NetTrack: Tracking Dynamic Objects with a Net},
    author={Guangze Zheng and Shijie Lin and Haobo Zuo and Changhong Fu and Jia Pan},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024},
    pages={1-8},
}

Open Source

The NetTrack code and BFT dataset are under Apache License.

Acknowledgements

This project is supported by the Innovation and Technology Commission of the HKSAR Government under the InnoHK initiative, ITF GHP/126/21GD and HKU's CRF seed grant. The authors would like to thank Dr. Ming-Shan Wang for his advice.

The website design is inspired by RT-1, whose template was borrowed from Jon Barron.