Dataset and Protocol

The dataset used in this contest was acquired using three XIMEA snapshot cameras including VIS, NIR and RedNIR, which respectively covers 16 bands, 25 bands and 15 bands after calibration. The videos were captured at 25 frames per second (FPS). Each frame was originally captured in 2D and arranged in a mosaic mode. Each frame is then converted to 3D with the first two dimensions index the location of each pixel, and the third dimension indexes the band number (code provided). For the RedNIR data, please drop out the last band which contains only zero values. False-color videos generated from the hyperspectral videos are also provided. The first frames of training dataset are as follows:

Preprocessing
Camera Calibration

The camera calibration process involves two steps: dark calibration and spectral correction. Dark calibration aims to remove the influence of noise produced by the camera sensor. It is done by subtracting a dark frame from the captured image, for which the dark frame was captured with lens covered by a cap. The goal of spectral calibration is to reduce the distortion of spectral responses. It is done by applying a sensor-specific spectral correction matrix on the measurement in each pixel. The 16 band images of the corrected hyperspectral data cube are saved as 2D frames with the 16 bands arranged again in a mosaic mode.

Image Conversion

To ensure fair comparison, the hyperspectral videos were converted to false-color videos using CIE color matching functions. This produces strictly spatially aligned hyperspectral and false-color videos.

It is important to note that the spectral reflectance of an object may not necessarily match its actual physical reflectance due to the absence of certain calibration procedures, such as white calibration. However, despite this limitation, the spectral differences between different objects still play a significant role in improving recognition performance.

For details on the above steps, please refer to: F. Xiong, J. Zhou, and Y. Qian. "Material based object tracking in hyperspectral videos", IEEE Trans. Image Process., vol. 29, no. 1, pp. 3719-3733, 2020.

Annotation

A single upright bounding box is provided for the location of the target object in each frame. The bounding box is represented by the centre location and its height and width. The labels for hyperspectral and color videos were generated independently. The labels for the hyperspectral videos can be used directly on the false-color videos.

Attributes

The whole dataset contains 95 sets of videos for training and 77 sets of videos for validation. Every video is labelled with associated challenging factors out of eleven attributes, including illumination variation, scale variation, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, out-of-view , background clutters, and low resolution.

Protocols:

  • The use of the training set is optional.
  • Tracking starts from the first frame of the sequence. The bounding box in the first frame are used to initialize the location of tracking. Single object tracking is expected.
  • The same model hyper-parameters shall be used for all the sequences.
  • The tracking results contain a sequence of bounding boxes for each frame.
  • It is suggested to use a single model to handle HSVs with different numbers of bands.

Evaluation Metrics

Precision plot, success plot and area under curve (AUC) will be used to calculate the performance of all the trackers. Precision plot records the fractions of frames whose estimated location is within a given distance threshold to the ground truth. The average distance precision rate is reported at a threshold of 20 pixels. Success plot shows the percentages of successful frames whose overlap ratio between the predicted bounding box and ground-truth is larger than a certain threshold varied from 0 to 1. AUC will be caluclated on each success plot. All the results are presented with one-pass evaluation (OPE), i.e., a tracker is run throughout a test sequence with initialization from the ground truth position in the initial frame. Related codes are provided in the source code package.

Problems and Updates (2023.07.12)

Training Set: Video name: training/HSI-NIR/car22; training/HSI-NIR-FalseColor/car22 Problem: No object appears in the last 13 frames. Suggestion: Delete the last 13 frames, as well as the corresponding GT values in groundtruth_rect.txt.

Video name: training/HSI-NIR/car25; training/HSI-NIR-FalseColor/car25 Problem: The last frame has an inaccurate label. Suggestion: Delete the last frame, as well as the corresponding GT value in groundtruth_rect.txt.

Video name: training/HSI-NIR/car39; training/HSI-NIR-FalseColor/car39 Problem: The objects are not labeled in the first 22 frames (which were labeled with -1). Suggestion: Delete the first 22 frames, as well as the corresponding GT values in groundtruth_rect.txt, which means that this video starts from frame 23.

Video name: training/HSI-NIR/car40; training/HSI-NIR-FalseColor/car40 Problem: The objects are not labeled in the first 42 frames (which were labeled with -1). Suggestion: Delete the first 42 frames, as well as the corresponding GT values in groundtruth_rect.txt, which means that this video starts from frame 43.

Video name: training/HSI-VIS/automobile11; training/HSI-VIS-FalseColor/automobile11 Problem: Some objects are wrongly labeled. Suggestion: Download the new annotation results, which can be available in [1].

Video name: training/HSI-VIS/automobile13; training/HSI-VIS-FalseColor/automobile13 Problem: The annotation results drifted away from the object; the first 23 frames of the video are not accurate. Suggestion: Delete the first 23 frames. And download the new annotation results, which can be available in [1].

[1] Baidu YunPan: https://pan.baidu.com/s/10W73JdvJkXu6RXrv8SMORA Access code: 1234

Validation Set: Video name: validation/HSI-NIR/car62; validation/HSI-NIR-FalseColor/car62 Problem´╝Ü The objects are not labeled in the first 16 frames (which were labeled with -1). Suggestion: Delete the first 16 frames, as well as the corresponding GT values in groundtruth_rect.txt, which means that this video starts from frame 17. [Must solve the problem in Validation Set!!!]

Technical Support:

Fengchao Xiong

School of Computer Science and Engineering, Nanjing University Science and Technology

Email: fcxiong@njust.edu.cn