ACT: an ACTNet for visual tracking

Artigo Revisado por pares

ACT: an ACTNet for visual tracking

2018; Institution of Engineering and Technology; Volume: 13; Issue: 5 Linguagem: Inglês

10.1049/iet-ipr.2018.5807

ISSN

1751-9667

Autores

Ning Li, Qingge Ji, Tianjun Ma,

Tópico(s)

Advanced Vision and Imaging

Resumo

IET Image ProcessingVolume 13, Issue 5 p. 722-728 Research ArticleFree Access ACT: an ACTNet for visual tracking Ning Li, Ning Li School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510006 People's Republic of China Guangdong Province Key Laboratory of Big Data Analysis and Processing, Sun Yat-sen University, Guangzhou, 510006 People's Republic of ChinaSearch for more papers by this authorQingge Ji, Corresponding Author Qingge Ji issjqg@mail.sysu.edu.cn School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510006 People's Republic of China Guangdong Province Key Laboratory of Big Data Analysis and Processing, Sun Yat-sen University, Guangzhou, 510006 People's Republic of ChinaSearch for more papers by this authorTianjun Ma, Tianjun Ma School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510006 People's Republic of China Guangdong Province Key Laboratory of Big Data Analysis and Processing, Sun Yat-sen University, Guangzhou, 510006 People's Republic of ChinaSearch for more papers by this author Ning Li, Ning Li School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510006 People's Republic of China Guangdong Province Key Laboratory of Big Data Analysis and Processing, Sun Yat-sen University, Guangzhou, 510006 People's Republic of ChinaSearch for more papers by this authorQingge Ji, Corresponding Author Qingge Ji issjqg@mail.sysu.edu.cn School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510006 People's Republic of China Guangdong Province Key Laboratory of Big Data Analysis and Processing, Sun Yat-sen University, Guangzhou, 510006 People's Republic of ChinaSearch for more papers by this authorTianjun Ma, Tianjun Ma School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510006 People's Republic of China Guangdong Province Key Laboratory of Big Data Analysis and Processing, Sun Yat-sen University, Guangzhou, 510006 People's Republic of ChinaSearch for more papers by this author First published: 18 March 2019 https://doi.org/10.1049/iet-ipr.2018.5807Citations: 3AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract Owing to convolutional neural network (CNN) models' success in various fields of computer vision, the authors proposed an advanced convolutional network (ACTNet) to enhance the accuracy of visual tracking. Different from prior methods, they regard a CNN as not only a semantic feature map extractor but also a position predictor. Rectified Linear Unit (RLU) and sigmoid are both used in ACTNet for feature extraction and position determination. To avoid overfitting in pre-training, they introduce adding Erlang noise to create more training samples and to improve the robustness of each base learner. Experiments on widely used evaluation datasets demonstrate that their proposed ACT method outperforms state-of-the-art methods. 1 Introduction As a fundamental task of computer vision, visual tracking attracts more attention in the recent few years. It commonly refers to estimating the trajectory of a moving object in a given video. In general situation, we have to predict the position of the object sequentially by the information provided in the first frame. Moreover, the tracking objects and the position information in the first frame of the sequence are given in the handcrafted demarcation. Owing to complicated factors such as deformation (DEF), partial occlusion (OCC), fast motion (FM) and scale variations (SVs), visual tracking is still a challenging problem. To solve this challenge, existing methods can be mainly divided into generative models and discriminative models [1]. Either model relies on handcrafted features to capture information of targets. Although handcrafted features successfully separate foreground from background, these features only obtain low-level semantic representations and ignore detail appearances. Extensive experiments show that these features are not robust enough especially in complicated environment and with significant appearance changes. Recently, deep convolutional neural networks (CNNs) have achieved great success in many fields of computer vision, e.g. object detection [2], image classification [3] and cross-view retrieval [4]. Deep CNNs are trained to extract high-level features from pictures with their strong capabilities in learning semantic representations [5, 6], demonstrating its ability to distinguish objects from various situations. The application of CNNs in most area of computer vision prompts researching in visual tracking, with challenges. Since deep CNNs consist of multiple layers and millions of parameters, the supervised training process requires a huge amount of samples with labels. Most papers focus on one-pass model-free single object tracking [7], in which the only training instance is the first frame with provided position. Previous methods [8, 9] propose pre-training CNNs using a large-scale dataset and fine-tuning the parameters using the first frame in video sequences. Zhang and Suganthan [10] propose a very simple CNN model to extract features and classify objects from background. Fan et al. [11] proposed another CNN-based tracking method that learns both spatial and temporal features jointly from image pairs of two adjacent frames. Owing to CNNs' good capability for feature representation, these methods have reached state-of-the-art performance. However, the lack of training samples impedes improvement in precision tracking results, since it is crucial during the entire process. Some of the prior works show that the application of CNNs for locating targets' positions and scales in video sequences cannot assure state-of-the-art precision and success rate. Moreover, excessive iterations may not guarantee globally asymptotical stabilisation and easily contributes to overfitting (as shown in Fig. 1). Fig. 1Open in figure viewerPowerPoint Owing to lack of online training samples, normal iterative network methods for visual tracking [12] use excessive iterations to update models and cannot guarantee globally asymptotical stabilisation Some branched works [13] attempted to use CNNs as a feature extractor and to add another discriminator for classification. All these factors restrict development of CNNs' application in visual tracking. To address the above issue, we propose a novel deep-structured CNN for visual tracking. The contributions of our paper are summarised into three folds: (i) To obtain robust representations of tracking targets, we propose a deep-structured CNN architecture for feature extraction and positioning. It adopts small convolution kernel size and removes fully connected layers for semantic features. (ii) We develop an online training method that adds Erlang noise during online fine-tuning, further improving tracking accuracy. (iii) We evaluated our method on an open benchmark [14, 15] showing a superior performance. 2 Related work 2.1 Visual tracking Visual tracking is a fundamental problem in computer vision and has been widely used in some areas including the intelligent transportation system and the sport analysis. A tracking model can be broken down into two primary components: an appearance model to describe targets and a strategy to predict the motion of targets. For appearance models, generative models generate a conditional distribution to acquire the conditional probability for a specific class, e.g. sparse coding [16, 17] and principal component analysis [18]; discriminative models generate a joint probability distribution to train classifiers to distinguish foreground from the surrounding environment, e.g. multiple instance learning [19], support vector machine (SVM) [20] and structured SVM [21], also known as tracking-by-detection. Discriminative learning methods have advantages in discriminating foreground information and behave more robustly, gradually becoming more practicable in visual tracking. Most current visual tracking methods based on deep-learning utilise discriminative models as well. Henriques et al. [22] derived kernelised method with multiple channels for simplified computing and real-time tracking. Correlation filters have also become one of the focused areas in research, e.g. [22]. To deal with the lack of training datasets, Wang and Yeung [23] acquire the general representation of features by pre-training with widely used training datasets, e.g. ImageNet, SVHN, CIFAR-10 and so on. Then, it fine-tunes the models before tracking on video sequences for more robust classification. It is proven to be effective but behaves worst than traditional methods. Furthermore, training by iterations in an oversimplified NN leads to overfitting. Conscious of the difference between classification and tracking, Yun et al. [24] pre-trains a CNN to obtain domain-independent information and generic target representations, and by detecting with online-updated binary classification layers. Recurrently, target-attending tracking [25] improves correlation filters and proposes multi-directional recurrent NN to search for reliable parts. In summary, deep learning has gradually taken the place of state of the art. We do believe it has not been fully exploited. 2.2 Deep learning and CNN Research in NNs has been ongoing for several decades. Models with multilayer NNs can be dated back to the 1960s. Imitating biological NNs which constitute animal brains, artificial NNs are made up without task-specific programming. A standard NN consists of hundreds of connected processors (which are called neurones) to produce a sequence of real-valued activations [26]. The task of deep learning is to assign credit for every neurone by long-time computational stages. This procedure commonly conducts with a large scale of datasets and it belongs to supervised learning. CNNs are inspired by the organisation of animal visual cortex. It evolves multilayer perceptions, adding convolutional layers on the basis of fully connected layers, achieving better generalisation on vision problems with fewer memory requirements for running the whole network [27]. GoogleNet and VGGNets [28] are commonly used in typical recognition tasks. They have proved the relationship between a good performance and deep-structured networks. VGGNets introduce continuous convolutional layers for extracting feature maps. With 3 × 3 small-scaled convolution kernels, VGGNets enhance the behaviour in visual recognition and decrease the number of parameters. 3 Offline CNN training Feature extraction and analysis are necessary to understand the mechanism of deep learning. Before presenting our tracking method for CNNs, we first present the architecture of our locating CNN and the procedures of offline training. Details of the advanced convolutional network (ACTNet) will be presented in the next section. 3.1 Principles of feature extraction Owing to the good performance of VGGNets in visual classification and detection, we propose an ACTNet referring to the principles of VGGNets. Fig. 2 shows the architecture of the proposed ACTNet. The ACTNet consists of 13 convolutional layers of which all utilise 3 × 3 correlation kernels. For target tracking, there is no fully connected layer in ACTNet. The ACTNet can be segmented into two parts: feature extraction part and locating part. The feature extraction part learns to acquire feature maps and the locating part outputs a location map that draws the target's position. We adopt different activation functions in different parts, ReLU in feature extraction and sigmoid in locating. For semantic representations, the feature extraction part is deeply structured. Conv1 and conv2 both have two convolution layers, whereas conv3 and conv4 have three convolution layers. We keep the size of outputs in each portion invariant by setting padding in convolution layers equal to 1, and controlling output size by pooling layers. It is proven that our feature extraction part can obtain a good performance in high-level representations. Fig. 2Open in figure viewerPowerPoint Architecture of the proposed ACTNet The locating part contains three convolution layers. They are all the same and transport the feature maps into a location map. It outputs a binary location map, which describes the location of tracking targets. The well-designed architecture allows it to contain more parameters. For 400 × 400 input images, ACTNet outputs a 50 × 50 probability map, in which each pixel represents an 8 × 8 block. The block is initialised with all zeroes. If the target is in the corresponding block, value will be changed into a non-zero value, representing the probability of foreground objects. With an analysing process, we can obtain the predicted target's position. 3.2 Pre-training with offline datasets We pre-trained ACTNet offline with ILSVRC2014 (ImageNet Large-Scale Visual Recognition Challenge) dataset provided for an object detection competition. More than 450,000 images annotated into 200 basic-level categories were used for training and validation. All these images were hand labelled with a bounding box indicating the location of objects. ILSVRC2014 is considered to be an excellent training dataset to help a CNN discipline its ability to discriminate objects in complicated circumstances. Before training ACTNet, we performed other necessary procedures such as processing labels and the determining loss functions. In view of ILSVRC2014 labels objects with bounding box information shown, we created new binary labels according to labels in original datasets. A binary label has the same scale as the probability map, and it is set according to a Gaussian distribution with expectation and standard deviation focused at the centre of the bounding box. As for the loss function, we adopted maximum-likelihood estimation [29] to measure the forecast results. It is defined as (1)where denotes the labelled value in (i, j) and represents the prediction obtained from the probability map. m×n is the size of probability map. Let denotes the predicted value and denotes the set of coordinate points in the bounding box, the calculation of is defined as (2)The process of backpropagation uses normal stochastic gradient descent (SGD) to decrease the loss of the CNN. Trained in this way, it is learnt that convolution kernels were enforced to conduct a probability map, which indicates the location of detected objects. The training process consumes significant time, and we train the model step-by-step with different classes of images. The parameters are saved for online tracking. 4 Proposed algorithm 4.1 Online training The online training aims to fine-tune what the model learnt in pre-training. Considering that the process of pre-training has no pertinence up until now, the trained model does not ensure good precision on specific objects and tracking sequences. An online model adaption will make the pre-trained CNN fire on the indicated tracking target. In this session, a specific video sequence is supplied to fine-tune our model with the tracking object's annotation given in the first frame. The annotation contains the coordinate of the upper-left corner and the scale size of a bounding box. To enforce ACTNet sensitive to a particular object, large quantities of iterations are needed in online training. As mentioned in the previous examples, simple iterations probably lead to overfitting. We propose to add Erlang noise to change the appearance of original images. The probability density distribution is defined as (3)where parameters denote expectation and standard deviation of Erlang noise. The implementation is the synthesis of several exponential functions (4)where denotes the implementation of the Erlang noise and denotes the uniform noise. Here, the parameter a is a random value that produces an exponential function. The processed samples will be put into ACTNet for 180 iterations. In addition to avoiding overfitting, the iterations improve the accuracy and robustness under region noise. Our experiments prove that our tracker increases accuracy with the help of the online training step. 4.2 Target localisation and scale determination The real-time tracking process starts with target localisation. For the tth frame, a rectangle region of interest (ROI) is determined by focusing on the bounding box of the last frame. is centred at the last location. Comparing to traditional motion methods, we crop an ROI on the basis of the label in t–1. Patches are commonly chosen around the prediction position in t–1. This increases the success rate of tracking, especially when it comes to FM. To predict the most probable location in the tth frame, we define the confidence to stand for the probability. We proposed a shifting window to compute the maximal of all pixels as (5)where refers to the value returned from the tth probability map and (w, h) denotes the width and height of bounding box. The maximum corresponds to the centre of a new bounding box (Fig. 3). Fig. 3Open in figure viewerPowerPoint Pipeline of our tracking algorithm For scale determination, we trained fully connected layers to predict the current scale with the corresponding feature maps extracted in the feature extraction part. We established a predefined scale set , which contains different scales. Denoting the extracted feature maps , is the ground truth scale, the training process is shown as (6)where defines a prediction function of and is the loss function. Trained by decreasing , the scale prediction layers establish a correlation between and . To reduce computational complexity, we change the to an evaluation operation that outputs a score of the scale as (7) is defined in a hinge loss (8)The fully connected layers for scale determination do not need to be trained with offline datasets; it is initialised before online training and updates its coefficient with the first frame in the tracking sequence. 4.3 Online updating For more accurate tracking results, our tracker needs to be updated during online tracking. This adapts the model to changes in target appearance and strengthens its ability for discrimination. To decrease the influence of contaminated tracking results, we suggest updating the model after an evaluation of returned training samples. Our method divides the samples into negative samples and positive samples by a predefined threshold , which is associated with the confidence in (5). Once is beyond , we assign it as a negative sample and update our model; or we do nothing with the net. Meanwhile, we update the scale determination with another threshold . The set S updates its elements using a defined scale factor . 5 Experiments Our proposed ACT tracker is implemented in MATLAB using the Caffe framework. We run the tracker and all our experiments on a computer with 24 1.2 GHz central processing units and a GeForce GTX TITAN graphics processing unit. As for the pre-training process, our ACTNet is trained using SGD method in the backpropagation and the learning rate is 8×10−7. For the fully connected layers in scale determination, we also use SGD and the learning rate is set to be 5×10−10. To avoid overfitting, we also apply a drop-out ratio (0.4) in both two nets. The scale factor is set to 1.083 to update the scale set S. In the following section, we will introduce the Object Tracking Benchmark (OTB) visual tracking benchmark and our experiment results based on the OTB dataset. 5.1 Evaluation setting and metrics The OTB visual tracking benchmark [12, 14] provides commonly used datasets for visual tracking to evaluate trackers' performances. It contains 100 sequences with hand-made annotations. Each row in the ground truth represents the position of the bounding box of the target in a specific frame. In quantities analysis, we apply precision plots and success plots as evaluation metrics. Precision plots get the centre location error by calculating the distance between our predicted results and ground truth, while success plots detect the bounding box overlap. The robustness evaluation includes three important aspects: one-pass evaluation (OPE), temporal robustness evaluation (TRE) and spatial robustness evaluation (SRE). Given the ground truth in the first frame, the OPE simply returns precision plots and success plots in sequences, while TRE and SRE make some initialisations on start frames. In TRE evaluations, a sequence is divided into 20 segments and our tracker runs on different initial frames. Considering the tracking results are strongly affected by the initialisation in the first frame, we change the start bounding box spatially in SRE to test our tracker's robustness. In an SRE evaluation, 12 different initial bounding boxes are given with 8 spatial shifts and 4 SVs. All shifts are 10% of original size and we use these metrics as an important reference when comparing with other state-of-the-art trackers. The source code for evaluation is public on the project website [http://cvlab.hanyang.ac.kr/tracker_benchmark/benchmark_v10.html.]. 5.2 Quantitative results Fig. 4 demonstrates the tracking results in a subset of challenging sequences. In our experiments, our tracker is compared with 12 top trackers including C-COT [30], HDT [31], DeepSRDCF [32], DLSSVM, scale_DLSSVM [21], CF2 [33], LCT [34], staple [35], MEEM [36], SRDCF [37], SRDCFdecon [38] and DSST [39]. All these tracking results are available on a GitHub project for visual tracker benchmark results. Additionally, Fig. 5 shows the average precision plots and success plots of OPE, TRE and SRE in all OTB sequences. Our tracker achieves the highest performance on the whole, though it achieves the second best result in precision plots for TRE. This indicates that ACT's accuracy in segment frames is not as good as C-COT's. Our ACT method outperforms in both metrics with a sizable margin, compared with other trackers. It proves that our tracker is more robust and effective in complicated circumstance. Its performance affects less by the initialisation in a tracking process. Fig. 4Open in figure viewerPowerPoint Tracking results in a subset of challenging sequences: Biker, BlurFace, Bolt, Box, CarDark, CarScale, ClifBar, Crowds, David, Football, Human4, Liquor, Skiing, Surfer, Walking and Woman Fig. 5Open in figure viewerPowerPoint Average precision plots and success plots of OPE, TRE and SRE in all OTB sequences. The first row shows precision plots and success plots of OPE, and the second and third ones show quantitative results in TRE and SRE To facilitate more detailed analysis, we demonstrate average precision scores in different attributes. One-pass scores are shown in Table 1 and robustness scores are in Table 2. It is worth mentioning that robustness scores synthesise both metrics in TRE and SRE. The OTB datasets are meticulously classified into 11 different attributes, and this scores aim to test whether our proposed ACT method can handle various challenging factors. Both tables show that our tracker achieves a high-level effectiveness in most factors, especially in FM and motion blur (MB). However, we should also point out that the proposed ACT is not satisfactory in out of view (OV) and illumination variation (IV). The low-level success rate indicates our tracker may be easily affected by OCC and IV. Table 1. Average precision scores on different attributes in OPE verification: low resolution (LR), background clutter (BC), OV, in-plane rotation (IPR), FM, MB, DEF, OCC, SV, out-of-plane rotation (OPR), IV ACT C-COT CF2 DeepSRDCF DLSSVM DSST HDT LCT MEEM Scale_DLSSVM SRDCF SRDCFdecon Staple LR 0.581 0.562 0.557 0.352 0.430 0.408 0.551 0.286 0.360 0.418 0.426 0.436 0.438 BC 0.635 0.601 0.623 0.591 0.592 0.517 0.610 0.587 0.569 0.583 0.587 0.619 0.576 OV 0.709 0.726 0.575 0.619 0.581 0.462 0.569 0.594 0.606 0.512 0.555 0.647 0.547 IPR 0.629 0.625 0.582 0.596 0.556 0.563 0.580 0.592 0.535 0.580 0.566 0.598 0.580 FM 0.698 0.652 0.578 0.608 0.553 0.428 0.574 0.534 0.553 0.533 0.569 0.603 0.508 MB 0.701 0.666 0.616 0.625 0.578 0.455 0.614 0.524 0.541 0.545 0.601 0.606 0.541 DEF 0.690 0.668 0.626 0.617 0.632 0.506 0.627 0.668 0.560 0.655 0.635 0.626 0.618 OCC 0.724 0.698 0.606 0.628 0.589 0.532 0.603 0.627 0.552 0.592 0.627 0.642 0.593 SV 0.647 0.658 0.531 0.628 0.494 0.546 0.523 0.553 0.498 0.531 0.587 0.634 0.551 OPR 0.631 0.659 0.587 0.630 0.582 0.536 0.584 0.624 0.558 0.591 0.599 0.633 0.575 IV 0.627 0.637 0.560 0.589 0.540 0.561 0.557 0.588 0.533 0.564 0.576 0.624 0.568 overall 0.675 0.633 0.565 0.595 0.545 0.513 0.553 0.551 0.532 0.565 0.589 0.587 0.566 Top three results are in bold, italic and underline, respectively. Table 2. Average precision scores on different attributes in robustness verification: LR, BC, OV, IPR, FM, MB, DEF, OCC, SV, OPR, IV ACT C-COT CF2 DeepSRDCF DLSSVM DSST HDT LCT MEEM Scale_DLSSVM SRDCF SRDCFdecon Staple LR 0.571 0.562 0.557 0.352 0.430 0.408 0.551 0.286 0.360 0.418 0.426 0.426 0.438 BC 0.621 0.601 0.623 0.591 0.592 0.517 0.610 0.587 0.569 0.583 0.587 0.619 0.576 OV 0.693 0.726 0.575 0.619 0.581 0.462 0.569 0.594 0.606 0.512 0.555 0.647 0.547 IPR 0.644 0.625 0.582 0.596 0.556 0.563 0.580 0.592 0.535 0.580 0.566 0.598 0.580 FM 0.756 0.652 0.578 0.608 0.553 0.428 0.574 0.534 0.553 0.533 0.569 0.603 0.508 MB 0.714 0.666 0.616 0.625 0.578 0.455 0.614 0.524 0.541 0.545 0.601 0.606 0.541 DEF 0.692 0.656 0.626 0.617 0.632 0.506 0.627 0.668 0.560 0.655 0.635 0.626 0.618 OCC 0.728 0.698 0.606 0.628 0.589 0.532 0.603 0.627 0.552 0.592 0.627 0.642 0.593 SV 0.647 0.658 0.531 0.628 0.494 0.546 0.523 0.553 0.498 0.531 0.587 0.634 0.551 OPR 0.669 0.659 0.587 0.630 0.582 0.536 0.584 0.624 0.558 0.591 0.599 0.633 0.575 IV 0.630 0.637 0.560 0.589 0.540 0.561 0.557 0.588 0.533 0.564 0.576 0.624 0.568 overall 0.726 0.672 0.605 0.641 0.589 0.554 0.603 0.628 0.566 0.608 0.626 0.653 0.600 Top three results are in bold, italic and underline colours, respectively. The real-time performance is another important aspect for validating the computational efficiency of our tracking method. To get a more objective validation, we compared ACT with other trackers including C-COT, Staple, SRDCF, DeepSRDCF, GOTURN and DSST. The mean Frames Per Second (FPS) of test results are shown in Table 3. These trackers can be sorted into different classes according to whether they are deep-learning methods or not. The results show that though deep-learning methods get more accurate predictions, their mean FPS is quite lower than traditional. We believe it still needs substantial improvements in order to match the demands of real-time tracking. In addition, we also perform a comparative experiment with or without Erlang noise in Table 4. It proves that Erlang noise can increase each precision by about 10%. Table 3. Mean FPS on testing sequences Deep learning Mean FPS ACT Y 23.65 C-COT Y 0.51 Staple N 44.77 SRDCF N 1.99 DeepSRDCF Y 0.38 DSST N 53.60 Testing results of C_COT, Staple and DeepSRDCF are from the VOT2016 challenge. Table 4. Comparative experiment with or without Erlang noise Online training with Erlang noise Online training without Erlang noise precision of OPE 0.824 0.775 success rate of OPE 0.647 0.640 precision of TRE 0.794 0.693 success rate of SRE 0.635 0.602 precision of SRE 0.778 0.647 success rate of TRE 0.618 0.539 6 Conclusion In this paper, we present an ACTNet for visual tracking. Different from prior methods, this advanced tracking net utilises the principles of VGGNets to obtain high-level semantic representation of target features. Different activation functions are utilised to yield accurate regression for predicted positions. During the training process, we add Erlang noise to create more useful training samples for improved tracking performance. Experiments on OTB evaluation datasets demonstrate that our proposed ACT method outperforms state-of-the-art methods. 7 Acknowledgments This research was supported by the Natural Science Foundation of Guangdong Province, China (No. 2016A030313288). We thank the anonymous reviewers for their suggestions and comments. 8 References 1Ng A.Y., and Jordan M.I.: ' On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes'. Advances in Neural Information Processing Systems, Vancouver, Canada, 2002, pp. 841– 848 2Ren S., He K., and Girshick R. et al: ' Faster R-CNN: towards real-time object detection with region proposal networks'. Advances in Neural Information Processing Systems, Montreal, Canada, 2015, pp. 91– 99 3Xiao T., Xu Y., and Yang K. et al: ' The application of two-level attention models in deep convolutional neural network for fine-grained image classification'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 842– 850 4Feng F., Wang X., and Li R.: ' Cross-modal retrieval with correspondence autoencoder'. Proc. 22nd ACM Int. Conf. Multimedia ACM, Orlando, USA, 2014, pp. 7– 16 5Girshick R., Donahue J., and Darrell T. et al: ' Rich feature hierarchies for accurate object detection and semantic segmentation'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 580– 587 6Wang N., Li S., and Gupta A. et al: ' Transferring rich feature hierarchies for robust visual tracking', arXivpreprint arXiv:1501.04587, 2015 7Wang N., Shi J., and Yeung D.Y. et al: ' Understanding and diagnosing visual tracking systems'. Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 3101– 3109 8Oquab M., Bottou L., and Laptev I. et al: ' Learning and transferring mid-level image representations using convolutional neural networks'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 1717– 1724 9Tompson J., Goroshin R., and Jain A. et al: ' Efficient object localization using convolutional networks'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 648– 656 10Zhang L., and Suganthan P.N.: ' Visual tracking with convolutional neural network'. 2015 IEEE Int. Conf. IEEE Systems, Man, and Cybernetics (SMC), Hong Kong, 2015 11Fan J., Xu W., and Wu Y. et al: 'Human tracking using convolutional neural networks', IEEE Trans. Neural Netw., 2010, pp. 1610– 1623 12Wang L., Ouyang W., and Wang X. et al: ' STCT: sequentially training convolutional networks for visual tracking'. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 1373– 1381 13Nam H., and Han B.: ' Learning multi-domain convolutional neural networks for visual tracking'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 4293– 4302 14Wu Y, Lim J., and Yang M.H.: 'Object tracking benchmark', IEEE Trans. Pattern Anal. Mach. Intell., 2015, 37, (9), pp. 1834– 1848 15Wu Y., Lim J., and Yang M.H.: ' Online object tracking: a benchmark'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Washington DC, USA, 2013, pp. 2411– 2418 16Zhang S., Yao H., and Sun X. et al: 'Sparse coding based visual tracking: review and experimental comparison', Adv. Pattern Recognit., 2013, 46, (7), pp. 1772– 1788 17Li J., and Wang J: 'Adaptive object tracking algorithm based on eigen basis space and compressive sampling', IET Image Process., 2012, 6, (8), pp. 1170– 1180 18Bro R., and Smilde A.K.: 'Principal component analysis', Anal. Methods, 2014, 6, (9), pp. 2812– 2831 19Zhang K., and Song H.: 'Real-time visual tracking via online weighted multiple instance learning', Adv. Pattern Recognit., 2013, 46, (1), pp. 397– 411 20Sun L., Liu G., and Liu Y.: 'Multiple pedestrians tracking algorithm by incorporating histogram of oriented gradient detections', IET Image Process., 2013, 7, (7), pp. 653– 659 21Ning J., Yang J., and Jiang S. et al: ' Object tracking via dual linear structured SVM and explicit feature map'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 4266– 4274 22Henriques J.F., Caseiro R., and Martins P. et al: 'High-speed tracking with kernelized correlation filters', IEEE Trans. Pattern Anal. Mach. Intell., 2015, 37, (3), pp. 583– 596 23Wang N., and Yeung D.-Y: ' Learning a deep compact image representation for visual tracking'. Advances in Neural Information Processing Systems, Nevada, USA, 2013, pp. 809– 817 24Yun S., Choi J., and Yoo Y. et al: ' Action-decision networks for visual tracking with deep reinforcement learning'. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 2711– 2720 25Cui Z., Xiao S., and Feng J. et al: ' Recurrently target-attending tracking'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 1449– 1458 26LeCun Y., Bengio Y., and Hinton G.: 'Deep learning', Nature, 2015, 521, (7553), pp. 436– 444 27Donahue J., Jia Y., and Vinyals O. et al: ' Decaf: a deep convolutional activation feature for generic visual recognition'. Proc. Int. Conf. Machine Learning, Madrid, Spain, 2014, pp. 647– 655 28Simonyan K., and Zisserman A.: ' Very deep convolutional networks for large-scale image recognition', arXiv preprint arXiv, 2014:1409.1556 29Brox T.: ' Maximum likelihood estimation'. Advances in Computer Vision, USA, 2014, pp. 481– 482 30Danelljan M., Robinson A., and Khan F.S. et al: ' Beyond correlation filters: learning continuous convolution operators for visual tracking'. European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 472– 488 31Qi Y., Zhang S., and Qin L. et al: ' Hedged deep tracking'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 4303– 4311 32Danelljan M., Hager G., and Shahbaz Khan F. et al: ' Convolutional features for correlation filter based visual tracking'. Proc. IEEE Int. Conf. Computer Vision Workshops, Santiago, Chile, 2015, pp. 58– 66 33Ma C., Huang J.B., and Yang X. et al: ' Hierarchical convolutional features for visual tracking'. Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 3074– 3082 34Ma C., Yang X., and Zhang C. et al: ' Long-term correlation tracking'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 5388– 5396 35Bertinetto L., Valmadre J., and Golodetz S. et al: ' Staple: complementary learners for real-time tracking'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 1401– 1409 36Zhang J., Ma S., and Sclaroff S.: ' MEEM: robust tracking via multiple experts using entropy minimization'. European Conf. Computer Vision, Cham, 2014, pp. 188– 203 37Danelljan M., Hager G., and Shahbaz Khan F. et al: ' Learning spatially regularized correlation filters for visual tracking'. Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 4310– 4318 38Danelljan M., Hager G., and Shahbaz Khan F. et al: ' Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking'. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 1430– 1438 39Danelljan M., Häger G., and Khan F. et al: ' Accurate scale estimation for robust visual tracking'. Proc. British Machine Vision Conf., Nottingham, September 2014, pp. 1– 5 Citing Literature Volume13, Issue5April 2019Pages 722-728 FiguresReferencesRelatedInformation

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

ACT: an ACTNet for visual tracking