Artigo Acesso aberto Revisado por pares

IAUFD: A 100k images dataset for automatic football image/video analysis

2022; Institution of Engineering and Technology; Volume: 16; Issue: 12 Linguagem: Inglês

10.1049/ipr2.12543

ISSN

1751-9667

Autores

Amirhosein Zanganeh, Mahdi Jampour, Kamran Layeghi,

Tópico(s)

Anomaly Detection Techniques and Applications

Resumo

IET Image ProcessingVolume 16, Issue 12 p. 3133-3142 ORIGINAL RESEARCHOpen Access IAUFD: A 100k images dataset for automatic football image/video analysis Amirhosein Zanganeh, Amirhosein Zanganeh Department of Computer Engineering, North Tehran Branch, Islamic Azad University, Tehran, IranSearch for more papers by this authorMahdi Jampour, Corresponding Author Mahdi Jampour [email protected] orcid.org/0000-0002-1559-1865 Quchan University of Technology, Quchan, Iran Correspondence Mahdi Jampour, Quchan University of Technology, Quchan, Iran. Email: [email protected]Search for more papers by this authorKamran Layeghi, Kamran Layeghi Department of Computer Engineering, North Tehran Branch, Islamic Azad University, Tehran, IranSearch for more papers by this author Amirhosein Zanganeh, Amirhosein Zanganeh Department of Computer Engineering, North Tehran Branch, Islamic Azad University, Tehran, IranSearch for more papers by this authorMahdi Jampour, Corresponding Author Mahdi Jampour [email protected] orcid.org/0000-0002-1559-1865 Quchan University of Technology, Quchan, Iran Correspondence Mahdi Jampour, Quchan University of Technology, Quchan, Iran. Email: [email protected]Search for more papers by this authorKamran Layeghi, Kamran Layeghi Department of Computer Engineering, North Tehran Branch, Islamic Azad University, Tehran, IranSearch for more papers by this author First published: 30 May 2022 https://doi.org/10.1049/ipr2.12543 Islamic Azad University Football Dataset (IAUFD) http://sites.google.com/view/image-and-video-analysis AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract Nowadays, analyzing football videos using computer vision techniques has attracted increasing attention. Significant events detection, football video summarization, football results predictions, statistics etc. are exciting applications in this area. On the other hand, the deep learning approaches are very successful methods for image and video analysis that need much data. Nevertheless, to the best of our knowledge, publicly available datasets in this area are small or individual, which are not enough for such deep learning-based approaches. A public dataset was collected, annotated, and prepared, namely IAUFD*, to meet this gap for researches in this direction. The IAUFD contains 100,000 real-world images from 33 football videos in 2,508 min, annotated in 10 event categories. These categories include the goal, center of the field, celebration, red card, yellow card, the ball, stadium, the referee, penalty-kick, and free-kick. It is believed that these moments are the basis and useful for any high-level action or event exploration. For a generalization of our dataset, we paid attention to various weather (e.g., sunny, rainy, cloudy etc.), season, time of day, and location. We also used two deep neural networks (VggNet-13 and ResNet-18) to evaluate our proposed dataset as the baseline for future studies and comparison. 1 INTRODUCTION A rapid growth in technology was the cause of expansion and production in various equipment for video recording and storage media. Nowadays, about 500 h of video content are uploaded to websites like YouTube every minute, and about five billion videos are watched every day by different users around the world. These huge video volumes produced and watched by various users indicate the importance of using automated systems for storing, processing and managing the video contents [1]. Consequently, these media users all over the world daily encounter a wide range of different video types. The produced videos can be classified into two surveillance and entertainment areas. Using video surveillance systems in many organizations, offices, factories, and working environments has led to accurate monitoring and controlling of the environment, reduced violations, increased system efficiency etc. The second category is the various entertainment-related videos. According to users' tendencies and interests, they can personally provide different video programs or follow and watch videos accessible on the internet. Among the videos, the sports videos and particularly the football matches videos are very popular because football is one of the most attractive sports in the world. Its popularity has now resulted in becoming a large and intensely active industry, for example, the European football market's annual income is approximately equal to $28.7 billion, representing the significance of this sport in the economy [2]. The attraction and popularity of football games have drawn fans' attention, and these videos are now considered one of the most-viewed videos. A football game's long duration is regarded as one of its features, which can be viewed as an advantage in some cases and a disadvantage in others. This long duration is why many people do not primarily have the opportunity to watch a 90-min football game. However, on the other hand, they intend to watch at least the most important and exciting game moments. Therefore, comprehensive research has recently been commenced for football video analysis through image/frame or video processing [3]. These researches include recognizing important occurrences and significant events [4-6], analyzing the statistics of the matches [7], predicting football games results [8], summarizing the video of football games [9] etc. Recent research on football video analysis is promising. The automatic analysis of football game videos is instrumental in event detection, producing statistical reports, football object tracking, video summarization etc. The machine learning approaches, particularly the supervised learning methods, can significantly perform automated analysis with suitable training data. In such techniques, data is critical, and preparing appropriate data is always challenging. Considering the lack of a comprehensive, suitable, and publicly available dataset, including football game images for analysis, we provided a football dataset presented in this paper. Several instances of images in this dataset, including objects, events, and scenes, are demonstrated in Figure 1. We performed a survey to gather this dataset towards football game images understanding or recognizing the significant events. Our survey consists of a questionnaire with five everyday football game events, and selecting the most important events has been given to the participants. The mentioned survey form includes the goal, penalty-kick, free-kick, ball hits the goalpost, and red and yellow card events. The questionnaire has been distributed among 200 people of different ages. It has been asked to give a score from one to five (the lowest to highest) for significant events. Our survey outcomes are demonstrated in Table 1. As expected, according to the average opinion of participants in the survey, the most significant event in the football game is the goal event, and then the penalty kick. Free-kick, ball hits the goalpost, and penalty cards (red and yellow) are the next significant events, respectively. Based on the survey outcomes and for recognizing the important football game events, we collected 100k football images, including objects, events, scenes in 10 groups. Our 10 classes are the goal, beginning the game at the center of the field (start/restart), celebration, the red card event, the yellow card event, the ball, the stadium, the referee, penalty kick, and free-kick. The purpose of preparing such classes is to explore the significant events towards various high-level applications such as statistical analysis, football object tracking, video summarization etc. Our dataset can also be considered a comprehensive set for various football game analyses. However, there are various approaches using video-processing (e.g., LSTM, optical flow, motion etc.) introduced by Apostolidis et al. [1]; our focus is on providing a dataset for images or video-frames processing. TABLE 1. Survey results about the most important football game events Event type The average received score Goal 4.32 Penalty-kick 4.00 Free-kick 3.32 Ball hits the goalpost 3.20 Red and Yellow Card 2.96 FIGURE 1Open in figure viewerPowerPoint Some sample images from our proposed dataset with different observations (objects, events, or scenes) Despite introducing a comprehensive dataset of football game images that are freely accessible to the public, this paper proposes a baseline model to recognize the game objects, with several evaluating protocols. Our model relies on two well-known deep neural network architectures of VggNet-13 and ResNet-18 for object recognition and event detection. 2 RELATED DATASET In the field of analysis, there are many datasets containing sports videos and films. For example, Ramanathan et al. [12] provided a dataset with 257 basketball videos. In the presented dataset, video frames have been classified into ten different categories, which can be used for event detection. Karpathy et al. [13] have presented a dataset with one million videos of various sports. In their dataset, videos have been classified into 487 classes. The given dataset has many varieties and extensiveness in the number of videos. However, it does not have any labeled/classified images for the football game analysis. Giancola et al. [2] have developed a video dataset. The presented dataset included videos only in limited categorization, where all videos are prepared in 3 categories. Wang et al. [10] also introduced a football dataset with videos and images. In the presented dataset of Wang, the problem in the dataset of Karpathy et al. has partially been resolved. In addition to the videos, the images also existed in the dataset. However, the number of images is limited to only three classes of players, the ball and the goal. Although, Deliege et al. provided a dataset of 500 football matches. Their videos are classified into 17 different classes. The dataset provides only European football matches and including videos from other continents could be a good idea to make it more comprehensive [11]. All available datasets have one or some gaps to the best of our knowledge. For instance, the size of the dataset, label information, type of data (videos only), publicly availability etc., are challenges that make the research difficult in this area. Also, the lack of variety in different conditions has led to a decrease in the existing datasets' comprehensiveness [6]. On the other hand, they are not adequate to object recognition in football game events due to the low number of available classes. Eventually, we provide a dataset of images from football game videos with various conditions to remove the mentioned limitations in our dataset. The existing and our presented datasets are briefly compared in Table 2. According to the gap related to this paper's research topic, we introduce a dataset with 100,000 images, described in the following section in more detail. TABLE 2. Comparing the available datasets in order to analyze football game videos Dataset name Data type Image no. Class no. SoccerNet [2] Video - 3 SoccerDB [10] Video/Image 55,290 3 SoccerNet-v2 [11] Video - 17 IAUFD (ours) Image 100,000 10 3 DATASET OVERVIEW In this section, we introduce how to prepare the dataset. Furthermore, the statistical information and more details about our dataset classes are described in the following. In IAUFD (Islamic Azad University Football Dataset), 100,000 images have globally been extracted from 33 football game matches, including the national, international, and club competitions. However, some datasets published both videos and images as described in Section 2, we focused on providing a comprehensive dataset of football images. In our dataset, the football matches are related to different hours, locations, fields, seasons etc. This variation avoids the over-fitting in the proposed machine learning-based models and keeps the extracted data from real-world conditions. Our image sampling is about one frame per second (fps) for all football videos to avoid the same consecutive frames. However, as the matches' times vary, the total number of extracted frames from each video is different. Nevertheless, we automatically ignored frames with a similarity of 95% or greater with their previous extracted frames. Our strategy of such similarity was the well-known normalized sum of squared differences between two frames. The presented dataset characteristics are shown in Table 3. As can be seen, we have images with multiple labels (i.e., from one label to five labels for images), which means we labeled several objects, events, or scenes in some images. This multiple labeling is helpful for the applications that we are looking for simultaneous events. In addition, the characteristics of the teams participating in the matches have been shown in Table 4. The competitions' videos have been selected comprehensively, and they can cover all the continents of the world. TABLE 3. The characteristics of our proposed dataset Features Info Number of competitions 33 Matches Number of participated teams 36 Teams Types of competitions Club, national, friendly Profile of participated teams From 5 world continents Total duration of matches 2508 min Competitions time All 12 months Total number of images 100,000 Images Image with one label 23,104 Image with two label 31,843 Image with three label 7,739 Image with four label 1,549 Image with five label 134 Number of image categories 10 Classes TABLE 4. Names for the matches videos of the participated teams Club competitions Africa Asia International matches America Australia Europe Italy England Chelsea Cape Verde Iran Argentina Australia Iceland Everton Congo Iraq Brazil Ireland Mancity Ghana South Korea Chile Slovakia Crystal Palace Guinea UAE Columbia Germany Swansea Ivory Coast Peru Paraguay Belgium Tottenhom Tunisia Zambia France Manchester United Hungary Wells Russia Statistically, the images are somehow gathered that all the football game objects, such as the stadium view, the spectators, the players, the referees, the ball, and various game scenes like the yellow card, the red card are included. While different videos provided the reference videos of matches, the extracted images also have various sizes, and there are not any restrictions for them. Eventually, the images' length and width are diverse, and nevertheless, the size of the smallest image is 640 × 288, and the largest image has a size of 1920 × 1080. We also mention that the colored images are considered due to the more and necessary information for some events like yellow and red cards. The recognized objects in each image have been listed in Table 5. Presently, the existence or nonexistence of 10 features (the goal, center of the field, celebration, the yellow and red cards, the ball, stadium image, referee, penalty-kick, and free-kick) has been determined. If the features mentioned in the considered image exist, 1 is inserted in the corresponding column, otherwise zero. It should be considered that multiple events may be observed in an image, and therefore, several columns related to each image may be equal to 1. We note that, labeling the images has been performed by two students who were fully familiar with the football game. In order to avoid any error, the first half of data has been labeled by the first student and it has been double-checked by the second student and the second half has been labeled by the second student and the first student has controlled the labeling accuracy. TABLE 5. Objects labeling available for each image Image Number Goal Center of field Celebration Yellow card Red card Ball Stadium Referee Penalty-kick Free-kick 000001 0 1 0 0 0 1 1 0 0 0 000002 0 0 0 0 0 0 1 1 0 0 ⋅⋅⋅ 0 0 0 0 0 0 0 0 0 0 000019 1 0 0 0 0 1 0 0 0 0 ⋅⋅⋅ 0 0 0 0 0 0 0 0 0 0 060016 1 0 0 0 0 1 0 1 0 1 ⋅⋅⋅ 0 0 0 0 0 0 0 0 0 0 100000 0 0 0 0 0 0 0 1 0 0 Number of affected 24,012 569 1375 591 71 47,691 4116 35,360 540 25,562 images per class The table of features related to the dataset has somehow been designed to allow the new users and researchers to add new features. Researchers can add other features existing in images to the table of dataset features. For instance, there is a corner kick in some images. If required, other researchers can add such a feature to the dataset and allow the other interested researchers to work in this field in the future. As represented in Table 5, image 19, for example, the goal and the ball are visible in the image. Therefore, 1 is inserted for the column of the goal and the ball, while zero is considered for other columns As represented in Figure 2. The other advantage of the presented dataset is the possibility of extracting images based on the features' conjunction or disjunction. If required, researchers can provide the needed training datasets by combining various dataset features. Therefore, while there are a few datasets for football video analysis, our dataset's significant advantage is more extensive than previous datasets where our dataset included 100,000 labeled images. In addition to our dataset size, our dataset contained images with multiple labels, which is helpful for more complicated purposes and high-level understanding. For instance, it can allow us to learn and find images with numerous occurrences (e.g., frames that contain the goal and ball and celebration together). Moreover, to the best of our knowledge, as described in Section 2 the previous datasets have a few classes (e.g., three). Instead, our dataset contains ten classes of objects, events, or scenes that make more challenges for future studies. Also, another advantage of our dataset is its attention to the multiple types of data, such as objects, events, and scenes, where such data gives the capability of providing more complicated models with a higher impact of success. For instance, objects such as the ball, players in a scene goal gate increase the impact of exploring the goal event. FIGURE 2Open in figure viewerPowerPoint Football significant observations (objects, events, or scenes) labeling in our dataset 4 BASELINE AND DATASET EVALUATION In order to evaluate the proposed dataset, various objects, events, or scenes are recognized despite introducing a standard protocol. These targets include the goal, center of the field (Start/Restart the Game), celebration, the yellow and red cards, the ball, stadium, the referee image, the penalty-kick, and the free-kick. Recognizing these targets is essential since they influence significant events detection, statistical analysis, match summarization etc. Convolutional Neural Networks (CNN) are being studied and used as one of the most accurate machine learning methods to solve various problems [14, 15]. As one of the machine learning techniques, deep learning has benefited from the technological advancements of graphics processing units (GPU), which has provided its extensive use. Compared to traditional methods, deep learning techniques have obtained better results for many significant problems such as event detection [16], particularly in football videos [17]. Eventually, we provide to recognize the images containing these objects/events using two standard deep neural network architectures VggNet-13 and ResNet-18. The purpose of evaluation by these two deep neural networks is the introduction of baseline results on this dataset, although such a dataset can be used for other different applications. The VggNet-13 This paper uses the standard VggNet-13 deep neural network model to recognize objects, events, or scenes in the proposed dataset. The VggNet model is a well-designed convolutional neural network architecture presented by Simonyan and Zisserman [18]. The VggNet architecture has been designed to make interactions between the desired network depth and decreasing the number of network parameters, like 3 × 3 convolution filter with a step length equal to 1 and a 2 × 2 max-pooling, which have been used for all layers. The activation function, which the VGG network works with, contains a linear rectifier (ReLU), and the sigmoid function is used in the last layer. In this work, we used standard VggNet-13 for various evaluations on proposed dataset. The ResNet-18 The residual network (ResNet) [19] is another efficient deep model that has been utilized and examined to evaluate the proposed dataset. The ResNet network architecture utilized in this paper contains 18 layers, including four residual blocks with the same structure. The batch normalization layer and the ReLU activation function have been used in the presented model before the convolutional layers. Additionally, there is a similar number of filters for each residual block, except the number of filters in the last convolutional layer, which has been doubled to keep each layer's computational complexity. Therefore, when the network is getting deeper, the number of filters in the residual blocks will be equal to 64, 128, 256, and 512, respectively. Multiple observations evaluation In addition to previous standard models, we propose to recognize observations (objects, events, or scenes) in our dataset simultaneously. For classifying football images, some objects appear together in the image and consequently; the image should have more than one label based on the number of recognized objects. For solving this problem, instead of using a multi-path architecture we use one-path VggNet-13 architecture and in its training process, images are used that simultaneously contain desired objects/events. 5 EXPERIMENTAL RESULTS This section evaluates the results of two baseline methods, multiple occurrences, and multiple labels experiments for recognizing the intended observations using our introduced dataset. In the training process, we used all the data for evaluations except the goal, ball, referee, and free-kick classes that are more than 10k images, and we used the first 10,000 images of these classes for the experiments. To this end, we have a unique strategy in all experiments and protocols. When we select desirable images (e.g., class of goal or ball etc.), we divide them into three parts: training, validation, and test subsets with the ratio of 50%, 25%, and 25%, respectively. If the number of subsets is not dividable, we assume the floor of the number of images. For instance, if the number of a selected class is 150, the final number of each subset is 75, 37, and 37, respectively, which means we ignore the last image. In all experiments, our training setup is with the batch size of 64 and the number of epochs set to 100, unless our training process finishes before approaching this value. Moreover, we resized all input images into the size of 200 × 200 pixels. We also used a data augmentation mechanism that generates five images from one image using zooming, shift, and horizontal flip. Metrics and protocols In evaluations, it is aimed to find the frames of a football game video, in which the significant objects like the goal, the yellow and red cards, the ball etc., can be observed. In our project, we are motivated to recognize the frames with important observations. These observations are a part of important game segments. They can eventually be considered as a high-level feature for football game videos. However, recognizing important game portions can be defined even more intelligently, which can also be considered a part of future works. In order to evaluate the proposed method performance, four evaluation metrics including the recall 1, precision 2, F-measure 3 and accuracy 4 have been used as the following. R e c a l l = T P T P + F P , \begin{equation} Recall = \frac{TP}{TP+FP}, \end{equation} (1) P r e c i s i o n = T P T P + F N , \begin{equation} Precision = \frac{TP}{TP+FN}, \end{equation} (2) where TP (True Positive) is the number of positive samples that have correctly been recognized as positive, TN (True Negative) is the number of negative samples that have correctly been recognized as negative. Also, FP (False Positive) is the number of false-positive recognition, and FN (False Negative) is the number of false-negative recognitions. Afterward, f-measure and accuracy values are defined as follows: F − m e a s u r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l , \begin{equation} F-measure = \frac{2 \times Precision \times Recall}{Precision + Recall}, \end{equation} (3) A c c u r a c y = T P + T N T P + T N + F P + F N . \begin{equation} Accuracy = \frac{TP+TN}{TP+TN+FP+FN}. \end{equation} (4) We provide four protocols using two baseline deep models for evaluating the proposed dataset: 1) Two-class evaluation protocol (Section 5.2): In this protocol, input image only belongs to two classes and a label with the highest probability is determined for the image. 2) Multi-class evaluation protocol (Section 5.3), where the input images are from several different classes. For evaluating the dataset, a seven-class evaluation protocol is introduced to classify the input images. In this protocol, a softmax function is used to select a label with the highest probability for the input image among the seven classes. 3) Multiple occurrences evaluation protocol (Section 5.4): In images, more than one simultaneous observation (objects, events, or scenes) may occur together. To recognize and classify multiple occurred observations, we use one-path architecture. In its training process, images are used that contain desired events (e.g., goal and ball) simultaneously. 4) Multiple labels evaluation protocol (Section 5.5): For the images with more than one simultaneous observation (objects, events, or scenes), we use a multi-path architecture containing each path to recognize an observation. The input image is classified and labeled using multiple labels due to having more than one observation for the input image. Two-class evaluation protocol In this section, the problem of recognizing each observation (e.g., ball, cards, free-kick, goal, penalty-kick, referee, stadium view, center of the field, and celebration) versus other observations have been examined as a two-class problem. In two-class classification problems, the input images are selected from only two classes during the training, and in the testing phase, each sample image assigns to a class. Therefore, in this experiment, we select images with only one desirable label (e.g., only the goal versus other). For evaluation, we use both standard VggNet-13 and ResNet-18 on our presented dataset. The results of the VggNet-13 with standard measures TP, TN, FP, FN, recall, precision, f-measure, and accuracy for nine various observations on the proposed dataset are shown in Table 6 in detail. We note that both red and yellow cards are considered as a common class in our experiments. Similarly, the results of the ResNet-18 with the same measures for the nine observations on the presented dataset is reported in Table 7 in detail. TABLE 6. Evaluating measures of nine various observations in our proposed dataset using the VggNet-13 Observation Measure Ball Cards Free-kick Goal Penalty-kick Referee Stadium Center field Celebration TP 73.96% 86.66% 86.00% 92.00% 82.22% 72.93% 87.97% 87.68% 82.00% TN 70.02% 83.03% 83.95% 88.00% 77.77% 70.93% 84.96% 86.23% 79.94% FP 26.04% 13.34% 14.00% 8.00% 17.78% 27.07% 12.03% 12.32% 18.00% FN 29.98% 16.97% 16.05% 12.00% 22.23% 29.07% 15.04% 13.77% 20.06% Recall 71.15% 83.62% 84.27% 88.46% 78.71% 71.50% 85.39% 86.42% 80.34% Precision 73.96% 86.66% 86.00% 92.00% 82.22% 72.93% 87.97% 87.68% 82.00% f-measure 72.53% 85.11% 85.12% 90.19% 80.43% 72.20% 86.66% 87.04% 81.16% Accuracy 71.99% 84.84% 84.97% 90.00% 79.99% 71.93% 86.46% 86.95% 80.97% TABLE 7. Evaluating measures of nine various observations in our proposed dataset using the ResNet-18 Observation Measure Ball Cards Free-kick Goal Penalty-kick Referee Stadium Center field Celebration TP 71.96% 75.75% 67.91% 84.98% 85.92% 56.93% 89.97% 81.88% 76.10% TN 70.02% 73.93% 63.99% 80.98% 82.22% 52.93% 87.97% 79.71% 69.91% FP 28.04% 24.25% 32.09% 15.02% 14.08% 43.07% 10.03% 18.12% 23.90% FN 29.98% 26.07% 36.01% 19.02% 17.78% 47.07% 12.03% 20.29% 30.09% Recall 70.59% 74.39% 65.34% 81.71% 82.85% 54.74% 88.20% 80.14% 71.66% Precision 71.96% 75.75% 67.91% 84.98% 85.92% 56.93% 89.97% 81.88% 76.10% f-measure 71.26% 75.06% 66.60% 83.31% 84.35% 55.81% 89.07% 81.00% 73.81% Accuracy 70.95% 74.82% 65.95% 82.98% 84.07% 54.93% 88.97% 80.79% 73.00% Multi-class evaluation protocol In another experiment, recognizing seven selected observations is examined as a seven-class problem, and the results of the introduced deep models are presented using the proposed dataset. In multi-class problems, the images with various labels are used in the training process. The label with the highest probability is determined in the testing process as the final label of the input image using the softmax function. We note that the selected images are only with one label, and therefo

Referência(s)