Artigo Acesso aberto Revisado por pares

Improving automatic bug assignment using time‐metadata in term‐weighting

2014; Institution of Engineering and Technology; Volume: 8; Issue: 6 Linguagem: Inglês

10.1049/iet-sen.2013.0150

ISSN

1751-8814

Autores

Ramin Shokripour, John Anvik, Zarinah Mohd Kasirun, Sima Zamani,

Tópico(s)

Web Application Security Vulnerabilities

Resumo

IET SoftwareVolume 8, Issue 6 p. 269-278 ArticleOpen Access Improving automatic bug assignment using time-metadata in term-weighting Ramin Shokripour, Corresponding Author Ramin Shokripour [email protected] Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, MalaysiaSearch for more papers by this authorJohn Anvik, John Anvik Department of Computer Science, Central Washington University, Ellensburg, Washington, USASearch for more papers by this authorZarinah M. Kasirun, Zarinah M. Kasirun Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, MalaysiaSearch for more papers by this authorSima Zamani, Sima Zamani Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, MalaysiaSearch for more papers by this author Ramin Shokripour, Corresponding Author Ramin Shokripour [email protected] Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, MalaysiaSearch for more papers by this authorJohn Anvik, John Anvik Department of Computer Science, Central Washington University, Ellensburg, Washington, USASearch for more papers by this authorZarinah M. Kasirun, Zarinah M. Kasirun Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, MalaysiaSearch for more papers by this authorSima Zamani, Sima Zamani Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, MalaysiaSearch for more papers by this author First published: 01 December 2014 https://doi.org/10.1049/iet-sen.2013.0150Citations: 6AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Abstract Assigning newly reported bugs to project developers is a time-consuming and tedious task for triagers using the traditional manual bug triage process. Previous efforts for creating automatic bug assignment systems use machine learning and information-retrieval techniques. These approaches commonly use tf-idf, a statistical computation technique for weighting terms based on term frequency. However, tf-idf does not consider the metadata, such as the time frame at which a term was used, when calculating the weight of the terms. This study proposes an alternate term-weighting technique to improve the accuracy of automatic bug assignment approaches that use a term-weighting technique. This technique includes the use of metadata in addition to the statistical computation to calculate the term weights. Moreover, it restricts the set of terms used to only nouns. It was found that when using only nouns and the proposed term-weighting technique, the accuracy of an automatic bug assignment approach improves from 12 to 49% over tf-idf for three open-source projects. 1 Introduction Increasing complexity of software, shortening development cycles and higher customer expectations of quality have placed a major responsibility on some of software development steps such as software debugging. Many researchers have focused on improving the various stages of debugging process through automation, such as analysing crashes reports [1], automatically categorising defects [2] and predicting bug severity [3]. Bug assignment is one of the debugging processes that aims at recommending an appropriate developer to fix a newly reported bug to a project [4, 5]. The traditional bug assignment process is a time-consuming and tedious process that imposes additional cost to the project [4]. Moreover, traditional developer-driven bug triage, which uses the developers activity history to recommend the proper developer to fix the new bug, significantly decreases the speed of the bug-fixing process [6]. As a result, researchers looking to improve the debugging process have focused on creating automatic bug assignment systems to assist with the bug-fixing process [4, 5, 7-11]. These systems mine project artefacts for the required information, which are mostly in a textual format to recommend the most appropriate developer(s) for fixing the new bug. The textual nature of these artefacts has led researchers to use text analysis methods for extracting information from these resources. Most of the automatic bug assignment approaches use either machine learning (ML) [4, 11, 7, 12] or information-retrieval (IR) [8, 10] methods to extract and analyse the information from these artefacts. Both of these methods rely on statistical computations, such as computing the frequency of terms in documents and for weighting the extracted terms from the information resource(s). A developer recommendation is then made based on the amount of similarity between the terms extracted from the new bug report and those extracted from existing information resources. A common means of determining the term similarity is based on the weighting of the terms in the documents and corpus. The common term-weighting techniques, which originate from a natural language context, do not consider additional information, such as metadata [13]. The metadata may include items such as developer identifiers, time stamps and commit comments, which associates answers to who, when and why with the term in the artefact. One important piece of metadata is the 'when', or the time at which the term was created or modified. Such metadata can have a significant role in establishing the relationship and the similarity between the new bug and the activities of the project's developer. For example, the time at which a developer uses certain terms may be an important factor for determining the activities of the developer at various periods of the project's life. Take for example, a developer, who six months ago worked on a bug, containing the term 'switch', as compared to another developer who worked on another bug, containing the same term, two years ago. The first developer is likely to be a more appropriate recommendation for fixing a new bug that contains the term 'switch'. Therefore, the use of such metadata in calculating the weight of the terms can improve the accuracy of automatic bug assignment. This paper presents a new time-aware noun-based bug assignment (TNBA) approach that improves the accuracy of the developer recommendation by weighting the terms based on the time at which the developers used the terms. The data extraction and preparation of approach from [14] was modified to include the use of metadata in term-weighting and restricted the source of text data to only bug reports. Moreover, in this paper, it is shown that using only noun terms not only improves the data set quality and the approach accuracy, but also has a positive effect on other existing methods, such as the vector space model (VSM) and Naïve Bayes (NB). The rest of this paper is organised as follows: First, the proposed approach and term-weighting technique are presented in Section 2. Next, an evaluation of the approach is given in Sections 3 and 4. We conclude the paper by discussing some threats to validity and related work. 2 Proposed approach In this section, a noun-based approach that uses time-aware term-weighting technique for automatic bug assignment is presented. As shown in Fig. 1, the proposed approach contains a set of components, including activity history extraction, entity extraction, term-weighting and developer selection. The rest of this section provides details for each of these components. Fig. 1Open in figure viewerPowerPoint Overview of proposed approach 2.1 Activity history extraction Previously fixed bug reports (hereafter called 'fixed bugs') which are recorded in the issue tracking system (ITS) are an important information resource for bug assignment. A key step in this process is determining the developers who fixed the bugs. Two techniques are used to link the bug reports with their associated developers. First, any patch(es) attached to the bug reports is(are) examined and the name of the developer who has provided the patch is extracted. If the bug report does not have any attached patches, the log messages in the project's version control system (VCS) are mined to determine the link [15]. It is not uncommon for developers to put the IDs of the bug report in the commit messages when submitting changes to the project's VCS [15]. A variety of techniques can be used to form this link, ranging from an ITS that automatically creates the linking commit message when a bug report is resolved, to project standards requiring developers to provide this information for any source code commits. For projects where this link does not exist, other techniques have been proposed [16, 17]. As will be discussed in Section 3.1, only projects that contained ITS-VCS links were chosen as subject systems to eliminate any confounding influence from these other linking techniques. 2.2 Entity extraction An investigation on the impact of noun usage in bug assignment in [14] indicated that using only the noun terms not only provides enough information for making decisions, but also leads to an approach that is independent from dimensionality reduction methods. In this step, the expertise of the developers is determined by examining the nouns found in the summary and description of fixed bugs by the developers. Finally, an index of the nouns is created for each developer that contains a compound triple of the noun, the developer who used the noun and the reporting date of the fixed bug in which the noun was used. 2.3 Term-weighting To weight the nouns, two pieces of metadata are used: the date when the noun was used and how often the noun is repeated by a developer relative to how often the noun is repeated across the entire project. Although the second piece of metadata is similar to the tf-idf technique, the distinction is that instead of categorising by document (i.e. bug report) the new technique categorises by developer. The term-weighting approach calculates the term weights as follows. The weight of each noun (N) relative to a new bug report (B) is the combination of how frequently the developer has used the term (Freq(N,D)) and how recently the developer has used the term in a bug report (Recency(N,D,B)). As shown in (1), the frequency of use for the noun is found as a combination of how often the noun has been used in a fixed bug by the developer (Freqbug), how often the noun has been used across all previous fixed bugs by the developer (Freqdev) and how often the noun appears across all the fixed bug reports of the project (Freqproj). The date of the noun is the reporting date of the bug in which the noun has appeared (Datenoun). Therefore, the recency of use (eq. (2)) for each noun is determined as the inverse of the difference in dates between the date of the new bug (DateB) and date of the fixed bugs by the developer (D) in which noun was used (Datenoun). Equation (3) shows the formula for calculating the weight of nouns that have been used in the development activities. (1) (2) (3) 2.4 Developer selection Using this new weighting technique means that the weight of each extracted term is calculated relative to a new bug report. In other words, it is possible that the same term that appears in two different new bug reports has different weights when making a developer recommendation. This is because the developer who fixed the more recent bug report will have a higher probability of being active on the relevant part of the project. The extracted nouns from the summary and description of the previously fixed bug reports are used for determining the expertise of the developers. Having determined the weight of the nouns, the expertise of each developer is calculated for fixing a new bug report. The expertise (ExpertiseDev) of each developer (D) is calculated from sum of the weights of all nouns that are in common between each appearing noun (N) in the new bug report and the nouns that appeared in previous fixed bugs by the developer (4). Finally, the developers are then ranked based on their calculated expertise and the top n entries form the list of recommendations. (4)To further clarify how developer expertise is determined using TNBA compared to previous methods (i.e. using both 'Freq of Use' and 'Recency of Use' measures against using only the statistical computation, 'Freq of Use'), bug #68148 fixed by developer 'jeromel' from the Eclipse project is used. Table 1 presents a subset of the common terms between the activities of three developers and the bug report. The table shows that the use of only statistical information ('Freq of Use') can result in choosing the incorrect developer that either fixed the most bug reports, the developer with the most common terms ('akiezun'), or the developer with the highest term frequency ('kjohnson'). Also considering how recently the terms were used, the developer that most recently worked on the related subjects (see Datenoun column) can be determined resulting in a more accurate recommendation ('jeromel', actual fixer). This real world example shows how considering time positively affects the expertise determination process and gives higher value to the more correct developer for bug fixing. Table 1. Sample of determining expertise of developers for fixing the bug #68148 of Eclipse project Developer Noun DateB Datenoun Date difference (days) Recency of use Freqbug Freqdev Freqproj Freq of use TNBA Akiezun container 2004-06-22 2002-06-04 749 0.0013 2 4 210 0.0381 5.086E-05 persist 2004-06-22 2002-05-03 781 0.0013 1 1 7 0.1429 1.829E-4 project 2004-06-22 2003-08-08 319 0.0031 1 60 942 0.0637 1.997E-4 startup 2004-06-22 2002-12-18 552 0.0018 1 1 42 0.0238 4.313E-05 total (expertise of developer) 0.2685 4.766E-4 kjohnson model 2004-06-22 2004-02-20 123 0.0081 1 11 87 0.1264 0.00103 project 2004-06-22 2004-05-12 41 0.0244 1 36 942 0.0382 9.32E-4 total (expertise of developer) 0.1647 0.0020 jeromel model 2004-06-22 2004-06-03 19 0.0526 2 4 87 0.0920 0.0048 project 2004-06-22 2004-06-14 8 0.1250 2 8 942 0.0170 0.0021 startup 2004-06-22 2004-06-14 8 0.1250 1 1 42 0.0238 0.0030 total (expertise of developer) 0.1327 0.0099 3 Evaluation setup In this section, the setup for empirically evaluating the proposed approach is presented. 3.1 Subject systems The approach was evaluated using three open-source projects: the Eclipse JDT project, the ArgoUML project and the NetBeans project. The JDT project is a plug-in for the Eclipse framework that provides the tools to implement a Java integrated development environment (IDE). NetBeans is also an IDE for Java development, however, it also supports other languages, such as PHP and C/C++. ArgoUML is a popular open source UML modelling tool. These projects were chosen for the following reasons. First, these projects have been used by other researchers for evaluating their bug assignment techniques [11, 18, 19]. Second, these projects have different scales of information resources. This allows for an investigation of the effect of the proposed approach on projects of different sizes. Lastly, the projects were chosen such that they contained links between bug reports and version control commits. As previously mentioned, this choice was made to avoid any effects caused by the use of unreliable linking techniques. Some of the properties of these subject systems are presented in Table 2. Table 2. Properties of the extracted data from artefacts of the subject systems Eclipse NetBeans Argouml first commit 2001-05-03 1999-02-04 1998-01-27 last commit 2011-12-15 2010-06-25 2012-03-14 # of developers 58 175 40 # of Java files 8308 1375 4371 # of commits 162 321 67 216 58 874 avg. of changes per day 48.8 16.16 11.40 # of reported bugs 47 265 185 578 6413 avg. of bug per day 12 44.62 1.24 # of fixed bugs 21 466 69 651 2807 avg. of fixed bug per day 5.5 16.74 0.54 3.2 Bug samples Each software project has different goals and requirements in different periods of its lifetime [20]. Also, requirement changes may result in the conditions of the project's development being changed [21]. Therefore, to avoid bias that may be caused by the conditions in a specific time period of the project's lifetime, 200 fixed bugs were randomly selected from all periods of project's lifetime as the primary test set for each subject system. Although the subject systems are the same as those used in other researches, the test sets are different. In this work, the test sets are selected from the bug reports that changed project Java files, and the developers that fixed the bugs were successfully determined by the Activity History Extraction component. The selected bug reports do not conform to any other specific constraint. In addition to this test set, two other data sets were used containing bug reports from two different periods of the project's lifetime. The first test set contains the first 100 reported bugs to the project (early test set), and second test set contains the last 100 reported bugs to the project (late test set) within the time frame of the extracted data (Table 2). These two testing sets were used to investigate the effect of the time-metadata on the accuracy of the automatic bug assignment approach in two different conditions. Using the early test set allows for investigating the effect of the proposed technique when there is little historical data, and using the last test set allows for investigating the case where there is a lot of historical data. Using these three test sets will demonstrate the performance of the proposed approach in comparison to other approaches for various conditions of the subject systems. The IDs of the used bugs in test sets of the projects are available online [https://www.drive.google.com/file/d/0B0sa-hXpOgiJeklmSUIyc1B2X1E/edit?usp=sharing]. 3.3 Comparison systems The proposed approach was evaluated in terms of using both the time-metadata and only noun terms. First, to assess the impact of the time-metadata, the performance of the term-weighting technique was evaluated with (TNBA) and without (TNBA(no-time)) using time-metadata. In either case, TNBA is also compared to tf-idf, the most common term-weighting technique that weighs the terms only based on the term frequency. In this case, the developers are ranked based on the summation of the calculated weights of the terms for each developer by tf-idf, without using VSM. In all these evaluations only noun terms are used. In terms of noun usage assessment, two different sets of input data were used: first, only the noun terms (TNBA), and then all the terms (TNBA(all-terms)). To assess the power of TNBA for bug assignment, the bug assignment literature was surveyed to select the most popular and high-performing methods. In this case, Naïve Bayes (NB) [4, 11] and Vector Space Model (VSM) [8, 10], the most popular techniques in ML and IR approaches, respectively, were chosen. NB and VSM were also evaluated using noun terms and all terms as input data sets to assess the impact of noun usage in existing bug assignment methods. Furthermore, the results of TNBA are compared to VSM (time). VSM (time) uses VSM to identify the most relevant previous fixed bugs and then re-orders the obtained ranked list based on the dates when the bugs were submitted. The developers who fixed the top ranked bugs in the re-ordered list are then recommended as fixer for the new bug. In the following subsections, brief overviews of tf-idf, VSM and NB are provided. Tf-idf: Tf-idf determines the importance of terms based on the frequency of term appearance in the documents and corpus. The weight of the term (t) in the document (d) is calculated using (5) where TermFreq(t, d) is the number of appearances of the term in the document. The number of documents in the corpus (N) and the number of documents in the corpus containing the term (DocFreq(t)) form the second statement of the tf-idf formula [22]. (5) VSM: VSM is an algebraic method that measures the similarity of the training set with a new document in three steps. First, a vector of the terms in each of document is created. Next, the weights of the terms are determined based on the term-weighting technique, such as tf-idf [23]. Finally, the cosine similarity of the document vectors against a vector of the query is calculated and the documents are ranked based on the amount of similarity with the given query. NB: NB is a probability technique that determines the probability of an instance belonging to a specific class using Bayes conditional probability rules. In this technique, terms that appear more frequently in one class than in other classes increases the probability of associating a new instance that contains the term to that class. In NB, the probability that an instance belongs to a class is determined by multiplying the probability of its features and the probability of each feature is calculated based on its frequency in the class. 3.4 Metric For a new bug report, if the top N ranked results contain the actual developer who fixed the bug, it is counted as a correct answer (6). This metric is used to evaluate the accuracy of a bug assignment approach [18]. (6) 3.5 Data collection and procedure The data for evaluating the proposed approach were collected from the ITS and VCS of the subject systems. From the ITS, the fixed bugs were collected in XML format. Next, the fixed bugs were linked to the associated developers using two techniques; first, examining any patch(es) attached to the bug report to extract the name of the developer, and examining the commit message to find the bug's ID. The detection of the bug IDs in commit messages was done using general patterns observed to be used by developers. Specifically, a rule-based named entity recognition (NER) method was used by applying the NE transducer component of the ANNIE [http://www.aktors.org/technologies/annie/] plugin of GATE [http://www.gate.ac.uk/]. To improve the accuracy of the results obtained from NER, the extracted value from the commit messages was compared with the list of bug IDs collected from the ITS. Next, the commit date was compared to the date of creation and resolution of the bug report. If the date of the commit is after the creation date of the bug or it is before the resolution date of the bug, then the developer IDs from the commit and bug report are compared. If there is a match between these two, the commit is linked to the bug report and the committer is taken as the fixer of the bug. As individuals may have different user names in both the VCS and the ITS, a map was created by hand to associate user names from the two systems. After linking the bugs to their developer, the summary and description fields of the previously fixed bugs were inputted into the Annie plugin to extract the nouns. The nouns were further refined by removing those that are less than three characters, contain a symbol or start with a digit. The remaining nouns were then lemmatized using the Stanford CoreNLP API [http://www.nlp.stanford.edu/software/corenlp.shtml]. The collection of the lemmatized nouns is used as the data set in the evaluation process. Due to the differences in the reporting date of the bugs existing in the test sets, we needed a dynamic corpus generator that creates a specific corpus for each bug that contains only the data which were recorded in the artefact before the reporting date of the bug. Using this corpus generator improves the effectiveness of evaluation process and makes the more realistic results for the methods. 4 Evaluation results and analysis This section presents the results and analysis of the evaluation of the proposed approach. Specifically, the proposed approach is assessed along the dimensions of the use of time-metadata and noun-only terms. TNBA is further evaluated in comparison to two other bug assignment methods: VSM and NB. 4.1 Impact of time-metadata To assess the impact of the use of time-metadata, the performance of TNBA was compared to TNBA (no-time), tf-idf and VSM (time). Since TNBA uses the noun terms for term weighting, the noun terms are given to all the methods in this section to make a fair comparison on the time consideration. Table 3 and Fig. 2 present the accuracy results of TNBA, TNBA (no-time), tf-idf and VSM (time) on the subject systems. The TNBA approach outperforms the TNBA (no-time), tf-idf and VSM (time) by as much as 14, 12, and 48% on the subject systems. These results show that considering how recently a term was used, in addition to the frequency of the use, improves the accuracy of term-weighing for bug assignment. Unlike TNBA that considers the time distance between the reporting time of the new bug and the time of all the relevant activities of developer, VSM (time) takes into account only the time at which one of the bug fixing activities occurred. Therefore, although the time is considered in the VSM (time), it was found to perform the worst of the examined approaches for developer recommendation. Fig. 2Open in figure viewerPowerPoint Accuracy of VSM (time), tf-idf, TNBA (no-time) and TNBA Table 3. Comparison of TNBA, TNBA (no-time), tf-Idf, and VSM (time) based on random test sets Project # Recommendations TNBA, % TNBA (no-time), % tf-idf, % VSM (time), % Eclipse top1 27 18.5 21 13.5 top5 67.5 53.5 60 30.5 NetBeans top1 51 30.5 30.5 17.5 top5 89 75.5 77.5 43 ArgoUML top1 43 31.5 32 13.5 top5 79 73 72.5 31.5 4.2 Impact of noun-only terms To evaluate the impact of using only nouns, two different data sets were used: one containing only nouns (TNBA) and one containing all the terms (TNBA (all-terms)) from the bug report summary and description. Table 4 and Fig. 3 present the accuracy results of TNBA and TNBA (all-terms) on all the subject systems. These results indicate that despite the reduced size of data set for each of the subject systems, using only nouns performs as well, or better, than using all of the terms. In other words, using nouns not only provides enough data to identify the developer correctly but can also improve the bug assignment accuracy by up to 9%. Fig. 3Open in figure viewerPowerPoint Accuracy of TNBA and TNBA (all-terms) using random test set Table 4. Comparison of the approach using noun-only and all-terms data sets based on random test set Project # Recommendations TNBA TNBA (all-terms) Eclipse top1 27% 22% top5 67.5 66 NetBeans top1 51% 47% top5 89% 80% ArgoUML top1 43% 40.5% top5 79% 77% Furthermore, as shown in Table 5, the data sets sizes for only nouns are 51, 45 and 43% less for the Eclipse, NetBeans and ArgoUML projects, respectively. Therefore, as one would expect, using only nouns significantly reduces data set sizes that need to be analysed, without sacrificing accuracy. Table 5. Data set sizes for evaluating approaches in all-terms and noun-based cases Project Case Data set size NetBeans noun-based 15 551 all-terms 28 631 Eclipse noun-based 158 241 all-terms 325 744 ArgoUML noun-based 7911 all-terms 13 971 To investigate the impact of using only nouns in other bug assignment methods, both VSM and NB approaches were evaluated using either only nouns or all the terms. As Table 6 presents, the results indicate the same conclusion as for TNBA; using only nouns does as well, if not better, than using all the terms while significantly reducing the data set size. Table 6. Comparison of VSM and NB using noun-only and all-term data sets based on random test set Project # Recommendations VSM, % VSM (Noun), % Naive Bayes, % Naive Bayes (Noun), % Eclipse top1 5.5 8.5 13 17 top5 26.5 32.5 40.5 46.5 NetBeans top1 17.5 29.5 9 12 top5 62.5 78 44.5 39 ArgoUML top1 22.5 21 20 17.5 top5 69 67 56.5 55 The results for ArgoUML show that for all of the approaches, using all of the terms results in slightly better performance. This may be a consequence of the project's development process. Possibly, ArgoUML developers use more unique terms from the other parts of speech (e.g. verbs and adjectives). This would result in higher IDF values for the non-noun terms and a more accurate developer recommendation approach than using only the noun terms. 4.3 Comparison to other methods In this section the performance of TNBA is compared to NB and VSM, the most popular bug assignment approaches for ML and IR techniques, respectively. Recall from Section 3.2, three different test sets are used for evaluation: early, late and random. These data sets are used to evaluate TNBA, VSM and NB approaches to ensure an unbiased comparison. Figs. 4-6 show the accuracy results of TNBA, VSM and NB for the early, late and random test sets, respectively. As shown in these figures and also in Table 7, TNBA has a higher accuracy than VSM and NB in nearly all cases. Fig. 4Open in figure viewerPowerPoint Accuracy of Naïve Bayes, VSM and TNBA using the early test set Fig. 5Open in figure viewerPowerPoint Accuracy of Naïve Bayes, VSM and TNBA using the late test set Fig. 6Open in figure viewerPowerPoint Accuracy of Naïve Bayes, VSM and TNBA using the random test set Table 7. Comparison of TNBA, NB and VSM using early, random and late test sets Project Test set # Recommendation TNBA NB VSM Eclipse early top1 0% 0% 0%

Referência(s)
Altmetric
PlumX