Artigo Revisado por pares

CPA : Accurate C ross- P latform Binary A uthorship Characterization Using LDA

2020; Institute of Electrical and Electronics Engineers; Volume: 15; Linguagem: Inglês

10.1109/tifs.2020.2980190

ISSN

1556-6021

Autores

Saed Alrabaee, Mourad Debbabi, Lingyu Wang,

Tópico(s)

Topic Modeling

Resumo

Binary authorship characterization refers to the process of identifying stylistic characteristics that are related to the author of an anonymous binary code. The aim is to automate the laborious and error-prone reverse engineering task of discovering information related to the author(s) of binary code. This paper presents CPA, a novel approach for characterizing the authors of program binaries. Instead of using generic features such as n-grams, CPA proposes a set of new features based on collections of various aspects of author style, including author code traits, code structure characteristics, and author expertise in solving coding tasks. It employs the Latent Dirichlet Allocation (LDA) algorithm to generate author style signatures to help identify similar author style characteristics in other binaries. We evaluated CPA on large datasets extracted from selected opensource C/C++ projects in GitHub and Google Code Jam events, and it successfully attributed a large number of authors with a significantly higher F 1 score: around 91% when the number of authors was 1,500. In addition, the false positive rate was low, around 1.5%. When the code was subjected to refactoring techniques or code transformation or was processed using different compilers/compilation settings, there was no significant drop in accuracy, demonstrating the robustness of our tool. Finally, in the case of code written by multiple authors, CPA was able to identify the authors with a high F 1 score, around 89%.

Referência(s)