Artigo Produção Nacional Revisado por pares

Understanding uses and misuses of similarity hashing functions for malware detection and family clustering in actual scenarios

2021; Elsevier BV; Volume: 38; Linguagem: Inglês

10.1016/j.fsidi.2021.301220

ISSN

2666-2825

Autores

Marcus Botacin, Vitor Hugo Galhardo Moia, Fabrício Ceschin, Marco Aurélio Amaral Henriques, André Grégio,

Tópico(s)

Network Security and Intrusion Detection

Resumo

An everyday growing number of malware variants target end-users and organizations. To reduce the amount of individual malware handling, security analysts apply techniques for finding similarities to cluster samples. A popular clustering method relies on similarity hashing functions, which create short representations of files and compare them to produce a score related to the similarity level between them. Despite the popularity of those functions, the limits of their application to malware samples have not been extensively studied so-far. To help in bridging this gap, we performed a set of experiments to characterize the application of these functions on long-term, realistic malware analysis scenarios. To do so, we introduce SHAVE, an ideal model of similarity hashing-based antivirus engine. The evaluation of SHAVE consisted of applying two distinct hash functions (ssdeep and sdhash) to a dataset of 21 thousand actual malware samples collected over four years. We characterized this dataset based on the performed clustering, and discovered that: (i) smaller groups are prevalent than large ones; (ii) the threshold value chosen may significantly change the conclusions about the prevalence of similar samples in a given dataset; (iii) establishing a ground-truth for similarity hashing functions comparison has its issues, since the clusters originated from traditional AV labeling routines may result from a completely distinct approach; (iv) the application of similarity hashing functions improves traditional AVs’ detection rates by up to 40%; and finally (v) taking specific binary regions into account (e.g., instructions), leads to better classification results than hashing the entire binary file.

Referência(s)