Artigo Revisado por pares

Exsense: Extract sensitive information from unstructured data

2020; Elsevier BV; Volume: 102; Linguagem: Inglês

10.1016/j.cose.2020.102156

ISSN

1872-6208

Autores

Yongyan Guo, Jiayong Liu, Wenwu Tang, Cheng Huang,

Tópico(s)

Network Security and Intrusion Detection

Resumo

Large-scale sensitive information leakage incidents are frequently reported in recent years. Once sensitive information is leaked, it may lead to serious effects. In this context, sensitive information leakage has long been a question of great interest in the field of cybersecurity. However, most sensitive information resides in unstructured data. Therefore, how to extract sensitive information from voluminous unstructured data has become one of the greatest challenges. To address the above challenges, we propose a method named ExSense for extracting sensitive information from unstructured data, which utilizes the content-based and context-based extract mechanism. On the one hand, the method uses regular matching to extract sensitive information with predictable patterns. On the other hand, we build a model named BERT-BiLSTM-Attention for extracting sensitive information with natural language processing. This model uses the latest BERT algorithm to accomplish word embedding and extracts sensitive information by using BiLSTM and attention mechanism, with an F1 score of 99.15%. Experimental results on real-world datasets show that ExSense has a higher detection rate than using individual methods (i.e., content analysis and context analysis). In addition, we analyze about a million texts on Pastebin, and the results prove that ExSense can extract sensitive information from unstructured data effectively.

Referência(s)
Altmetric
PlumX