Artigo Acesso aberto Revisado por pares

Classifying the content of online notepad services using active learning

2024; Springer Science+Business Media; Linguagem: Inglês

10.1007/s10844-024-00902-8

ISSN

1573-7675

Autores

Mhd Wesam Al-Nabki, Eduardo Fidalgo, Enrique Alegre, Sarah Jane Delany, Francisco Jáñez-Martino,

Tópico(s)

Text and Document Classification Technologies

Resumo

Abstract Pastebin is an online notepad service to share text anonymously. However, it could be misused to propagate suspicious or even illegal activities, like leaking sensitive information or sharing hyperlinks to child sexual abuse material. Due to the high rate of daily upload pastes, manual inspection of this material is not feasible. Conversely, an automatic classifier could identify such activities with little or no human intervention. However, a supervised model may require a significant number of training samples and have to handle distinct text typologies presented in Pastebin. This paper presents a classification approach composed of three cascading supervised classifiers that use Active Learning to select and label the most informative samples from Pastebin. The modularity of the proposed design allows each classifier to adapt to a specific text typology. The first classifier determines whether the text is a code snippet, and the second is to identify whether it is readable. The third classification level is twofold: (i) a binary classifier to say whether the text is suspicious and (ii) a multiclass classifier with seven predefined categories of possibly illegal activities. The average class recall of the binary and multiclass classifiers is $$95.24\%$$ 95.24 % and $$80.33\%$$ 80.33 % , respectively. Additionally, this paper presents a dataset of 3.8 million Pastebin samples, called onlIne Notepad Services PastEbin aCtiviTies (INSPECT-3.8M), along with their labels using our classification framework. Our classifier recognised that $$7.54\%$$ 7.54 % of the collected samples are correlated with presumably criminal activities. Law enforcement agencies may benefit from the insights shared in our research when aiming to investigate or automate the monitoring of Pastebin or other Online Notepad Services. This would allow responsible authorities to block illegal content before it spreads to the public.

Referência(s)