Granularity-Based Assessment of Similarity Between Short Text Strings

Capítulo de livro Revisado por pares

Granularity-Based Assessment of Similarity Between Short Text Strings

2019; Springer Science+Business Media; Linguagem: Inglês

10.1007/978-981-13-7091-5_9

ISSN

1876-1119

Autores

Harpreet Kaur, Raman Maini,

Tópico(s)

Advanced Malware Detection Techniques

Resumo

The capacity to discover the similarity between two textual bases, or inside one textual base, has much utilization including plagiarism detection and in the area of reused text (strings) in a database manageable to the removal of duplication. Past structure-metric methodologies have used either suffix trees or variance of longest common subsequence algorithms to recognize duplicate text. In this paper, different string distance metrics have been investigated: Levenshtein Distance (L. Dist.), Cosine Similarity (C.S.), and Hamming Distance (H. Dist) and also Hashes (ASCII-based hashing) on token sequences to detect matching of strings were used. Similarity index techniques vary on the basis of granularity: some techniques work on character level, word level, and some work on corpus-based granularity. The benefit of the approaches evaluated is to handle multiples patterns for similarity at a time. The work has been carried out on strings. From the simulation, it has been observed that ASCII-based hashing performs better than other techniques in terms of running time and accuracy. All techniques face one issue of increase in similarity searching time linearly with database size, whereas hashing handles this issue efficiently. ASCII-based hashing handles the issue of scalability very well.

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Granularity-Based Assessment of Similarity Between Short Text Strings