Granularity-Based Assessment of Similarity Between Short Text Strings
2019; Springer Science+Business Media; Linguagem: Inglês
10.1007/978-981-13-7091-5_9
ISSN1876-1119
Autores Tópico(s)Advanced Malware Detection Techniques
ResumoThe capacity to discover the similarity between two textual bases, or inside one textual base, has much utilization including plagiarism detection and in the area of reused text (strings) in a database manageable to the removal of duplication. Past structure-metric methodologies have used either suffix trees or variance of longest common subsequence algorithms to recognize duplicate text. In this paper, different string distance metrics have been investigated: Levenshtein Distance (L. Dist.), Cosine Similarity (C.S.), and Hamming Distance (H. Dist) and also Hashes (ASCII-based hashing) on token sequences to detect matching of strings were used. Similarity index techniques vary on the basis of granularity: some techniques work on character level, word level, and some work on corpus-based granularity. The benefit of the approaches evaluated is to handle multiples patterns for similarity at a time. The work has been carried out on strings. From the simulation, it has been observed that ASCII-based hashing performs better than other techniques in terms of running time and accuracy. All techniques face one issue of increase in similarity searching time linearly with database size, whereas hashing handles this issue efficiently. ASCII-based hashing handles the issue of scalability very well.
Referência(s)