Sequence Labeling Algorithms for Punctuation Restoration in Brazilian Portuguese Texts

Capítulo de livro

Produção Nacional Revisado por pares

Sequence Labeling Algorithms for Punctuation Restoration in Brazilian Portuguese Texts

2022; Springer Science+Business Media; Linguagem: Inglês

10.1007/978-3-031-21689-3_43

ISSN

1611-3349

Autores

Tiago Barbosa de Lima, Péricles Miranda, Rafael Ferreira Mello, Moésio Wenceslau da Silva Filho, Ig Ibert Bittencourt, Thiago Cordeiro, Jário José,

Tópico(s)

Multimodal Machine Learning Applications

Resumo

Punctuation Restoration is an essential post-processing task of text generation methods, such as Speech-to-Text (STT) and Machine Translation (MT). Usually, the generation models employed in those tasks produce unpunctuated text, which is difficult for human readers and might degrade the performance of many downstream text processing tasks. Thus, many techniques exist to restore the text’s punctuation. For instance, approaches based on Conditional Random Fields (CRF) and pre-trained models, such as the Bidirectional Encoder Representations from Transformers (BERT), have been widely applied. In the last few years, however, one approach has gained significant attention: casting the Punctuation Restoration problem into a sequence labeling task. In Sequence Labeling, each punctuation symbol becomes a label (e.g., COMMA, QUESTION, and PERIOD) that sequence tagging models can predict. This approach has achieved competitive results against state-of-the-art punctuation restoration algorithms. However, most research focuses on English, lacking discussion in other languages, such as Brazilian Portuguese. Therefore, this paper conducts an experimental analysis comparing the Bi-Long Short-Term Memory (BI-LSTM) + CRF model and BERT to predict punctuation in Brazilian Portuguese. We evaluate those approaches in the IWSLT2 2012-03 and OBRAS dataset in terms of precision, recall, and $$F_1$$ -score. The results showed that BERT achieved competitive results in terms of punctuation prediction, but it requires much more GPU resources for training than the BI-LSTM + CRF algorithm.

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Sequence Labeling Algorithms for Punctuation Restoration in Brazilian Portuguese Texts