Artigo Revisado por pares

End-to-end neural network modeling for Japanese speech recognition

2016; Acoustical Society of America; Volume: 140; Issue: 4_Supplement Linguagem: Inglês

10.1121/1.4969755

ISSN

1520-9024

Autores

Hitoshi Ito, Aiko Hagiwara, Manon Ichiki, Tomokazu Mishima, Shoei Sato, Akio Kobayashi,

Tópico(s)

Speech and Audio Processing

Resumo

This study proposes end-to-end neural network modeling to adapt direct speech-to-text decoding to Japanese. End-to-end speech recognition systems using deep neural networks (DNNs) are currently being investigated. These systems do not need intermediate phonetic representation. Instead, many of them utilize Recurrent Neural Networks (RNNs) trained by using much more data than ever before. The end-to-end approach makes acoustic models simpler to train. Typically, previous works have dealt with phonogram labels such alphabetic characters. Ideograms such as Kanji, however, make end-to-end speech recognition more complex. A single Kanji can have multiple readings, such as On-yomi (Chinese reading) and Kun-yomi (Japanese reading). In addition, whereas alphabets have at most 100 labels, Japanese has over 2000 labels to predict, such as Kanji, Hiragana, Katakana, the Roman alphabet, digits, and punctuation marks. To resolve this problem, we attempt to make end-to-end neural network modeling allows speech recognition of Japanese without phonetic representation. This method trains RNN and adopts the connectionist temporal classification (CTC) objective function. The proposed method was able to deal with a large amount of character labels. We also analyzed the decoding results and examined ideas for improving the accuracy of word error rate (WER).

Referência(s)