End-to-end neural network modeling for Japanese speech recognition
2016; Acoustical Society of America; Volume: 140; Issue: 4_Supplement Linguagem: Inglês
10.1121/1.4969755
ISSN1520-9024
AutoresHitoshi Ito, Aiko Hagiwara, Manon Ichiki, Tomokazu Mishima, Shoei Sato, Akio Kobayashi,
Tópico(s)Speech and Audio Processing
ResumoThis study proposes end-to-end neural network modeling to adapt direct speech-to-text decoding to Japanese. End-to-end speech recognition systems using deep neural networks (DNNs) are currently being investigated. These systems do not need intermediate phonetic representation. Instead, many of them utilize Recurrent Neural Networks (RNNs) trained by using much more data than ever before. The end-to-end approach makes acoustic models simpler to train. Typically, previous works have dealt with phonogram labels such alphabetic characters. Ideograms such as Kanji, however, make end-to-end speech recognition more complex. A single Kanji can have multiple readings, such as On-yomi (Chinese reading) and Kun-yomi (Japanese reading). In addition, whereas alphabets have at most 100 labels, Japanese has over 2000 labels to predict, such as Kanji, Hiragana, Katakana, the Roman alphabet, digits, and punctuation marks. To resolve this problem, we attempt to make end-to-end neural network modeling allows speech recognition of Japanese without phonetic representation. This method trains RNN and adopts the connectionist temporal classification (CTC) objective function. The proposed method was able to deal with a large amount of character labels. We also analyzed the decoding results and examined ideas for improving the accuracy of word error rate (WER).
Referência(s)