Learning Simple Statistics for Language Comprehension and Production: The CAPPUCCINO Model
2011; Wiley; Volume: 33; Issue: 33 Linguagem: Inglês
ISSN
1551-6709
AutoresStewart M. McCauley, Morten H. Christiansen,
Tópico(s)Language and cultural evolution
ResumoLearning Simple Statistics for Language Comprehension and Production: The CAPPUCCINO Model Stewart M. McCauley (smm424@cornell.edu) Morten H. Christiansen (christiansen@cornell.edu) Department of Psychology, Cornell University, Ithaca, NY 14853 USA Abstract Whether the input available to children is sufficient to explain their ability to use language has been the subject of much theoretical debate in cognitive science. Here, we present a simple, developmentally motivated computational model that learns to comprehend and produce language when exposed to child-directed speech. The model uses backward transitional probabilities to create an inventory of ‘chunks’ consisting of one or more words. Language comprehension is approximated in terms of shallow parsing of adult speech and production as the reconstruction of the child’s actual utterances. The model functions in a fully incremental, on- line fashion, has broad cross-linguistic coverage, and is able to fit child data from Saffran’s (2002) statistical learning study. Moreover, word-based distributional information is found to be more useful than statistics over word classes. Together, these results suggest that much of children’s early linguistic behavior can be accounted for in a usage-based manner using distributional statistics. Keywords:Language Learning; Computational Modeling; Corpora; Chunking; Shallow Parsing; Usage-Based Approach Introduction The ability to produce and understand a seemingly unbounded number of different utterances has long been hailed as a hallmark of human language acquisition. But how is such open-endedness possible, given the much more limited nature of other animal communication systems? And how can a child acquire such productivity, given input that is both noisy and necessarily finite in nature? For nearly half a century, generativists have argued that human linguistic productivity can only be explained by positing a system of abstract grammatical rules working over word classes and scaffolded by considerable innate language-specific knowledge (e.g., Pinker, 1999). Recently, however, an alternative theoretical perspective on linguistic productivity has emerged in the form of usage-based approaches to language (e.g., Tomasello, 2003). This perspective is motivated by analyses of child-directed speech, showing that there is considerably more information available in the input than previously assumed. For example, distributional and phonological information can provide reliable cues for learning about lexical categories and phrase structure (for a review, see Monaghan & Christiansen, 2008). Behavioral studies have shown that children can use such information in an item-based manner (Tomasello, 2003). A key difference between generative and usage-based approaches pertains to the granularity of the linguistic units necessary to account for the productivity of human language. At the heart of usage-based theory lies the idea that grammatical knowledge develops gradually through abstraction over multi-word utterances (e.g., Tomasello, 2003), which are assumed to be stored as multi-word ‘chunks.’ Testing this latter assumption, Bannard and Matthews (2008) showed not only that non-idiomatic chunk storage takes place, but also that storing such units actively facilitates processing: young children repeated multi-word sequences faster, and with greater accuracy, when they formed a frequent chunk. Moreover, Arnon and Snider (2010) extended these results, demonstrating an adult processing advantage for frequent phrases. The existence of such chunks is problematic for generative approaches that have traditionally clung to a words-and-rules perspective, in which memory-based learning and processing are restricted to the level of individual words (e.g., Pinker 1999). One remaining challenge for usage-based approaches is to provide an explicit computational account of language comprehension and production based on multi-word chunks. Although Bayesian modeling has shown that chunk-based grammars are in principle sufficient for the acquisition of linguistic productivity (Bannard, Lieven, & Tomasello, 2009), no full-scale computational model has been forthcoming (though models of specific aspects of acquisition do exist, such as the optional infinitive stage; Freudenthal, Pine & Gobet, 2009). The scope of the computational challenge facing usage-based approaches becomes even more formidable when considering the success with which the generativist principles of words and rules have been applied in computational linguistics. In this paper, we take an initial step towards answering this challenge by presenting the ‘Comprehension And Production Performed Using Chunks Computed Incrementally, Non-categorically, and On-line’ (or CAPPUCCINO) model of language acquisition. The aim of the CAPPUCCINO model is to provide a test of the usage-based assumption that children’s language use may be explained in terms of stored chunks. To this end, the model gradually builds up an inventory of chunks consisting of one or more words—a ‘chunkatory’—used for both language comprehension and production. The model was further designed with several key psychological and computational properties in mind: a) incremental learning: at any given point in time, the model can only rely on the input seen so far (no batch learning); b) on-line processing: input is processed word-by-word as it is encountered; c) simple statistics: learning is based on computing backward transitional probabilities (which 8-month-olds can track; Pelucchi, Hay, & Saffran, 2009); d) comprehension: the
Referência(s)