Solutions to the Binding Problem
1999; Cell Press; Volume: 24; Issue: 1 Linguagem: Inglês
10.1016/s0896-6273(00)80826-0
ISSN1097-4199
Autores Tópico(s)Face Recognition and Perception
ResumoThe binding problem, or constellation of problems, concerns our capacity to integrate information across time, space, attributes, and ideas. The goal of research in this area is to understand how we can respond to relations within relevant subsets of the world but not to relations between arbitrarily selected parts or properties. Language comprehension and thinking critically depend on correct binding of syntactic and semantic structures. Binding is required when we select an action to perform in a particular context. We must, for example, reach in the right direction, lift the glass with the correct muscle tension, and drink the water it contains rather than eat or inhale it. The mediating "event file" binds stimulus to response (Hommel, 1998; Shadlen and Movshon, 1999 [this issue of Neuron]). The most extensive discussion so far has focused on the problem of binding in visual perception. How does the brain segregate the correct sensory data to represent the objects that are actually present and not some illusory recombinations of their features? In considering perceptual binding, it is important to note that "seeing" an object is not the same as identifying it (268Kahneman D. Treisman A. Gibbs B. The reviewing of object files object-specific integration of information.Cogn. Psychol. 1992; 24: 175-219Crossref PubMed Google Scholar, 613Treisman A. Perceiving and reperceiving objects.Am. Psychol. 1992; 47 (a): 862-875Crossref PubMed Google Scholar, 621Treisman A. Kanwisher N.K. Perceiving visually-presented objects recognition, awareness, and modularity.Curr. Opin. Neurobiol. 1998; 8: 218-226Crossref PubMed Scopus (85) Google Scholar). To generate a perceptual experience and to be able to act on it, we need to specify the current details of how an object looks, where it is, how it is oriented, and many other often arbitrary details of its current instantiation. Thus we must construct a temporary token ("object file") that binds together these current features with the more permanent identifying characteristics of its type. The present papers have dealt mostly with this perceptual version of the binding problem, although many ideas can apply more broadly to binding at all levels. Why should a binding problem arise in vision, both for the brain and for the scientist attempting to understand it? A number of factors could contribute to create binding failures in vision. One is that various properties of objects appear to be separately analyzed by specialized subsystems. Thus, while information from the same location is implicitly bound by the cells that respond to it initially (allowing the selective adaptation to conjunctions of features described by McCollough, 1965; see also Wolfe and Cave, 1999 [this issue of Neuron]), at later stages the information from these cells appears to be routed to different neural populations, forming a distributed representation of an object's different properties. Another is that receptive fields at higher levels are large enough to generalize across a wide range of locations. Because visual scenes typically contain multiple objects, the question of which features belong to which objects could frequently arise. To aggravate the problem, different parts of a single object occupy different locations, and there may be occluding objects that break their continuity. Coarse coding of different stimulus dimensions also creates representations that may depend on ratios of activity in neurons with different but overlapping tuning. Whenever the perceptual representations of simultaneously present objects depend on distributed patterns of firing in populations of cells, the risk of superposition ambiguities within the same neural network will arise, creating a need to identify and to signal which units belong to the same representation. For any case of binding, the binding problem can actually be dissected into three separable problems. Different theories have focused primarily on one of the three. (1) Parsing. How are the relevant elements to bind as a single entity selected and segregated from those belonging to other objects, ideas, or events? (2) Encoding. How is the binding encoded so that it can be signaled to other brain systems and used? (3) Structural description. How are the correct relations specified between the bound elements within a single object? The second and third operations are not necessarily sequential, and in fact some models combine all three as part of the same process. I will discuss each of these three aspects of the binding problem in turn. There are a number of different ways in which the initial parsing of objects might be accomplished. In particular, the selection of parts and of properties may depend on different mechanisms. Different attributes of the same object, such as its color, orientation, and direction of motion, must occupy the same location. However, different parts, like the arms and legs of a child or the two colors of her shirt and pants, occupy different locations and, if she is partly occluded, may not even be spatially linked in the retinal image. Possible ways of parsing objects and backgrounds include the selection of sensory data that share the same temporal parameters (onset, offset, flicker rate), or that match a prestored template (e.g., 93Chelazzi L. Miller E.K. Duncan J. Desimone R.C. A neural basis for visual search in inferior temporal cortex.Nature. 1993; 363: 345-347Crossref PubMed Scopus (556) Google Scholar), or that occupy the same or adjacent locations, or finally that share one or more Gestalt properties, such as their color or texture, common fate, collinearity, good continuation, symmetry, and convexity. In cases where objects have different temporal onset or offset times, the externally imposed synchrony of the initial neural response to one object may solve the parsing problem as well as the encoding problem (see below). The evidence suggests that temporal modulations can be used to separate figure and ground (323Leonards U. Singer W. Fahle M. The influence of temporal phase differences on texture segmentation.Vision Res. 1996; 36: 2689-2697Crossref PubMed Scopus (93) Google Scholar). However, when pitted against spatial cues temporal modulations do not disrupt selection by location (164Fahle M. Koch C. Spatial displacement, but not temporal asynchrony, destroys figural binding.Vision Res. 1995; 35: 491-494Crossref PubMed Scopus (68) Google Scholar, 282Kiper D.C. Gegenfurtner K.R. Movshon J.A. Cortical oscillatory responses do not affect visual segmentation.Vision Res. 1996; 36: 539-544Crossref PubMed Scopus (70) Google Scholar), so they cannot always be the dominant cue. When one of the objects is known or expected, the selection may be mediated by a match to this familiar or cued object. The best-known example is the Dalmatian dog to be extracted from a background of black and white patches (Figure 1). One of two superimposed movies of real life scenes (like a ball game and a hand game) can be selected so efficiently that a salient event in the unattended movie (a lady with an umbrella walking across the field) is completely missed (414Neisser U. Becklen R. Selective looking attending to visually specified events.Cogn. Psychol. 1975; 7: 480-494Crossref Google Scholar). This type of segregation no doubt uses other grouping cues such as common fate, collinearity, and shared colors, but top-down predictions are also likely to play a role. 708Zhang, X. (1999). Anticipatory inhibition: an intentional non-spatial mechanism revealed with the distractor previewing technique. PhD thesis, Princeton University, Princeton, NJ.Google Scholar showed recently that suppression of irrelevant stimuli can also be helped by precueing the template of the unwanted stimulus, even when it occupies the same location as the target, ensuring that suppression is not mediated by location but by a representation of the object itself. Most of the research on binding has been devoted to the last two forms of selection—by location and by shared features. Feature Integration Theory (FIT; e.g., 619Treisman A. Gelade G. A feature-integration theory of attention.Cogn. Psychol. 1980; 12: 97-136Crossref PubMed Google Scholar, 615Treisman, A. (1993). The perception of features and objects. In Attention: Selection, Awareness and Control: A Tribute to Donald Broadbent, A. Baddeley and L. Weiskrantz, eds. (Oxford: Clarendon Press).Google Scholar, 618Treisman A. Feature binding, attention and object perception.Philos. Trans. R. Soc. Lond. B Biol. Sci. 1998; 353: 1295-1306Crossref PubMed Scopus (252) Google Scholar) uses both. It was developed to account for a number of empirical findings: (1) that search for targets that need binding to distinguish them from the nontargets often requires attention; (2) that when attention is directed elsewhere, illusory conjunctions wrongly recombining features of different objects are frequently seen; (3) that precueing the relevant location helps much more when a conjunction must be reported than when the targets are defined as a disjunction of separate features; and (4) that grouping by single features occurs in parallel across the field, whereas grouping by conjunctions is much less salient and also seems to require attention. In early papers, we proposed that binding is achieved by directing spatial attention serially to the locations of different objects (or homogeneous groups of objects). Features of objects in unattended locations are thereby excluded and cannot form illusory conjunctions with the features of the attended object. The relevant locations are selected in a "master map" of locations by an externally directed "window of attention" serially focused on single filled locations or contiguous clusters that might correspond to "objects." The attention window gives access to the features in the corresponding locations in the different feature maps and allows the information from those locations to be assembled in a single object file for further analysis and identification. The separation of explicit access to features and to locations may correspond to the separation of ventral "what" and dorsal "where" pathways, although there must be implicit links between the two as well as implicit location information in the feature maps. The fact that patients with bilateral parietal lesions (Balints' syndrome) have major problems with binding is consistent with the idea that the master map of locations is associated with parietal function (496Robertson L. Treisman A. Freidman-Hill S. Grabowecky M. The interaction of spatial and object pathways evidence from Balint's syndrome.J. Cogn. Neurosci. 1997; 9: 254-276Crossref PubMed Google Scholar). If our model of the deficit is correct, the simultanagnosia that these patients also suffer is evidence for the importance of binding to normal object perception. If only one object can be bound, it seems that only one object is seen. 611Treisman A. Features and objects the fourteenth Bartlett memorial lecture.Quart. J. Exp. Psychol. 1988; 40A: 201-237Crossref Google Scholar, 622Treisman A. Sato S. Conjunction search revisited.J. Exp. Psychol. Hum. Percept. Perform. 1990; 16: 459-478Crossref PubMed Scopus (584) Google Scholar added a feature-based selection process to the original version of FIT to account for cases where highly discriminable features appear to group and allow rapid or parallel conjunction search. The idea was that the binding process could be bypassed if connections from the separate feature maps responding to the target features were used to signal the corresponding locations within the master map and to suppress all others, thus excluding all nontargets from further processing. For example, if a target of a search is known to be red and vertical among green vertical and red horizontal lines, all the active locations in a color map for red and an orientation map for vertical could select the corresponding locations in the master map through the implicit links connecting them, and temporarily inhibit all other locations. 691Wolfe J.M. Cave K.R. Franzel S.L. Guided Search an alternative to the Feature Integration model for visual search.J. Exp. Psychol. Hum. Percept. Perform. 1989; 15: 419-433Crossref PubMed Scopus (878) Google Scholar proposed a similar model of Guided Search, and 689Wolfe J.M. Cave K.R. The psychophysical evidence for a binding problem in human vision.Neuron. 1999; 24 (this issue,): 11-17Abstract Full Text Full Text PDF PubMed Scopus (87) Google Scholar report evidence supporting the feature-based account. This form of selection may also be responsible for grouping the parts of partially occluded objects. In a different form of grouping by Gestalt properties, edges or elements that are continuous or collinear appear to be linked by horizontal connections in area V1 (271Kapadia M.K. Ito M. Gilbert C.D. Westheimer G. Improvement in visual sensitivity by changes in local context parallel studies in human observers and in V1 of alert monkeys.Neuron. 1995; 15: 843-856Abstract Full Text PDF PubMed Google Scholar), which may increase the salience of object boundaries (695Yen S.-C. Finkel L.H. Extraction of perceptually salient contours by striate cortical networks.Vision Res. 1998; 38: 719-741Crossref PubMed Scopus (160) Google Scholar). These connections might also be used to define the locations to be selected or suppressed in the master map of locations. Constraints on the shape of the attention window could also bias the selection in favor of simple, symmetrical, convex objects. Several papers in this issue develop similar ideas to explain selection by location or feature grouping. Wolfe and Cave present their Guided Search model (691Wolfe J.M. Cave K.R. Franzel S.L. Guided Search an alternative to the Feature Integration model for visual search.J. Exp. Psychol. Hum. Percept. Perform. 1989; 15: 419-433Crossref PubMed Scopus (878) Google Scholar, 91Cave K.R. Wolfe J.M. Modeling the role of parallel processing in visual search.Cogn. Psychol. 1990; 22: 225-271Crossref PubMed Google Scholar; Wolfe and Bennett, 1997), which extended FIT. They recently modified the earliest stages to include an initial loose "bundling" by location, as opposed to the "tight binding" achieved through attention. Shadlen and Movshon favor the related computational model proposed by 435Olshausen B.A. Anderson C.H. Van Essen D.C. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information.J. Neurosci. 1993; 13: 4700-4719PubMed Google Scholar, using shifter circuits to achieve the same goal of serial processing by locations. Reynolds and Desimone adopt the idea that spatial attention is the main mechanism for binding, combining it with their biased competition account and linking it to evidence from single unit-recordings (see also 354Luck S.J. Girelli M. McDermott M.T. Ford M.A. Bridging the gap between monkey neurophysiology and human perception an ambiguity resolution theory of visual selective attention.Cogn. Psychol. 1997; 33 (b): 64-87Crossref PubMed Google Scholar). 486Reynolds J.H. Desimone R. The role of neural mechanisms of attention in solving the binding problem.Neuron. 1999; 24 (this issue,): 19-29Abstract Full Text Full Text PDF PubMed Scopus (196) Google Scholar( [this issue of Neuron]) also propose salience and grouping as supplementary binding mechanisms that could, by the increased activation they cause, select a winner in a competitive interaction with other objects. One concern with this hypothesis is that, without additional specification of the selection process, it seems to require the binding problem to be at least partly solved in order to contribute to its solution. The attribute on which the stimuli are salient (e.g., luminance or color contrast), or on which they are grouped (e.g., color, orientation, shared motion), must be bound to the other attributes of the same stimuli before their greater activation can be transmitted to those other dimensions. This "catch 22" might be resolved in cases where salience and grouping are determined at the early levels of processing, before the stimulus attributes are segregated into separate specialized visual areas. At these earliest stages, high contrast may ensure high firing rates for holistic stimuli rather than for particular attributes, so that subsequently each attribute wins out in the competition within its own specialized area. The mechanism is less clear for cognitively determined or learned salience and grouping (such as the salience of one's own name in a list of other names). The identity of the name might not be available at stages that precede the specialized processing of separate attributes. In order to make progress in understanding, it is useful to sharpen the disagreements so that clear distinctions in empirical data can decide between them. As I understand it, Reynolds and Desimone's biased competition model differs from FIT primarily in the source of suppression of unwanted stimuli, ideas, or responses. (The main difference claimed by Reynolds and Desimone was that illusory conjunctions, in their model but not mine, arise from spatial uncertainty within receptive fields, but this was in fact the model proposed by Treisman and Gormican, 1988 [see pp. 45–46]. I also proposed additional mechanisms besides spatial attention that can be used to select stimuli and deal with the binding problem, one being grouping and one being object tracking by reentrant connections [Treisman, 1995]). Both accounts assume an external source of control by selective attention, presumably directed by prefrontal and parietal areas. But Reynolds and Desimone restrict this control to biasing the competition that would occur between objects anyway. They assume that selection of one stimulus or response from many is directly determined by suppressive links between their neural substrates, with the more active winning over the less active. Attention in their model biases the competition by adding top-down activation to one of the competing sets of cells. FIT, on the other hand, attributes an inhibitory as well as an activating role to the external control system that we label attention. The evidence Reynolds and Desimone cite is the observation that when a second stimulus is introduced within the receptive field of a visual neuron, its response is a weighted average rather than the sum of the two separate responses (see also 391Miller E.C. Gochin P.M. Gross C.G. Suppression of visual responses of neurons in inferior temporal cortex of the awake macaque monkey by addition of a second stimulus.Brain Res. 1993; 616: 25-29Crossref PubMed Google Scholar). This is consistent with a direct suppressive interaction between the two afferent inputs to the cell, but it could also result from external selection of one and inhibition of the other to reduce cross-talk and binding errors. Intrinsic competition may be difficult to distinguish from the effects of dividing or focusing an extrinsic source of attention. One piece of evidence that favors the biased competition model is the finding by 487Reynolds J. Chelazzi L. Desimone R. Competitive mechanisms subserve attention in macaque areas V2 and V4.J. Neurosci. 1999; 19: 1736-1753PubMed Google Scholar that when two unattended stimuli are presented in the same receptive field, the response is still lower than the response to the more effective of the two presented alone, even though they are not competing with each other for attention. Of course, this does not preclude the idea that attention also inhibits unwanted stimuli when binding errors might otherwise occur. If the idea of intrinsic competition is correct, it raises many interesting questions for research. How do the cells "know" that they are being activated by different objects rather than by one complex object? There may be some feedback from higher object recognition areas, but when both objects are unattended, this is likely to be limited. Does similarity play a role in determining the degree of competition? What other factors affect the competition and its outcome? Several of the present papers deal primarily with the second problem: how the feature bundles, once selected, are encoded to be used for thinking, deciding, and acting. Essentially, this is a special case of the central question, "What, in neural terms, corresponds to the final representation of what we see?" Is it the activation of particular labeled cells, or particular cell assemblies, or particular temporal patterns of activity within or across cells, independent of which cells implement the pattern? Or is it a combination of place and temporal pattern? We are far from having an answer, which is quite a handicap in devising models of object perception, but the question has not been discussed much. One constraint, of course, is that the codes should remain distinct when several objects are present at once (the binding problem). The focal hypothesis debated in the present issue is the proposal by 392Milner P. A model for visual shape recognition.Psychol. Rev. 1974; 81: 521-535Crossref PubMed Scopus (239) Google Scholar and 653von der Malsburg, C. (1981). The correlation theory of brain function. MPI Biophysical Chemistry, Internal Report 81–2. Reprinted in Models of Neural Networks II (1994), E. Domany, J.L. van Hemmen, and K. Schulten, eds. (Berlin: Springer).Google Scholar that the neurons coding elements that belong to the same object are distinguished from others by firing in synchrony. Oscillations in the range of 30–60 Hz are thought to assist and perpetuate the synchronization, especially for widely separated neurons. While the synchrony lasts, the cells that share it are treated as representing the same object, event, or proposition. In this issue, 658von der Malsburg C. The what and why of binding the modeler's perspective.Neuron. 1999; 24 (this issue,): 95-104Abstract Full Text Full Text PDF PubMed Scopus (215) Google Scholar gives the theoretical reasoning behind models of binding by synchronized firing. 217Gray C.M. The temporal correlation hypothesis of visual feature integration still alive and well.Neuron. 1999; 24 (this issue,): 31-47Abstract Full Text Full Text PDF PubMed Scopus (317) Google Scholar and 554Singer W. Neuronal synchrony a versatile code for the definition of relations?.Neuron. 1999; 24 (b this issue,): 49-65Abstract Full Text Full Text PDF PubMed Scopus (1088) Google Scholar discuss the physiological evidence and implementation of these ideas. In contrast, 539Shadlen M.N. Movshon J.A. Synchrony unbound a critical evaluation of the temporal bining hypothesis.Neuron. 1999; 24 (this issue,): 67-77Abstract Full Text Full Text PDF PubMed Scopus (302) Google Scholar and 197Ghose G.M. Maunsell J. Specialized representations in visual cortex a role for binding?.Neuron. 1999; 24 (this issue,): 79-85Abstract Full Text Full Text PDF PubMed Scopus (50) Google Scholar point out some problems for the synchrony account, including the supposedly limited precision of temporal and spatial coding by neurons, the need to use timing relations to represent real temporal differences (in such discriminations as the perceived direction of sounds or the syllables of speech), and the failure in several studies to observe the predicted oscillations in visual areas (e.g., 604Tovee M. Rolls E. Oscillatory activity is not evident in the primate temporal visual cortex with static stimuli.Neuroreport. 1992; 3 (a): 369-372Crossref PubMed Google Scholar, 702Young M.P. Tanaka K. Yamane S. On oscillating neuronal responses in the visual cortex of the monkey.J. Neurophysiol. 1992; 67: 1464-1474PubMed Google Scholar). The binding-by-synchrony hypothesis has created considerable interest and excitement, since it provides a means of disambiguating superimposed distributed codes in neural networks, thus greatly increasing their flexibility. It also provides a plausible reason for the attentional limit of around four objects that is widely observed in the perception of brief displays and in studies of visual working memory: the different firing rates that can be easily discriminated on a background of inherent noise and accidental synchronies may set a low limit to the number of objects that can be simultaneously bound. 658von der Malsburg C. The what and why of binding the modeler's perspective.Neuron. 1999; 24 (this issue,): 95-104Abstract Full Text Full Text PDF PubMed Scopus (215) Google Scholar(see also 554Singer W. Neuronal synchrony a versatile code for the definition of relations?.Neuron. 1999; 24 (b this issue,): 49-65Abstract Full Text Full Text PDF PubMed Scopus (1088) Google Scholar) points out another important advantage to temporal binding which is not often discussed: it allows coarse coding within dimensions. Coding intermediate values on perceptual dimensions by ratios of activity in differently tuned but overlapping populations of cells can maximize both neural economy and discriminability. However, if different values are signaled by particular combinations of cells, the binding problem reemerges as soon as more than one coarsely coded feature is present. While exploring the effects of coarse coding and of similarity on binding, we had found that the predicted illusory conjunctions can indeed arise within dimensions (e.g., illusory purple with brief presentations of red and blue), and that attention seems to play the same role for within-dimension binding as for between-dimension binding (612Treisman A. Search, similarity and the integration of features between and within dimensions.J. Exp. Psychol. Hum. Percept. Perform. 1991; 27: 652-676Crossref Scopus (149) Google Scholar, 614Treisman A. Spreading suppression or feature integration? A reply to Duncan and Humphreys.J. Exp. Psychol. Hum. Percept. Perform. 1992; 18 (b): 589-593Crossref Scopus (26) Google Scholar). The main alternative hypothesis for signaling the outputs of the binding process is a place code, which represents different objects or parts of objects by the firing of different labeled conjunction coding or "cardinal" cells at the top of a hierarchical perceptual system (34Barlow H.B. Single units and cognition a neurone doctrine for perceptual psychology.Perception. 1972; 1: 371-394Crossref PubMed Google Scholar, 36Barlow H.B. The twelfth Bartlett memorial lecture the role of single neurons in the psychology of perception.Quart. J. Exp. Psychol. 1985; 37: 121-145Crossref Google Scholar). The cardinal cells could be replaced by cell assemblies, provided that the coding is sufficiently sparse for overlap between cells taking part in different assemblies not to be a problem. The place or "labeled line" hypothesis is developed in this issue by Shadlen and Movshon, by Ghose and Maunsell, and by Riesenhuber and Poggio, who created a model to demonstrate its feasibility. There is considerable evidence for coding of specific percepts by specialized cells. Single units in monkeys respond to faces better than to other stimuli (e.g., 447Perrett D.I. Smith P.A.J. Potter D.D. Mistlin A.J. Head A.S. Milner A.D. Jeeves M.A. Visual cells in the temporal cortex sensitive to face view and gaze direction.Proc. R. Soc. Lond. B Biol. Sci. 1985; 223: 293-317Crossref PubMed Google Scholar). The behavioral discriminations of motion made by a monkey can be predicted from the activity of individual cells in area MT (419Newsome W.T. Britten K.H. Movshon J.A. Neuronal correlates of a perceptual decision.Nature. 1989; 341: 52-54Crossref PubMed Google Scholar, 69Britten K.H. Newsome W.T. Shadlen M.N. Celebrini S. Movshon J.A. A relationship between behavioral choice and the visual responses of neurons in macaque MT.Vis. Neurosci. 1996; 13: 87-100Crossref PubMed Google Scholar)—quite strong evidence for cardinal cells in the case of directions of motion. However, faces may be a special case, highly significant from the evolutionary point of view, and objects are typically more complex than directions of motion. Ghose and Maunsell suggest that the number of objects we can actually identify is only in the tens of thousands, but we can also see thousands of differences between individual tokens and views of these objects. The cardinal cell hypothesis does run into combinatorial explosion problems if all discriminable instantiations of all objects must have unique cells to signal their presence. A related difficulty is that this account allows no distinction between identifying and "seeing," or between types and tokens (268Kahneman D. Treisman A. Gibbs B. The reviewing of object files object-specific integration of information.Cogn. Psychol. 1992; 24: 175-219Crossref PubMed Google Scholar, 613Treisman A. Perceiving and reperceiving objects.Am. Psychol. 1992; 47 (a): 862-875Crossref PubMed Google Scholar). Using cardinal cells, feature binding can be coded only by identifying prestored conjunctions. It is not clear, therefore, how a new and unexpected conjunction would be bound and perceived. Consider seeing a three-legged camel with wings, or a triangular book with a hole through it, or a new object like an electron microscope picture of a cell that we have never seen before. All these would be instantly visible and bound, even when we have no idea what they are. The plasticity of the nervous system may allow new cardinal cells to be created through learning and experience—for example, to signal familiar configurations like letters, digits, and grandmothers. But novel objects can clearly be "seen" under good conditions the first time they appear. Another deficiency is that cardinal cells have no way to represent the hierarchical structure of the object, the different momentary constellations of articulated parts in a given token as seen on a particular occasion. These could perhaps be read from the complete pathway through the hierarchy that culminates in the cardinal cell. 197Ghose G
Referência(s)