In an extensive study, Yarbus (1967)
showed that the perception of a complex scene involves a complicated
pattern of fixations, where the eye is held (fairly) still, and
saccades, where the eye moves to foveate a new part of the scene
(cf.
).
Usually, we are not conscious of this pattern; when perceiving a scene, the
generation of this eye-gaze pattern is felt as an integral part of
"perceiving." Looking closer at the individual fixation, the
question then arises: how does the visual system decide where to fixate
next-how does it select a new location to direct our attention to? It is
speculated that
The fixation time is also spent making small initial saccadic adjustments. It seems likely that a decision is made during this interval as to where to locate future fixations. One possibility is that the initial stage of a fixation is occupied by extracting information from the present fixation locus and the last part of the interval is devoted to acquiring information about the next (or maybe the next but one) place to fixate. Alternatively both operations could be carried out at once, or they could be interleaved (first a bit of one job, then a bit of the other, and so on). The plain fact is that there is little direct evidence that can be called on. (Barber & Legge 1976, p. 58)
Most research literature now agrees that the attention selection mechanism
consists of two functionally independent, hierarchical stages: An early, pre-attentive stage that operates without capacity limitation and in parallel across the entire visual field, followed by a later, attentive limited-capacity stage that can deal with only one item (or at best a few items) at a time. When items pass from the first to the second stage of processing, these items are considered to be selected. (Theeuwes 1993, p. 97f, original italics)It is important to note that attention is shifted to the new location before the saccade that moves the gaze to a new part of the scene is initiated, and thus
the movements of the eyes should not be considered as the selection process itself, but merely as the outcome of attentional selection processes preceding actual eye-shifts. (Theeuwes 1993, p. 96)
Treisman (cf. Theeuwes 1993, p. 98) suggested a feature integration theory (FIT), that views the perception of objects on the basis of the two-stage model above as a process where the pre-attentive stage detects primitive aspects of the scene-basic `features' like edges, orientation, width, size, colour, brightness, movement direction etc. For these basic features to be perceived as objects in the world, they have to be `integrated' in the attentive stage. For example, the finding of a large red X among large and small red and green coloured X'es and O's would proceed by parallel detection of red objects, large objects and round objects, followed by a serial integration and checking of each object, to decide whether it is a large red X. Theeuwes (1993) notes that the pre-attentive stage has three basic properties:
Thus it would seem that objects are selected purely by bottom-up processes-yet there seems to be one dimension along which selection can be strategically controlled, i.e. governed by top-down processes: spatial location. Various studies indicate that it is possible strategically to focus attention to smaller areas of the visual field (Eysenck & Keane (1990, p. 106) cite LaBerge (1983), and Theeuwes (1993, p. 115) cites Eriksen & Yeh (1985) along with Eriksen & James (1986)), and this has given rise to the spotlight metaphor: It is suggested that attention can be varied like a spotlight across the visual field, and the spotlight "enhances the efficiency of detection of events within its beam" (Humphreys & Bruce (1989, p. 145) cite Posner et al. (1980, p. 172)).
A very similar, but perhaps even better metaphor is the notion of a zoom lens (Theeuwes (1993, p. 115) cites Eriksen & Yeh (1985)), where the photographer can only select objects for immortalization from that part of the scene the zoom lens is focused on. In contrast to the spotlight metaphor which had a "fixed aperture" (Humphreys & Bruce (1989, p. 145) cite Eriksen & Eriksen (1974) for determining this size to be of about 1° ), the zooming of the lens can be varied over time. There is also an implicit trade-off in this metaphor: the scene can be attended to at large (wide angle), but then only with a poor resolution of detail; alternatively, attention can "zoom in" on a part of the scene, thus improving the resolution. But it is still such that "within the beam of attention, top-down control is lost and pre-attentive processing occurs unintentionally; allowing the item with the highest bottom-up activation to enter the second stage of attentive processing" (Theeuwes 1993, p. 117). This metaphor should not be interpreted too literally, though; Theeuwes (1993, p. 132) cites Kwak et al. (1991) for reporting that the attentional beam can be shifted to an entirely different region instantly, regardless of how far away this new region is. A spotlight or zoom lens metaphor would on the other hand imply that the time taken to shift attention would be influenced by this distance. Also, Eysenck & Keane (1990, p. 109f) cite Egly and Homa for a study in which the stimulus could occur in three concentric rings (inner, middle and outer). The subjects' attention was directed to the middle ring, and according to the spotlight metaphor, any target object displayed in the inner ring-which would be within the zoom lens beam of the middle ring-should be detected more easily than in the outer ring. This was not what was found in their study (detection of stimuli in the inner and outer ring was equally poor), and we must conclude that "visual attention may be rather more complex than that [spotlight] comparison would suggest" (Eysenck & Keane 1990, p. 109); attention can be shaped in non-circular figures, and does not move entirely like a spotlight.
Figure 6: A model of visual selective attention. The spatially parallel process computes feature difference maps that are added together and subsequently used for selection of objects for further, serial integration of features. All this is done before attention and eyes are directed to the new, target location.
To conclude, we can picture the visual attention selection mechanism as suggested by figure 6 where a bottom-up, spatially parallel process of unlimited capacity produces some `feature difference maps' along several dimensions of primitive features. These feature maps are then added together, and a zoom lens (spotlight) that can be strategically directed-by a top-down process-delineates the area within which the objects that have the highest `difference sum' will be selected first.
After the selection of objects by the pre-attentive process, attention acts to "glue" together the features of the different dimensions according to the feature integration theory proposed by Treisman (cf. Theeuwes 1993, p. 98). If the target object is defined by one primitive feature only (e.g. the colour), this attentive process can be "short-circuited," resulting in the "pop-out" effect, but if subjects are to search for target objects defined by a conjunction of features (say, a large green X), attention must combine the features in a serial process. Evidence in support of this theory comes from studies where reaction times for detection of single-feature defined targets were unaffected by the amount of objects-the so-called display size. When, on the other hand, the target object was defined by a conjunction of features, display size did affect the reaction times in a linear fashion, indicating that subjects had to perform a serial, attentive operation of combining features and checking for a target match. There is also evidence that this serial process is self-terminating, i.e. when the serial process does find a target object, it stops (Humphreys & Bruce 1989, p. 175f.). Another characteristic feature of this "attentional gluing" is that it seems that it can be influenced by top-down processes. Stored knowledge of objects affect the combination of the primitive features; "A carrot is likely to be combined with the colour orange, a tomato with the colour red" (Humphreys & Bruce 1989). This can also result in incorrect feature combinations-the so-called and reported "illusory conjunctions" (Humphreys & Bruce (1989, p. 177) cite Treisman & Schmidt (1982)).