The retina of the human eye is not homogeneous; to allow for diurnal vision, it is divided into a large outer ring of highly light-sensitive but colour-insensitive rods, and a comparatively small central region of less light-sensitive but colour-sensitive cones called the fovea. The outer ring only provides peripheral vision, so all detailed observations of the surrounding world is made with the fovea, which must thus constantly be subjected to different parts of the viewed scene by successive fixations.
It would seem that during a fixation, information about the current location is processed, and the decision of where to move the fovea for the next fixation-the so-called selection-is also made during this fixation. Most research agrees that this is done in two phases, a spatially parallel and unlimited pre-attentive stage, where basic features of each location in the scene are subjected to local mismatch detection, followed by an attentive, serial and limited-capacity stage that-according to the Feature Integration Theory put forward by Treisman-combines and integrates the different basic features to produce `whole' percepts of the attended items (cf. figure 6).
Whereas the parallel stage is purely bottom-up driven, the serial stage can to some extent be strategically controlled. According to the zoom lens metaphor, top-down processes can reduce and move the area to which the serial stage directs attention. This metaphor also implies a trade-off: if the visual scene is attended to at large, it is done with relatively low resolution of detail, and similarly, if it is attended to in detail, it is done with relatively high resolution of detail. Some research shows that this zoom lens metaphor shouldn't be interpreted literally.
The human eyes are capable of making many different movements; some are involuntary like rolling, nystagmus, drift and microsaccades together with physiological nystagmus that serves to constantly shift the retinal image so as to call fresh receptors into use; some require an external stimulus to be initiated, like convergence and pursuit motion; finally, the very important saccades can be induced voluntarily, although they are ballistic, that is, their trajectory and destination cannot be altered when they have been initiated. All tracking data of human eye movements will thus consist of several superimposed movements, and must therefore somehow be "filtered" to extract the eye-gaze data that is related to the attentional processes of the viewer.
Eye movements have been classified, according to what situations they occur in, as being spontaneous, task-relevant or orientation of thought looking, and we propose a new type: intentional manipulatory looking, which is an act of using the direction of one's eyes to manipulate objects of the surrounding environment. Generally, the eyes are not attracted by the physical qualities of the items in the scene, but rather by how important the viewer would rate them to be. Thus, when viewing faces, the eyes of the viewer will be attracted mostly by the eyes, lips and nose. It is reasonable to assume that during spontaneous or task-relevant looking (and for intentional manipulatory looking this is trivially true), the direction of gaze is indicative of what the observer is interested in-although the observer might not always attend to what she is looking at.
The eye-gaze pattern is determined partly by the composition of the scene, and partly by the observer's thoughts and stored knowledge of the items in the scene. The working memory proposed by Baddeley (1981) is where these two determinants meet, and research has shown that the actual control of the eye-movements is probably performed by some process-monitoring system.
When a scene is observed, it is initially scanned for important elements, and this scanpath is then more or less repeated in successive cycles; the observer does not to any great extent attend to the remaining, less scanned part of the scene.
Finally, some studies have shown that not only the reaction times for detecting target objects are indicative of the difficulty of the processing task, but also the duration of the fixations reflects the time it takes to register and process the fixated information.