Likewise, language itself acts as a strong gaze cue – listeners' eye movements in psycholinguistic eye-tracking experiments reflect their real-time language comprehension ( Tanenhaus et al., 1995). Where intersections do occur, there is evidence that the way viewers make sense of a visual scene does indeed guide the language they use to describe it – visual information influences which objects speakers identify as important enough to mention and how they characterize the relationships between those objects ( Coco and Keller, 2012 Clarke et al., submitted). For modeling people's interpretation of visual scenes and for accounting for their linguistic descriptions of such scenes, both fields must address the ways that local cues are integrated with larger contextual cues and the ways that different tasks guide people's strategies.ĭespite these seemingly interlinked problem domains, vision and language have largely been studied as separate fields. These findings identify a key role for visual salience in language production decisions and highlight the importance of scene complexity for REG.Ĭognitive science research in the domains of vision and language faces similar challenges for modeling the way people use and integrate information. We also find that participants are more likely to mention non-target landmarks that are large, salient, and in close proximity to the target. Referring expressions for large targets are shorter than those for smaller targets, and expressions about targets in highly cluttered scenes use more words. Our results show that visual properties (salience, clutter, area, and distance) influence REG for targets embedded in images from the Where's Wally? books. This paper presents a study testing whether participants are sensitive to visual features that allow them to compose such “good” descriptions. However, the visual search literature suggests that some descriptions are better than others at enabling listeners to search efficiently within complex stimuli. That previous work, which relies largely on simple scenes, tends to treat vision as a pre-process for extracting feature categories that are relevant to disambiguation. Referring expression generation (REG) presents the converse problem to visual search: given a scene and a specified target, how does one generate a description which would allow somebody else to quickly and accurately locate the target?Previous work in psycholinguistics and natural language processing has failed to find an important and integrated role for vision in this task. 3Linguistics and English Language, University of Edinburgh, Edinburgh, Scotland, UK.2Department of Linguistics, The Ohio State University, Columbus, OH, USA.1School of Informatics, University of Edinburgh, Edinburgh, Scotland, UK.Clarke 1, Micha Elsner 2* and Hannah Rohde 3
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |