Natural language interfaces (NLIs) for data visualization are becoming increasingly popular both in academic research and in commercial software. Yet, there is a lack of empirical understanding of how people specify visualizations through natural language. To bridge this gap, we conducted an online study with 102 participants. We showed participants a series of ten visualizations for a given dataset and asked them to provide utterances they would pose to generate the displayed charts. The curated list of utterances generated from the study is provided below. This corpus of utterances can be used to evaluate existing NLIs for data visualization as well as for creating new systems and models to generate visualizations from natural language utterances.

For more details about the study, corpus, and potential applications, please refer to the accompanying research paper.

Team
Arjun Srinivasan, Georgia Tech
Nikhila Nyapathy, Georgia Tech
Bongshin Lee, Microsoft Research
Steven M. Drucker, Microsoft Research
John Stasko, Georgia Tech


Corpus (Link to Google Sheet)

The above table displays 814 utterance sets curated from the online study. Here, we define an utterance set as a collection of one or more utterances that collectively map to a specific visualization. Utterance sets can be singletons (i.e., contain a single utterance) or sequential (i.e., composed of two or more utterances).

Column descriptions: