NLV: Natural Language Utterances for Specifying Data Visualizations

Natural language interfaces (NLIs) for data visualization are becoming increasingly popular both in academic research and in commercial software. Yet, there is a lack of empirical understanding of how people specify visualizations through natural language. To bridge this gap, we conducted an online study with 102 participants. We showed participants a series of ten visualizations for a given dataset and asked them to provide utterances they would pose to generate the displayed charts. The curated list of utterances generated from the study is provided below. This corpus of utterances can be used to evaluate existing NLIs for data visualization as well as for creating new systems and models to generate visualizations from natural language utterances.

For more details about the study, corpus, and potential applications, please refer to the accompanying research paper.

Team

Arjun Srinivasan, Georgia Tech
Nikhila Nyapathy, Georgia Tech
Bongshin Lee, Microsoft Research
Steven M. Drucker, Microsoft Research
John Stasko, Georgia Tech

Corpus (Link to Google Sheet)

The above table displays 814 utterance sets curated from the online study. Here, we define an utterance set as a collection of one or more utterances that collectively map to a specific visualization. Utterance sets can be singletons (i.e., contain a single utterance) or sequential (i.e., composed of two or more utterances).

Column descriptions:

sequential: Indicates if an utterance set is singleton ('n') or sequential ('y'). Individual utterances in sequential utterance sets are separated by a pipe symbol (|).
visId: The type of visualization an utterance set maps to. The study presented ten visualizations during each session. Vega-Lite specifications corresponding to different visIds can be found in the vlSpecs.json file on the GitHub repo. All visualizations used during the study can also be viewed on a single page here.
dataset: The dataset an utterance set was issued in the context of. The three datasets used during the study can be found on the GitHub repo.