Recently, many reserachers are trying to build automatic helpdesk systems. However, there are very few methods to evaluate such systems. In STC-3, we aim to explore methods to evaluate task-oriented, multi-round, textual dialogue systems automatically. This dataset have the following features:
In this competition, we consider annotations ground truth, and participants are required to predict nugget type for each turn (Nugget Detection, or ND) and dialogue quality for each dialogue (Dialogue Quality, or DQ).
The Chinese dataset contains 4,090 (3,700 for training + 390 for testing) customer-helpdesk dialgoues which are crawled from Weibo. All of these dialogues are annotated by 19 annotators.
The English dataset contains 2062 dialogues (1,672 for training + 390 for testing) are manually translated from a subset of the Chinese dataset. The English dataset shares the same annotations with the Chinese dataset.
We hired 19 Chinese students from the department of Computer Science, Waseda University to annotate this dataset.
Each file is in JSON format with UTF-8 encoding.
Following are the top-level fields:
Each element of the turns field contains the following fields:
Each element of annotations contains the following fields:
nugget: The list of nugget types for each turn (see details below).
quality: A dictonary consists of the subjetive dialogue quality scores: A-score, S-score, and E-score (see details below).
A-score: Task Accomplishment (Has the problem been solved? To what extent?)
S-score: Customer Satisfaction of the dialogue (not of the product/service or the company)
E-score: Dialogue Effectiveness (Do the utterers interact effectively to solve the problem efficiently?)
Scale: [2, 1, 0, -1, -2]
During the data annotaiton, we noticed that annotators’ assessment on dialgoues are highly subjective and are hard to consolidate them into one gold label. Thus, we proposed to preserve the diverse views in the annotations “as is” and leverage them at the step of evaluation measure calculation.
Instead of juding whether the estimated label is equal to the gold label, we compare the difference between the estiamted distributions and the gold distributions calculaed by 19 anntators’ annotations). Specifically, we employ these metrics for quality sub-task and nugget sub-task:
For the details about the metrics, please vistit:
For some obvious reasons, we do not release the annotations of the test set. Instead, we require you to submit your prediction file to our server for evaluation. Also, we provide a evaluation script to help you calcualte these metrics for training set locally. For details, please visit Evaluation Page.
Please contact: email@example.com