
Authors:
(1) Clemencia Siro, University of Amsterdam, Amsterdam, The Netherlands;
(2) Mohammad Aliannejadi, University of Amsterdam, Amsterdam, The Netherlands;
(3) Maarten de Rijke, University of Amsterdam, Amsterdam, The Netherlands.
2 Methodology and 2.1 Experimental data and tasks
2.2 Automatic generation of diverse dialogue contexts
3 Results and Analysis and 3.1 Data statistics
3.2 RQ1: Effect of varying amount of dialogue context
3.3 RQ2: Effect of automatically generated dialogue context
6 Conclusion, Limitations, and Ethical Considerations
7 Acknowledgements and References
Label quality. In Phase 2, our experiments aim to establish the impact of presenting annotators with different types of context during crowdsourcing. Different from conventional dialogue context, we provide the annotators with the dialogue summary (C0-sum), the user’s information need in the dialogue (C0-heu and C0-llm). We also aim to uncover if we can improve the quality of the crowdsourced labels in C0 to match those in C7. We calculate the Cohen’s Kappa similar to Section 3.2; see Table 2.
The heuristic approach (C0-heu) yields the highest agreement (Kappa and Tau), indicating a noteworthy degree of agreement in relevance assessments. The LLM-generated context (C0-llm and C0-sum) results in a moderate to substantial level of agreement, signifying a reasonable level of agreement regarding the relevance of the system response. We observe similar results for usefulness. The heuristic approach (C0-heu) again leads with the highest level of agreement (0.71 and 0.59), C0- sum follows with a kappa score of 0.63, while C0- llm has a kappa score of 0.53. This high level of agreement (Kappa) for the two aspects indicates the quality of the labels; the additional context provided, generated either heuristically or with LLMs, is effective in conveying relevant information to annotators, leading to more consistent assessments.
For both relevance and usefulness, C0-heu consistently improves agreement among annotators, while the LLM-generated context (C0-llm and C0- sum) has a substantially lower agreement than C7. This difference reflects the limitations of LLMs in capturing context and generating a factual summary. While they generate coherent text, LLMs sometimes fail to correctly represent the sequential order of the dialogue and users’ language patterns.
Label consistency across conditions. In Figure 4a we report the agreement between the setups in Phase 2 and compare them to C7 (relevance) and C3 (usefulness) due to their high inter-annotator agreement (IAA) and label consistency. For the relevance annotations, varying levels of agreement emerge. There is substantial agreement between C0-heu and C0-llm (59.36%), showing a significant overlap in the labels assigned using both methods, although there are instances where annotators differ in their assessments of relevance. C0-sum exhibits moderate label agreement with C0-llm (62.74%) and C0-heu (65.67%), pointing to relatively similar label assignments across the setups.
We observe similar results for usefulness in Figure 4b. While the heuristically generated approach achieves high IAA, the C0-sum method demonstrates greater consistency with all other setups in terms of usefulness. This suggests that while annotators using the C0-heu approach often agreed on a single label, the chosen label may not have always been the most accurate. We note slightly low agreement levels for a similar label between the three setups, consistent with results in Phase 1. Unlike relevance, which used a binary scale, usefulness was rated on a 1–3 scale. This finer-grained scale may explain the lower agreement compared to relevance, as different types of contextual information can influence usefulness scores.
Regarding RQ2, we show that we can improve the consistency of the labels assigned by crowdworkers in C0 condition by augmenting the current turn with automatically generated supplementary dialogue context. The heuristic approach demonstrates higher consistency in both IAA and label consistency for relevance and usefulness compared to C0 and C7. Providing annotators with the user’s initial utterance expressing their preference, particularly in scenarios lacking context, can significantly enhance the quality and consistency of crowdsourced labels. This approach can yield performance comparable to a setup involving the entire dialogue C7, without imposing the cognitive load of reading an entire conversation on annotators. This streamlines the annotation process and maintains high-quality results, offering a practical strategy for obtaining reliable labels for dialogue evaluation.
This paper is available on arxiv under CC BY 4.0 DEED license.