
Authors:
(1) Clemencia Siro, University of Amsterdam, Amsterdam, The Netherlands;
(2) Mohammad Aliannejadi, University of Amsterdam, Amsterdam, The Netherlands;
(3) Maarten de Rijke, University of Amsterdam, Amsterdam, The Netherlands.
2 Methodology and 2.1 Experimental data and tasks
2.2 Automatic generation of diverse dialogue contexts
3 Results and Analysis and 3.1 Data statistics
3.2 RQ1: Effect of varying amount of dialogue context
3.3 RQ2: Effect of automatically generated dialogue context
6 Conclusion, Limitations, and Ethical Considerations
7 Acknowledgements and References
Label quality. To gauge the quality of the crowdsourced labels, we rely on inter-annotator agreement (Boguslav and Cohen, 2017; Carletta, 1996). In order to understand how the amount of dialogue context influences the quality of ratings by annotators, we calculate the agreement between annotators for both relevance and usefulness across the three variations; see Table 1. To address potential randomness in relevance ratings, given the binary scale, we randomly drop one rating from each dialogue and compute the agreement. We repeat this process for each annotator and calculate an average Cohen’s Kappa score. For usefulness, we compute Kappa for each pair of annotators and then calculate the average. We assess the significance of the agreement using the Chi-squared method. All Kappa scores are statistically significant (p ≤ 0.05).
We observe an increase in the Kappa and Tau score as the dialogue context increases from C0 to C7. Despite the lack of context in C0, there is a moderate level of agreement regarding the relevance of the current turn. With the introduction of more context in C3 and C7, comes an increase in agreement regarding the relevance of the current turn (see Table 1). Providing additional dialogue context seems to lead to higher levels of consensus among annotators. This is likely due to dataset characteristics: users tend to express their preferences early in the dialogue, rather than in subsequent exchanges. Hence, in the case of C0, which only includes the current turn, when the user’s utterance is incomplete, lacking an explicit expression of their preference, annotators rate more dialogues as relevant compared to C3 and C7. Overall, we conclude that when annotators have insufficient information to come up with a judgment, they tend to judge the system positively, introducing a positivity bias (Park et al., 2018).
We see in Table 1 (row 3) that despite the lack of context in C0, there is substantial agreement regarding the usefulness of the current turn. This is due to the availability of the user’s next utterance, which serves as direct feedback on the system’s response, resulting in higher agreement than for relevance assessment. As more context is provided, there is an even higher level of agreement among annotators regarding the usefulness of the current turn. Access to a short conversation history significantly improves agreement on usefulness.
Surprisingly, despite having access to the entire conversation history in C7, there is a slightly lower level of agreement than in C3. The complete dialogue context may introduce additional complexity or ambiguity in determining the usefulness of the current turn. This occurs when conflicting feedback arises from the user’s next utterance compared to the previous dialogue context. For example, when the system repeats a recommendation that the user has already watched or stated before, and the user expresses their intent to watch the movie in the next utterance, it leads to divergent labels. Similar trend is observed with the Tau correlations though the values are lower compared to the Kappa scores.
Label consistency across conditions. We examine the impact of varying amounts of dialogue context on the consistency of crowdsourced labels across the three variations for relevance and usefulness and report the percentage of agreement in Figure 3. We observe moderate agreement (58.54%) between annotations of C0 and C3, suggesting that annotators demonstrate a degree of consistency in their assessments when provided with different amounts of context. This trend continues with C0 and C7, where the agreement increases slightly to 60.98%. The most notable increase is between C3 and C7 (68.29%). As annotators were exposed to progressively broader contextual information, their assessments became more consistent.
Usefulness behaves differently. We observe moderate agreement (41.71%) between C0 and C3, indicating a degree of consistency in annotator assessments within this range of context. A notable decrease in agreement is evident when comparing C3 and C7, down to 28.3% agreement. The most substantial drop is observed between C0 and C7, yielding a mere 14.63% agreement. These findings emphasize the significant impact of context on the consistency of usefulness annotations. For usefulness assessment providing annotators with a more focused context, improves their agreement.
With respect to RQ1, we note considerable differences in the labels assigned by annotators as we vary the amount of dialogue context. As the context expands, annotators incorporate more information into their assessments, resulting in context-specific labels. Annotator judgments are shaped not only by response quality but also by the broader conversation. This highlights the complexity of the task and the need for a carefully designed annotation methodology that considers contextual variations. These findings emphasize the significance of dialogue context in annotator decision-making.
This paper is available on arxiv under CC BY 4.0 DEED license.