
Authors:
(1) Clemencia Siro, University of Amsterdam, Amsterdam, The Netherlands;
(2) Mohammad Aliannejadi, University of Amsterdam, Amsterdam, The Netherlands;
(3) Maarten de Rijke, University of Amsterdam, Amsterdam, The Netherlands.
2 Methodology and 2.1 Experimental data and tasks
2.2 Automatic generation of diverse dialogue contexts
3 Results and Analysis and 3.1 Data statistics
3.2 RQ1: Effect of varying amount of dialogue context
3.3 RQ2: Effect of automatically generated dialogue context
6 Conclusion, Limitations, and Ethical Considerations
7 Acknowledgements and References
In this work, we investigated the impact of varying the dialogue context size and type on crowdsourced evaluation labels. In particular we crowdsourced evaluation labels for two aspects: relevance and usefulness. Our findings reveal that optimal context is dependent on the aspect under evaluation. For relevance annotators tend to agree more on a label when they have access to the whole dialogue context. However this does not hold for the usefulness aspect where we witness high annotator agreement when partial context is available. We show that a simple approach like providing an automatically generated user need through heuristics without revealing the entire dialogue can consistently increase annotator agreement across the two aspects. This implies that we can rely on automatic methods such as the use of LLMs to improve the productivity of the crowdworkers by reducing the amount of dialogue they have to read before evaluating the current response.
This study contributes towards how LLMs can be integrated in the annotation process to ensure quality labels from the crowdworkers. In this work we used GPT-4 API which is not open source. For future work we will explore the use of open-source LLMs, like Llama-chat (Touvron et al., 2023), to facilitate a more transparent and reproducible experimental framework.
In this work, we dived into the effect of task design on crowdsourced evaluation labels, specifically the amount and type of context available. Nonetheless our study faces some limitations: the absence of actual user ratings hinders us from claiming an optimal strategy for presenting previous dialogue history. Despite this limitation, we highlight the noteworthy observation of high label consistency in C7 for relevance and C3 for usefulness aspect, which served as our basis for comparison. It is crucial to note that our study is exploratory in nature and thus may be data or task specific. To ensure the applicability and generalizability of our findings, it is imperative to undertake further investigations to ascertain the extent to which these findings can be extrapolated across different tasks and datasets.
Annotator diversity
All participants in this research were master workers recruited exclusively from the United States through Amazon Mechanical Turk (MTurk). While this selection ensured a level of language proficiency and familiarity with the context, it is crucial to note that the findings of this study may not generalize universally due to the specific demographic representation. The restriction to U.S.-based annotators may introduce a limitation in terms of cultural diversity and global perspectives, influencing the external validity of the study.
Annotator bias
Despite the provision of detailed instructions and examples to annotators, potential biases may still arise during the evaluation process due to the diverse backgrounds of the annotators. Cultural biases may be more pronounced if annotators from different cultural backgrounds interpret movie preferences, relevance, or usefulness in divergent ways. Subjective biases may also be influenced by the diverse interpretations of guidelines, as individuals from different backgrounds may have distinct views on dimensions like “relevance” or “usefulness.”
To mitigate these potential biases, continuous monitoring and feedback mechanisms were incorporated into the study design. Additionally, the study refrained from disclosing the specific research angle to annotators to prevent potential biases related to the research objectives.
This research was supported by the Dreams Lab, a collaboration between Huawei Finland, the University of Amsterdam, and the Vrije Universiteit Amsterdam, by the Hybrid Intelligence Center, a 10-year program funded by the Dutch Ministry of Education, Culture and Science through the Netherlands Organisation for Scientific Research, https://hybrid-intelligence-centre.nl, by project LESSEN with project number NWA.1389.20.183 of the research program NWA ORC 2020/21, which is (partly) financed by the Dutch Research Council (NWO), and by the FINDHR (Fairness and Intersectional Non-Discrimination in Human Recommendation) project that received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070212.
All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.
Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The relationship between IR effectiveness measures and user satisfaction. In Proceedings of the 30th Annual International Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval, page 773–774, New York, NY, USA. Association for Computing Machinery.
Omar Alonso, Daniel E. Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2):9–15.
Amazon Mechanical Turk. 2023. https://www.mturk. com.
Mayla Boguslav and Kevin Bretonnel Cohen. 2017. Inter-annotator agreement and the upper limit on machine performance: Evidence from biomedical natural language processing. In MEDINFO 2017: Precision Healthcare through Informatics - Proceedings of the 16th World Congress on Medical and Health Informatics, Hangzhou, China, 21-25 August 2017, volume 245 of Studies in Health Technology and Informatics, pages 298–302. IOS Press.
Adam Bouyamourn. 2023. Why LLMs hallucinate, and how to get (evidential) closure: Perceptual, intensional, and extensional learning for faithful natural language generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3181–3193. Association for Computational Linguistics.
Paweł Budzianowski and Ivan Vulic. 2019. ´ Hello, it’s GPT-2 - How can I help you? Towards the use of pretrained language models for task-oriented dialogue systems. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 15–22, Hong Kong. Association for Computational Linguistics.
Jean Carletta. 1996. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249–254.
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2023. A survey on evaluation of large language models. CoRR, abs/2307.03109.
Jan Deriu, Álvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems. Artif. Intell. Rev., 54(1):755–810.
Djellel Eddine Difallah, Elena Filatova, and Panos Ipeirotis. 2018. Demographics and dynamics of mechanical turk workers. In Proceedings of the Eleventh Association for Computing Machinery International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pages 135–143. Association for Computing Machinery.
Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the Eleventh Association for Computing Machinery International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pages 162–170. Association for Computing Machinery
Guglielmo Faggioli, Laura Dietz, Charles L. A. Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on large language models for relevance judgment. In Proceedings of the 2023 Association for Computing Machinery SIGIR International Conference on Theory of Information Retrieval, ICTIR 2023, Taipei, Taiwan, 23 July 2023, pages 39–50. Association for Computing Machinery.
Xiachong Feng, Xiaocheng Feng, and Bing Qin. 2022. A survey on dialogue summarization: Recent advances and new frontiers. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5453–5460. ijcai.org.
Sarik Ghazarian, Johnny Wei, Aram Galstyan, and Nanyun Peng. 2019. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 82–89, Minneapolis, Minnesota. Association for Computational Linguistics.
Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. 2020. GRADE: Automatic graphenhanced coherence metric for evaluating opendomain dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9230–9240, Online. Association for Computational Linguistics.
Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju. 2019. Understanding and mitigating worker biases in the crowdsourced collection of subjective judgments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, page 1–12, New York, NY, USA. Association for Computing Machinery.
Pritam Kadasi and Mayank Singh. 2023. Unveiling the multi-annotation process: Examining the influence of annotation quantity and instance difficulty on model performance. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 1371–1388. Association for Computational Linguistics.
Gabriella Kazai. 2011. In search of quality in crowdsourcing for search engine evaluation. In Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings, volume 6611 of Lecture Notes in Computer Science, pages 165–176. Springer.
Gabriella Kazai, Jaap Kamps, Marijn Koolen, and Natasa Milic-Frayling. 2011a. Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In Proceeding of the 34th International Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, 2011, pages 205–214. Association for Computing Machinery.
Gabriella Kazai, Jaap Kamps, and Natasa MilicFrayling. 2011b. Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the 20th Association for Computing Machinery International Conference on Information and Knowledge Management, CIKM ’11, page 1941–1944, New York, NY, USA. Association for Computing Machinery.
Gabriella Kazai, Jaap Kamps, and Natasa MilicFrayling. 2012. The face of quality in crowdsourcing relevance labels: Demographics, personality and labeling accuracy. In Proceedings of the 21st Association for Computing Machinery International Conference on Information and Knowledge Management, CIKM ’12, page 2583–2586, New York, NY, USA. Association for Computing Machinery.
Gabriella Kazai, Jaap Kamps, and Natasa MilicFrayling. 2013. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Information Retrieval, 16(2):138–178.
Christoph Kern, Stephanie Eckman, Jacob Beck, Rob Chew, Bolei Ma, and Frauke Kreuter. 2023. Annotation sensitivity: Training data collection methods affect model performance. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 14874– 14886. Association for Computational Linguistics.
Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, and Tasos Anastasakos. 2016. Understanding user satisfaction with intelligent assistants. In Proceedings of the 2016 Association for Computing Machinery on Conference on Human Information Interaction and Retrieval, CHIIR ’16, page 121–130, New York, NY, USA. Association for Computing Machinery.
Margaret Li, Jason Weston, and Stephen Roller. 2019. ACUTE-EVAL: improved dialogue evaluation with optimized questions and multi-turn comparisons. CoRR, abs/1909.03087.
Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy F. Chen, Zhengyuan Liu, and Diyi Yang. 2023. Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 1487–1505. Association for Computational Linguistics.
Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 9748–9758.
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, Jiashen Sun, and Hengliang Luo. 2016. When does relevance mean usefulness and user satisfaction in web search? In Proceedings of the 39th International Association for Computing Machinery SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17-21, 2016, pages 463–472. Association for Computing Machinery.
Shikib Mehri and Maxine Eskenazi. 2020. USR: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 681–707, Online. Association for Computational Linguistics.
Jekaterina Novikova, Ondˇrej Dušek, and Verena Rieser. 2018. RankME: Reliable human ratings for natural language generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 72–78, New Orleans, Louisiana. Association for Computational Linguistics.
OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
Kunwoo Park, Meeyoung Cha, and Eunhee Rhim. 2018. Positivity bias in customer satisfaction ratings. In Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23-27, 2018, pages 631–638. Association for Computing Machinery.
Mihir Parmar, Swaroop Mishra, Mor Geva, and Chitta Baral. 2023. Don’t blame the annotator: Bias already starts in the annotation instructions. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 1771–1781. Association for Computational Linguistics.
Weiping Pei, Zhiju Yang, Monchu Chen, and Chuan Yue. 2021. Quality control in crowdsourcing based on fine-grained behavioral features. Proc. Association for Computing Machinery Hum. Comput. Interact., 5(CSCW2):442:1–442:28.
Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Falk Scholer. 2021. On the effect of relevance scales in crowdsourcing relevance assessments for information retrieval evaluation. Inf. Process. Manag., 58(6):102688.
Kevin Roitero, Michael Soprano, Shaoyang Fan, Damiano Spina, Stefano Mizzaro, and Gianluca Demartini. 2020. Can the crowd identify misinformation objectively?: The effects of judgment scale and assessor’s background. In Proceedings of the 43rd International Association for Computing Machinery SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, pages 439–448. Association for Computing Machinery.
Sashank Santhanam, Alireza Karduni, and Samira Shaikh. 2020. Studying the effects of cognitive biases in evaluation of conversational agents. In CHI ’20: CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, April 25-30, 2020, pages 1–13. Association for Computing Machinery.
Alexander Schmitt and Stefan Ultes. 2015. Interaction quality: Assessing the quality of ongoing spoken dialog interaction by experts - and how it relates to user satisfaction. Speech Commun., 74:12–36.
Clemencia Siro, Mohammad Aliannejadi, and Maarten de Rijke. 2022. Understanding user satisfaction with task-oriented dialogue systems. In Proceedings of the 45th International Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2018–2023, New York, NY, USA. Association for Computing Machinery.
Clemencia Siro, Mohammad Aliannejadi, and Maarten de Rijke. 2023. Understanding and predicting user satisfaction with conversational recommender systems. ACM Transactions on Information Systems, 42(2):Article 55.
Weiwei Sun, Shuo Zhang, Krisztian Balog, Zhaochun Ren, Pengjie Ren, Zhumin Chen, and Maarten de Rijke. 2021. Simulating user satisfaction for the evaluation of task-oriented dialogue systems. In Proceedings of the 44th International Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval, page 2499–2506, New York, NY, USA. Association for Computing Machinery.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian CantonFerrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and finetuned chat models. CoRR, abs/2307.09288.
Chien-Sheng Wu, Steven C. H. Hoi, Richard Socher, and Caiming Xiong. 2020. TOD-BERT: pre-trained natural language understanding for task-oriented dialogue. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 917–929. Association for Computational Linguistics.
This paper is available on arxiv under CC BY 4.0 DEED license.