paint-brush
Why LLMs Are More Accurate and Cost-Effective Than Human Fact-Checkersby@languagemodels
New Story

Why LLMs Are More Accurate and Cost-Effective Than Human Fact-Checkers

by Language Models (dot tech)5mApril 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

SAFE, an AI tool for fact-checking, outperforms human annotators with 76% accuracy on disagreement cases while being 20x cheaper. It highlights how LLMs can replace human raters in evaluating long-form factuality.

Company Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Why LLMs Are More Accurate and Cost-Effective Than Human Fact-Checkers
Language Models (dot tech) HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 LongFact: Using LLMs to generate a multi-topic benchmark for long-form factuality

3 Safe:LLM agents as factuality autoraters

4 LLMs agents can be better factuality annotators than humans

5 F1@k: Extending F1 with recall from human-preferred length

6 Larger LLMs are more factual

7 Related Work

8 Limitations

9 Conclusion, Acknowledgments, Author Contribution, and References

Appendix

A. Frequently asked questions

B. LongFact details

C. SAFE details

D. Metric details

E. Further analysis

4 LLM AGENTS CAN BE BETTER FACTUALITY ANNOTATORS THAN HUMANS

LLMs have achieved superhuman performance on reasoning benchmarks (Bubeck et al., 2023; Gemini Team, 2023; Li et al., 2022) and higher factuality than humans on summarization tasks (Pu et al., 2023). The use of human annotators, however, is still prevalent in language-model research, which often creates a bottleneck due to human annotators’ high cost and variance in both evaluation speed and quality. Because of this mismatch, we investigate how SAFE compares to human annotations and whether it can replace human raters in evaluating long-form factuality


Figure 4: On a set of 16,011 individual facts, SAFE annotations agree with 72.0% of human annotations.


To quantitatively evaluate the quality of annotations obtained using SAFE, we leverage crowdsourced human annotations released from Min et al. (2023). These data contain 496 prompt–response pairs where responses have been manually split into individual facts (for a total of 16,011 individual facts), and each individual fact has been manually labeled as either supported, irrelevant, or not supported. For each individual fact, we obtain an annotation from SAFE by running the SAFE pipeline, starting from revising the fact to be self-contained (i.e., steps 2–4 in Figure 1).[6] We then compare SAFE annotations with human annotations at the individual-fact level.


Figure 5: SAFE outperforms human annotators while being more than 20× cheaper. Left: on a randomly-sampled subset of 100 individual facts where the annotation from SAFE and humans disagreed, SAFE annotations were correct 76% of the time (we use “researcher + full internet access” labels as the ground truth). Right: rating a single model response costs $4.00 using human annotations (Min et al., 2023), compared to $0.19 using SAFE configured with GPT-3.5-Turbo and Serper API.


First, we directly compare each individual fact’s SAFE annotation and human annotation, finding that SAFE agrees with humans on 72.0% of individual facts (see Figure 4). This indicates that SAFE achieves human-level performance on a majority of individual facts. We then examine a randomly-sampled subset of 100 individual facts for which the annotation from SAFE disagreed with the annotation from human raters. We manually re-annotate each of these facts (allowing access to Google Search rather than only Wikipedia for more-comprehensive annotations), and we use these labels as the ground-truth. We found that on these disagreement cases, SAFE annotations were correct 76% of the time, whereas human annotations were correct only 19% (see Figure 5), which represents a 4 to 1 win ratio for SAFE. Examples of correct SAFE ratings and incorrect SAFE ratings are shown in Appendix C.2 and Appendix C.3, respectively. Appendix A.3 analyzes failure causes for SAFE, and Appendix A.4 analyzes presumed failure causes for human annotators.


To rate all individual facts from the 496 prompt–response pairs, SAFE issued GPT-3.5-Turbo API calls costing $64.57 and Serper API calls costing $31.74, thus costing a total of $96.31, equivalent to $0.19 per model response. For comparison, Min et al. (2023) reported a cost of $4 per model response to obtain crowdsourced human annotations for the dataset. This means that in addition to outperforming human raters, SAFE is also more than 20× cheaper than crowdsourced human annotators. Overall, the superior performance of SAFE demonstrates that language models can be used as scalable autoraters that have the potential to outperform crowdsourced human annotators.

5 F1@K: EXTENDING F1 WITH RECALL FROM HUMAN-PREFERRED LENGTH





Figure 6: Long-form factuality performance of thirteen large language models across four model families (Gemini, GPT, Claude, and PaLM-2), sorted by F1@K in descending order. F1@K averaged over SAFE (Section 3) evaluation results on model responses to the same set of 250 random prompts from LongFact-Objects. F1@K was calculated with K = 64 (the median number of facts among all model responses) and K = 178 (the maximum number of facts among all model responses). The ranking of models remains relatively stable at sufficiently-large K values (more discussion in Appendix E.3). Raw metrics are shown in Table 2.



Finally, we combine factual precision and recall using standard F1, as shown as Equation (1). This proposed metric is bounded between [0, 1], where a higher number indicates a more-factual response. For a response to receive a full score of 1, it must not contain not-supported facts and must have at least K supported facts. F1@K also follows expectations for many clear hypothetical cases, as shown in Appendix D.2, and we contend that it allows for standardized and quantifiable comparisons between language models and between methods of obtaining responses.





Equation (1): F1@K measures the long-form factuality of a model response y given the number of supported facts S(y) and the number of not-supported facts N(y) that are in y. The hyperparameter K indicates the number of supported facts required for a response to achieve full recall. We discuss how K should be selected in Appendix D.3.


This paper is available on arxiv under CC by 4.0 Deed license.


[6] We show performance of step 1 of SAFE (splitting a response into individual facts) in Appendix C.4.


Authors:

(1) Jerry Wei, Google DeepMind and a Lead contributors;

(2) Chengrun Yang, Google DeepMind and a Lead contributors;

(3) Xinying Song, Google DeepMind and a Lead contributors;

(4) Yifeng Lu, Google DeepMind and a Lead contributors;

(5) Nathan Hu, Google DeepMind and Stanford University;

(6) Jie Huang, Google DeepMind and University of Illinois at Urbana-Champaign;

(7) Dustin Tran, Google DeepMind;

(8) Daiyi Peng, Google DeepMind;

(9) Ruibo Liu, Google DeepMind;

(10) Da Huang, Google DeepMind;

(11) Cosmo Du, Google DeepMind;

(12) Quoc V. Le, Google DeepMind.