New Story

Challenges in Using Google Search for Factuality Verification

by Language Models (dot tech)3mApril 9th, 2025

Too Long; Didn't Read

This section discusses limitations in both LongFact and SAFE, including the reliance on LLM capabilities, the potential issues with Google Search as a knowledge source, and challenges with repeated facts in the F1@K metric. Future work aims to improve the efficiency and reliability of these systems.

Company Mentioned

Coin Mentioned

featured image - Challenges in Using Google Search for Factuality Verification

Table of Links

Abstract and 1 Introduction

2 LongFact: Using LLMs to generate a multi-topic benchmark for long-form factuality

3 Safe:LLM agents as factuality autoraters

4 LLMs agents can be better factuality annotators than humans

5 F1@k: Extending F1 with recall from human-preferred length

6 Larger LLMs are more factual

7 Related Work

8 Limitations

9 Conclusion, Acknowledgments, Author Contribution, and References

Appendix

A. Frequently asked questions

8 LIMITATIONS

Both LongFact and SAFE rely on LLMs to function, hence the used-LLM’s capabilities (especially instruction following and reasoning, but also abilities such creativity) directly affect the quality of the generated LongFact prompts and SAFE. With respect to generating LongFact, a language model that cannot follow instructions may generate prompts that are not within the given topic or that do not solicit long-form responses. Since model outputs to generate LongFact are relatively short, however, we use GPT-4 to maximize the quality of the generated prompts. In SAFE, a weak language model may (a) decompose a long-form response into individual facts that contain more than one factual statement, (b) exhibit poor reasoning when determining the relevance of an individual fact to the prompt, (c) propose unrelated search queries to verify a fact, or (d) fail to reason about whether a fact is supported by search results. For this reason, our configuration of SAFE uses GPT-3.5-Turbo, which provides strong performance in most cases at low cost, though it still exhibits some failures due to model capability (further examined in Appendix A.3). Future work may examine whether finetuning a cost-effective language model to perform the steps in SAFE may reduce failures while maintaining a low inference-time cost.

There also exists an inherent weakness in SAFE, which is that it relies on Google Search as a knowledge source to obtain ground-truths, which may not suffice in corner cases. For example, Google Search may not easily find information about some fact or may lack profundity in expert-level domains such as law and medicine. At the same time, however, Google Search allows access to the entire internet and thus is arguably the most-comprehensive knowledge source for open-domain factuality tasks that is also widely-accessible. We alleviate this concern by reporting our labels as “supported” or “not supported” by Google Search results, rather than attempting to label facts as globally factual or non-factual (for discussion on label selection, see Appendix A.2), though future work may investigate whether there are better proxies for global factuality than verification by Google Search. We discuss other limitations of and avenues of exploration for SAFE in Appendix C.6.

Furthermore, while F1@K is a robust aggregation of factual precision and recall, it operates under the assumption that a response does not contain any repeated facts. While this is a reasonable assumption in practice, it also provides the opportunity for our metric to be gamed by repeating a supported fact. As stated in Appendix D.1, however, we contend that repetition of facts can be better measured by other metrics such as fluency or usefulness of model responses. If desired, we also suggest that a step of removing duplicated facts can be added into SAFE, though we leave this for future research due to the added complexity required to account for this case.