2 minute read

Today’s paper “What Evidence Do Language Models Find Convincing?” explores what types of evidence and argumentation techniques language models find convincing when presented with ambiguous, open-domain questions that have conflicting answers online. For example, for the question “does aspartame cause cancer?” there are arguments and evidence supporting both yes and no answers. The goal is to study why models might be convinced by certain real-world evidence paragraphs over others.

Method Overview

The authors construct a high-quality dataset called ConflictingQA for studying model perceptions of convincing arguments when presented with ambiguous questions and conflicting real-world evidence. For over 100 different categories spanning science, technology, philosophy and more, they use GPT-4 to generate controversial yes/no questions around debated topics. Duplicates are removed manually.

For each question, the authors use search engine retrieval to collect a diverse set of top web results containing arguments and evidence for both “yes” and “no” answers. They modify each question to be an affirmative or negative statement using GPT-4. So, for exampke, “is aspartame safe?” is transformed into “aspartame is safe” and “aspartame is harmful”. Now, they extract paragraphs from each found webpage and use an ensemble of GPT-4-1106-preview and claude-intant-v1 and keep only the documents where the two models agree on the label.

Minimize image

Edit image

Delete image

In this way, they obtain the ConflictingQA dataset that contains a set of controversial question with both positive and negative supporting information.

They study convincingness using the notion of paragraph win rate - when shown two conflicting paragraphs, this measures how often a model predicts the answer that aligns with a given paragraph’s stance. The win rate metrics lets them rigorously compare different evidence paragraphs.

Results

First, the method computes the paragraph win rates. After this step, the authors look to explain why models pick a particular paragraph over others. Firstly, they look at various automatic metrics and check how they correlate with the win rate. The results are shown below:

Minimize image

Edit image

Delete image

This analysis reveals that the Flesch-Kincaid score and the number of unique tokens doesn’t correlate with convincingness. On the other hand, question-paragraph embedding similarity correlates strongly with the win rate for all models except GPT-4. Similarly, a weak correlation exists between the n-gram overlap and win rate.

Furthermore, the authors make a series of changes to the paragraphs in order to see how this affects the win-rate. Specifically, they ask an LLM to re-write the paragraph with various goals in mind: stylistic changes such as adding more information, adding scientific references or making the text sound more objective, relevancy changes such as rewriting the text, adding more keywords, etc.

Here’s an example:

Minimize image

Edit image

Delete image

The results are shown below:

Minimize image

Edit image

Delete image

As we can see above, stylistic features tend to have a neutral to negative effect while relevancy-based features significantly improve win-rate.

Conclusion

This paper introduces ConflictingQA, a dataset containing a set of controversial question with both positive and negative supporting information. This dataset is used to get some insight on what LLMs find convincing and what information they prefer when answering a question. More details in the full paper.

Congrats to the authors for their work!

Code available at: https://github.com/AlexWan0/rag-convincingness

Wan, Alexander, Eric Wallace, and Dan Klein. “What Evidence Do Language Models Find Convincing?.” arXiv preprint arXiv:2402.11782 (2024).

Updated: