1 minute read

In this new paper from Google DeepMind, the authors study the significance of premise ordering in different reasoning tasks when using large language models (LLMs). Despite the logical axiom that the sequence of premises should not affect the validity of a conclusion, this study uncovers a sensitivity of LLMs to the arrangement of these premises. Let’s consider the following example to make things more clear:

  1. If A then B.

  2. If B then C.

  3. A is True.

We can derive that C is True, independent of the order. The paper sets two benchmarks, one for logical reasoning and one for mathematical reasoning.

Logical Reasoning

For testing the effect of premise ordering on logical reasoning, the authors start from SimpleLogic (Zhang et al., 2022) and synthetically generate variants with different premise orders. There are three types of cases: forward order which denotes that the order conforms with the ground truth proof with forward chaining, reverse order which is the reverse of forward and random order.

Logical reasoning example

R-GSM for Mathematical Reasoning

A new benchmark, R-GSM, built upon the GSM8K dataset, is introduced to examine the impact of premise ordering on LLM performance in mathematical reasoning. The findings reveal that LLMs, including state-of-the-art models, display a pronounced drop in accuracy when the premises are not presented in the “natural” order of reasoning steps. This sensitivity poses significant implications for how we understand and deploy LLMs in reasoning tasks.

Reordering the premises has a direct influence on the performance


For logical reasoning, LLMs achieve optimal performance when premises align with the order found in the ground-truth proof (forward order). Any deviation from this sequence results in a notable degradation of the model’s reasoning capability, with performance dips of over 30% in some cases.

Logical reasoning results

The same drop in performance is observed for mathematical reasoning:

Mathematical reasoning results


Premise order significantly affects the performance of LLMs on reasoning tasks even though this order does not change the underlying task. More details in the paper.

Congrats to authors for their work!

Chen, Xinyun et al. “Premise Order Matters in Reasoning with Large Language Models.” (2024).