2 minute read

Understanding context is one of the key requirements for a Large Language Model. Although there’s been a noticeable improvement in performance, with recent large language models (LLMs) demonstrating remarkable capabilities, there’s a growing emphasis on evaluating their problem-solving quality rather than exploring their ability to comprehend context. This new study from Georgetown University and Apple introduces a new context understanding benchmark. The benchmark consist of four tasks and nine datasets:

Tasks and datasets in the context understanding benchmark

Coreference Resolution

The Coreference Resolution task involves identifying all expressions in a text that refer to the same entity. It directly contributes to achieving a coherent understanding of the overall message conveyed within the text. Two datasets were used in this benchmark: WSC273 and OntoNotes 5.0. Here’s an example of what the model receives as input in order to measure the performance on the coreference resolution task:

Coreference Example from OntoNotes 5.0 dataset

When it comes to the results, we can see that GPT-3.5-Turbo usually performs the best, but LLaMA 30B is not far behind. LLMs can usually handle simple coreference relationships (WSC273 dataset), but they struggle on more complex documents (OntoNotes dataset). For all tasks, FT represents the results of finetuning on the task and acts as a reference point.

Coreference Resolution Results

Implicit Discourse Relation Classification

This task aims to assess the capabilities of identifying the different ways in which different segments of text are interconnected and to structure them in order to convey a coherent and meaningful message.

Implicit Discourse Relation Classification Example

For this task, there seems to be a significant increase in score for models larger than 7B, but even for GPT-3.5-Turbo there is a significant gap as opposed to finetuning on this task:

Implicit Discourse Relation Classification Results

Dialogue State Tracking

The goal of Dialogue State Tracking is to track key information provided by the user as the conversation progresses. The dataset used for evaluating this task is MultiWOZ and you can see an example prompt below:

Dialogue State Tracking example

The results follow the same trend as for Coreference Resolution where GPT-3.5-Turbo performs the best:

Dialogue State Tracking results

Query Rewriting

The Query Rewriting task is defined as rewriting the last utterance of a user in a conversation into an independent text that can be interpreted without any other context. The testing is conducted on five datasets.

Query Rewriting example

For this task, there is a significant gap between small and large models, with smaller models performing a lot worse. Again GPT-3.5-Turbo performs the best.

Query Rewriting results

Conclusion

This paper introduces a contextual understanding benchmark designed to thoroughly assess the performance of LLMs. For more information please consult the full paper: https://huggingface.co/papers/2402.00858. Kudos to the authors for their work!

Updated: