1 minute read

In the ever-evolving landscape of Large Language Models (LLMs), the need for scalable and accurate evaluation metrics is more pressing than ever. In this blog, we will talk about SemScore, a new evaluation metric that leverages Semantic Textual Similarity (STS) to offer a direct comparison between model outputs and gold standard responses. This approach not only promises to streamline the assessment of instruction-tuned LLMs but also aligns closely with human judgment, marking a significant step forward in automated LLM evaluation.

The SemScore Advantage

SemScore aims to address the limitations of existing evaluation metrics such as BLEU and ROUGE, which often struggle with the nuanced outputs of instruction-tuned models. To give a better idea, in the below example, while the LLM response is rated by a human evaluator as very high quality, the BLEU and ROUGE scores are low because there is a low N-gram overlap between the generation and the expected target response.

Limitations of current evaluation metrics such as BLEU and ROUGE

The idea behind SemScore is simple yet effective: compute the cosine similarity between the embedded generation and target.

A Comparative Study

In order to better understand the effectiveness of the proposed method, the authors first perform a human study that ranks the most popular LLMs in order of their performance.

Human evaluation of LLMs

Based on the human evaluation from above, a human ranking of LLMs is derived that stands as a basis for comparison with other evaluation metrics including SemScore:

Model ranking according to different metrics

Based on these results SemScore and G-Eval-4 have the strongest correlation with the human judgement. G-Eval-4 uses GPT-4 as a judge and while it correlates with the human judgement, may have larger costs. Nevertheless, both metrics correlate well with the human judgement as shown below:

Kendall τ & Pearson r correlation between metrics and human scores


SemScore represents a very interesting advancement in the evaluation of instruction-tuned LLMs, offering a scalable, efficient, and accurate metric that closely aligns with human judgment. For more details please consult the full paper: https://arxiv.org/pdf/2401.17072.pdf.

Kudos to the authors for their great work!