Evaluation of large language models

IA-generated image (Copilot)

The rapid evolution of large language models and their evident competence as foundational models and as a basis for building general-purpose conversational agents sharpen the desire of individuals and organizations to use them in a wide range of applications. The use of language models has proven to be interesting in solving different problems that, by nature, have solutions based on natural language processing.

However, precisely when one wishes to place the language model at the core of an information system that will meet a specific need that we encounter one of the greatest challenges in the field: how to certify the quality of the model. The task of consistently evaluating the competence of a language model is still an open question. The complexity of this task increases when these models are coupled with external resources such as intricate prompt engineering procedures and the use of contexts as support for formulating responses.

This evaluation, when quantitative, is not very expressive. When the competence of the language model is reduced to a small set of numbers, although we can draw relative conclusions between different models submitted to the same evaluation method, it is difficult to understand in which aspects one model is better than another or what needs improvement. On the other hand, when the evaluation is qualitative, we encounter issues of cost, lack of systematization and reproducibility, and difficulties in minimizing subjectivity. Moreover, concerning the non-functional aspects of a system within the context of language models, there are ethical, legal, information security, and model behavior safety issues to consider.

In this context, one of the focus areas of the KEML team is established: the study of methods for evaluating language models. The team’s efforts are directed towards two fronts:

Development of an evaluation framework, named HarpIA, which allows for quantitative and qualitative evaluations, based on different strategies and supported by systems that enable systematization, reproducibility, standardization, and transparency of evaluation.
Proposal of task-oriented and domain-oriented evaluation environments for large language models, in which it is possible to explore the models’ performance objectively and in a controlled manner. In this line of work, the group offers the Cocoruta and Blabinha environments.

Learn a bit more about ….

Espírito Santo, F. O.; Peres, S. M.; Gramacho G. S.; Brandão A. A. F.; Cozman, F. G. Legal Document-Based, Domain-Driven Q&A System: LLMs in Perspective . In the International Joint Conference on Neural Networks (IJCNN 2024), Yokohama – Japão, 2024. p.1-9. ISBN: 978-8-3503-5931-2.
Other publications here!