HarpIA Lab

The HarpIA Lab module was designed to facilitate the automated evaluation of large language models (LLMs). The module focuses on the evaluation process, although the performance of an LLM on a given task is the target of the evaluation. In other words, the evaluation process is completely decoupled from the LLM construction and optimization processes. Thus, the evaluation process was designed to receive as input a JSON file in which the outputs produced by the LLM to be evaluated, the metrics to be calculated, and other data necessary to perform the evaluation are appropriately organized. In brief, the evaluation workflow using HarpIA Lab can be described by the following steps:

preparing the data that will serve as the input for the evaluation process;
choosing the metrics and other procedures of interest to be performed by HarpIA Lab;
analyzing the results using tools offered by HarpIA Lab;
exporting the raw or analytical results produced by the module for analysis using tools preferred by the researcher.

The HarpIA Lab can be used in two ways: (a) from the command line or (b) through a web interface. In both cases, the idea is that the researcher can establish an evaluation process and document its process using JSON files for later processing, either for auditing the results obtained or for replicating and reproducing the evaluation study. It is also worth highlighting some other characteristics that guide the design of HarpIA Lab:

agnosticism regarding the LLM being evaluated: this module can be used to evaluate any LLM, whether it is a commercially available model or an open-source model running on local infrastructure. The evaluation process, which aims to assess the quality of the models, is executed using input files containing data that express the behavior of the LLM being evaluated. In situations where the module needs to interact with the LLM, as is the case with “attack” evaluations, communication will be implemented via a “gateway”, so that the execution of the model is decoupled from the HarpIA Lab module.
to facilitate reproducibility and auditing: The results of an evaluation performed by the HarpIA Lab module are documented in JSON files. These files contain the data that served as input for the evaluation process, the choices made by the researcher regarding the metrics that should be considered in the evaluation, and the results obtained (aggregated or per instance tested). The researcher is encouraged to share these files as a way of providing transparency to the data included in their publications. In this way, the data produced in the evaluation can be safely used as a benchmark, since third parties will be able to audit or reproduce the data that obtained by the evaluation.
to save time and resources: The HarpIA Lab module allows the researcher to save time and resources required for code development and reduces the occurrence of common errors in the production of ad hoc implementations.
to facilitate comparative analyses: the HarpIA Lab aims to facilitate the comparison of different models using the results of their evaluations. This comparison will be made through the analysis of visualizations and also through statistical procedures used in the specialized literature, as indicated in the release plan described below.

Development stages

Release 1:

In this release, the HarpIA Lab will provide:

a web interface that facilitates the workflow within the module.
a set of quantitative evaluation metrics focused on the evaluation of natural language processing capabilities: BertScore, MoverScore, BLEU and METEOR.

The workflow in release 1 is illustrated in the figure below: the researcher prepares the dataset that expresses the LLM behavior in a given task. The researcher then submits the data in the format expected by the platform to the HarpIA Lab module and starts the evaluation. The evaluation is executed by the platform and, when completed, the results can be downloaded by the researcher to analyze them using tools of their choice.

Release 2:

In Release 2, the HarpIA Lab will offer new quantitative metrics and metrics focused on the evaluation of retrieval augmented generation (RAG). Automated attack procedures (in the Red Team work style) will also be offered. Finally, this version will also feature the implementation of a graph generation functionality, facilitating exploratory data analysis (EDA).

Release 3:

In release 3, statistical validation methods will be offered to provide robust comparison between two or more LLMs, as well as new graph generation capabilities.