The HarpIA Survey module is dedicated to supporting evaluations that are performed by humans. Since the researcher has at their disposal a large language model (LLM) and a group of human evaluators, this module helps in performing tasks that are commonly required to conduct an evaluation study involving human subjects, such as:
- the creation of an evaluation team, composed of one or more human evaluators;
- the specification of evaluation tasks, including the prompt engineering strategy that determines the overall behavior of the LLM;
- the collection of interaction prompts prepared by the evaluator with the aim of evaluating the LLM’s performance according to the task specified by the researcher;
- coordinating the submission of interaction prompts to the LLM being evaluated, as well as the presentation of model-generated responses to the human evaluator;
- the collection of the evaluator’s responses to the questionnaire specified by the researcher, whose questions (study variables) guide the evaluator in assessing the responses generated by the model.
This module is built on the Moodle platform, which was customized using plugins created by the HarpIA project. The adoption of the Moodle platform as the basis for the development of this module was motivated by the possibility of making use of its native functionalities, such as user authentication, robust data persistence and the ease of customizing web pages that will be used in the evaluation tasks specified by the researcher. These functionalities reduce the time required to prepare the infrastructure needed to conduct user studies and promote the security of the data collected from participants. Learn more about this module by analyzing its architecture.
It is also worth highlighting some other characteristics that guided the design of the HarpIA Survey:
- familiarity: Moodle is an eLearning platform with a large community of users around the world. Researchers familiar with the platform will find it easy to use the HarpIA Survey module, while less familiar researchers may face a moderate learning curve, but this can be smoothed out with the numerous training resources offered online and free of charge;
- customization: the web pages that will be presented to the human evaluator can be extensively customized by the researcher using native tools from the Moodle platform, which allows the usability of the website to be adjusted to the particular needs of each evaluation;
- persistence: all data are saved in the Moodle platform database and can be exported for analysis using statistical tools, copied for replication of the study with another population of evaluators or backed up;
- low internal coupling: the HarpIA Survey module is composed of three components (two plugins for the Moodle platform and a gateway that coordinates communication with LLMs). These components are relatively independent and communicate through fairly flexible APIs;
- agnosticism regarding the evaluated LLM: this module can be used to evaluate any LLM, whether it is a commercially offered model or an open-source model running on local infrastructure. Furthermore, new models can be easily incorporated, since communication with the models is done through APIs invoked through Python scripts.
Development stage
Two types of evaluation are enabled in this release:
- simple prompt-based evaluation (Q&A) – assessment of an LLM’s performance in answering isolated questions (each question constitutes, in itself, the prompt for interaction with the model);
- chat-based evaluation (Chat) – evaluation of the performance of an LLM when interacting with the user (each user interaction prompt is combined with the history of previous interaction prompts).
Essentially, the workflow implemented in Release 1 of HarpIA Moodle is illustrated in the figure below: the researcher specifies the evaluation task in the HarpIA platform, and the recruited evaluators carry out the evaluation of the LLM. Both interact with the platform remotely through any of the modern web browsers, such as Chrome, Firefox, Safari or Edge.
Once the HarpIA Survey module is installed, the evaluation task can be specified, the evaluators can be enrolled and, after the evaluators have participated in the study, the collected data can be exported for analysis, using statistical tools of the researcher’s preference. Assuming that no email server is installed (as is the case for the distribution of HarpIA Survey in this first release), the access credentials to the platform must be manually forwarded to each invited evaluator.
Release 2:
In Release 2, the chat-based evaluation will offer an integration with APIs that will allow for the evaluation of LLMs fueling conversational systems, which run independently from the infrastructure that supports the HarpIA Survey module.
Release 3:
In Release 3, two new types of evaluation will be offered. They will enable the interaction with two active LLMs simultaneously, allowing for the comparative evaluation based on the interaction modes described by both the simple prompt-based evaluation (comparative Q&A) and the chat-based evaluation (comparative Chat).