HarpIA Survey – chat-based evaluation

11/05/2025

The chat-based evaluation requires the human evaluator to perform a sequence of chat-based tasks. This task consists of an interaction between the human evaluator and the language model, followed by the human evaluator considering a set of questions about how he or she perceives certain qualities of the response that was generated by the model. It must be noted that, when generating a response, the model may include new questions as a means to improve the fluidity of the dialogue and, in addition, the human evaluator may react to these questions, taking part in the construction of a dialogue. On the other hand, the human evaluator may guide the course of the dialogue according to his or her subjective intentions. In chat-based tasks, the following statements are taken as premises:

  • In each interaction between the human evaluator and the language model, which can also be called a dialogue turn, there must be a speech (or utterance) by the human evaluator and a speech by the model. Each speech may be more or less complex, consisting of one or several sentences, whether interrogative or not. A set of interactions between the human evaluator and the language model, delimited by an initial interaction and a final interaction, constitutes an instance of dialogue;
  • Each turn of a dialogue, except the first, must take into account its preceding turns. In other words, drawing an analogy with statistical data analysis, the set of interactions within a dialogue does not follow an independent distribution. The context and intentions embedded in previous interactions must be considered in the process of understanding and formulating the speeches in the current turn, both by the model and by the human evaluator;
  • Because dialogue turns are inherently related, their order significantly impacts how they’re understood. Specifically, later turns tend to exhibit a greater dependency on preceding turns when compared to the initial ones. Consequently, the sequence (or history) of turns in a dialogue should influence the behavior of both the language model and the human evaluator.

When the HarpIA Survey is set up to perform a chat-based evaluation, results like the ones illustrated below are produced. In the example, the dialogue consists of four turns and eight speeches. In addition, the human evaluator is asked, in each turn, to answer some questions designed by the researcher to capture the perception of the human evaluator regarding certain qualities of the response generated by the language model. In Turn 3, the text in blue shows a case where the language model was successful in considering the history preceding the current turn. In contrast, the text in red shows a case where the model was unsuccessful.

Please consider watching the following videos to see in practice how the researcher sets up the HarpIA Survey module to support a chat-based evaluation, and how the human evaluator interacts with the module during its participation in an evaluation study. In the first video, a researcher sets up a chat-based evaluation like a Red Team attack on the LLM. The second video shows how a team member interacts with the module.

  • Setting up a chat-based evaluation


 

  • Human evaluator performing a chat-based task