HarpIA Survey – simple prompt-based evaluation – Knowledge Enhanced Machine Learning

The simple prompt-based evaluation requires the human evaluator to perform a sequence of Q&A tasks. This task consists of an interaction between the human evaluator and the language model, followed by the human evaluator considering a set of questions about how he or she perceives certain qualities of the response that was generated by the model. The task assumes the availability of an active large language model (LLM), whose behavior is modified by a system prompt to meet the researcher’s expectations. The input to the model is a single prompt (expressed in natural language) submitted by the human evaluator and the output consists of a single response generated by the model (also expressed in natural language). In this context, some statements are taken as premises:

Each prompt presented as an input to the language model constitutes an instance that is processed independently of other instances. By analogy with statistical data analysis, the set of prompts submitted by the human evaluator to the language model should follow an independent distribution. In other words, the context and intent inherent to a prompt are unrelated to the context and intent inherent to any other prompts previously submitted to the language model;
As a consequence of this independence, the order in which prompts are presented to the language model does not influence the process of generating responses to the prompts and, therefore, does not affect the results of the language model evaluation.

Some examples of tasks typically performed by LLMs that can be modeled as Q&A tasks are text completion, question and answering (Q&A), translation, summarization, and paraphrasing. When the HarpIA Survey is set up to perform a simple prompt-based evaluation, results such as those illustrated below are produced.

Please consider watching the following videos to see in practice how the researcher sets up the HarpIA Survey module to support a simple prompt-based evaluation, and how the human evaluator interacts with the module during its participation in an evaluation study. In the first video, a researcher sets up a simple prompt-based evaluation like a Red Team attack on the LLM. The second video shows how the team member interacts with the module.

Setting up a chat-based evaluation

Human evaluator performing a chat-based task