Models

21/07/2024

AI-generated image (Copilot)

Since its inception in 2021, the KEML team has created a series of models for various purposes: text2SQL translators, language models predating the advent of large language models, architectures for implementing conversational agents, etc. Here you will find a list of GitHub and Hugging face repositories that illustrate the work already carried out by the group.

More information about these models can be obtained on this website, in the Resources menu options, or by contacting the team.


  • Cocoruta 1.o: Cocoruta is a specialized large language model fine-tuned for legal document-based Question Answering (Q&A), developed to address legal queries related to the “Blue Amazon”—a term used to describe Brazil’s extensive maritime territory. Cocoruta 1.0 is based on the LLaMa 2-7B model, fine-tuned with a corpus of 68,991 legal documents totaling 28.4 million tokens. Despite being trained with fewer parameters than some larger models, Cocoruta demonstrates competitive performance in domain-specific legal discourse.

The Cocoruta model, with 7 billion parameters (LLaMa 2-7B), was trained using a dataset of 28.4 million tokens extracted from 68,991 legal documents. The training process involved 15 epochs to ensure comprehensive learning from the data. The model’s effectiveness in generating accurate and relevant legal content was assessed using various automatic evaluation metrics. It achieved a BLEU score of 61.2, a ROUGE-N score of 79.2, a BERTSCORE of 91.2, and a MOVERSCORE of 76.5, highlighting its strong performance in producing high-quality legal text.

The performance of Cocoruta in qualitative evaluation showed the utility of fine-tuning, as answers aligned with legal discourse were more frequent in Cocoruta compared to larger models. The larger models exhibited higher proficiency, delivering well-structured answers. However, for questions not directly related to the legal context, responses from the larger models did not maintain legal discourse: Adherence to legal discourse: 74%; Correct answers: 68%; Inappropriate discourse: 51%,

Disclaimer: Cocoruta may reproduce biases and prejudices inherent in the legal documents used for its training, which include older legislation. Users should exercise caution when interpreting the model’s outputs, especially in contexts requiring up-to-date legal perspectives or that may involve underrepresented groups. We observed that the Cocoruta model, while less proficient in handling utterances compared to larger models, would impart a legal bias to potential interactions.

Access the model here.

Scientific paper related to Cocoruta 1.0 (please cite this paper if you use the Cocoruta 1.0 model):

  • Espírito Santo, F. O.; Peres, S.M.; Gramacho, G. S.; Brandão, A. A. F.; Cozman, F. G. Legal Document-Based, Domain-Driven Q&A System: LLMs in Perspective. In Proceedings of International Joint Conference on Neural Networks (IJCNN 2024), Japão, 2024.

* “Cocoruta” is the name given to a bird species endemic to the Fernando de Noronha archipelago (Brazil), currently threatened with extinction. The resource’s name was chosen as a tribute to biodiversity and to help promote the conservation of the Blue Amazon (Brazilian coast).