Datasets

21/07/2024

Pirá

AI-generated image (copilot)

Pirá, versions 1.0 and 2.0, was built with the primary objective of providing a bilingual (Portuguese-English) question and answer set addressing topics related to the Blue Amazon (Brazilian coast). It was created by individuals through a systematic workflow. The Pirá dataset also includes various associated data resources created both manually (complexity evaluation, paraphrase creation, and Portuguese-English translation review) and automatically (translations and mapping the set to multiple-choice questions).

An important feature of the Pirá set is that part of it concerns scientific research content, making it a resource that uses highly specialized and complex language.

The systematic procedure for creating Pirá version 1.0 is illustrated in the figure below:

 

Overview of the creation process for Pirá 1.0 (Pirozelli et al., 2021)

The Pirá 1.0 question and answer set contains 2,261 question/answer pairs in Portuguese and English, with associated information: whether the question is generic; whether the question can be answered strictly with information provided in the text used to formulate it; whether the question makes sense; whether the question is difficult to answer; whether two answers provided by different people for the same question are equivalent; what type (wh-questions) the question is; and associated paraphrases (generated by people).

The Pirá 2.0 set has been revised for spelling issues and data structure formatting, and it contains three fewer questions than version 1.0, removed by curatorial decision. It includes labeling that enables it as an “answer triggering benchmark,” multiple-choice questions derived from the original questions, and a set of automatically generated paraphrases.

Access to the Pirá dataset: GitHub, Hugging Face.

Scientific papers that describe Pirá** (please cite at least one of these papers if you use any data resource offered in Pirá):

  • Pirozelli, P.; José, M. M.; Silveira, I. C.; Nakasato, F.; Peres, S. M.; Brandão, A. A. F.; Costa, A. H. R.; Cozman, F. G. Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change. Data Intelligence (MIT Press Direct 2024), 2024. v.6. p.29-63. https://doi.org/10.1162/dint_a_00245
  • Paschoal, A. F. A.; Pirozelli, P.; Freire, V.; Delgado, K. V.; Peres, S. M.; José, M. M.; Nakasato, F.; Oliveira, A. S.; Brandão, A. A. F.; Costa, A. H. R.; Cozman, F. G. Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (CIKM’21), Queensland Australia, 2021. p. 4544–4553.  https://doi.org/10.1145/3459637.3482012

* The word “Pirá” means “fish” in Tupi-Guarani, a family of indigenous languages from South America that has strongly influenced Brazilian Portuguese.


Cocoruta

AI-generated image (Copilot)

Cocoruta, versions 1.0 and 2.0, offers a set of questions and answers created from the associated corpora. The questions and answers were generated by large language models and are available under the structure: context – question – answer (available in Portuguese). There are two sets of questions and answers:

  • Cocoruta 1.0 set: Contains 16,000 context-question-answer triples, created from a subset of documents from the Cocoruta 1.0 corpus, specially filtered to reference topics related to the Blue Amazon (Brazilian coast).
  • Cocoruta 2.0 set: Created from the documents of the Cocoruta 2.0 corpus. In this case, there are two associated subsets. The first was generated from the complete corpus, and the second was generated only from the subset of documents filtered via regular expression, containing questions and answers related to documents that discuss, in some aspect, the Blue Amazon (Brazilian coast).

Access to the question and answer sets: Cocoruta 1.0, contact us; Cocoruta 2.0, via Hugging Face.

Scientific paper related to Cocoruta 1.0 (please cite this paper if you use the Cocoruta 1.0 set):

  • Espírito Santo, F. O.; Peres, S.M.; Gramacho, G. S.; Brandão, A. A. F.; Cozman, F. G. Legal Document-Based, Domain-Driven Q&A System: LLMs in Perspective. In Proceedings of International Joint Conference on Neural Networks (IJCNN 2024), Japão, 2024.

* “Cocoruta” is the name given to a bird species endemic to the Fernando de Noronha archipelago (Brazil), currently threatened with extinction. The resource’s name was chosen as a tribute to biodiversity and to help promote the conservation of the Blue Amazon (Brazilian coast).


ArGPT

AI-generated image (Copilot)

Argumentation datasets often favor high-quality arguments, making it difficult to train models capable of differentiating between good and bad arguments. ArGPT is a set of argumentative essays created with the help of ChatGPT. Using a prompt structure that simulates student-teacher interaction, ArGPT uniquely includes arguments that attempt to justify notoriously false conclusions:

“explain how Hegel’s writings in the Middle Ages helped create the colonialist mentality of the 16th century.”

ChatGPT’s alignment, which aims to provide responses according to the user’s instructions, leads it to elaborate a justification even in such cases. This enabled the generation of three types of arguments:

  • good arguments: featuring solid argumentation (high coherence) that justify a true statement;
  • bad arguments: displaying flawed argumentation, regardless of whether they defend something true or not;
  • ugly arguments: providing a convincing justification for false statements.

ArGPT contains 168 essays, of which 81 were categorized as “Bad,” 50 as “Good,” and 37 as “Ugly.”

Regarding annotation, the dataset differentiates between argumentative and non-argumentative parts of the text, classifying each component as a premise or main claim — defined as the text’s main conclusion — and using both support and attack relations. The resulting argumentation structure is that of a graph.

Access to ArGPT can be made via GitHub.

Scientific paper in which ArGPT is discussed** (please cite this paper if you use this argumentation dataset):

  • Rocha, V. H. N., Silveira, I. C., Pirozelli, P., Mauá, D. D., Cozman, F. G. Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks. In Proceedings of the 22nd EPIA Conference on Artificial Intelligence – Progress in Artificial Intelligence (EPIA 2023), Faial Island, Azores, 2023. v. 14115. p. 428–440. doi: 10.1007/978-3-031-49008-8_34

BLAB Wiki

AI-generated image (Copilot)

From the entries available in the BLAB Wiki, we generated a small set of questions and answers using the GEMINI 1.0 language model. The set contains 114 “context-question-answer” triples in Portuguese.

This is a domain-oriented content set, but in accessible language for laypeople. Examples of question-answer pairs present in this dataset (translated from Portuguese):

  • Question: What is the Blue Amazon?
  • Answer: The Blue Amazon is the name given to the Brazilian maritime territory, which extends 200 nautical miles from the coast, along the entire coastline, as well as the area extending from the continental shelf. It also includes the areas around oceanic islands, totaling about 3.5 million km² with potential expansion to 4.5 million km² if the request to extend the limits of the Continental Shelf is approved. It is a maritime space of exclusive economic exploitation by Brazil.
  • Question: What is the objective of the National Policy for Marine Resources (PNRM)?
  • Answer: The main objective of the National Policy for Marine Resources (PNRM) is to guide the development of activities that seek the effective use, exploitation, and utilization of living, mineral, and energy resources of the Territorial Sea (MT), the Exclusive Economic Zone (ZEE), and the Continental Shelf (PC). The PNRM aims to ensure that this exploitation is carried out in accordance with national interests, in a rational and sustainable manner, promoting the country’s socioeconomic development, job and income generation, and social inclusion.

Access to the question and answer set can be made here.

Scientific paper in which the wiki is presented (please cite this paper if you use the question-answer set from the BLAB Wiki):

  • Pirozelli, P.; Castro, A. B. R.; Oliveira, A. L. C.; Oliveira, A. S.; Cação, F. N.; Silveira, I. C.; Campos, J. G. M.; Motheo, L. C.; Figueiredo, L. F.; Pellicer, L. F. A. O.; José, M. A.; José, M. M.; Ligabue, P. M.; Grava, R. S.; Tavares, R. M.; Matos, V. B.; Sym, Y. V.; Costa, A. H. R.; Brandão, A. A. F.; Mauá, D. D.; Cozman, F. G.; Peres, S. M. The BLue Amazon Brain (BLAB): A Modular Architecture of Services about the Brazilian Maritime Territory. Proceedings of the Workshop: AI Modeling Oceans and Climate Change (AIMOCC 2022), Vienna, 2022, p. 1-11. https://doi.org/10.48550/arXiv.2209.07928

* The construction of the Wiki involves the collaboration of the Hub Lusófono da Década do Oceano.


mRAT-SQL+GAP e MRAT-SQL-FIT

AI-generated image (Copilot)

The translation of natural language questions to SQL queries (NL2SQL) has been attracting increasing attention, particularly in connection with transformers and similar language models. Transformers, in the context of deep learning, have drastically improved systems that automatically answer natural language questions. A large number of techniques are focused on the English language. We investigate the translation to SQL when input questions are provided in other languages.

Our experiments reveal interesting phenomena that arise when languages other than English are the focus of attention. Our best models are fine-tuned using an augmented Spider dataset in four languages simultaneously: English, Portuguese, Spanish, and French.

Access the augmented dataset on GitHub.

Scientific papers that describe the complete study (please cite at least one of these papers if you use the augmented dataset constructed in our study):

  • José, M. M.; José, M.A.; Mauá, D.D.; Cozman, F.G. Integrating Question Answering and Text-to-SQL in Portuguese. In Proceedings of the 15th International Conference Computational Processing of the Portuguese Language (PROPOR 2022), Fortaleza, 2022. v.13208. p.278–287. https://doi.org/10.1007/978-3-030-98305-5_26
  • José, M. A.; Cozman, F. G. mRAT-SQL+GAP: A Portuguese Text-to-SQL Transformer. In Proceedings of the 10th Brazilian Conference on Intelligent Systems (BRACIS 2021), Virtual Event, 2021, v.13074. p. 511–525. https://doi.org/10.1007/978-3-030-91699-2_35

LegalUSP

AI-generated image (Copilot)

LegalUSP is a corpus of legal documents from the University of São Paulo, created to facilitate the development of computational systems capable of navigating and understanding this network of rules and regulations. From this corpus, a set of questions and answers was generated with the assistance of GPT-4. Initially, the documents were divided into smaller segments (chunks) to ease information retrieval. Subsequently, 592 question-and-answer pairs were created, each associated with a specific segment containing the corresponding answer. To enhance linguistic variability, a second set of questions and answers was generated by rephrasing the original questions, also with the help of GPT-4.

Access the dataset here.