Corpora

21/07/2024

Pirá

AI-generated image (Copilot)

The Pirá, versions 1.0 and 2.0, is a resource that includes corpora and a set of questions and answers. Although the resource has two versions, the pair of corpora is common to these versions and is organized as follows:

Corpus 1: 3,891 abstracts of scientific articles reporting research conducted on topics related to the Brazilian coast, extracted from the Scopus indexer. The abstracts were selected using a regular expression search filter containing keywords related to the Brazilian coast. Available in English and Portuguese.

Corpus 2: 189 small excerpts from two reports on the global ocean organized by the United Nations (World Ocean Assessment I and World Ocean Assessment II). The excerpts were selected manually. Available in English and Portuguese.

Access to the Pirá corpus: Github, Hugging Face.

Scientific papers that describe Pirá (please cite at least one of these papers if you use the corpora associated with Pirá):

  • Pirozelli, P.; José, M. M.; Silveira, I. C.; Nakasato, F.; Peres, S. M.; Brandão, A. A. F.; Costa, A. H. R.; Cozman, F. G. Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change. Data Intelligence (MIT Press Direct 2024), 2024. v.6. p.29-63. https://doi.org/10.1162/dint_a_00245
  • Paschoal, A. F. A.; Pirozelli, P.; Freire, V.; Delgado, K. V.; Peres, S. M.; José, M. M.; Nakasato, F.; Oliveira, A. S.; Brandão, A. A. F.; Costa, A. H. R.; Cozman, F. G. Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (CIKM’21), Queensland Australia, 2021. p. 4544–4553.  https://doi.org/10.1145/3459637.3482012

* The word “Pirá” means “fish” in Tupi-Guarani, a family of indigenous languages from South America that has strongly influenced Brazilian Portuguese.


Cocoruta

AI-generated image (Copilot)

Cocoruta, versions 1.0 and 2.0, is a resource that includes corpora, question and answer sets, and optimized models based on these sets. The corpora are organized as follows:

  • Corpus Cocoruta 1.0: Laws, provisional measures, decrees, ordinances, and other legal documents addressing national governance issues (Brazilian). The corpus consists of 172,408 documents, forming a substantial corpus containing 67.2 million tokens. The corpus’s suitability for focusing on the Blue Amazon domain (Brazilian coast) was achieved through filtering with a regular expression containing keywords associated with the ocean theme. The filtered corpus contains 68,991 documents, totaling 28.4 million tokens. Available in Portuguese.
  • Corpus Cocoruta 2.0: Laws, provisional measures, decrees, ordinances, and other legal documents addressing national governance issues. The corpus consists of 200,000 documents, totaling 226 million tokens. After applying filtering via a regular expression, we created a specialized corpus on maritime issues with 53,000 documents. The corpus documents are in JSON format, described by the following fields (values in Portuguese): year of the document, status (revoked or not); type (law, decree, ordinance, etc.); title (e.g., Complementary Law No. 63, January 11, 1998); summary (document summary); html-string (content); URL (address of the original document). Available in Portuguese.

Differences between Cocoruta 1.0 and 2.0: The Cocoruta 1.0 corpus can be considered a “basket” of legal documents. It is not organized in a structured way based on metadata like Cocoruta 2.0. Additionally, the regular expression used for filtering maritime documents was more specialized for the second version of the corpus. It now includes more terms, but they are more specific.

Access to the Cocoruta corpora: contact us.

Scientific paper related to Cocoruta 1.0 (please cite this paper if you use the Cocoruta 1.0 Corpus):

  • Espírito Santo, F. O.; Peres, S.M.; Gramacho, G. S.; Brandão, A. A. F.; Cozman, F. G. Legal Document-Based, Domain-Driven Q&A System: LLMs in Perspective. In Proceedings of International Joint Conference on Neural Networks (IJCNN 2024), Japão, 2024.

* “Cocoruta” is the name given to a bird species endemic to the Fernando de Noronha archipelago (Brazil), currently threatened with extinction. The resource’s name was chosen as a tribute to biodiversity and to help promote the conservation of the Blue Amazon (Brazilian coast).


BLAB Wiki

AI-generated image (copilot)

The BLAB Wiki – Blue Amazon Brain Wiki is a small set of entries that provide knowledge about the Brazilian coast (Blue Amazon). The goal of this wiki is to serve as an initial base of texts, written by specialists, covering various topics related to the Blue Amazon. Available in Portuguese.

Currently, the wiki includes three sets of entries:

  • Biodiversity: Pelagic environment, species conservation, coastal ecosystems, deep sea, marine microbiology; primary production and zoology (marine annelids, cnidarians, mollusks, poriferans).
  • Legislation and Governance: Federal constitution, definition of maritime spaces, coastal management, fishing legislation and mariculture, Brazilian Navy, water quality, marine resources, and conservation units.
  • Socioenvironmental: Oil activities, colonization of Brazil, environmental disasters in the coastal and marine environment, coastal erosion and sedimentation, maritime sports, energy generation, sea mining, fishing and aquaculture, marine pollution and contamination, ports, transportation and navigation, coastal tourism, and urbanization of Brazil.

Access to the wiki: Blue Amazon Brain Wiki

Scientific paper in which the wiki is presented (please cite this paper if you use the entries from the Wiki – Blue Amazon):

  • Pirozelli, P.; Castro, A. B. R.; Oliveira, A. L. C.; Oliveira, A. S.; Cação, F. N.; Silveira, I. C.; Campos, J. G. M.; Motheo, L. C.; Figueiredo, L. F.; Pellicer, L. F. A. O.; José, M. A.; José, M. M.; Ligabue, P. M.; Grava, R. S.; Tavares, R. M.; Matos, V. B.; Sym, Y. V.; Costa, A. H. R.; Brandão, A. A. F.; Mauá, D. D.; Cozman, F. G.; Peres, S. M. The BLue Amazon Brain (BLAB): A Modular Architecture of Services about the Brazilian Maritime Territory. Proceedings of the Workshop: AI Modeling Oceans and Climate Change (AIMOCC 2022), Vienna, 2022, p. 1-11. https://doi.org/10.48550/arXiv.2209.07928

* The construction of the Wiki involves the collaboration of Hub Lusófono da Década do Oceano (Lusophone Hub of the Decade of Ocean).


Automated Essay Score (AES) ENEM

AI-generated image (Copilot)

A new benchmark for automatic essay scoring in Portuguese, consisting of entries associated with metadata and organized into pre-established training, validation, and test subsets. The collection comprises 3,604 essays and essay paraphrases, annotated with identifier, theme, title, body of text, set of grades, year.

Access to the collection here.

Scientific paper in which the collection is presented (please cite this paper if you use the essays from this collection):

  • Silveira, I.C., Barbosa, B., Mauá, D. D. A New Benchmark for Automatic Essay Scoring in Portuguese. In Proceedings of the 16th International Conference on Computational Processing of Portuguese, 2024. https://aclanthology.org/2024.propor-1.23

LegalUSP

AI-generated image (Copilot)

The University of São Paulo (USP) is one of the largest and most important higher education institutions in Brazil. With an annual budget of R$ 8.6 billion for 2024, USP encompasses 42 teaching and research units distributed across 8 campuses located in 9 cities. This vast and diverse structure makes its set of rules and regulations formally complex and often difficult to understand.

LegalUSP is a corpus of legal documents from the University of São Paulo, created to facilitate the development of computational systems capable of navigating and understanding this network of rules and regulations. The dataset consists of 866 documents extracted from the university’s official website and converted into text files, covering the period from January 2023 to May 2024. These documents include a variety of regulations, such as Historical Norms, Statutes, General Regulations, Resolutions, Ordinances, Regulations of the Bodies, and other normative documents.

Access the corpus here.