Carolina V1.2: Schema
tei_carolina_schema_1_2.odd
TEI Corpus Carolina ODD - version 1.2 Ada
C4AI - Center for Artificial Intelligence
LAVIHD-USP/LAPELINC-UESB
TEI Consortium
Distributed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License
Copyright 2013 TEI Consortium.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that
the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the
following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and
the following disclaimer in the documentation and/or other materials provided with the
distribution.
The Open Corpus for Linguistics and Artificial Intelligence (Carolina) was compiled for academic purposes, namely linguistic and computational analysis. It is composed of texts assembled in various digital repositories, whose licenses are multiple and therefore should be observed when making use of the corpus. The Carolina headers are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
The Open Corpus for Linguistics and Artificial Intelligence (Carolina) was compiled for academic purposes, namely linguistic and computational analysis. It is composed of texts assembled in various digital repositories, whose licenses are multiple and therefore should be observed when making use of the corpus. The Carolina headers are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
A TEI Customization - Corpus Carolina - 1.2 Ada
Origin region of source document
Author name
Part - If the document is a part of a collection or series, it refers to which part it corresponds to (eg: section in a newspaper)
Document extraction or download date
"download" type is the end of the download
Translator desciption
Role translator
Extent of source document - pages, bytes or tokens
Unit can be pages, bytes or tokens
File of source document
File type of source document
Source document access URL
Credits for work with corpus file - Name of researchers who worked (before or after creation) with this corpus file.
Work description (before or after creation) with this corpus file.
Credits for work with corpus file - Name of researchers and work performed (before or after creation) with this corpus file.
Root element
Name of the file created in the corpus or title of source document or name of corpus
"Main" for title and "sub" for version of corpus
Written or oral text (transcribed)
Written or oral text (transcribed)
written
spoken
mixed
Constitution (integral, fragmented, etc ...)
Constitution (integral, fragmented, etc ...)
domain of use
unknown - unknown level of preparedness;
spontaneous - spontaneous publication: content that does not reflect a revision structure;
mixed - includes spontaneous and monitored publications;
monitored - monitored publication: content that reflects a revision structure or a certain level of formality.
Authority responsible for the source document
Access conditions for the source document (public, under authorization, etc ...)
Access conditions for the source document (public, under authorization, etc ...)
Category description
Category name
Category Reference
ID of referenced taxonomy
ID of referenced category
Taxonomies declarations
Sizes
Attribute description
Linguistic variety (regional) indicated in the source document
License type of the source document (Public domain, Creative Commons, etc.)
URL from license
Corpus Project Description
Collection -If the document is part of a collection or series, the reference is made to the collection it belongs to
Source informations
Sponsor (Institution creating or responsible for the publication of the source document)
Carolina typology
Origin region of source document
Extracted text from source document
Constraint that the @type attribute is mandatory in the Corpus header
The @type attribute must be present in this context.
tei_carolina_schema_1_2.rng
points to a <handNote>
element describing the hand considered responsible for the content of the element concerned.
The @when attribute cannot be used with any other att.datable.w3c attributes.
The @from and @notBefore attributes cannot be used together.
The @to and @notAfter attributes cannot be used together.
identifies one or more declarable elements within the header, which are understood to apply to the element bearing this attribute and its content.
specifies whether or not its parent element is fragmented in some way, typically by some other overlapping structure: for example a speech which is divided between two or more verse stanzas, a paragraph which is split across a page division, a verse line which is divided between two speakers.
Y
(yes) the element is fragmented in some (unspecified) respect
N
(no) the element is not fragmented, or no claim is made as to its completeness
I
(initial) this is the initial part of a fragmented element
M
(medial) this is a medial part of a fragmented element
F
(final) this is the final part of a fragmented element
(identifier) provides a unique identifier for the element bearing the attribute.
The @unit attribute may be unnecessary when @unitRef is present.
The element should not be categorized in detail with @subtype unless also categorized in general with @type
@targetLang should only be used on if @target is specified.
(paragraph) marks paragraphs in prose. [3.1. Paragraphs 7.2.5. Speech Contents]
Abstract model violation: Paragraphs may not occur inside other paragraphs or ab elements.
Abstract model violation: Lines may not contain higher-level structural elements such as div, p, or ab.
(name, proper noun) Credits for work with corpus file - Name of researchers who worked (before or after creation) with this corpus file. [3.5.1. Referring Strings]
Origin region of source document [3.5.2. Addresses 2.2.4. Publication, Distribution, Licensing, etc. 3.11.2.4. Imprint, Size of a Document, and Reprint Information]
Extent of source document - pages, bytes or tokens [3.5.3. Numbers and
Measures]
Unit can be pages, bytes or tokens
pages
bytes
tokens
specifies the number of the specified units that comprise the measurement
(\-?[\d]+/\-?[\d]+)
Document extraction or download date [3.5.4. Dates and Times 2.2.4. Publication, Distribution, Licensing, etc. 2.6. The Revision Description 3.11.2.4. Imprint, Size of a Document, and Reprint Information 15.2.3. The Setting Description 13.3.7. Dates and Times]
(19|2\d)\d\d(-(0[1-9]|1[012])(-(0[1-9]|[12][0-9]|3[01]))?)?(\s-\s(19|2\d)\d\d(-(0[1-9]|1[012])(-(0[1-9]|[12][0-9]|3[01]))?)?)?
"download" type is the end of the download
extraction
download
Extraction
Download
File of source document [3.9. Graphics and Other Non-textual Components]
(uniform resource locator) Source document access URL
File type of source document
text/xml
text/html
text/plain
application/pdf
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
text/csv
Author name [3.11.2.2. Titles, Authors, and Editors 2.2.1. The Title Statement]
Translator desciption [3.11.2.2. Titles, Authors, and Editors]
Role translator
[^\p{C}\p{Z}]+
(statement of responsibility) Credits for work with corpus file - Name of researchers and work performed (before or after creation) with this corpus file. [3.11.2.2. Titles, Authors, and Editors 2.2.1. The Title Statement 2.2.2. The Edition Statement 2.2.5. The Series Statement]
(responsibility) Work description (before or after creation) with this corpus file. [3.11.2.2. Titles, Authors, and Editors 2.2.1. The Title Statement 2.2.2. The Edition Statement 2.2.5. The Series Statement]
Name of the file created in the corpus or title of source document or name of corpus [3.11.2.2. Titles, Authors, and Editors 2.2.1. The Title Statement 2.2.5. The Series Statement]
"Main" for title and "sub" for version of corpus
main
sub
provides the name of the organization responsible for the publication or distribution of a bibliographic item. [3.11.2.4. Imprint, Size of a Document, and Reprint Information 2.2.4. Publication, Distribution, Licensing, etc.]
(scope of bibliographic reference) Part - If the document is a part of a collection or series, it refers to which part it corresponds to (eg: section in a newspaper) [3.11.2.5. Scopes and Ranges in Bibliographic Citations]
Root element [4. Default Text Structure 15.1. Varieties of Composite Text]
specifies the version number of the TEI Guidelines against which this document is valid.
[\d]+(\.[\d]+){0,2}
(TEI header) supplies descriptive and declarative metadata associated with a digital resource or set of resources. [2.1.1. The TEI Header and Its Components 15.1. Varieties of Composite Text]
(file description) contains a full bibliographic description of an electronic file. [2.2. The File Description 2.1.1. The TEI Header and Its Components]
(title statement) groups information about the title of a work and those responsible for its content. [2.2.1. The Title Statement 2.2. The File Description]
Sponsor (Institution creating or responsible for the publication of the source document) [2.2.1. The Title Statement]
Sizes [2.2.3. Type and Extent of File 2.2. The File Description 3.11.2.4. Imprint, Size of a Document, and Reprint Information 10.7.1. Object Description]
(publication statement) groups information concerning the publication or distribution of an electronic or other text. [2.2.4. Publication, Distribution, Licensing, etc. 2.2. The File Description]
(release authority) Authority responsible for the source document [2.2.4. Publication, Distribution, Licensing, etc.]
Access conditions for the source document (public, under authorization, etc ...) [2.2.4. Publication, Distribution, Licensing, etc.]
Access conditions for the source document (public, under authorization, etc ...)
restricted
free
License type of the source document (Public domain, Creative Commons, etc.) [2.2.4. Publication, Distribution, Licensing, etc.]
URL from license
(series statement) Collection -If the document is part of a collection or series, the reference is made to the collection it belongs to [2.2.5. The Series Statement 2.2. The File Description]
(source description) Source informations [2.2.7. The Source Description]
(fully-structured bibliographic citation) contains a fully-structured bibliographic citation, in which all components of the TEI file description are present. [3.11.1. Methods of Encoding Bibliographic References and Lists of References 2.2. The File Description 2.2.7. The Source Description 15.3.2. Declarable Elements]
(encoding description) documents the relationship between an electronic text and the source or sources from which it was derived. [2.3. The Encoding Description 2.1.1. The TEI Header and Its Components]
(project description) Corpus Project Description [2.3.1. The Project Description 2.3. The Encoding Description 15.3.2. Declarable Elements]
(classification declarations) Taxonomies declarations [2.3.7. The Classification Declaration 2.3. The Encoding Description]
defines a typology either implicitly, by means of a bibliographic citation, or explicitly by a structured taxonomy. [2.3.7. The Classification Declaration]
Category name [2.3.7. The Classification Declaration]
(category description) Category description [2.3.7. The Classification Declaration]
en
pt
(text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting. [2.4. The Profile Description 2.1.1. The TEI Header and Its Components]
(language usage) Linguistic variety (regional) indicated in the source document [2.4.2. Language Usage 2.4. The Profile Description 15.3.2. Declarable Elements]
characterizes a single language or sublanguage used within a text. [2.4.2. Language Usage]
(identifier) Attribute description
pt-BR
pt
(text classification) Carolina typology [2.4.3. The Text Classification]
(category reference) Category Reference [2.4.3. The Text Classification]
ID of referenced category
ID of referenced taxonomy
(TEI document) contains a single TEI-conformant document, combining a single TEI header with one or more members of the model.resource class. Multiple <TEI>
elements may be combined within a <TEI>
(or <teiCorpus>
) element. [4. Default Text Structure 15.1. Varieties of Composite Text]
Extracted text from source document [4. Default Text Structure 15.1. Varieties of Composite Text]
(text body) contains the whole body of a single unitary text, excluding any front or back matter. [4. Default Text Structure]
(text description) provides a description of a text in terms of its situational parameters. [15.2.1. The Text Description]
(primary channel) Written or oral text (transcribed) [15.2.1. The Text Description]
Written or oral text (transcribed)
w
written
s
spoken
m
mixed
Constitution (integral, fragmented, etc ...) [15.2.1. The Text Description]
Constitution (integral, fragmented, etc ...)
[^\p{C}\p{Z}]+
describes the nature and extent of originality of this text. [15.2.1. The Text Description]
(domain of use) domain of use [15.2.1. The Text Description]
describes the extent to which the text may be regarded as imaginative or non-imaginative, that is, as describing a fictional or a non-fictional world. [15.2.1. The Text Description]
describes the extent, cardinality and nature of any interaction among those producing and experiencing the text, for example in the form of response or interjection, commentary, etc. [15.2.1. The Text Description]
describes the extent to which a text may be regarded as prepared or spontaneous. [15.2.1. The Text Description]
unknown - unknown level of preparedness; spontaneous - spontaneous publication: content that does not reflect a revision structure; mixed - includes spontaneous and monitored publications; monitored - monitored publication: content that reflects a revision structure or a certain level of formality.
[^\p{C}\p{Z}]+
characterizes a single purpose or communicative function of the text. [15.2.1. The Text Description]
Origin region of source document [13.2.3. Place Names]
The @type attribute must be present in this context.