TEI P5: Guidelines for Electronic Text Encoding and Interchange by the TEI Consortium Originally edited by C.M. Sperberg-McQueen and Lou Burnard for the ACH-ALLC-ACL Text Encoding Initiative Now entirely revised and expanded under the supervision of the Technical Council of the TEI Consortium edited by Lou Burnard and Syd Bauman 1.3.0. Last updated on February 1st 2009. Oxford -- Providence -- Charlottesville -- Nancy 2008 e TEI Guidelines ii e TEI Guidelines edited by Lou Burnard and Syd Bauman 2008 e TEI Guidelines 1.3.0. Last updated on February 1st 2009. Copyright 2009 TEI Consortium. is is free soware; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Soware Foundation; either version 2 of the License, or (at your option) any later version. is material is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details. A copy of the GNU General Public License is stored on the TEI web site along with this file; you can also contact the Free Soware Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA, for a copy. For information about the TEI, including contact details, consult the TEI web site at http://www.tei-c.org/. ii edited by Lou Burnard and Syd Bauman Contents i Releases of the TEI Guidelines xv ii Dedication xvii iii Preface and Acknowledgments xix iv About ese Guidelines xxiii iv.1 Structure and Notational Conventions of this Document . . . . . . . . . . . . . . . . . xxiv iv.1.1 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv iv.1.2 Intended Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi iv.2 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxix iv.3 Future Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxx v A Gentle Introduction to XML xxxi v.1 What's special about XML? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii v.1.1 Descriptive markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii v.1.2 Types of document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii v.1.3 Data independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxiii v.2 Textual structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxiii v.3 XML structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxiv v.3.1 Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxiv v.3.2 Content models: an example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxiv v.3.3 Validating a document's structure . . . . . . . . . . . . . . . . . . . . . . . . . . xxxvi v.3.4 An example schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxvii v.4 Complicating the issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xl v.5 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xlii v.5.1 Declaring attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xlii v.5.2 Identifiers and indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xliv v.6 Other components of an XML document . . . . . . . . . . . . . . . . . . . . . . . . . xlv v.6.1 Character References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xlv v.6.2 Processing instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xlvi v.6.3 Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xlvii v.7 Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xlix v.7.1 Associating entity definitions with a document instance . . . . . . . . . . . . . . . l v.7.2 Associating a document instance with its schema . . . . . . . . . . . . . . . . . . l v.7.3 Assembling multiple resources into a single document . . . . . . . . . . . . . . . li v.7.4 Stylesheet association and processing . . . . . . . . . . . . . . . . . . . . . . . . . li vi Languages and Character Sets liii vi.1 Language identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . liv vi.2 Characters and Character Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lvi vi.2.1 Historical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lvi vi.2.2 Terminology and key concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . lvii vi.2.3 Abstract characters, glyphs and encoding scheme design . . . . . . . . . . . . . . lviii vi.2.4 Entry of characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lix vi.2.5 Output of characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lx vi.2.6 Unicode and XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lx iii e TEI Guidelines vi.2.7 Special aspects of Unicode character definitions . . . . . . . . . . . . . . . . . . . lxiii vi.2.8 Character entities in non-validated documents . . . . . . . . . . . . . . . . . . . lxiv vi.2.9 Issues arising from the internal representations of Unicode . . . . . . . . . . . . . lxv 1 e TEI Infrastructure 1 1.1 TEI Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Defining a TEI Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 A Simple Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 A Larger Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 e TEI Class System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Attribute Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Model Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.1 Standard Content Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.2 Datatype Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 e TEI Infrastructure Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 e TEI Header 17 2.1 Organization of the TEI Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.1 e TEI Header and its Components . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.2 Types of Content in the TEI Header . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.3 Model Classes in the TEI Header . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 e File Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 e Title Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 e Edition Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 Type and Extent of File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.4 Publication, Distribution, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.5 e Series Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.6 e Notes Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.7 e Source Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.8 Computer Files Derived from Other Computer Files . . . . . . . . . . . . . . . . 32 2.3 e Encoding Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3.1 e Project Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2 e Sampling Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.3 e Editorial Practices Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.4 e Tagging Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.5 e Reference System Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.6 e Classification Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.3.7 e Application Information Element . . . . . . . . . . . . . . . . . . . . . . . . 44 2.3.8 Module-Specific Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.4 e Profile Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.4.1 Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.2 Language Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.3 e Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.5 e Revision Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6 Minimal and Recommended Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.7 Note for Library Cataloguers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.8 e TEI Header Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 iv edited by Lou Burnard and Syd Bauman 3 Elements Available in All TEI Documents 55 3.1 Paragraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 Treatment of Punctuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Highlighting and Quotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.1 What Is Highlighting? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.2 Emphasis, Foreign Words, and Unusual Language . . . . . . . . . . . . . . . . . . 60 3.3.3 Quotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.4 Terms, Glosses, Equivalents, and Descriptions . . . . . . . . . . . . . . . . . . . . 67 3.3.5 Some Further Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4 Simple Editorial Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.1 Apparent Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.4.2 Regularization and Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4.3 Additions, Deletions, and Omissions . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5 Names, Numbers, Dates, Abbreviations, and Addresses . . . . . . . . . . . . . . . . . 78 3.5.1 Referring Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.5.2 Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.5.3 Numbers and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.5.4 Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.5.5 Abbreviations and eir Expansions . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.6 Simple Links and Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.7 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.8 Notes, Annotation, and Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.8.1 Notes and Simple Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.8.2 Index Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.9 Graphics and other non-textual components . . . . . . . . . . . . . . . . . . . . . . . 101 3.10 Reference Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.10.1 Using the xml:id and n Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.10.2 Creating New Reference Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.10.3 Milestone Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.10.4 Declaring Reference Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.11 Bibliographic Citations and References . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.11.1 Elements of Bibliographic References . . . . . . . . . . . . . . . . . . . . . . . . 112 3.11.2 Components of Bibliographic References . . . . . . . . . . . . . . . . . . . . . . 115 3.11.3 Bibliographic Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.11.4 Relationship to Other Bibliographic Schemes . . . . . . . . . . . . . . . . . . . . 126 3.12 Passages of Verse or Drama . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.12.1 Core Tags for Verse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.12.2 Core Tags for Drama . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 3.13 Overview of the Core Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4 Default Text Structure 135 4.1 Divisions of the Body . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.1.1 Un-numbered Divisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.1.2 Numbered Divisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.1.3 Numbered or Un-numbered? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.1.4 Partial and Composite Divisions . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.2 Elements Common to All Divisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.2.1 Headings and Trailers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 v e TEI Guidelines 4.2.2 Openers and Closers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.2.3 Arguments, Epigraphs, and Postscripts . . . . . . . . . . . . . . . . . . . . . . . . 148 4.2.4 Content of Textual Divisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.3 Grouped and Floating Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.3.1 Grouped Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.3.2 Floating Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.4 Virtual Divisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 4.5 Front Matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 4.6 Title Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 4.7 Back Matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.8 Module for Default Text Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5 Representation of Non-standard Characters and Glyphs 169 5.1 Is Your Journey Really Necessary? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.2 Markup Constructs for Representation of Characters and Glyphs . . . . . . . . . . . . 170 5.2.1 Character Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 5.3 Annotating Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.4 Adding New Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.5 How to Use Code Points from the Private Use Area . . . . . . . . . . . . . . . . . . . . 180 5.6 Module Character and Glyph Documentation . . . . . . . . . . . . . . . . . . . . . . 181 6 Verse 183 6.1 Structural Divisions of Verse Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.2 Components of the Verse Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.3 Rhyme and Metrical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.3.1 Sample Metrical Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.3.2 Segment-Level versus Line-level Tagging . . . . . . . . . . . . . . . . . . . . . . . 192 6.3.3 Metrical Analysis of Stanzaic Verse . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.4 Rhyme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.5 Metrical Notation Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.6 Encoding Procedures for Other Verse Features . . . . . . . . . . . . . . . . . . . . . . 198 6.7 Module for Verse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7 Performance Texts 199 7.1 Front and Back Matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 7.1.1 e Set Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 7.1.2 Prologues and Epilogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 7.1.3 Records of Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 7.1.4 Cast Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 7.2 e Body of a Performance Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 7.2.1 Major Structural Divisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 7.2.2 Speeches and Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 7.2.3 Stage Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 7.2.4 Speech Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 7.2.5 Embedded Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 7.2.6 Simultaneous Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.3 Other Types of Performance Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 7.3.1 Technical Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 vi edited by Lou Burnard and Syd Bauman 7.4 Module for Performance Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 8 Transcriptions of Speech 225 8.1 General Considerations and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 225 8.2 Documenting the Source of Transcribed Speech . . . . . . . . . . . . . . . . . . . . . 227 8.3 Elements Unique to Spoken Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 8.3.1 Utterances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 8.3.2 Pausing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.3.3 Vocal, Kinesic, Incident . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.3.4 Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 8.3.5 Temporal Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 8.3.6 Shis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 8.4 Elements Defined Elsewhere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 8.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 8.4.2 Synchronization and Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 8.4.3 Regularization of Word Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 8.4.4 Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 8.4.5 Speech Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 8.4.6 Analytic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 8.5 Module for Transcribed Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 9 Dictionaries 251 9.1 Dictionary Body and Overall Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 252 9.2 e Structure of Dictionary Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 9.2.1 Hierarchical Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 9.2.2 Groups and Constituents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 9.3 Top-level Constituents of Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 9.3.1 Information on Written and Spoken Forms . . . . . . . . . . . . . . . . . . . . . 259 9.3.2 Grammatical Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 9.3.3 Sense Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 9.3.4 Etymological Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 9.3.5 Other Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 9.3.6 Related Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 9.4 Headword and Pronunciation References . . . . . . . . . . . . . . . . . . . . . . . . . 277 9.5 Typographic and Lexical Information in Dictionary Data . . . . . . . . . . . . . . . . 280 9.5.1 Editorial View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 9.5.2 Lexical View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 9.5.3 Retaining Both Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 9.6 Unstructured Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 9.7 e Dictionary Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 10 Manuscript Description 291 10.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 10.2 e Manuscript Description Element . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 10.3 Phrase-level Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 10.3.1 Origination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 10.3.2 Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 10.3.3 Watermarks and Stamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 vii e TEI Guidelines 10.3.4 Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 10.3.5 References to Locations within a Manuscript . . . . . . . . . . . . . . . . . . . . 299 10.3.6 Names of Persons, Places, and Organizations . . . . . . . . . . . . . . . . . . . . 302 10.3.7 Catchwords, Signatures, Secundo Folio . . . . . . . . . . . . . . . . . . . . . . . . 303 10.3.8 Heraldry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 10.4 e Manuscript Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 10.5 e Manuscript Heading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 10.6 Intellectual Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 10.6.1 e and Elements . . . . . . . . . . . . . . . . . . . . 310 10.6.2 Authors and Titles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 10.6.3 Rubrics, Incipits, Explicits, and Other Quotations from the Text . . . . . . . . . . 313 10.6.4 Filiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 10.6.5 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 10.6.6 Languages and Writing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 10.7 Physical Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 10.7.1 Object Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 10.7.2 Writing, Decoration, and Other Notations . . . . . . . . . . . . . . . . . . . . . . 321 10.7.3 Bindings, Seals, and Additional Material . . . . . . . . . . . . . . . . . . . . . . . 326 10.7.4 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 10.7.5 Additional information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 10.7.6 Manuscript Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 10.7.7 Module for Manuscription Description . . . . . . . . . . . . . . . . . . . . . . . 333 11 Representation of Primary Sources 335 11.1 Digital Facsimiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 11.2 Scope of Transcriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 11.3 Altered, Corrected, and Erroneous Texts . . . . . . . . . . . . . . . . . . . . . . . . . 345 11.3.1 Core elements for Transcriptional Work . . . . . . . . . . . . . . . . . . . . . . . 346 11.3.2 Abbreviation and Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 11.3.3 Correction and Conjecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 11.3.4 Additions and Deletions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 11.3.5 Substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 11.3.6 Cancellation of Deletions and Other Markings . . . . . . . . . . . . . . . . . . . . 361 11.3.7 Text Omitted from or Supplied in the Transcription . . . . . . . . . . . . . . . . . 362 11.4 Hands and Responsibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 11.4.1 Document Hands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 11.4.2 Hand, Responsibility, and Certainty Attributes . . . . . . . . . . . . . . . . . . . . 365 11.5 Damage and Conjecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 11.5.1 Damage, Illegibility, and Supplied Text . . . . . . . . . . . . . . . . . . . . . . . . 367 11.5.2 Use of the , , , , and Elements in Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 11.6 Aspects of Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 11.6.1 Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 11.6.2 Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 11.7 Headers, Footers, and Similar Matter . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 11.8 Other Primary Source Features not Covered in these Guidelines . . . . . . . . . . . . . 374 11.9 Module for Transcription of Primary Sources . . . . . . . . . . . . . . . . . . . . . . . 374 viii edited by Lou Burnard and Syd Bauman 12 Critical Apparatus 375 12.1 e Apparatus Entry, Readings, and Witnesses . . . . . . . . . . . . . . . . . . . . . . 375 12.1.1 e Apparatus Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 12.1.2 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 12.1.3 Indicating Subvariation in Apparatus Entries . . . . . . . . . . . . . . . . . . . . 379 12.1.4 Witness Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 12.1.5 Fragmentary Witnesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 12.2 Linking the Apparatus to the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 12.2.1 e Location-referenced Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 12.2.2 e Double End-Point Attachment Method . . . . . . . . . . . . . . . . . . . . . 389 12.2.3 e Parallel Segmentation Method . . . . . . . . . . . . . . . . . . . . . . . . . . 391 12.3 Using Apparatus Elements in Transcriptions . . . . . . . . . . . . . . . . . . . . . . . 393 12.4 Module for Critical Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 13 Names, Dates, People, and Places 395 13.1 Attribute Classes Defined by this Module . . . . . . . . . . . . . . . . . . . . . . . . . 395 13.1.1 Linking Names and their Referents . . . . . . . . . . . . . . . . . . . . . . . . . . 395 13.1.2 Dating Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 13.2 Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 13.2.1 Personal Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 13.2.2 Organizational Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 13.2.3 Place Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 13.3 Biographical and Prosopographical Data . . . . . . . . . . . . . . . . . . . . . . . . . 410 13.3.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 13.3.2 e Person Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 13.3.3 Organizational Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 13.3.4 Places . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 13.3.5 Names and Nyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 13.3.6 Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 13.4 Module for Names and Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 14 Tables, Formul, and Graphics 441 14.1 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 14.1.1 TEI Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 14.1.2 Other Table Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 14.2 Formul and Mathematical Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 446 14.3 Specific Elements for Graphic Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 14.4 Overview of Basic Graphics Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 14.5 Graphic Image Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 14.5.1 Vector Graphic Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 14.5.2 Raster Graphic Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 14.5.3 Photographic and Motion Video Formats . . . . . . . . . . . . . . . . . . . . . . 455 14.6 Module for Tables, Formul, and Graphics . . . . . . . . . . . . . . . . . . . . . . . . 456 15 Language Corpora 457 15.1 Varieties of Composite Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 15.2 Contextual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 15.2.1 e Text Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 ix e TEI Guidelines 15.2.2 e Participant Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 15.2.3 e Setting Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 15.3 Associating Contextual Information with a Text . . . . . . . . . . . . . . . . . . . . . 466 15.3.1 Combining Corpus and Text Headers . . . . . . . . . . . . . . . . . . . . . . . . 466 15.3.2 Declarable Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 15.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 15.4 Linguistic Annotation of Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 15.4.1 Levels of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 15.5 Recommendations for the Encoding of Large Corpora . . . . . . . . . . . . . . . . . . 471 15.6 Module for Language Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 16 Linking, Segmentation, and Alignment 473 16.1 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 16.1.1 Pointers and Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 16.1.2 Using Pointers and Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 16.1.3 Groups of Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 16.1.4 Intermediate Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480 16.2 Pointing Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 16.2.1 Pointing Elsewhere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 16.2.2 Pointing Locally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 16.2.3 W3C element() Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 16.2.4 TEI XPointer Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 16.2.5 Canonical References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 16.3 Blocks, Segments, and Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 16.4 Correspondence and Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 16.4.1 Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 16.4.2 Alignment of Parallel Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 16.4.3 A ree-way Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 16.5 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 16.5.1 Aligning Synchronous Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 16.5.2 Placing Synchronous Events in Time . . . . . . . . . . . . . . . . . . . . . . . . . 507 16.6 Identical Elements and Virtual Copies . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 16.7 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 16.8 Alternation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 16.9 Stand-off Markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 16.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 16.9.2 Overview of XInclude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 16.9.3 Doing Stand-off Markup in TEI . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 16.9.4 Well-formedness and Validity of Stand-off Markup . . . . . . . . . . . . . . . . . 524 16.9.5 Including Text or XML Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . 525 16.10 Connecting Analytic and Textual Markup . . . . . . . . . . . . . . . . . . . . . . . . . 526 16.11 Module for Linking, Segmentation, and Alignment . . . . . . . . . . . . . . . . . . . . 526 17 Simple Analytic Mechanisms 527 17.1 Linguistic Segment Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 17.2 Global Attributes for Simple Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 17.3 Spans and Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 17.4 Linguistic Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 x edited by Lou Burnard and Syd Bauman 17.5 Module for Analysis and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 541 18 Feature Structures 543 18.1 Organization of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543 18.2 Elementary Feature Structures and the Binary Feature Value . . . . . . . . . . . . . . . 543 18.3 Other Atomic Feature Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 18.4 Feature and Feature-Value Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548 18.5 Feature Structures as Complex Feature Values . . . . . . . . . . . . . . . . . . . . . . 550 18.6 Re-entrant Feature Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 18.7 Collections as Complex Feature Values . . . . . . . . . . . . . . . . . . . . . . . . . . 553 18.8 Feature Value Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 18.8.1 Alternation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 18.8.2 Negation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 18.8.3 Collection of Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 18.9 Default Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 18.10 Linking Text and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560 18.11 Feature System Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 18.11.1 Linking a TEI Text to Feature System Declarations . . . . . . . . . . . . . . . . . 564 18.11.2 e Overall Structure of a Feature System Declaration . . . . . . . . . . . . . . . 566 18.11.3 Feature Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 18.11.4 Feature Structure Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 18.11.5 A Complete Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 18.12 Formal Definition and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 577 19 Graphs, Networks, and Trees 579 19.1 Graphs and Digraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579 19.1.1 Transition Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 19.1.2 Family Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 19.1.3 Historical Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 19.2 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 19.3 Another Tree Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598 19.4 Representing Textual Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 19.5 Module for Graphs, Networks, and Trees . . . . . . . . . . . . . . . . . . . . . . . . . 608 20 Non-hierarchical Structures 611 20.1 Multiple Encodings of the Same Information . . . . . . . . . . . . . . . . . . . . . . . 612 20.2 Boundary Marking with Empty Elements . . . . . . . . . . . . . . . . . . . . . . . . . 613 20.3 Fragmentation and Reconstitution of Virtual Elements . . . . . . . . . . . . . . . . . . 617 20.4 Stand-off Markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 20.5 Non-XML-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 21 Certainty and Responsibility 625 21.1 Levels of Certainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 21.1.1 Using Notes to Record Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . 626 21.1.2 Structured Indications of Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . 626 21.2 Attribution of Responsibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631 21.3 e Certainty Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631 xi e TEI Guidelines 22 Documentation Elements 633 22.1 Phrase Level Documentary Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 22.1.1 Phrase Level Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 22.1.2 Element and Attribute Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . 635 22.2 Modules and Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636 22.3 Specification Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 22.4 Common Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 22.4.1 Description of Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 22.4.2 Exemplification of Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 22.4.3 Classification of Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641 22.4.4 Element Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641 22.4.5 Attribute List Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642 22.4.6 Element Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 22.4.7 Pattern Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 22.5 Building a Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 22.6 Combining TEI and Non-TEI Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 650 22.7 Module for Documention Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651 23 Using the TEI 653 23.1 Obtaining the TEI Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 23.2 Personalization and Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 23.2.1 Kinds of Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 23.2.2 Modification and Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662 23.2.3 Documenting the Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663 23.2.4 Examples of Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664 23.3 Conformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665 23.3.1 Well-formedness criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665 23.3.2 Validation Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666 23.3.3 Conformance to the TEI Abstract Model . . . . . . . . . . . . . . . . . . . . . . . 666 23.3.4 Use of the TEI Namespace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667 23.3.5 Documentation Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668 23.3.6 Varieties of TEI Conformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668 23.4 Implementation of an ODD System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670 23.4.1 Making a Unified ODD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670 23.4.2 Generating Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 23.4.3 Names and Documentation in Generated Schemas . . . . . . . . . . . . . . . . . 678 23.4.4 Making a RELAX NG Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679 23.4.5 Making a DTD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684 23.4.6 Generating Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 23.4.7 Using TEI Parameterized Schema Fragments . . . . . . . . . . . . . . . . . . . . 685 A Model Classes 691 B Attribute Classes 711 C Elements 745 D Attributes 1241 xii edited by Lou Burnard and Syd Bauman E Datatypes and Other Macros 1247 F Bibliography 1267 Works cited in examples in the Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267 Works cited elsewhere in the text of the Guidelines . . . . . . . . . . . . . . . . . . . . . . . 1279 Reading list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1282 eory of Markup and XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1283 TEI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288 G Prefatory Notes 1293 Prefatory Note (March 2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293 Introductory Note (November 2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1294 Introductory Note (June 2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1294 Introductory Note (May 1999) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296 Typographic corrections made . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296 Specific changes in the DTD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296 Outstanding errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1297 Preface (April 1994) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1299 TEI Working Committees (1990-1993) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1299 Advisory Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1301 Steering Committee Membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1302 H Colophon 1303 xiii e TEI Guidelines xiv i Releases of the TEI Guidelines P1 1990, C.M. Sperberg-McQueen and Lou Burnard P2 1992, C.M. Sperberg-McQueen and Lou Burnard P3 1994, C.M. Sperberg-McQueen and Lou Burnard P4 2001, Lou Burnard, Syd Bauman, and Steven DeRose P5 2007, Lou Burnard and Syd Bauman xv i. Releases of the TEI Guidelines xvi ii Dedication In memoriam Donald E. Walker 22 November 1928 ­ 26 November 1993 Antonio Zampolli 1937 ­ 22 August 2003 xvii ii. Dedication xviii iii Preface and Acknowledgments is publication constitutes the fih distinct version of the Guidelines for Electronic Text Encoding and Interchange, and the first complete revision since the appearance of P3 in 1994. It includes substantial amounts of new material and a major revision of the underlying technical infrastructure. With this version, the Guidelines enter a new stage in their development as a community-maintained open source project. is edition is the first version to have benefitted from the close overview and oversight of an elected TEI Technical Council. e editors are therefore particularly pleased to acknowledge with gratitude the hard work and dedication put into this project by the Council over the last five years. e Chair of the TEI Board sits on the Technical Council, and the Board also nominates one other member to the Council. e other Council members are all elected by the Consortium membership, and serve periods of up to two years at a time. e Board nominates the Chair of the Technical Council from among its members. e names and affiliations of all Council members who served during the production of this edition of the Guidelines are listed below. Chair * 2002-3: John Unsworth (University of Virginia) * 2003-7: Christian Wittern (Kyoto University) Board Members * 2002-7: Sebastian Rahtz (University of Oxford) * 2004-5: Julia Flanders (Brown University) * 2006: Matthew Zimmerman (New York University) * 2007: Daniel O'Donnell (University of Lethbridge) Elected Members * 2003-6: Alejandro Bia (University of Alicante) * 2004-6; 2006-7: David Birnbaum (University of Pittsburgh) * 2007: Tone Merete Bruvik (University of Bergen) * 2007: Arianna Ciula (King's College London) * 2005-7: James Cummings (University of Oxford) * 2002-7: Matthew Driscoll (University of Copenhagen) * 2002-4: David Durand (Ingenta plc) xix iii. Preface and Acknowledgments * 2002-4: Tomas Erjavec (Jozef Stefan Institute, Ljubljana) * 2002: Fotis Jannidis (University of Munich) * 2006: Amit Kumar (University of Illinois at Urbana-Champaign) * 2002: Martin Mueller (Northwestern University) * 2006-7: Dorothy Porter (University of Kentucky) * 2002-3: Merillee Proffitt (Research Libraries Group) * 2002: Peter Robinson (De Montfort University) * 2002: Geoffrey Rockwell (Macmaster University) * 2002-7: Laurent Romary (University of Nancy; Max Planck Digital Library) * 2003-7: Susan Schreibman (University of Maryland) * 2004-5: Natasha Smith (University of North Carolina at Chapel Hill) * 2006-7: Conal Tuohy (Victoria University of Wellington) * 2004-5: Edward Vanhoutte (Royal Academy of Dutch Language and Literature) * 2005-7: John Walsh (Indiana University) * 2002-5: Perry Willett (Indiana University) e bulk of the Council's work has been carried out by email and by regular telephone conference. In addition, the Council has held six two-day face-to-face meetings. During production of P5, these meetings were generously hosted by the following institutions: King's College, London (2002) Oxford University Computing Services (2003) Royal Academy of Dutch Language and Literature, Ghent (2004) AFNOR: Association française de normalisation, Paris (2005) Institute for Research in Humanities, Kyoto University (2006) Berlin-Brandenburgische Akademie der Wissenschaen, Berlin (2007) During the production of TEI P5, the Council chartered a number of smaller workgroups and similar activities, each of which made significant contribution to the intellectual content of the work. Active members of these are listed below: Character Set Workgroup Active between July 2001 and January 2005, this group revised and developed the recommendations now forming chapters vi Languages and Character Sets and 5. Representation of Nonstandard Characters and Glyphs. It was chaired by Christian Wittern, and its membership included: Deborah Anderson (Berkeley); Michael Beddow (independent scholar); David Birnbaum (Pittsburgh University); Martin Duerst (W3C/Keio University); Patrick Durusau (Society of Biblical Literature); Tomohiko Morioka (Kyoto University); and Espen Ore (National Library of Norway). Meta Taskforce Active between February 2003 and February 2005, this group developed the material now forming 22. Documentation Elements. It was chaired by Sebastian Rahtz, and its membership included: Alejandro Bia; David G. Durand; Laurent Romary; Norman Walsh (Sun Microsystems); and Christian Wittern. xx Workgroup on Stand-Off Markup, XLink and XPointer Active between February 2002 and January 2006, this group reviewed and expanded the material now largely forming part of 16. Linking, Segmentation, and Alignment. It was chaired by David G. Durand, and its membership included: Jean Carletta (Edinburgh University); Chris Caton (University of Oxford); Jessica P. Hekman (Ingenta plc); Nancy M. Ide (Vassar College); and Fabio Vitali (University of Bologna). Manuscript Description Task Force Active between February 2003 and December 2005, this group reviewed and finalised the material now forming 10. Manuscript Description. It was chaired by Matthew Driscoll and comprised David Birnbaum and Merrillee Proffitt, in addition to the TEI Editors. Names and Places Activity Active between January 2006 and May 2007, this group formulated the new material now forming part of 13. Names, Dates, People, and Places. It was chaired by Matthew Driscoll. and its membership included Gabriel Bodard (King's College London); Arianna Ciula; James Cummings; Tom Elliott (University of North Carolina at Chapel Hill); yvind Eide (University of Oslo); Leif Isaksen (Oxford Archaeology plc); Richard Light (private consultant); Tadeusz Piotrowski (Opole University); Sebastian Rahtz; and Tatiana Timcenko (Vilnius University). Joint TEI/ISO Activity on Feature Structures Active between January 2003 and August 2007, this group reviewed the material now presented in 18. Feature Structures and revised it for inclusion in ISO Standard 24610. It was chaired by Kiyong Lee (Korea University), and its active membership included the following: Harry Bunt (Tilburg); Lionel Clément (INRIA); Eric de la Clergerie (INRIA); ierry Declerck (Saarbrücken); Patrick Drouin (University of Montréal); Lee Gillam (Surrey University); and Kiti Hasida (ICOT). e TEI Editors, Lou Burnard (University of Oxford) and Syd Bauman (Brown University) serve ex officio on the Council and, as far as possible, on all Council workgroups. e council also oversees an Internationalization and Localization project, led by Sebastian Rahtz and with funding from the ALLC. is activity, ongoing since October 2005, is engaged in translating key parts of the P5 source into a variety of languages. Production of the translations currently included in P5 has been co-ordinated by the following: Chinese Marcus Bingenheimer (Chung-hwa Institute of Buddhist Studies, Taipei) and Weining Hwang (Würzburg University) French Pierre-Yves Duchemin (ENSSIB); Jean-Luc Benoit (ATILF); Anila Angjeli (BnF); Joëlle Bellec Martini (BnF); Marie-France Claerebout (Aldine); Magali Le Coënt (BIUSJ); Florence Clavaud (EnC); Cécile Pierre (BIUSJ). German Werner Wegstein (Würzburg University) Japanese Ohya Kazushi (Tsurumi University) Spanish Carmen Arronis Llopis (University of Alicante) and Alejandro Bia (Miguel Hernández University) Italian Marco Venuti (University of Venice) and Letizia Cirillo (University of Bologna) xxi iii. Preface and Acknowledgments xxii iv About ese Guidelines ese Guidelines have been developed and are maintained by the Text Encoding Initiative Consortium (TEI); see iv.2 Historical Background. ey are addressed to anyone who works with any kind of textual resource in digital form. ey make recommendations about suitable ways of representing those features of textual resources which need to be identified explicitly in order to facilitate processing by computer programs. In particular, they specify a set of markers (or tags) which may be inserted in the electronic representation of the text, in order to mark the text structure and other features of interest. Many, or most, computer programs depend on the presence of such explicit markers for their functionality, since without them a digitized text appears to be nothing but a sequence of undifferentiated bits. e success of the World Wide Web, for example, is partly a consequence of its use of such markup to indicate such features as headings and lists on individual pages, and to indicate links between pages. e process of inserting such explicit markers for implicit textual features is oen called `markup', or equivalently within this work `encoding'; the term `tagging' is also used informally. We use the term encoding scheme or markup language to denote the complete set of rules associated with the use of markup in a given context; we use the term markup vocabulary for the specific set of markers or named distinctions employed by a given encoding scheme. us, this work both describes the TEI encoding scheme, and documents the TEI markup vocabulary. e TEI encoding scheme is of particular usefulness in facilitating the loss-free interchange of data amongst individuals and research groups using different programs, computer systems, or application soware. Since they contain an inventory of the features most oen deployed for computer-based text processing, the Guidelines are also useful as a starting point for those designing new systems and creating new materials, even where interchange of information is not a primary objective. ese Guidelines apply to texts in any natural language, of any date, in any literary genre or text type, without restriction on form or content. ey treat both continuous materials (`running text') and discontinuous materials such as dictionaries and linguistic corpora. ough principally directed to the needs of the scholarly research community, the Guidelines are not restricted to esoteric academic applications. ey are also useful for librarians maintaining and documenting electronic materials, and for publishers and others creating or distributing electronic texts. Although they focus on problems of representing in electronic form texts which already exist in traditional media, these Guidelines are also applicable to textual material which is `born digital'. We believe them to be adequate to the widest variety of currently existing practices in using digital textual data, but by no means limited to them. e rules and recommendations made in these Guidelines are expressed in terms of what is currently the most widely-used markup language for digital resources of all kinds: the Extensible Markup Language (XML), as defined by the World Wide Web Consortium's XML Recommendation. However, the TEI encoding scheme itself does not depend on this language; it was originally formulated in terms of a predecessor of XML (the ISO xxiii iv. About ese Guidelines Standard Generalized Markup Language), and may in future years be re-expressed in other such frameworks as the field of markup develops and matures. For more information on markup languages see chapter v A Gentle Introduction to XML; for more information on the associated character encoding issues see chapter vi Languages and Character Sets. is document provides the authoritative and complete statement of the requirements and usage of the TEI encoding scheme. As such, although it includes numerous small examples, it must be stressed that this work is intended to be a reference manual rather than a tutorial guide. e remainder of this chapter comprises three sections. e first gives an overview of the structure and notational conventions used throughout these Guidelines. e second enumerates the design principles underlying the TEI scheme and the application environments in which it may be found useful. Finally, the third section gives a brief account of the origins and development of the Text Encoding Initiative itself. iv.1 Structure and Notational Conventions of this Document e remaining two sections of the front matter to the Guidelines provide background tutorial material for those unfamiliar with basic markup technologies. Following the present introductory section, we present a detailed introduction to XML itself, intended to cover in a relatively painless manner as much as the novice user of the TEI scheme needs to know about markup languages in general and XML in particular. is is followed by a discussion of the general principles underlying current practice in the representation of different languages and writing systems in digital form. is chapter is largely intended for the user unfamiliar with the Unicode encoding systems, though the expert may also find its historical overview of interest. e body of this edition of the Guidelines proper contains 23 chapters arranged in increasing order of specialist interest. e first five chapters discuss in depth matters likely to be of importance to anyone intending to apply the TEI scheme to virtually any kind of text. e next seven focus on particular kinds of text: verse, drama, spoken text, dictionaries, and manuscript materials. e next nine chapters deal with a wide range of topics, one or more of which are likely to be of interest in specialist applications of various kinds. e last two chapters deal with the XML encoding used to represent the TEI scheme itself, and provide technical information about its implementation. e last chapter also defines the notion of TEI conformance and its implications for interchange of materials produced according to these Guidelines. As noted above, this is a reference work, and is not intended to be read through from beginning to end. However, the reader wishing to understand the full potential of the TEI scheme will need a thorough grasp of the material covered by the first four chapters and the last two. Beyond that, the reader is recommended to select according to their specific interests: one of the strengths of the TEI architecture is its modular nature. As far as possible, extensive cross referencing is provided wherever related topics are dealt with; these are particularly effective in the online version of the Guidelines. In addition, a series of technical appendixes provide detailed formal definitions for every element, every class, and every macro discussed in the body of the work; these are also cross linked as appropriate. Finally, a detailed bibliography is provided, which identifies the source of many examples cited in the text as well as documenting works referred to, and listing other relevant publications. As an aid to the reader, most chapters of these Guidelines follow the same basic organization. e chapter begins with an overview of the subjects treated within it, linked to the following subsections. Within each section where new elements are described, a summary table is first given, which provides their names and a brief description of their intended usage. is is then followed where appropriate by further discussion of each element, including wherever possible usage examples taken somewhat eclectically from a variety of real sources. ese examples are not intended to be exhaustive, but rather to suggest typical ways in which the elements concerned may usefully be applied. Where appropriate, a link to a statement of the source for most examples is provided in the online version. Within the examples, use of whitespace such as newlines or indentation is simply intended to aid legibility, and is not prescriptive or normative. xxiv iv.1. Structure and Notational Conventions of this Document Wherever TEI elements or classes are mentioned in the text, they are linked in the online version to the relevant reference specification for the element or class concerned. Element names are always given in the form , where `name' is the generic identifier of the element; empty elements such as or include a closing slash to distinguish them wherever they are discussed. References to attributes take the form attname, where `attname' is the name of the attribute. References to classes are also presented as links, for example model.divLike for a model class, and att.global for an attribute class. iv.1.1 Design Principles Because of its roots in the humanities research community, the TEI scheme is driven by its original goal of serving the needs of research, and is therefore committed to providing a maximum of comprehensibility, flexibility, and extensibility. More specific design goals of the TEI have been that the Guidelines should: * provide a standard format for data interchange * provide guidance for the encoding of texts in this format * support the encoding of all kinds of features of all kinds of texts studied by researchers * be application independent is has led to a number of important design decisions, such as: * the choice of XML and Unicode * the provision of a large predefined tag set * encodings for different views of text * alternative encodings for the same textual features * mechanisms for user-defined modification of the scheme We discuss some of these goals in more detail below. e goal of creating a common interchange format which is application independent requires the definition of a specific markup syntax as well as the definition of a large set of elements or concepts. e syntax of the recommendations made in this document conforms to the World Wide Web Consortium's XML Recommendation (Bray et al. (eds.) (2006)) but their definition is as far as possible independent of any particular schema language. e goal of providing guidance for text encoding suggests that recommendations be made as to what textual features should be recorded in various situations. However, when selecting certain features for encoding in preference to others, these Guidelines have tended to prefer generic solutions to specific ones, and to avoid areas where no consensus exists, while attempting to accommodate as many diverse views as feasible. Consequently, the TEI Guidelines make (with relatively rare exceptions) no suggestions or restrictions as to the relative importance of textual features. e philosophy of the Guidelines is `if you want to encode this feature, do it this way' -- but very few features are mandatory. In the same spirit, while the Guidelines very rarely require you to encode any particular feature, they do require you to be honest about which features you have encoded, that is, to respect the meanings and usage rules they recommend for specific elements and attributes proposed. e requirement to support all kinds of materials likely to be of interest in research has largely conditioned the development of the TEI into a very flexible and modular system. e development of other XML vocabularies or standards is typically motivated by the desire to create a single fully specified encoding scheme for use in a well-defined application domain. By contrast, the TEI is intended for use in a large number of rather ill-defined and oen overlapping domains. It achieves its generality by means of the modular architecture described in 1. e TEI Infrastructure which enables each user to create a schema appropriate to their needs without compromising the interoperability of their data. e Guidelines have been written largely with a focus on text capture (i.e. the representation in electronic form of an already existing copy text in another medium) rather than text creation (where no such copy text xxv iv. About ese Guidelines exists). Hence the frequent use of terms like `transcription', `original', `copy text', etc. However, the Guidelines are equally applicable to text creation. Concerning text capture the TEI Guidelines do not specify a particular approach to the problem of fidelity to the source text and recoverability of the original; such a choice is the responsibility of the text encoder. e current version of these Guidelines, however, provides a more fully elaborated set of tags for markup of rhetorical, linguistic, and simple typographic characteristics of the text than for detailed markup of page layout or for fine distinctions among type fonts or manuscript hands. It should be noted also that, with the present version of the Guidelines, it is no longer necessarily the case that an unmediated version of the source text can be recovered from an encoded text simply by removing the markup. In these Guidelines, no hard and fast distinction is drawn between `objective' and `subjective' information or between `representation' and `interpretation'. ese distinctions, though widely made and oen useful in narrow, well-defined contexts, are perhaps best interpreted as distinctions between issues on which there is a scholarly consensus and issues where no such consensus exists. Such consensus has been, and no doubt will be, subject to change. e TEI Guidelines do not make suggestions or restrictions as to which of these features should be encoded. e use of the terms descriptive and interpretive about different types of encoding in the Guidelines is not intended to support any particular view on these theoretical issues. Historically, it reflects a purely practical division of responsibility amongst the original working committees (see further iv.2 Historical Background). In general, the accuracy and the reliability of the encoding and the appropriateness of the interpretation is for the individual user of the text to determine. e Guidelines provide a means of documenting the encoding in such a way that a user of the text can know the reasoning behind that encoding, and the general interpretive decisions on which it is based. e TEI header may be used to document and justify many such aspects of the encoding, but the choice of TEI elements for a particular feature is in itself a statement about the interpretation reached by the encoder. In many situations more than one view of a text is needed since no absolute recommendation to embody one specific view of text can apply to all texts and all approaches to them. Within limits, the syntax of XML ensures that some encodings can be ignored for some purposes. To enable encoding multiple views, these Guidelines not only treat a variety of textual features, but sometimes provide several alternative encodings for what appear to be identical textual phenomena. ese Guidelines offer the possibility of encoding many different views of the text, simultaneously if necessary. Where different views of the formal structure of a text are required, as opposed to different annotations on a single structural view, however, the formal syntax of XML (which requires a single hierarchical view of text structure) poses some problems; recommendations concerning ways of overcoming or circumventing that restriction are discussed in chapter 20. Non-hierarchical Structures. In brief, the TEI Guidelines define a general-purpose encoding scheme which makes it possible to encode different views of text, possibly intended for different applications, serving the majority of scholarly purposes of text studies in the humanities. Because no predefined encoding scheme can possibly serve all research purposes, the TEI scheme is designed to facilitate both selection from a wide range of predefined markup choices, and the addition of new (non-TEI) markup options. By providing a formally verifiable means of extending the TEI recommendations, the TEI makes it simple for such user-identified modifications to be incorporated into future releases of the Guidelines as they evolve. e underlying mechanisms which support these aspects of the scheme are introduced in chapter 1. e TEI Infrastructure, and detailed discussions of their use provided in chapter 23. Using the TEI. iv.1.2 Intended Use We envisage three primary functions for these Guidelines: * guidance for individual or local practice in text creation and data capture; xxvi iv.1. Structure and Notational Conventions of this Document * support of data interchange; * support of application-independent local processing. ese three functions are so thoroughly interwoven in practice that it is hardly possible to address any one without addressing the others. However, the distinction provides a useful framework for discussing the possible role of the Guidelines in work with electronic texts. Use in Text Capture and Text Creation e description of textual features found in the chapters which follow should provide a useful checklist from which scholars planning to create electronic texts should select the subset of features suitable for their project. Problems specific to text creation or text `capture' have not been considered explicitly in this document. ese Guidelines are not concerned with the process by which a digital text comes into being: it can be typed by hand, scanned from a printed book or typescript, read from a typesetter's tape, or acquired from another researcher who may have used another markup scheme (or no explicit markup at all). We include here only some general points which are oen raised about markup and the process of data capture. XML can appear distressingly verbose, particularly when (as in these Guidelines) the names of tags and attributes are chosen for clarity and not for brevity. Editor macros and keyboard shortcuts can allow a typist to enter frequently used tags with single keystrokes. It is oen possible to transform word-processed or scanned text automatically. Markup-aware soware can help with maintaining the hierarchical structure of the document, and display the document with visual formatting rather than raw tags. e techniques described in chapter 23.2. Personalization and Customizationmay be used to develop simpler data capture TEI-conformant schemas, for example with limited numbers of elements, or with shorter names for the tags being used most oen. Documents created with such schemas may then be automatically converted to a more elaborated TEI form. Use for Interchange e TEI format may simply be used as an interchange format, permitting projects to share resources even when their local encoding schemes differ. If there are n different encoding formats, to provide mappings between each possible pair of formats requires n*(n-1) translations; with an interchange format, only 2n such mappings are needed. However, for such translations to be carried out without loss of information, the interchange format chosen must be as expressive (in a formal sense) as any of the target formats; this is a further reason for the TEI's provision of both highly abstract or generic encodings and highly specific ones. To translate between any pair of encoding schemes implies: 1. identifying the sets of textual features distinguished by the two schemes; 2. determining where the two sets of features correspond; 3. creating a suitable set of mappings. For example, to translate from encoding scheme X into the TEI scheme: 1. Make a list of all the textual features distinguished in X. 2. Identify the corresponding feature in the TEI scheme. ere are three possibilities for each feature: (a) the feature exists in both X and the TEI scheme; (b) X has a feature which is absent from the TEI scheme; (c) X has a feature which corresponds with more than one feature in the TEI scheme. xxvii iv. About ese Guidelines e first case is a trivial renaming. e second will require an extension to the TEI scheme, as described in chapter 23.2. Personalization and Customization. e third is more problematic, but not impossible, provided that a consistent choice can be made (and documented) amongst the alternatives. e ease with which this translation can be defined will of course depend on the clarity with which scheme X represents the features it encodes. Translating from the TEI into scheme X follows the same pattern, except that if a TEI feature has no equivalent in X, and X cannot be extended, information must be lost in translation. e rules defining conformance to the Guidelines are given in some detail in chapter 23.3. Conformance. e basic principles informing those rules may be summarized as follows: 1. e TEI abstract model (that is, the set of categorical distinctions which it defines) must be respected. e correspondence between a tag X and the semantic function assigned to it by these Guidelines may not be changed; such changes are known as tag abuse and strongly deprecated. 2. A TEI document must be expressed as a valid XML-conformant document which uses the TEI namespace appropriately. If, for example, the document encodes features not provided by the Guidelines, such extensions may not be associated with the TEI namespace. 3. It must be possible to validate a TEI document against a schema derived from these Guidelines, possibly with extensions provided in the recommended manner. Use for Local Processing Machine-readable text can be manipulated in many ways; some users: * edit texts (e.g. word processors, syntax-directed editors) * edit, display, and link texts in hypertext systems * format and print texts using desktop publishing systems, or batch-oriented formatting programs * load texts into free-text retrieval databases or conventional databases * unload texts from databases as search results or for export to other soware * search texts for words or phrases * perform content analysis on texts * collate texts for critical editions * scan texts for automatic indexing or similar purposes * parse texts linguistically * analyze texts stylistically * scan verse texts metrically * link text and images ese applications cover a wide range of likely uses but are by no means exhaustive. e aim has been to make the TEI Guidelines useful for encoding the same texts for different purposes. We have avoided anything which would restrict the use of the text for other applications. We have also tried not to omit anything essential to any single application. Because the TEI format is expressed using XML, almost any modern text processing system is able to process it, and new TEI-aware soware systems are able to build on a solid base of existing soware libraries. xxviii iv.2. Historical Background iv.2 Historical Background e Text Encoding Initiative grew out of a planning conference sponsored by the Association for Computers and the Humanities (ACH) and funded by the U.S. National Endowment for the Humanities (NEH), which was held at Vassar College in November 1987. At this conference some thirty representatives of text archives, scholarly societies, and research projects met to discuss the feasibility of a standard encoding scheme and to make recommendations for its scope, structure, content, and draing. During the conference, the Association for Computational Linguistics and the Association for Literary and Linguistic Computing agreed to join ACH as sponsors of a project to develop the Guidelines. e outcome of the conference was a set of principles (the `Poughkeepsie Principles', Burnard (1988)), which determined the further course of the project. e Text Encoding Initiative project began in June 1988 with funding from the NEH, soon followed by further funding from the Commission of the European Communities, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada. Four working committees, composed of distinguished scholars and researchers from both Europe and North America, were named to deal with problems of text documentation, text representation, text analysis and interpretation, and metalanguage and syntax issues. Each committee was charged with the task of identifying `significant particularities' in a range of texts, and two editors appointed to harmonise the resulting recommendations. A first dra version (P1, with the `P' here and subsequently standing for `Proposal') of the Guidelines was distributed in July 1990 under the title Guidelines for the Encoding and Interchange of Machine-Readable Texts. Extensive public comment and further work on areas not covered in this version resulted in the draing of a revised version, TEI P2, distribution of which began in April 1992. is version included substantial amounts of new material, resulting from work carried out by several specialist working groups, set up in 1990 and 1991 to propose extensions and revisions to the text of P1. e overall organization, both of the dra itself and of the scheme it describes, was entirely revised and reorganized in response to public comment on the first dra. In June 1993 an Advisory Board met to review the current state of the TEI Guidelines, and recommended the formal publication of the work done to that time. at version of the TEI Guidelines, TEI P3, consolidated the work published as parts of TEI P2, along with some additional new material and was finally published in May of 1994 without the label dra, thus marking the conclusion of the initial development work. In February of 1998 the World Wide Web Consortium issued a final Recommendation for the Extensible Markup Language, XML.1 Following the rapid take-up of this new standard metalanguage, it became evident that the TEI Guidelines (which had been published originally as an SGML application) needed to be reexpressed in this new formalism if they were to survive. e TEI editors, with abundant assistance from others who had developed and used TEI, developed an update plan, and made tentative decisions on relevant syntactic issues. In January of 1999, the University of Virginia and the University of Bergen formally proposed the creation of an international membership organization, to be known as the TEI Consortium, which would maintain, develop, and promote the TEI. Shortly thereaer, two further institutions with longstanding ties to the TEI (Brown University and Oxford University) joined them in formulating an Agreement to Establish a Consortium for the Maintenance of the Text Encoding Initiative (An Agreement to Establish a Consortium for the Maintenance of the Text Encoding Initiative (March 1999)), on which basis the TEI Consortium was eventually established and incorporated as a not-for-profit legal entity at the end of the year 2000. e first members of the new TEI Board took office during January of 2001. e TEI Consortium was established in order to maintain a permanent home for the TEI as a democratically constituted, academically and economically independent, self-sustaining, non-profit organization. In addition, the TEI Consortium was intended to foster a broad-based user community with sustained involvement in the future development and widespread use of the TEI Guidelines (Burnard (2000)). 1XML was originally developed as a way of publishing on the World Wide Web richly encoded documents such as those for which the TEI was designed. Several TEI participants contributed heavily to the development of XML, most notably XML's senior co-editor C. M. Sperberg-McQueen, who served as the North American editor for the TEI Guidelines from their inception until 1999. xxix iv. About ese Guidelines To oversee and manage the revision process in collaboration with the TEI Editors, the TEI Board formed a Technical Council, with a membership elected from the TEI user community. e Council met for the first time in January 2002 at King's College London. Its first task was to oversee production of an XML version of the TEI Guidelines, updating P3 to enable users to work with the emerging XML toolset. is, the P4 version of the Guidelines, was published in June 2002. It was essentially an XML version of P3, making no substantive changes to the constraints expressed in the schemas apart from those necessitated by the shi to XML, and changing only corrigible errors identified in the prose of the P3 Guidelines. However, given that P3 had by this time been in steady use since 1994, it was clear that a substantial revision of its content was necessary, and work began immediately on the P5 version of the Guidelines. is was planned as a thorough overhaul, involving a public call for features and new development in a number of important areas not previously addressed including character encoding, graphics, manuscript description, biographical and geographical data, and the encoding language in which the TEI Guidelines themselves are written. e members of the TEI Council and its associated workgroups are listed iniii Preface and Acknowledgments. In preparing this edition, they have been attentive to the requirements and practice of the widest possible range of TEI users, who are now to be found in many different research communities across the world, and have been largely instrumental in transforming the TEI from a grant-supported international research project into a self-sustaining community-based effort. One effect of the incorporation of the TEI has been the legal requirement to hold an annual meeting of the Consortium members; these meetings have emerged as an invaluable opportunity to sustain and reinforce that sense of community. e present work is therefore the result of a sustained period of consultation, draing, and revision, with input from many different experts. Whatever merits it may have are to be attributed to them; the Editors accept responsibility only for the errors remaining. iv.3 Future Developments e encoding recommended by this document may be used without fear that future versions of the TEI scheme will be inconsistent with it in fundamental ways. e TEI will be sensitive, in revising these Guidelines, to the possible problems which revision might pose for those who are already using this version of the Guidelines. With TEI P5, a version numbering system is introduced: the version number has two parts, a major number and a minor, for example 1.0. e TEI undertakes that no change will be made to the formal expression of these Guidelines (that is, a TEI schema, as defined in 23.3. Conformance) such that documents conformant to a given major numbered release cease to be compatible with a subsequent release of the same major number. Moreover, as far as possible, new minor releases will be made only for the purpose of adding new compatible features, or of correcting errors in existing features. e Guidelines are currently maintained as an open source (GNU General Public License) project, on the Sourceforge site http://tei.sf.net/ from which released and development versions may be freely downloaded; notice of errors detected and enhancements requested may also be submitted at this site. xxx v A Gentle Introduction to XML e encoding scheme defined by these Guidelines is formulated as an application of the Extensible Markup Language (XML) (Bray et al. (eds.) (2006)). XML is widely used for the definition of device-independent, system-independent methods of storing and processing texts in electronic form. It is now also the interchange and communication format used by many applications on the World Wide Web. In the present chapter we informally introduce some of its basic concepts and attempt to explain to the reader encountering them for the first time how and why they are used in the TEI scheme. More detailed technical accounts of TEI practice in this respect are provided in chapters 23. Using the TEI, 1. e TEI Infrastructure, and 22. Documentation Elements of these Guidelines. Strictly speaking, XML is a metalanguage, that is, a language used to describe other languages, in this case, markup languages. Historically, the word markup has been used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples include wavy underlining to indicate boldface, special symbols for passages to be omitted or printed in a particular font, and so forth. As the formatting and printing of texts was automated, the term was extended to cover all sorts of special codes inserted into electronic texts to govern formatting, printing, or other processing. Generalizing from that sense, we define markup, or (synonymously) encoding, as any means of making explicit an interpretation of a text. Of course, all printed texts are implicitly encoded (or marked up) in this sense: punctuation marks, capitalization, disposition of letters around the page, even the spaces between words all might be regarded as a kind of markup, the purpose of which is to help the human reader determine where one word ends and another begins, or how to identify gross structural features such as headings or simple syntactic units such as dependent clauses or sentences. Encoding a text for computer processing is, in principle, like transcribing a manuscript from scriptio continua1 ; it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted. By markup language we mean a set of markup conventions used together for encoding texts. A markup language must specify how markup is to be distinguished from text, what markup is allowed, what markup is required, and what the markup means. XML provides the means for doing the first three; documentation such as these Guidelines is required for the last. e present chapter attempts to give an informal introduction to those parts of XML of which a proper understanding is necessary to make best use of these Guidelines. e interested reader should also consult one or more of the many excellent introductory textbooks and web sites now available on the subject.2 1In the `continuous writing' characteristic of manuscripts from the early classical period, words are written continuously with no intervening spaces or punctuation. 2New textbooks about XML appear at regular intervals and to select any one of them would be invidious. A useful list of pointers to introductory web sites is available from http://www.xml.org/xml/resources_focus_beginnerguide.shtml; recommended online courses include http://www. w3schools.com/xml/default.asp and http://www.ibm.com/developerworks/edu/x-dw-xmlintro-i.html. xxxi v. A Gentle Introduction to XML v.1 What's special about XML? ree characteristics of XML distinguish it from other markup languages: 1. its emphasis on descriptive rather than procedural markup; 2. its notion of documents as instances of a document type; 3. its independence of any one hardware or soware system. ese three aspects are discussed briefly below, and then in more depth in the remainder of this chapter. XML is frequently compared with HTML, the language in which web pages have generally been written, which shares some of the above characteristics. Compared with HTML, however, XML has some other important features: * XML is extensible: it does not consist of a fixed set of tags; * XML documents must be well-formed according to a defined syntax; * an XML document can be formally validated against a schema of some kind; * XML is more interested in the meaning of data than in its presentation. v.1.1 Descriptive markup In a descriptive markup system, the markup codes used do little more than categorize parts of a document. Markup codes such as or \end{list} simply identify a portion of a document and assert of it that `the following item is a paragraph', or `this is the end of the most recently begun list', etc. By contrast, a procedural markup system defines what processing is to be carried out at particular points in a document: `call procedure PARA with parameters 42, b, and x here' or `move the le margin 2 quads le, move the right margin 2 quads right, skip down one line, and go to the new le margin,' etc. In XML, the instructions needed to process a document for some particular purpose (for example, to format it) are sharply distinguished from the markup used to describe it. Usually, the markup or other information needed to process a document will be maintained separately from the document itself, typically in a distinct document called a stylesheet, though it may do much more than simply define the rendition or visual appearance of a document.3 When descriptive markup is used, the same document can readily be processed in many different ways, using only those parts of it which are considered relevant. For example, a content analysis program might disregard entirely the footnotes embedded in an annotated text, while a formatting program might extract and collect them all together for printing at the end of each chapter. Different kinds of processing can be carried out with the same part of a file. For example, one program might extract names of persons and places from a document to create an index or database, while another, operating on the same text, but using a different stylesheet, might print names of persons and places in a distinctive typeface. v.1.2 Types of document A second key aspect of XML is its notion of a document type: documents are regarded as having types, just as other objects processed by computers do. e type of a document is formally defined by its constituent parts and their structure. e definition of a `report', for example, might be that it consisted of a `title' and possibly an `author', followed by an `abstract' and a sequence of one or more `paragraphs'. Anything lacking a title, according to this formal definition, would not formally be a report, and neither would a sequence of paragraphs followed by an abstract, whatever other report-like characteristics these might have for the human reader. 3We do not here discuss in any detail the ways that a stylesheet can be used or defined, nor do we discuss the popular W3C Stylesheet Languages XSLT and CSS. See further Berglund (ed.) (2006), Clark (ed.) (1999), and Lie and Bos (eds.) (1999). xxxii v.2. Textual structures If documents are of known types, a special-purpose program (called a parser), once provided with an unambiguous definition of a document type, can check that any document claiming to be of that type does in fact conform to the specification. A parser can check that all elements specified for a particular document type are present and no others, that they are combined in appropriate ways, correctly ordered, and so forth. More significantly, different documents of the same type can be processed in a uniform way. Programs can be written which take advantage of the knowledge encapsulated in the document type information, and which can thus behave in a more `intelligent' fashion. v.1.3 Data independence A basic design goal of XML is to ensure that documents encoded according to its provisions can move from one hardware and soware environment to another without loss of information. e two features discussed so far both address this requirement at an abstract level; the third feature addresses it at the level of the strings of data characters that make up a document. All XML documents, whatever languages or writing systems they employ, use the same underlying character encoding (that is, the same method of representing as binary data those graphic forms making up a particular writing system).4 is encoding is defined by an international standard,5 which is implemented by a universal character set maintained by an industry group called the Unicode Consortium, and known as Unicode.6 Unicode provides a standardised way of representing any of the many thousands of discrete symbols making up the world's writing systems, past and present. Most modern computing systems now support Unicode directly; for those which do not, XML provides a mechanism for the indirect representation of single characters by means of their character number, known as character references; see further v.6.1 Character References. v.2 Textual structures A text is not an undifferentiated sequence of words, much less of bytes. For different purposes, it may be divided into many different units, of different types or sizes. A prose text such as this one might be divided into sections, chapters, paragraphs, and sentences. A verse text might be divided into cantos, stanzas, and lines. Once printed, sequences of prose and verse might be divided into volumes, gatherings, and pages. Structural units of this kind are most oen used to identify specific locations or refer to points within a text (`the third sentence of the second paragraph in chapter ten'; `canto 10, line 1234'; `page 412', etc.) but they may also be used to subdivide a text into meaningful fragments for analytic purposes (`is the average sentence length of section 2 different from that of section 5?'`how many paragraphs separate each occurrence of the word nature? how many pages?'). Other structural units are more clearly analytic, in that they characterize a section of a text. A dramatic text might regard each speech by a different character as a unit of one kind, and stage directions or pieces of action as units of another kind. Such an analysis is less useful for locating parts of the text (`the 93rd speech by Horatio in Act 2') than for facilitating comparisons between the words used by one character and those of another, or those used by the same character at different points of the play. In a prose text one might similarly wish to regard as units of different types passages in direct or indirect speech, passages employing different stylistic registers (narrative, polemic, commentary, argument, etc.), passages of different authorship and so forth. And for certain types of analysis (most notably textual criticism) the physical appearance of one particular printed or manuscript source may be of importance: paradoxically, one may wish to use descriptive markup to describe presentational features such as typeface, line breaks, use of whitespace and so forth. ese textual structures overlap with one other in complex and unpredictable ways. Particularly when dealing with texts as instantiated by paper technology, the reader needs to be aware of both the physical 4See Extensible Markup Language (XML) 1.0, available from http://www.w3.org/TR/REC-xml, Section 2.2 Characters. 5ISO/IEC 10646-1993 Information Technology -- Universal Multiple-Octet Coded Character Set (UCS) 6See http://www.unicode.org/ xxxiii v. A Gentle Introduction to XML organization of the book and the logical structure of the work it contains. Many great works (Sterne's Tristram Shandy for example) cannot be fully appreciated without an awareness of the interplay between narrative units (such as chapters or paragraphs) and presentational ones (such as page divisions). For many types of research, the interplay among different levels of analysis is crucial: the extent to which syntactic structure and narrative structure mesh, or fail to mesh, for example, or the extent to which phonological structures reflect morphology. v.3 XML structures is section describes the simple and consistent mechanism for the markup or identification of textual structure provided by XML. It also describes the methods XML provides for the expression of rules defining how units of textual structure can meaningfully be combined in a text. v.3.1 Elements e technical term used in XML for a textual unit, viewed as a structural component, is element. Different types of elements are given different names, but XML provides no way of expressing the meaning of a particular type of element, other than its relationship to other element types. at is, all one can say about an element called (say) is that instances of it may (or may not) occur within elements of type , and that it may (or may not) be decomposed into elements of type . It should be stressed that XML is entirely unconcerned with the semantics of textual elements, because these are considered to be application dependent. It is up to the creators of XML vocabularies (such as these Guidelines) to choose intelligible element names and to define their intended use in text markup. at is the chief purpose of documents such as the TEI Guidelines. From the need to choose element names indicative of function comes the technical term for the name of an element type, which is generic identifier, or GI. Within a marked-up text (a document instance), each element must be explicitly marked or tagged in some way. is is done by inserting a tag at the beginning of the element (a start-tag) and another at its end (an endtag). e start- and end-tag pair are used to bracket off element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. For example, a quotation element in a text might be tagged as follows: ... Rosalind's remarks This is the silliest stuff that ere I heard of! clearly indicate ... As this example shows, a start-tag takes the form , where the opening angle bracket indicates the start of the start-tag, `quote' is the generic identifier of the element that is being delimited, and the closing angle bracket indicates the end of the start-tag. An end-tag takes an identical form, except that the opening angle bracket is followed by a solidus (slash) character, so that the corresponding end-tag is .7 e material between the start-tag and the end-tag (the string of words `is is the silliest stuff that ere I heard of' in the example above) is known as the content of the element. Sometimes there may be nothing between the start and the end-tag; in this case the two may optionally be merged together into a single composite tag with the solidus at the end, like this: . v.3.2 Content models: an example An element may be empty, that is, it may have no content at all, or it may contain just a sequence of characters with no other elements. Oen, however, elements of one type will be embedded (contained entirely) within elements of a different type. 7Because the opening angle bracket has this special function in an XML document, special steps must be taken to use that character for other purposes (for example, as the mathematical less-than operator); see further section v.6.1 Character References. xxxiv v.3. XML structures To illustrate this, we will consider a very simple structural model. Let us assume that we wish to identify within an anthology only poems, their headings, and the stanzas and lines of which they are composed. In XML terms, our document type is the anthology, and it consists of a series of poems. Each poem has embedded within it one element, a heading, and several occurrences of another, a stanza, each stanza having embedded within it a number of line elements. Fully marked up, a text conforming to this model might appear as follows:8 The SICK ROSE O Rose thou art sick. The invisible worm, That flies in the night In the howling storm: Has found out thy bed Of crimson joy: And his dark secret love Does thy life destroy. It should be stressed that this example does not use the names proposed for corresponding elements elsewhere in these Guidelines: the above is thus not a valid TEI document.9 It will, however, serve as an introduction to the basic notions of XML. Whitespace and line breaks have been added to the example for the sake of visual clarity only; they have no particular significance in the XML encoding itself. Also, the line is an XML comment and is not treated as part of the text. As it stands, the above example is what is known as a well-formed XML document because it obeys the following simple rules: 1. there is a single element enclosing the whole document: this is known as the root element ( in our case); 2. each element is completely contained by the root element, or by an element that is so contained; elements do not partially overlap one another; 3. a tag explicitly marks the start and end of each element. A well-formed XML document can be processed in a number of useful ways. A simple indexing program could extract only the relevant text elements in order to make a list of headings, first lines, or words used in the poem text; a simple formatting program could insert blank lines between stanzas, perhaps indenting the first line of each, or inserting a stanza number. Different parts of each poem could be typeset in different ways. A 8e example is taken from William Blake's Songs of innocence and experience (1794). 9e element names here have been chosen for clarity of exposition; there is, however, a TEI element corresponding to each, so that this example may be regarded as TEI conformable in the sense that this term is defined in 23.3. Conformance. xxxv v. A Gentle Introduction to XML more ambitious analytic program could relate the use of punctuation marks to stanzaic and metrical divisions.10 Scholars wishing to see the implications of changing the stanza or line divisions chosen by the editor of this poem can do so simply by altering the position of the tags. And of course, the text as presented above can be transported from one computer to another and processed by any program (or person) capable of making sense of the tags embedded within it with no need for the sort of transformations and translations needed for files which have been saved in one or other of the proprietary formats preferred by most word-processing programs. As we noted above, one of the attractions of XML is that it enables us to make up our own names for the elements rather than requiring us always to use names predefined by other agencies. Clearly, however, if we wish to exchange our poems with others, or to include poems others have marked up in our anthology, we will need to know a bit more about the names used for the tags. e means that XML provides for this is called a namespace. In our simple example, the tags just contain a simple name. As we shall see, it is also possible to use tags that include a qualified name, that is, a name with an optional prefix identifying the set of names to which it belongs. For example, we have defined an element for the purpose of marking lines of verse. Another person might, however, define an element called for the purpose of marking typographic lines, or drawn lines. Because of these different meanings, if we wish to share data it will be necessary to distinguish the two `line' components in our marked-up texts. is is achieved by including a namespace prefix within the markup, for example like this: This is one of my lines This is one of your lines is feature is particularly important if we have different definitions of what a `line' is, of course, but there are many occasions when it is useful to distinguish groups of tags belonging to different `markup vocabularies'; we discuss this further below (v.6.3 Namespaces). One particularly useful namespace prefix is predefined for XML: it is xml and we will see examples of its use below. Namespaces allow us to represent the fact that a name belongs to a group of names, but don't allow us to do much more by way of checking the integrity or accuracy of our tagging. Simple well-formedness alone is not enough for the full range of what might be useful in marking up a document. It might well be useful if, in the process of preparing our digital anthology, a computer system could check some basic rules about how stanzas, lines, and headings can sensibly co-occur in a document. It would be even more useful if the system could check that stanzas are always tagged and not occasionally or . An XML document in which such rules have been checked is technically known as a valid document, and the ability to perform such validation is one of the key advantages of using XML. To carry this out, some way of formally stating the criteria for successful validation is necessary: in XML this formal statement is provided by an additional document known as a schema.11 v.3.3 Validating a document's structure e design of a schema may be as lax or as restrictive as the occasion warrants. A balance must be struck between the convenience of following simple rules and the complexity of handling real texts. is is particularly the case when the rules being defined relate to texts that already exist: the designer may have only the haziest of notions as to an ancient text's original purpose or meaning and hence find it very difficult to specify consistent rules about its structure. On the other hand, where a new text is being prepared to an exact specification, for entry into a textual database of some kind for example, the more precisely stated the rules, the better they 10Note that this simple example has not addressed the problem of marking elements such as sentences explicitly; the implications of this are discussed in section v.4 Complicating the issue. 11e older terms Document Type Declaration and Document Type Definition, both abbreviated as DTD, may also be encountered. roughout these Guidelines we use the term schema for any kind of formal document grammar. xxxvi v.3. XML structures can be enforced. Even in the case where an existing text is being marked up, it may be beneficial to define a restrictive set of rules relating to one particular view or hypothesis about the text -- if only as a means of testing the usefulness of that view or hypothesis. A schema designed for use by a small project or team is likely to take a different position on such issues than one intended for use by a large and possibly fragmented community. It is important to remember that every schema results from an interpretation of a text. ere is no single schema encompassing the absolute truth about any text, although it may be convenient to privilege some schemas above others for particular types of analysis. XML is widely used in environments where uniformity of document structure is a major desideratum. In the production of technical documentation, for example, it is of major importance that sections and subsections should be properly nested, that cross-references should be properly resolved and so forth. In such situations, documents are seen as raw material to match against predefined sets of rules. As discussed above, however, the use of simple rules can also greatly simplify the task of tagging accurately elements of less rigidly constrained texts. By making these rules explicit, the scholar reduces his or her own burdens in marking up and verifying the electronic text, while also being forced to make explicit an interpretation of the structure and significant particularities of the text being encoded. v.3.4 An example schema A schema can be expressed in a number of different ways; frequently-encountered methods include the Document Type Definition (DTD) language which XML inherited from SGML; the XML Schema language (http://www.w3.org/XML/Schema) defined by the W3C; and the RELAX NG language (http://relaxng. org/) originally developed within the OASIS Technical Committee and now an ISO standard12 . In this chapter, and throughout these Guidelines, we give examples using the `compact syntax' of RELAX NG, but the specifications within these Guidelines are expressed in a way that is largely independent of the specific language in which a schema generated from them is expressed.13 Although we will use the RELAX NG compact syntax for illustration in what follows, the reader should bear in mind that analogous concepts are expressed differently in other schema languages. e following schema might be used to validate our example poem: anthology_p = element anthology { poem_p+ } poem_p = element poem { heading_p?, stanza_p+ } stanza_p = element stanza {line_p+} heading_p = element heading { text } line_p = element line { text } start = anthology_p Note that this is not the only way in which a RELAX NG schema might be written;14 we have adopted this idiom, however, because it matches that used throughout the rest of the Guidelines. A RELAX NG schema expresses rules about the possible structure of a document in terms of patterns; that is, it defines a number of named patterns, each of which acts as a kind of template against which an input document can be matched. e meaning of a pattern is expressed in a schema by reference to other patterns, or to a small number of built-in fundamental concepts, as we shall see. In the example above, the word to the le of the equals sign is the pattern's name, and the material following it declares a meaning for the pattern. Patterns may also be of particular types; the ones that interest us here are called element patterns and attribute patterns. In this example we see definitions for five element patterns. Note that we have used similar names 12ISO/IEC FDIS 19757-2 Document Schema Definition Language (DSDL) -- Part 2: Regular-grammar-based validation -- RELAX NG 13See further 22. Documentation Elements and 23.4. Implementation of an ODD System. In practice, the only part of a TEI element specification not expressed using TEI-defined syntax is the content model for an element, which is expressed using the RELAX NG schema language for reasons of processing convenience. RELAX NG uses its own XML vocabulary to define content models, which is adopted by the TEI for the same purpose. 14For a good tutorial introduction to RELAX NG, see van der Vlist (2004). xxxvii v. A Gentle Introduction to XML for the pattern and the element which the pattern describes: so, for example, the line anthology_p = element anthology {poem_p+} defines an element pattern called anthology_p, the value of which defines an element called anthology. ese naming conventions are arbitrary; we could use the same name for the pattern as for the element, since the two are syntactically quite distinct. e name, or generic identifier, of the element follows the word `element', and the content model for the element is given within the curly braces following that. Each of these parts is discussed further below. e last line of the schema above tells a RELAX NG validator which element (or elements) in a document can be used as the root element: in our case only . is enables the validator to detect whether a particular document is well-formed but incomplete; it also simplifies the processing task by providing an `entry point'. Generic identifier Following the word `element' each pattern declaration gives the generic identifier (oen abbreviated to GI) of the element being defined, for examplepoem, heading, etc. A GI may contain letters, digits, hyphens, underscore characters, or full stops, but must begin with a letter.15 Uppercase and lowercase letters are quite distinct: an element with the GI is not the same as an element with the GI ; the root element of a TEIconformant document is , not. Content model e second part of each declaration, enclosed in curly braces, is called the content model of the element being defined, because it specifies what may legitimately be contained within it. In RELAX NG, the content model is defined in terms of other patterns, either by embedding them, or (as in our examples above) by naming or referring to them. e RELAX NG compact syntax also uses a small number of reserved words to identify other possible contents for an element, of which by far the most commonly encountered is text, as in this example: it means that the element being defined may contain any valid character data, but no elements. If an XML document is thought of as a structure like a family tree, with a single ancestor at the top (in our case, this would be ), then almost always, following the branches of the tree downwards (for example, from to to to and ) will lead eventually to text. In our example, and are so defined, since their content models say text only and name no embedded elements. Occurrence indicators e declaration for in the example above states that a stanza consists of one or more lines. It uses an occurrence indicator (the plus sign) to indicate how many times something matching the pattern line_p may be repeated. ere are three occurrence indicators: the plus sign, the question mark, and the asterisk or star. e plus sign means that the pattern can match one or more times; the question mark means that it may match at most once but is not mandatory; the star means that the pattern concerned is not mandatory, but may match more than once. us, if the content model for were {line_p*}, stanzas with no lines would be possible as well as those with more than one line. If it were {line_p?}, again empty stanzas would be countenanced, but no stanza could have more than a single line. e declaration for in the example above thus states that a cannot have more than one heading, but may have none, and that it must have at least one and may have several. Connectors e content model {heading_p?, stanza_p+} contains more than one component, and thus needs additionally to specify the order in which these patterns ( and ) may appear. is ordering 15In XML, a single colon may also appear in a GI, where it has a special significance related to the use of namespaces, as further discussed in section v.6.3 Namespaces. e characters defined by Unicode as combining characters and as extenders are also permitted, as are logograms such as Chinese characters. xxxviii v.3. XML structures is determined by the connector (the comma) used between its components. e comma connector indicates that the patterns concerned must appear in the sequence given. Another commonly encountered connector is the vertical bar, representing alternation. If the comma in this example were replaced by a vertical bar, then a would consist of either a heading or just stanzas ­ but not both! Groups In our example so far, the components of each content model have been either single patterns or text. It is quite permissible, however, to define content models in which the components are lists of patterns, combined by connectors. Such lists may also be modified by occurrence indicators and themselves combined by connectors. To demonstrate these facilities, let us expand our example to include non-stanzaic types of verse. For the sake of demonstration, we will categorize poems as one of the following: stanzaic, couplets, or blank (or stichic). A blank-verse poem consists simply of lines (we ignore the possibility of verse paragraphs for the moment),16 so no additional elements need be defined for it. A couplet is defined as a followed by a . couplet_p = element couplet {firstLine_p, secondLine_p} e patterns firstLine_p and secondLine_p define elements and (which are distinguished to enable studies of rhyme scheme, for example17 ); these will have exactly the same content model as the existing element. We will therefore add the following two lines to our example schema: firstLine_p = element firstLine {text} secondLine_p = element secondLine {text} Next, we can change the declaration for the element to include all three possibilities: poem_p = element poem { heading_p?, (stanza_p+ | couplet_p+ | line_p+) } at is, a poem consists of an optional heading, followed by one or several stanzas, or one or several couplets, or one or several lines. Note the difference between this declaration and the following: poem_p = element poem {heading_p?, (stanza_p | couplet_p | line_p)+ } e second version, by applying the occurrence indicator to the group rather than to each element within it, would allow a single poem to contain a mixture of stanzas, couplets, and lines. A group of this kind can contain text as well as named elements: this combination, known as mixed content, allows for elements in which the sub-components appear with intervening stretches of character data. For example, if we wished to mark place names wherever they appear inside our verse lines, then, assuming we have also added a pattern for the element, we could change the definition for to 16It will not have escaped the astute reader that the fact that verse paragraphs need not start on a line boundary seriously complicates the issue; see further section v.4 Complicating the issue. 17is is however a rather artificial example; XPath, for example, provides ways of distinguishing elements in an XML structure by their position without the need to give them distinct names. xxxix v. A Gentle Introduction to XML line_p = element line { (text | name_p )* } Some XML schema languages place no constraints on the way that mixed content models may be defined, but in the XML DTD language, when text appears with other elements in a content model: it must always appear as the first option in an alternation; it may appear once only, and in the outermost model group; and if the group containing it is repeated, the star operator must be used. Although these constraints do not apply to (for example) schemas expressed in the RELAX NG language, all TEI content models currently obey them. Quite complex models can easily be built up in this way, to match the structural complexity of many types of text. As a further example, consider the case of stanzaic verse in which a refrain or chorus appears. Like a stanza, a refrain consists of repetitions of the line element. A refrain can appear at the start of a poem only, or as an optional addition following each stanza. is could be expressed by a pattern such as the following: refrain_p = element refrain {line_p+} poem_p = element poem {heading_p?, ( line_p+ | (refrain_p?, (stanza_p, refrain_p?)+ )) } at is, a poem consists of an optional heading, followed by either a sequence of lines or an unnamed group, which starts with an optional refrain and is followed by one or more occurrences of another group, each member of which is composed of a stanza followed by an optional refrain. A sequence such as refrain - stanza stanza - refrain follows this pattern, as does the sequence stanza - refrain - stanza - refrain. e sequence refrain - refrain - stanza - stanza does not, however, and neither does the sequence stanza - refrain - refrain - stanza. Among other conditions made explicit by this content model are the requirements that at least one stanza must appear in a poem, if it is not composed simply of lines, and that if there is both a heading and a stanza they must appear in that order. Note that the apparent complexity of this model derives from the constraints expressed informally above. A simpler model, such as poem_p = element poem {heading_p?, (line_p | refrain_p | stanza_p)+ } would not enforce any of them, and would therefore permit such anomalies as a poem consisting only of refrains, or an arbitrary mixture of lines and refrains. v.4 Complicating the issue In the simple cases described so far, we have assumed that one can identify the immediate constituents of every element in a textual structure. A poem consists of stanzas, and an anthology consists of poems. Stanzas do not float around unattached to poems or combined into some other unrelated element; a poem cannot contain an anthology. All the elements of a given document type may be arranged into a hierarchic structure like a family tree, with a single ancestor at one end and many children (mostly the elements containing simple text) at the other. For example, we could represent an anthology containing two poems, the first of which contains two four-line stanzas and the second a single stanza, by a tree structure like the following figure: is graphic representation of the structure of an XML document is close to the abstract model implicit in most XML processing systems. Most such systems now use a standardized way of accessing parts of an XML xl v.4. Complicating the issue document called XPath.18 XPath gives us a non-graphical way of referring to any part of an XML document: for example, we might refer to the last line of Blake's poem as /anthology/poem[1]/stanza[2]/line[4]. e square brackets here indicate a numerical selection: we are talking about the fourth line in the second stanza of the first poem in the anthology. If we le out all the square-bracketted selections, the corresponding XPath expression would refer to all lines contained by stanzas contained by poems contained by anthologies. An XPath expression can refer to any collection of elements: for example, the expression /anthology/poem refers to all poems in an anthology and the expression /anthology/poem/heading refers to all their headings. e solidus within an XPath expression behaves in much the same way as the solidus or backslash in a filename specification: it indicates that the item to the le directly contains the item to the right of it. In XPath it is also possible to indicate that any number of other items may intervene by repeating the solidus. For example, the XPath expression /anthology/poem//line[1] will refer to the first line of each poem in the anthology, irrespective of whether it is in a stanza. Clearly, there are many such trees that might be drawn to describe the structure of this or other anthologies. Some of them might be representable as further subdivisions of this tree: for example, we might subdivide the lines into individual words, since in our simple example no word crosses a line boundary. Surprisingly perhaps, this grossly simplified view of what text is (memorably termed an ordered hierarchy of content objects (OHCO) view of text by Renear et al.19 ) turns out to be very effective for a large number of purposes. It is not, however, adequate for the full complexity of real textual structures, for which more complex mechanisms need to be employed. ere are many other trees that might be drawn which do not fit within the anthology model which we have presented so far. We might, for example, be interested in syntactic structures or other linguistic constructs, which rarely respect the formal boundaries of verse. Or, to take a simpler example, we might want to represent the pagination of different editions of the same text. 18e official specification is at Clark and DeRose (eds.) (1999); many introductory tutorials are available in the XML references cited above and elsewhere on the Web: good beginners' tutorials include http://www.w3schools.com/xpath/default.asp and http://www.zvon.org/xxl/ XPathTutorial/, the latter being available in several languages. 19See Renear et al. (1996). xli v. A Gentle Introduction to XML In the OHCO model of text, representation of cases where different elements overlap so that several different trees may be identified in the same document is generally problematic. All the elements marked up in a document, no matter what namespace they belong to, must fit within a single hierarchy. To represent overlapping structures, therefore, a single hierarchy must be chosen, and the points at which other hierarchies intersect with it marked. For example, we might choose the verse structure as our primary hierarchy, and then mark the pagination by means of empty elements inserted at the boundary points between one page and the next. Or we could represent alternative hierarchies by means of the pointing and linking mechanisms described in chapter 16. Linking, Segmentation, and Alignment of the Guidelines. ese mechanisms all depend on the use of attributes, which may be used both to identify particular elements within a document and to point to, link, or align them into arbitrary structures. v.5 Attributes In the XML context, the word attribute, like some other words, has a specific technical sense. It is used to describe information that is in some sense descriptive of a specific element occurrence but not regarded as part of its content. For example, you might wish to add a status attribute to occurrences of some elements in a document to indicate their degree of reliability, or to add an identifier attribute so that you could refer to particular element occurrences from elsewhere within a document. Attributes are useful in precisely such circumstances. Although different elements may have attributes with the same name (for example, in the TEI scheme, every element is defined as having an attribute named n), they are always regarded as different, and may have different values assigned to them. If an element has been defined as having attributes, the attribute values are supplied in the document instance as attribute-value pairs inside the start-tag for the element occurrence. An end-tag cannot contain an attribute-value specification, since it would be redundant. e order in which attribute-value pairs are supplied inside a tag has no significance; they must, however, be separated by at least one whitespace (blank, newline, or tab) character. e value part must always be given inside matching quotation marks, either single or double20 . For example: ... Here attribute values are being specified for two attributes previously declared for the element: xml:id and status. For the instance of a in this example, represented here by an ellipsis, the xml:id attribute has the value P1 and the status attribute has the value dra. An XML processor can use the values of the attributes in any way it chooses; for example, a in which the status attribute has the value dra might be formatted differently from one in which the same attribute has the value revised; another processor might use the same attribute to determine whether or not poem elements are to be processed at all. e xml:id attribute is a slightly special case in that, by convention, it is always used to supply a unique value to identify a particular element occurrence, which may be used for cross-reference purposes, as discussed further below (v.5.2 Identifiers and indicators). v.5.1 Declaring attributes Attributes are declared in a schema in the same way as elements. As well as specifying an attribute's name and the element to which it is to be attached, it is possible to specify (within limits) what kind of value is acceptable for an attribute. 20In the unlikely event that both kinds of quotation marks are needed within the quoted string, either or both can also be presented in escaped form, using the predefined character entities ' or " xlii v.5. Attributes In the compact syntax of RELAX NG, an attribute is defined by means of an attribute pattern, like the following: att.status = attribute status {"draft" | "revised" | "published"} is defines a new pattern, called att.status, whose value is an attribute pattern defining an attribute named status. Attribute names are subject to the same restrictions as other names in XML; they need not be unique across the whole schema, however, but only within the list of attributes for a given element. A pattern defining the possible values for this attribute is given within the curly braces, in just the same way as a content model is given for an element pattern. In this case, the attribute's value must be one of the strings presented explicitly above. e attribute pattern definition must be included or referenced within the definition for every element to which the attribute is attached. We therefore modify the definition for the poem_p pattern given above as follows: poem_p = element poem {att.status?, heading_p?, stanza_p+} In RELAX NG, an element pattern simply includes any attribute patterns applicable to it along with its other constituents, as shown above. Attribute patterns can also be grouped and alternated in the same way as element patterns, though this particular feature is not widely used in the TEI scheme, since it is not available to the same extent in all schema languages. Because a question mark follows the reference to the att.status pattern in our example, a document in which the status attribute is not specified will still be valid; without this occurrence indicator the status attribute would be required. Instead of supplying a list of explicit values, an attribute pattern can specify that the attribute must have a value of a particular type, for example a text string, a numeric value, a normalized date, etc. is is accomplished by supplying a pattern that refers to a datatype. In the example above, because a list of acceptable values is predefined, a parser can check that no is defined for which the status attribute does not have one of dra, revised, or published as its value. By contrast, with a definition such as att.status = attribute status {text} a parser would accept almost any unbroken string of characters (status="awful", status="awe-ful", or status="12345678") as valid for this attribute. Sometimes, of course, the set of possible values cannot be predefined. Where it can, as in this case, it is generally better to do so. Schema languages vary widely in the extent to which they support validation of attribute values. Some languages predefine a small set of possibilities. Others allow the schema designer to use values from a predefined `library' of possible datatypes, or to add their own definitions, possibly of great complexity. A `datatype' might be something fairly general (any positive integer), something very specific or idiosyncratic (any four-character string ending with "T"), or somewhere between the two. In the RELAX NG schemas used by the TEI, general patterns have been defined for about half a dozen datatypes (using the W3C Schema Datatype Library, http://www.w3.org/TR/xmlschema-2/, and discussed further in 1.4.2. Datatype Macros). In addition to the two possibilities already mentioned -- plain text or an explicit list of possible strings -- other datatypes likely to be encountered include the following: boolean values must be either true or false xliii v. A Gentle Introduction to XML numeric values must represent a numeric quantity of some kind date values must represent a possible date and time in some calendar Two further datatypes of particular usefulness in managing XML documents are commonly known as ID -- for identifier -- and URI -- for Universal Resource Indicator, or pointer for short. ese are discussed in the next section. v.5.2 Identifiers and indicators It is oen necessary to refer to an occurrence of one textual element from within another, an obvious example being phrases such as `see note 6' or `as discussed in chapter 5'. When a text is being produced the actual numbers associated with the notes or chapters may not be certain. If we are using descriptive markup, such things as page or chapter numbers, being entirely matters of presentation, will not in any case be present in the marked-up text: they will be assigned by whatever processor is operating on the text (and may indeed differ in different applications). XML therefore predefines an attribute that may be used to provide any element occurrence with a special identifier, a kind of label, which may be used to refer to it from anywhere else: since it is defined in the XML namespace, the name of this attribute is xml:id and it is used throughout the TEI schema. Because it is intended to act as an identifier, its values must be unique within a given document. e cross-reference itself will be supplied by an element bearing an attribute of a specific kind, which must also be declared in the schema. Suppose, for example, we wish to include a reference within the notes on one poem that refers to another poem. We will first need to provide some way of attaching a label to each poem: this is easily done using the xml:id attribute. Note that not every poem need carry an xml:id attribute and the parser may safely ignore the lack of one in those that do not. Only poems to which we intend to refer need use this attribute; for each such poem we should now include in its start-tag some unique identifier, for example: ... ... ... Next we need to define a new element for the cross-reference itself. is will not have any content ­ it is only a pointer ­ but it has an attribute, the value of which will be the identifier of the element pointed at. is is achieved by the following definition: poemRef_p = element poemRef {attribute target {anyURI}, empty} e element has no content, but a single attribute called target. e value of this attribute must be a pointer or web reference of type anyURI;21 furthermore, because there is no indication of optionality on the attribute pattern, it must be supplied on each occurrence -- a with no referent is an impossibility. With these declarations in force, we can now encode a reference to the poem whose xml:id attribute specifies that its identifier is Rose as follows: Blake's poem on the sick rose ... 21e word `anyURI' is a predefined name, used in schema languages to mean that any Uniform Resource Identifier (URI) may be supplied here. e accepted syntax for URIs is an Internet Standard, defined in http://tools.ietf.org/html/rfc3986. anyURI is one of the datatypes defined by the W3C Schema datatype library. xliv v.6. Other components of an XML document A processor may take any number of actions when it encounters a link encoded in this way: a formatter might construct an exact page and line reference for the location of the poem in the current document and insert it, or just quote the poem's title or first lines. A hypertext style processor might use this element as a signal to activate a link to the poem being referred to, for example by displaying it in a new window. Note, however, that the purpose of the XML markup is simply to indicate that a cross-reference exists: it does not necessarily determine what the processor is to do with it. e target of a URI can be located anywhere: it may not necessarily be part of the same document, nor even located on the same computer system. Equally, it can be a resource of any kind, not necessarily an XML document or document fragment. It is thus a very convenient way of including references to non-XML data such as image files within a document. If, for example, we wished to include an illustration containing a reproduction of Blake's original in our anthology, the most appropriate method would probably be to define a new element called (for the sake of argument) with a target attribute of datatype URI: graphic_p = element graphic {att.url, empty} att.url = attribute url {anyURI} With these additions to the schema, we can now represent the location of the illustration within our text like this: By providing a location from which a reproduction of the required image can be downloaded, this encoding makes it possible for appropriate soware able to display the image as well as record its existence. Attributes form part of the structure of an XML document in the same way as elements, and can therefore be accessed using XPath. For example, to refer to all the poems in our anthology whose status attribute has the value dra, we might use an XPath such as /anthology/poem[@status='draft']. To find the headings of all such poems, we would use the XPath /anthology/poem[@status='draft']/heading. v.6 Other components of an XML document In addition to the elements and attributes so far discussed, an XML document can contain a few other formally distinct things. An XML document may contain references to predefined strings of data that a validator must resolve before attempting to validate the document's structure; these are called entity references. ey may be useful as a means of providing `boilerplate' text or representing character data which cannot easily be keyboarded. An XML document may also contain arbitrary signals or flags for use when the document is processed in a particular way by some class of processor (a common example in document production is the need to force a formatter to start a new page at some specific point in a document); such flags are called processing instructions. And, as noted earlier, an XML document may also contain instances of elements taken from some other namespace. We discuss each of these three cases in the rest of this section. v.6.1 Character References As mentioned above, all XML documents use the same internal character encoding. Since not all computer systems currently support this encoding directly, a special syntax is defined that can be used to represent individual characters from the Unicode character set in a portable way by providing their numeric value, in decimal or hexadecimal notation. xlv v. A Gentle Introduction to XML For example, the character é is represented within an XML document as the Unicode character with hexadecimal value 00E9. If such a document is being prepared on (or exported to) a system using a different character set in which this character is not available, it may instead be represented by the character reference é (the x indicating that what follows is a hexadecimal value) or é (its decimal equivalent). References of this type do not need to be predefined, since the underlying character encoding for XML is always the same. To aid legibility, however, it is also possible to use a mnemonic name (such as eacute) for such character references, provided that each such name is mapped to the required Unicode value by means of a construct known as an entity declaration. A reference to a named character entity always takes the form of an ampersand, followed by the name, followed by a semicolon. For example an XML document containing the string `T&C' might be encoded as T&C. ere is a small set of such character entity references that do not have to be declared because they form part of the definition of XML. ese include the names used for characters such as the ampersand (amp) and the open angle bracket or less-than sign (lt), which could not easily otherwise be included in an XML document without ambiguity. Other predeclared entity names are those for quotation marks (quot and apos for double and single respectively), and for completeness the closing angle bracket or greater-than sign (gt). For all other named character entities, a set of entity declarations must be provided to an XML processor before the document referring to them can be validated. e declaration itself uses a non-XML syntax inherited from SGML; for example, to define an entity named eacute with the replacement value é, the declaration could have any of the following forms: or, using hexadecimal notation: or, using decimal notation: Entities of this kind are useful also for string substitution purposes, where the same text needs to be repeated uniformly throughout a text. For example, if a declaration such as is included with a document, then references such as &TEI; may be used within it, each of which will be expanded in the same way and replaced by the string `Text Encoding Initiative' before the text is validated. v.6.2 Processing instructions Although one of the aims of using XML is to remove any information specific to the processing of a document from the document itself, it is occasionally very convenient to be able to include such information -- if only so that it can be clearly distinguished from the structure of the document. As suggested above, one common example is the need, when processing an XML document for printed output, to include a suggestion that the xlvi v.6. Other components of an XML document formatting processor might use to determine where to begin a new page of output. Page-breaking decisions are usually best made by the formatting engine alone, but there will always be occasions when it may be necessary to override these. An XML processing instruction inserted into the document is one very simple and effective way of doing this without interfering with other aspects of the markup. Here is an example XML processing instruction: It begins with . In between are two space-separated strings: by convention, the first is the name of some processor (tex in the above example) and the second is some data intended for the use of that processor (in this case, the instruction to start a new page). e only constraint placed by XML on the strings is that the first one must be a valid XML name; the other can be any arbitrary sequence of characters, not including the closing character-sequence ?>. A construct which looks like a processing instruction (but is not) is the XML declaration which can be supplied at the beginning of an XML document, for example: e XML declaration specifies the version number of the XML Recommendation applicable to the document it introduces (in this case, version 1.0), and optionally also the character encoding used to represent the Unicode characters within it. By default an XML document uses the character encoding UTF-8 or UTF-16; in this case, the 16-bit characters of Unicode have been mapped to the 8-bit character set known as ISO 8859-1; any characters present in the document but not available in the target character set will therefore need to be represented as character references (v.6.1 Character References). e XML declaration is purely documentary, but if it is wrong many XML-aware processors will be unable to process the associated text. v.6.3 Namespaces A valid XML document necessarily specifies the schema in which its constituent elements are defined. However, a well-formed XML document is not required to specify its schema (indeed, it may not even have a schema). It would still be useful to indicate that the element names used in it have some defined provenance. Furthermore, it might be desirable to include in a document elements that are defined (possibly differently) in different schemas. A cabinet-maker's schema might well define an element called with very different characteristics from those of a documentalist's. e concept of namespace was introduced into the XML language as a means of addressing these and related problems. If the markup of an XML document is thought of as an expression in some language, then a namespace may be thought of as analogous to the lexicon of that language. Just as a document can contain words taken from different languages, so a well-formed XML document can include elements taken from different namespaces. A namespace resembles a schema in that we may say that a given set of elements `belongs to' a given namespace, or are `defined by' a given schema. However, a schema is a set of element definitions, whereas a namespace is really only a property of a collection of elements: the only tangible form it takes in an XML document is its distinctive prefix and the identifying name associated with it. Suppose for example that we wish to extend our anthology to include a complex diagram. We might start by considering whether or not to extend our simple schema to include XML markup for such features as arcs, polygons, and other graphical elements. XML can be used to represent any kind of structure, not simply text, and there are clear advantages to having our text and our diagrams all expressed in the same way. xlvii v. A Gentle Introduction to XML Fortunately we do not need to invent a schema for the representation of graphical components such as diagrams; it already exists in the shape of the Scalable Vector Graphics (SVG) language defined by the W3C.22 SVG is a widely used and rich XML vocabulary for representing all kinds of two-dimensional graphics; it is also well supported by existing soware. Using an SVG-aware drawing package, we can easily draw our diagram and save it in XML format for inclusion within our anthology. When we do so, we need to indicate that this part of the document contains elements taken from the SVG namespace, if only to ensure that processing soware does not confuse our element with the SVG , which means something quite different. An XML document need not specify any namespace: it is then said to use the `null' namespace. Alternatively, the root element of a document may supply a default namespace, understood to apply to all elements which have no namespace prefix. is is the function of the xmlns attribute which provides a unique name for the default namespace, in the form of a URI: In exactly the same way, on the root element for each part of our document which uses the SVG language, we might introduce the SVG namespace name: Although a namespace name usually uses the URI (Uniform Resource Identifier) syntax, it is not treated as an online address and an XML processor regards it just as a string, providing a longer name for the namespace. e xmlns attribute can also be used to associate a short prefix name with the namespace it defines. is is very useful if we want to mingle elements from different namespaces within the same document, since the prefix can be attached to any element, overriding the implicit namespace for itself (but not its children): ere is no limit on the number of namespaces that a document can use. Provided that each is uniquely identified, an XML processor can identify those that are relevant, and validate them appropriately. To extend our example further, we might decide to add a linguistic analysis to each of the poems, using a set of elements such as , , etc., derived from some pre-existing XML vocabulary for linguistic analysis. 22e W3C Recommendation is defined at http://www.w3.org/Graphics/SVG/. xlviii v.7. Putting it all together O Rose thou art sick Marked Sections We mentioned above that the syntax of XML requires the encoder to take special action if characters with a syntactic meaning in XML (such as the le angle bracket or ampersand) are to be used in a document to stand for themselves, rather than to signal the start of a tag or an entity reference respectively. e predefined entities &, <, and > provide one method of dealing with this problem, if the number of occurrences of such things is small. Other methods may be considered when the number is large, as in an XML document like the present Guidelines, which contains hundreds of examples of XML markup. One is to label the XML examples as belonging to a different namespace from that of the document itself, which is the approach taken in the present Guidelines. Another and simpler approach is provided by one of the features inherited by XML from its parent SGML: the `marked section'. A marked section is a block of text within an XML document introduced by the characters . Between these rather strange brackets, markup recognition is turned off, and any tags or entity references encountered are therefore treated as if they were plain text. For example, when we come to write the users' manual for our anthology, we may find ourselves oen producing text like the following: Here is an example of the use of the line element: ....]]> v.7 Putting it all together In this chapter we have discussed most of the components of an XML document and its associated schema. We have described informally how an XML document is represented, and also introduced one way of representing the rules a RELAX NG validator might use to validate it. In a working system, the following issues will also need to be addressed: * how does a processor determine the schema (or schemas) that should be used to validate a given XML document instance? * if a document contains entity references that must be processed before the document can be validated, where are those entities defined? * an XML document instance may be stored in a number of different operating system files; how should they be assembled together? * how does a processor determine which stylesheets it should use when processing an XML document, or how to interpret any processing instructions it contains? * how does a processor enforce more exact validation than simple datatypes permit (for example of element content)? xlix v. A Gentle Introduction to XML Different schema languages and different XML processing systems take very different positions on all of these topics, since none of them is explicitly addressed in the XML specification itself. Consequently, the best answer is likely to be specific to a particular soware environment and schema language. Since this chapter is concerned with XML considered independently of its processing environment, we only address them in summary detail here. v.7.1 Associating entity definitions with a document instance In v.6.1 Character References we introduced the syntax used for the definition of named character entities such as eacute, which XML inherited from SGML. Different schema languages vary in the ways they make a collection of such definitions available to an XML processor, but fortunately there is one method that all current schema languages support. As well as, and following, the XML declaration (v.6.2 Processing instructions), an XML document instance may be prefixed with a special DOCTYPE statement. is declarative statement has been inherited by XML from SGML; in its full form it provides a large number of facilities, but we are here concerned only with the small subset of those facilities recognized by all schema languages. Here is an example DOCTYPE statement which we might consider prefixing to the final version of our anthology: ]> Any XML processor encountering this statement will use it to add the two named entities it defines to those already predefined for XML. Before the document instance itself is validated, any references to these entities will be expanded to the character string given. us, wherever in the document instance the string &legalese; appears, it will be replaced by the formulation above. is makes life a little easier for those keyboarding our anthology.23 e word anthology following the string DOCTYPE in this example is, of course, the name of the root element of the document to which this declaration is prefixed; however, only an XML DTD processor will take note of this fact. v.7.2 Associating a document instance with its schema Different schema languages adopt entirely different attitudes to this question. A document instance may be valid according to many different schemas, each appropriate to a different processing task. In RELAX NG therefore no facility for associating a particular schema with a particular instance exists: the task is regarded as a specific case of the more general issues addressed by the general architectural framework within which RELAX NG is defined: the ISO dra standard for Document Schema Definition Languages (DSDL).24 In W3C Schema and in the DTD schema language inherited by XML from SGML, however, a document instance can point directly to the resource or resources that may be used to validate it. In W3C Schema Language, this is usually done by means of an attribute on the root element of the document instance; for XML DTDs the DOCTYPE statement introduced in v.7.1 Associating entity definitions with a document instance is used for this purpose. 23And, indeed, for those responsible for deciding the licencing conditions if they change their minds later. 24DSDL is a project of ISO/IEC JTC 1/SC 34 WG 1, the object of which is to `bring together different validation-related tasks and expressions to form a single extensible framework that allows technologies to work in series or in parallel to produce a single or a set of validation results. e extensibility of DSDL accommodates validation technologies not yet designed or specified.' (http://dsdl.org). l v.7. Putting it all together Fortunately, any modern XML processing soware tool will provide clear ways of carrying out this task appropriate to the particular language chosen. In the interests of maximizing portability of document instances, they should contain as little processing-specific information as possible. v.7.3 Assembling multiple resources into a single document As we have already indicated, a single XML document may be made up of several different operating system files that need to be pulled together by a processor before the whole document can be validated. e XML DTD language defines a special kind of entity (a system entity) that can be used to embed references to whole files into a document for this purpose, in much the same way as the character or string entities discussed in v.6.1 Character References. Neither RELAX NG nor W3C Schema directly supports this mechanism, however, and we do not discuss it further here. An alternative way of achieving the same effect is to use a special kind of pointer element to refer to the resources that need to be assembled, in exactly the same way as we proposed for the illustration in our anthology. e W3C Recommendation XML Inclusions (XInclude)25 defines a generic mechanism for this purpose, which is supported by an increasing number of XML processors. v.7.4 Stylesheet association and processing As mentioned above, the processing of an XML document will usually involve the use of one or more stylesheets, oen but not exclusively to provide specific details of how the document should be displayed or rendered. In general, there is no reason to associate a document instance with any specific stylesheet and the schema languages we have discussed so far do not therefore make any special provision for such association. e association is made when the stylesheet processor is invoked, and is thus entirely application-specific. However, since one very common application for XML documents is to serve them as browsable documents over the Web, the W3C has defined a procedure and a syntax for associating a document instance with its stylesheet (see http://www.w3.org/TR/xml-stylesheet/). is Recommendation allows a document to supply a link to a default stylesheet and also to categorize the stylesheet according to its MIME type, for example to indicate whether the stylesheet is written in CSS or XSLT, using a specialized form of processing instruction. Assuming therefore that we have made a CSS-conformant stylesheet for our anthology and stored it in a file called anthology.css which is available from the same location as the anthology itself, we could make it available over the Web simply by adding a processing instruction like the following to the anthology: Multiple stylesheets can be defined for the same document, and options are available to specify how a web browser should select amongst them. For example, if the document also contained a directive: a different stylesheet called anthology_m.css could be used when rendering the document on a handheld device such as a mobile phone. Most modern web browsers support CSS (although the extent of their implementation varies), and some of them support XSLT. 25http://www.w3.org/TR/xinclude/. li v. A Gentle Introduction to XML Content validation As we noted above, most schema languages provide some degree of datatype validation for attribute values (v.5.1 Declaring attributes). ey vary greatly in the validation facilities they offer for the content of elements, other than the syntactic constraints already discussed. us, while we may very easily check that our elements contain only elements, we cannot easily check that elements contain between five and 500 correctly-spelled English words, should we wish to constrain our poetry in such a way. Also, because attributes and elements are treated differently, it is difficult or impossible to express co-occurrence constraints: for example, if the status of a poem is dra we might wish to permit elements such as within its content, but not otherwise. e XML DTD language offers very little beyond syntactic checking of element content. By contrast, a major impetus behind the design and development of the W3C schema language was the addition of a much more general and powerful constraint language to the existing structural constraints of XML DTDs. In RELAX NG the opposite approach was taken, in that all datatype validation, whether of attributes or element content, is regarded as external to the schema language. For attributes, as we have seen, RELAX NG makes use of the W3C Schema Datatype Library (but permits use of others). Because RELAX NG treats both elements and attributes as special cases of patterns, the same datatype validation facilities are available for element content as for attribute values; it is unlike other schema languages in this respect. In addition, for content validation, a different component of DSDL known as Schematron can be used. Schematron is a pattern matching (rather than a grammar-based) language, which allows us to test the components of a document against templates that express constraints such as those mentioned above. Like other XML processors, Schematron uses XPath to identify parts of an XML document; in addition, it provides elements that describe assertions to be tested and conditions which must be validated, as well as elements to report the results of the test. lii vi Languages and Character Sets e documents which users of these Guidelines may wish to encode encompass all kinds of material, potentially expressed in the full range of written and spoken human languages, including the extinct, the non-existent, and the conjectural. Because of this wide scope, special attention has been paid to two particular aspects of the representation of linguistic information oen taken for granted: language identification, and character encoding. Even within a single document, material in many different languages may be encountered. Human culture, and the texts which embody it, is intrinsically multilingual, and shows no sign of ceasing to be so. Traditional philologists and modern computational linguists alike work in a polyglot world, in which code-switching (in the linguistic sense) and accurate representation of differing language systems constitute the norm, not the exception. e current increased interest in studies of linguistic diversity, most notably in the recording and documentation of endangered languages, is one aspect of this long standing tradition. Because of their historical importance, the needs of endangered and even extinct languages must be taken into account when formulating Guidelines and recommendations such as these. Beyond the sheer number and diversity of human languages, it should be remembered that in their written forms they may deploy a huge variety of scripts or writing systems. ese scripts are in turn composed of smaller units, which for simplicity we term here characters. A primary goal when encoding a text should be to capture enough information for subsequent users of it correctly to identify both language, script, and constituent characters. In this chapter we address this requirement, and propose recommended mechanisms to indicate the languages, scripts and characters used in a document or a part thereof. Identification of language is dealt with in vi.1 Language identification. In summary, it recommends the use of pre-defined identifiers for a language where these are available, as they increasingly are, in part as a result of the twin pressures of an increasing demand for language-specific soware and an increased interest in language documentation. Where such identifiers are not available or not standardized, these Guidelines recommend a way of documenting language identifiers and their significance, in the same way as other metadata is documented in the TEI Header. Standardization of the means available to represent characters and scripts has moved on considerably since the publication of the first version of these Guidelines. At that time, it was essential to explicitly document the characters and encoded character sets used by almost any digital resource if it was to have any chance of being usable across different computer platforms or environments, but this is no longer the case. With the availability of the Unicode standard, almost 100,000 different characters representing almost all of the world's current writing systems are available and usable in any XML processing environment without formality. Nevertheless, however large the number of standardized characters, there will always be a need to encode documents which use non-standard characters and glyphs, particularly but not exclusively in historical material. Furthermore, the full potential of Unicode is still not yet realised in all soware which users of the Guidelines are likely liii vi. Languages and Character Sets to encounter. e second part of this chapter therefore discusses in some detail the concepts and practice underlying this standard, and also introduces the methods available for extending beyond it, which are more fully discussed in 5. Representation of Non-standard Characters and Glyphs. vi.1 Language identification Identification of the language a document or part thereof is written in is a crucial requirement for many envisioned usages of an electronic document. e TEI therefore accomodates this need in the following way: * A global attribute xml:lang is defined for all TEI elements. Its value identifies the language and writing system used. * e TEI Header has a section set aside for the information about the languages used in a document: see further 2.4.2. Language Usage. e value of the attribute xml:lang identifies the language using a coded value. For maximal compatibility with existing processes, modelling this value in the following way is recommended (this parallels the modelling of xml:lang): * e identifier for the language should be constructed as in Best Current Practice 471 . is same identifier has to be used to identify the corresponding element in the TEI header, if one is present. e first part of BCP 47 is called Tags for Identifying Languages2 , and proposes the following mechanism for constructing an identifier (tag) for languages as administered by the Internet Assigned Numbers Authority (IANA). e tag is assembled from a sequence of subtags separated by the hyphen (-, U+002D) character. It gives the language (possibly further identified with a sublanguage), a script and a region for this language, each possibly followed by a variant subtag. * e identifier consists of at least one `primary' subtag, it may be followed by one or more `extended' subtags. * Languages are identified by a language subtag, which may be a two letter code taken from ISO 639-1 or a three letter code taken from ISO 639-2. * ISO 639-2 reserves for private use codes in the range 'qaa' to 'qtz'. ese codes should be used for nonregistered language subtags. * A single letter primary subtag "x" indicates that the whole language tag is privately used. * Extended language subtags must begin with the letter "s". ey must follow the primary subtag and precede subtags that do define other properties of the language. e order is significant. * 4 character subtags are interpreted as script identifiers taken from ISO 15924 * Region subtags can be either two letter country codes taken from ISO 3166 (with exceptions) or 3 digit codes from the UN Standard Country Codes for Statistical Use. * Variant subtags may follow any of the above, but must precede private use extensions. * Private use extensions are separated from the other subtags by the single letter subtag "x", which must be followed by at least one subtag. ey might consist of several subtags separated with "-", but may not exceed a length of 32 characters. Examples of language tags * Simple language subtag ­ de (German) ­ ja (Japanese) 1Currently BCP 47 comprises two Internet Engineering Task Force documents, referred to separately as RFC 4646 and RFC 4647; over time, other IETF documents may succeed these as the best current practice. 2Phillips, Addison and Davis, Mark, Tags for Identifying Languages2006-09: http://tools.ietf.org/html/bcp47 liv vi.1. Language identification ­ zh (Chinese) * Language subtag plus Script subtag ­ zh-Hant (Traditional Chinese) ­ en-Latn (English written in Latin script) ­ sr-Cyrl (Serbian written with Cyrillic script) * Language-Script-Region ­ zh-Hans-CN (Simplified Chinese for the PRC) ­ sr-Latn-891 (Serbian, Latin script, Serbia and Montenegro) * Language-Region ­ zh-SG (Chinese for Singapore) ­ de-DE (German for Germany) * Other ­ zh-CN (Chinese in China, no script given) ­ zh-Latn (Chinese transcribed in the Latin script) * Extended: ­ de-CH-x-phonebook (phonebook collation for Swiss German) ­ zh-s-nan (the Southern Min language of the macrolanguage Chinese) ­ zh-s-nan-Hans-CN (the Southern Min language of the macrolanguage Chinese as spoken in China written in simplified Characters) ­ zh-Latn-x-pinyin (Chinese transcribed in the Latin script using the Pinyin system) It should be noted that capitalization given here follows established convention (e.g. capital letters for country codes, small letters for language codes), but BPC 47 does not ascribe any meaning to differences in capitalization. As can be seen, both BPC 47 and ISO 639-2 provide extensions that can be employed by private convention. e constructs mentioned above can thus be used to generate identifiers for any language, past and present, in any used in any area of the world. If such private extensions are used within the context of the TEI, they should be documented within the element of the TEI header, which might also provide a prose description of the language described by the language tag. While language, region and script can be adequately identified using this mechanism, there is only very rough provision to express a dimension of time for the language of a document; those codes provided (e.g. grc for `Greek, Ancient (to 1453)' in ISO 639-2) might not reflect the segments appropriate for a text at hand. Text encoders might express the time window of the language used in the document by means of the extension mechanism defined in BCP 47 and relate that to a element in the corresponding section of the TEI header. Equivalences to language identifiers by other authorities can be given in the section as well, but no formal mechanism for doing so has been defined. e scope of the language identification is extending to the whole subtree of the document anchored at the element that carries the xml:lang attribute, including all elements and all attributes where a language might apply.3 3is will exclude all attributes where a non-textual datatype has been specified, for example tokens, boolean values or predefined value lists. lv vi. Languages and Character Sets vi.2 Characters and Character Sets All document encoding has to do with representing one thing by another in an agreed and systematic way. Applied to the smallest distinctive units in any given writing system, which for the moment we may loosely call `characters', such representation raises surprisingly complex and troublesome issues. e reasons are partly historical and partly to do with conceptual unclarities about what is involved in identifying, encoding, processing and rendering the characters of a natural language. vi.2.1 Historical considerations When the first methods of representing text for storage or transmission by machines were devised, long before the development of computers, the overriding aim was to identify the smallest set of symbols needed to convey the essential semantic content, and to encode that symbol set in the most economical way that the storage or transmission media allowed. e initial outcome were systems that encoded only such content as could be expressed in uppercase letters in the Latin script, plus a few punctuation marks and some `control characters' needed to regulate the storage and transmission devices. Such encodings, originally developed for telegraphy, strongly influenced the way the pioneers of computing conceived of and implemented the handling of text, with consequences that are with us still. For many years aer the invention of computers, the way they represented text continued to be constrained by the imperative to use expensive resources with maximal efficiency. Even when storage and processing costs began their dramatic fall, the Anglo-centric outlook of most hardware designers and soware engineers hampered initiatives to devise a more generous and flexible model for text representation. e wish to retain compatability with `legacy' data was an additional disincentive. Eventually, tension in East Asia between commitment to technological progress and the inability of existing computers to cope with local writing systems led to decisive developments. Japanese, Korean and Chinese standards bodies, who long before the advent of computers had been engaged in the specification of character sets, joined with computer manufacturers and soware houses to devise ways of mapping those character sets to numeric encodings and processing the resulting text data. Unfortunately, in the early years there was little or no co-ordination among either the national standards bodies or the manufacturers concerned, so that although commercial necessity dictated that these various local standards were all compatible with the representation of US-American English, they were not straightforwardly compatible with one another. Even within Japan itself there emerged a number of mutually incompatible systems, thanks to a mixture of commercial rivalry, disagreements about how best to manage certain intractable problems, and the fact that such pioneering work inevitably involved some false starts, leading to incompatibilities even between successive products of the same bodies. Roughly at the same time, and for similar reasons, multiple and incompatible ways of representing languages that use Cyrillic scripts were devised, along with methods of encoding ancient writing systems which inevitably could not aim for compatibility with other writing systems apart from basic Latin script. Many of the earliest projects that fed into the TEI were shaped in this developmental phase of the computerised representation of texts, and it was also the context in which SGML was devised and finalized. SGML had of necessity to offer ways of coping with multiple writing systems in multiple representations; or rather, it provided a framework within which SGML-compliant applications capable of handling such multiple representations might be developed by those with sufficient financial and personnel resources (such as are seldom found in academia). Earlier editions of these Guidelines offered advice on character set and writing system issues addressed to the condition of those for whom SGML was the only feasible option. at advice must now be substantially altered because of two closely-related developments: the availability of the ISO/Unicode character set as an international standard, and the emergence of XML and related technologies which are committed to the theory and practice of character representation which Unicode embodies. lvi vi.2. Characters and Character Sets vi.2.2 Terminology and key concepts Before the significance of Unicode and the implications of the association between XML and Unicode can be adequately explained, it is necessary to clarify some key concepts and attempt to establish an adequately precise terminology for them. Figure vi.1: Examples of the small latin a rendered with different fonts. e word `character' will not of itself take us very far towards greater terminological precision. It tends to be used to refer indiscriminately both to the visible symbol on a page and to the letter or ideograph which that symbol represents, two things that it is essential to keep conceptually distinct. e visible symbol obviously has some aspects by which we interpret it as representing one character rather than another; but its appearance may also be significantly determined by features that have no effect on our notion of which character in a writing system it represents. A familiar instance is the lowercase a, which in printed texts may be represented either by a `single storey' symbol (cf. figure 1 in the examples from Baskerville SemiBold or Century) or by a `two storey' version (as in figure 1 in the examples from ArialRegular or Andale Mono Regular). We say that the single and double-storey symbols both represent one and the same the same abstract character a using two different glyphs. Similarly, an uppercase A in a serif typeface has additional strokes that are absent from the same letter when printed using a sans-serif typeface, so that once again we have differing glyphs standing for the same abstract character. In figure 1 there is even a font, Captials Regular, in which the glyph for the lowercase letter a looks like a typical glyph for the character uppercase A. e distinction between abstract characters and glyphs is fundamental to all machine processing of documents. In most scholarly encoding projects, the accurate recording of the abstract characters which make up the text is of prime importance, because it is the essential prerequisite of digitizing and processing the document without semantic loss. In many cases (though there are important exceptions, to be touched on shortly) it may not be necessary to encode the specific glyphs used to render those abstract characters in the original document. An encoding that faithfully registers the abstract characters of a document allows us to search and analyse our document's content, language and structure and access its full semantics. at same encoding, however, may not contain sufficient information to allow an exact visual representation of the glyphs in the source text or manuscript to be recreated. e importance of this distinction between information content and its visual representation is not always immediately apparent to people unused to the specific complexities of text handling by machine. Such users tend to ask first what (in order of conceptual priority) should actually be their very last question: how do I get a physical image that looks like character x in my source document to appear on to the screen or the output page? eir first question should in fact be: how can I get an abstract representation of character x into my encoded document in a way that will be universally and unambiguously identifiable, no matter what it happens to look like in printout or on any particular display? And occasionally the response they receive as a result of their misguided initial question is a custom `solution' that satisfies their immediate rendering wishes at the price of lvii vi. Languages and Character Sets making their underlying document unintelligible to other users (or even to the original user in other times and places) because it encodes the abstract character in an idiosyncratic way. at said, there will certainly be documents or projects where it is a matter of scholarly significance that the compositor or scribe chose to represent a given abstract character using one particular glyph or set of strokes rather than a semantically-equivalent but visually distinct alternative, and in that case the specific appearance of the form will have to be encoded on one way or another. But that encoding need not (and in most cases will not) involve a notation that visually resembles the original, any more than italicised text in an original document will be represented by the use of italic characters in the encoded version. A collection of the abstract characters needed to represent documents in a given writing system is known as a character set, and the character set or character repertoire of a processing or rendering device is the set of abstract characters that it is equipped to recognise and handle appropriately. ere is, however, a subtle distinction between these two parallel uses of the same term, involving one more key concept which it is essential to grasp. e character set of a document (or the writing system in which it is recorded) is purely a collection of abstract characters. But the character set of a computing device is a set of abstract characters which have been mapped in a well-defined way to a set of numbers or code points by which the device represents those abstract characters internally. It can therefore be referred to as a coded character set, meaning a set of abstract characters each of which has been assigned a numerical code point (or in some instances a sequence of code points) which unambiguously identifies the character concerned. It is now possible to use this terminology to say what Unicode is: it is a coded character set, devised and actively maintained by an international public body, where each abstract character is identified by a unique name and assigned a distinctive code point.4 Unicode is distinguished from other, earlier and co-existing coded character sets by its (current and potential) size and scope; its built-in provision for (in practical terms) limitless expansion; the range and quality of linguistic and computational expertise on which it draws; the commitment in principle (and to an increasing degree in practice) to implement it by all important providers of hardware and soware worldwide; and the stability, authority and accessibility it derives from its status as an international public standard. vi.2.3 Abstract characters, glyphs and encoding scheme design e distinction between abstract characters and glyphs can be crucial when devising an encoding scheme. Users performing text retrieval, searching or concordancing will expect the system to recognise and treat different glyphs as instances of the same character; but when perusing the text itself they may well expect to see glyph variants preserved and rendered. When encoding a pre-existing text, the encoder must determine whether a particular letter or symbol is a character or a glyphic variant. A detailed model of the relationship between characters and glyphs has been developed within the Unicode Consortium and an ISO work group (ISO/IEC JTC1 SC2/WG2). Its report ( Unicode Technical Report 17: Character Encoding Model) will form the base for much future standards work. e model makes explicit the distinction between two different properties of the components of written language: * their content, i.e. its meaning and phonetic value (represented by a character) * their graphical appearance (represented by a glyph) When searching for information, a system generally operates on the content aspects of characters, with little or no attention to their appearance. A layout or formatting process, on the other hand, must of necessity be concerned with the exact appearance of characters. Of course, some operations (hyphenation for example) require attention to both kinds of feature, but in general the kind of text encoding described in these Guidelines tends to focus on content rather than appearance (see further 6.3 Highlighting and Quotation). 4Although only Unicode is mentioned here explicitly, it should be noted that the character repertoire and assigned code points of Unicode and the ISO standard 10646 are identical and maintained in a way that ensures this continues to be the case. lviii vi.2. Characters and Character Sets An encoder wishing to record information about which glyphs are present in a given document may do so at either or both of two levels: * the level of character encoding, using an appropriate Unicode code point to represent the glyph concerned * the markup level, with the glyph indicated via appropriate elements and/or attributes e encoding practice adopted may be guided by, among other things, an assessment of the most frequent uses to which the encoded text will be put. For example, if recognition of identical characters represented by a variety of glyphs is the main priority, it may be advisable to represent the glyph variations at markup level, so that the character value can be immediately exposed to the indexing and retrieval soware. Plainly, an encoding project will need to consider such issues carefully and embody the outcome of their deliberations in local manuals of procedure to ensure encoding consistency. Using Unicode code points to represent glyph information requires that such choices be documented in the TEI Header. Such documentation does cannot of itself guarantee proper display of the desired glyph but at least makes the intention of the encoder discoverable. At present the Unicode Standard does not offer detailed specifications for the encoding of glyph variations. ese Guidelines do give some recommendations; some discussion of related matters is given in Chapter 18 Transcription of Primary Sources, and Chapter 25 Representation of non-standard Characters and Glyphs offers some features for the definition of variant glyphs. vi.2.4 Entry of characters. Text characters may be entered into a document using any of three methods, in any convenient combination. First, where suitable input facilities make this possible, the characters concerned may be entered directly into the document, either by normal keystrokes or by the use of one of the Input Method Editors (IMEs) commonly used for the entry of ideographic characters. is is most likely to be convenient where the display used for text entry and/or the printer used to produce output for proofreading purposes is capable of rendering the characters concerned using correct and readily identifiable glyphs. Where such easily checkable rendering is not available, or where there is no suitable method of inputting certain characters directly, they may be input by one of two possible forms of indirect notation or `reference'. e first form of reference is a Numeric Character Reference (NCR), which takes the general form &#D; where D is an integer representing the code point of the character in base 10, or &#xH;, where H is the code point in hexadecimal notation. is has the advantage that no declaration of what this notation means is required anywhere in the document instance or its associated schema. Every XML processor is capable of recognising NCRs and replacing them with the required code point value without needing access to any additional data. e disadvantage of NCRs as a means of entering, representing and proofing character data is that most human beings find them anything but `readable' and it is all too easy for the wrong character to be entered in error and retained undetected. e second form of reference is a Character Entity Reference (though, as explained below, this should not be taken to imply that such entities constitute a `type' that could be distinctively recognised by a processing system). Character entity references can (and indeed should) have names whose significance is apparent to humans, but each and every entity name has to be associated with its replacement (which as explained below should be a character value, possibly in the form of a NCR) via a formal declaration in the document's internal or external subset. For a large number of characters defined by Unicode and commonly used in documents, there are ISO entity sets declaring mnemonic names which should be used wherever feasible: XML compatible character entity declarations using ISO names and suitable for inclusion into the subset are available on the TEI web sites. Where characters are not defined in Unicode and so have to be assigned both a local code point and a local entity name of the project's choosing (see Non Unicode characters in XML documents below) it is highly desirable to follow the same nomenclature principles as ISO and to emulate the practice in the ISO character entity declarations of appending a string giving the character a unique descriptive name as a comment to the lix vi. Languages and Character Sets actual entity declaration. In addition, where different groups or projects are working on texts with geographical, historical, linguistic or other similarities that give rise to common issues of character encoding, it is highly advisable in the interests of consistency that they should consult one another when devising entity names. e TEI mailing list may provide a suitable first point of contact for such consultations. Further advice on the matter of locally-defined characters is contained in Chapter 25 Representation of non-standard Characters and Glyphs. vi.2.5 Output of characters Rendering of the encoded text is a complicated process that depends largely on the purpose, external requirements, local equipment and so forth, it is thus outside the scope of coverage for these Guidelines. It might however nevertheless be helpful to put some of the terminology used for the rendering process in the context of the discussion of this chapter. As was mentioned above, Unicode encodes abstract characters, not specific glyphs. For any process that makes characters visible, however, concrete, specifically designed glyph shapes have to be used. For a printing process, for example, these shapes describe exactly at which point ink has to be put on the paper and which areas have to be le blank. If we want to print a character from the Latin script, besides the selection of the overall glyph shape, this process also requires that a specific weight of the font has been selected, a specific size and to what degree the shape should be slanted. Beyond individual characters, the overall typesetting process also follows specific rules of how to calculate the distance between characters, how much whitespace occurs between words, at which points line breaks might occur and so forth. If we concern ourselves only with the rendering process of the characters themselves, leaving out all these other parameters, we will realize that of all the information required for this process, only a small amount will be drawn from the encoded text itself. is information is the code point used to encode the character in the document. With this information, the font selected for printing will be queried to provide a glyph shape for this character. Some modern font formats (e.g. OpenType) do implement a sophisticated mapping from a code point to the glyph selected, which might take into account surrounding characters (to create ligatures where necessary) and the language or even area this character is printed for to accomodate different typesetting traditions and differences in the usage of glyphs. A TEI document might provide some of the information that is required for this process for example by identifying the linguistic context with the xml:lang attribute. e selection of fonts and sizes is usually done in a stylesheet, while the actual layout of a page is determined by the typesetting system used. Similarily, if a document is rendered for publication on the Web, information of this kind can be shipped with the document in a stylesheet5 . vi.2.6 Unicode and XML e devisers of the XML standard took the view that Unicode should be the only means of representing abstract characters which conformant XML processors were obliged to support. at certainly does not preclude the use of other character encoding schemes or character sets in documents which are to be handled by XML processors, but it does mean that all the abstract characters which are encoded as characters (as distinct from being represented indirectly via markup) in an XML document must either possess an assigned code point within the public Unicode standard, or be assigned a code point devised by and specific to the local project, taken from a reserved range set aside by the standard expressly for this purpose, the so-called Private Use Areas or PUAs. For the vast majority of projects to which these Guidelines are applicable, the Unicode standard will already offer code points for all the abstract characters their documents employ, and so the requirement that all such characters should be resolvable by XML processors to Unicode code points will not involve any definition or use of PUA code points. Indeed, such projects are not obliged by their choice of XML to use Unicode in their 5e World Wide Web Consortium provides recommendations for two standard stylesheet languages: either CSS or XSL could be used for this purpose. lx vi.2. Characters and Character Sets documents. Provided they correctly declare at the requisite points any non-Unicode coded character set they may use, ensure that all their XML processors support their declared encoding, and then consistently employ that encoding in strict conformity with their declarations, they need not consciously concern themselves with Unicode unless and until they feel it is appropriate to do so. Non-Unicode character sets and XML processors ere are, however, strict limits to the way conformant XML processors handle documents whose character set is not Unicode, and unless these limits are understood it is likely that projects not yet ready to commit to Unicode across the board will run into unexpected and baffling problems as they attempt to operate with their legacy character encodings. First, it must be repeated that nothing in the XML standard requires conformant processors to handle non-Unicode documents. But even if there were any actual processors which on that basis refused to process non-Unicode documents, that would not limit their usefulness as severely as might at first appear. e reason is that there is a way of internally representing Unicode code points (explained further Encoding errors related to UTF-8 below) where there is no detectable difference between a document which is actually encoded in ASCII employing only 7-bit values and one which is encoded in Unicode but which happens to contain only the abstract characters encompassed by the 7-bit ASCII standard. And the XML standard specifies that this way of representing Unicode is the one which processors must assume as the default for any document that does not explicitly declare an encoding. At a stroke, this provision ensures that all pure 7-bit ASCII encoded documents can be processed without further ado by all conformant XML processors. Add to this the provision, also within the XML standard, that allows any Unicode code point to be indirectly specified using only 7-bit ASCII characters via a Numeric Character Reference (NCR), and the upshot is that all documents in non-Unicode encodings which can be pre-processed to rewrite any characters outside the 7-bit ASCII range as Unicode code points in NCR notation (a simple batch procedure for which soware is readily available) can be handled even by processors which have no inbuilt support for any encoding other than Unicode. In fact, every XML processor so far released has implemented methods, specified in the standard though not mandatory, which allow the processing of documents in at least some non-Unicode character sets. Such processors include in their documentation a statement of the non-Unicode encodings they support, and the use of such an encoding must be declared to the processor in the correct way. To avoid confusion when taking advantage of such encoding support, it is first of all essential to grasp that an encoding declaration in an XML document is indeed simply a declaration: it is not an incantation that magically converts the document that follows into the encoding concerned. It is a common error to think that simply declaring a document's encoding to be, say ISO-8859-1 (or for that matter UTF-8 or UTF-16, the representations of Unicode for which support is mandatory) is sufficient to `make it so'. Such a declaration is useless unless the document that follows actually is encoded strictly in conformance with the declaration. Some of the circumstances in which that may not in fact be the case are outlined in vi.2.9 Issues arising from the internal representations of Unicode below. Secondly, an encoding declaration does not somehow switch an XML processor into a mode where it works entirely in the declared encoding for as long as the declaration is in scope. On the contrary, all it does is instruct the processor to pass its input through a filter that immediately converts all the code points in the declared encoding into their Unicode counterparts; from that point onwards the document as seen by all subsequent stages of processing is actually in Unicode, even though that may not be apparent to the user. irdly, this invariable internal conversion has a crucial consequence: the fact that a processor can successfully accept a document in a non-Unicode encoding does not mean that it will necessarily convert any output it may produce back into the declared input encoding. Internally, the document has been converted to and processed in Unicode, and there is nothing in the XML standard that requires the reverse conversion to be performed at the output stage. Most processors go beyond the standard by offering a facility to output in various encodings: but whether it is available and how to use it must be ascertained from the processor's documentation. Should it be unavailable or unreliable, the output may need to be post-processed lxi vi. Languages and Character Sets through a character convertor to restore the original encoding, and again such soware is freely available and easy to use. Non Unicode characters in XML documents In the cases considered in the preceding section, there was a suitable Unicode code point corresponding to each abstract character contained in the non-Unicode character set of the input document. In such instances, the mandatory internal conversion to Unicode carried out by the processor can be more or less transparent to a user who wishes to continue to work with a non-Unicode character set. ings become rather different when the non-Unicode character set contains abstract characters for which there is no code point in the Unicode standard, or when a project that is attempting to work in Unicode throughout finds that it needs to represent abstract characters not currently provided for in the Unicode standard. Here, a significant difference between SGML and XML emerges in a rather troublesome way. Following their agenda to devise a subset of SGML that would be significantly easier to implement, the authors of the XML specification decided that one particular type of entity available in SGML, known as an internal SDATA entity, should not be carried over into XML. It would be idle to question that decision here, but its consequences for the handling of abstract characters for which there is no Unicode definition were significant. e procedures recommended in earlier versions of these Guidelines for encoding, processing and exchanging what we might call locally defined abstract characters were reliant on the availability of entities declared as of type SDATA, but that type is not supported in XML, and there is therefore no ready equivalent for XML-based projects to the recommendations previously offered.6 Entities in XML are really only of two basic types, parsed and unparsed. Unparsed entities are of no relevance here. References to parsed entities in an XML document result in only one kind of behaviour: when they appear in the parser's input stream, the parser expects to be able to resolve them by locating a declaration in the document's internal or external subset which maps the entity name to its replacement text. e parser then inserts that replacement text into the document in place of the entity reference, which is discarded without trace. e act of replacement is not notified to the application, except where it fails because the entity is undeclared or the declaration is in some way defective (in which case the parser signals a fatal error and stops.) ough for explanatory convenience much XML-related documentation, including these Guidelines, refers specifically to Character Entities and Character Entity References, a character entity in XML is not a distinct `type' in the sense that `type' is understood in Computer Science terminology, for example when referring to the type of an attribute. Hence there is no way in which editing or other soware can check that the replacement to be inserted is indeed a single character or its equivalent rather than an arbitrary chunk of text, possibly including markup. A character entity is simply a general entity whose replacement text happens to be declared as a character value or a NCR representing that value. is has two important consequences if it is proposed to use such an entity reference to stand for a character that has no Unicode equivalent. First, the entity name reference will disappear at an early stage in the parse and be replaced by the declared value of the entity, so that no processing which requires access in the parsed document to the entity reference as originally entered is possible. Secondly, if a character entity is to be used as a true equivalent to a normal character, and consequently be employed at all points in a document where a single character could legitimately occur (apart from in element and attribute names, where no references of any kind are allowed) then it is essential that its replacement value indeed be pure character data. If the replacement value of the entity were to contain any markup, or a processing instruction, there would be many places in a document where simple character data would be legitimate, but where the substitution of markup or some other replacement could cause the 6In essence, when an SGML parser encounters a reference to an entity of type SDATA, it supplies to the application which it is servicing the name of that entity, as found in the document, plus a pointer to a location somewhere on the local system, and what is present at that location may in turn allow or instruct the application to do one of a number of things, including looking up the entity name in a table and deriving information about the referenced entity which can trigger specific behaviours in the application appropriate to the processing of that abstract character. ere is however no way to make an XML parser do anything of the kind in response to an entity reference. lxii vi.2. Characters and Character Sets document to become invalid or malformed. Taken together, these considerations mean that the transparent use of a CER to stand for a non-Unicode character in an XML document is simply not possible. vi.2.7 Special aspects of Unicode character definitions Compatibility characters e principles of Unicode are judiciously tempered with pragmatism. is means, among other things, that the actual repertoire of characters which the standard encodes, especially those parts dating from its earlier days, include a number of items which on a strict interpretation of the Unicode Consortium's theoretical approach should not have been regarded as abstract characters in their own right. Some of these characters are grouped together into a code-point regions assigned to compatibility characters. Ligatures are a case in point. Ligatures (.e.g. the joining of adjacent lowercase letters `s' and `t' or `f' and `i' in Latin scripts, whether produced by a scribal practice of not liing the pen between strokes or dictated by the aesthetics of a type design) are representational features with no added semantic value beyond that of the two letters they unite (though for historians of typography their presence and form in a given edition may be of scholarly significance). However, by the time the Unicode standard was first being debated, it had become common practice to include single glyphs representing the more common ligatures in the repertoires of some typesetting devices and high-end printers, and for the coded character sets built into those devices to use a single code point for such glyphs, even though they represent two distinct abstract characters. So as to increase the acceptance of Unicode among the makers and users of such devices, it was agreed that some such pseudo-characters should be incorporated into the standard. Nevertheless, if a project requires the presence of such ligatured forms to be encoded, this should normally be done via markup, not by the use of a compatibility character. at way, the presence of the ligature can still be identified (and if desired, rendered visually) where appropriate, but indexing and retrieval soware will treat the code points in the document as a simple sequential occurrence of the two constituent characters concerned and so correctly align their semantics with non-ligatured equivalents. Such ligatures should not be confused with digraphs (usually) indicating diphthongs, as in the French word "coeur". Digraphs are atomic orthographic units representing abstract characters in their own right, not purely glyphic amalgamations, and indexing and retrieval soware must treat them as such. Where a digraph occurs in a source text, it should normally be encoded using the appropriate code point for the single abstract character which it indeed represents, either by direct entry of the character concerned of through the appropriate CER or NCR. Precomposed and combining characters and normalization e treatment of characters with diacritical marks within Unicode shows a similar combination of rigour and pragmatism. It is obvious enough that it would be feasible to represent many characters with diacritical marks in Latin and some other scripts by a sequence of code points, where one code point designated the base character and the remainder represented one or more diacritical marks that were to be combined with the base character to produce an appropriate glyphic rendering of the abstract character concerned. From its earliest phase, the Unicode Consortium espoused this view in theory but was prepared in practice to compromise by assigning single code points to precomposed characters which were already commonly assigned a single distinctive code point in existing encoding schemes. is means, however, that for quite a large number of commonly-occurring abstract characters, Unicode has two different, but logically and semantically equivalent encodings: a precomposed single code point, and a code point sequence of a base character plus one or more combining diacritics. Scripts more recently added to Unicode no longer exhibit this code-point duplication (in current practice no new precomposed characters are defined where the use of combining characters is possible) but this does nothing to remove the problem caused by the duplications permanently embodied in older strata of the character set. Together with essentially analogous issues arising from the encoding of certain East Asian ideographs, this duplication gives rise to the need to practice normalization of Unicode documents. Normalization is the process of ensuring that a given abstract character is represented in one way only in a given lxiii vi. Languages and Character Sets Unicode document or document collection. e Unicode Consortium provides four standard normalization forms, of which the Normalization Form C (NFC) seems to be most appropriate for text encoding projects. e World Wide Web Consortium has produced a document entitled Character Model for the World Wide Web 1.07 , which among other things discusses normalization issues and outlines some relevant principles. An authoritative reference is Unicode Standard Annex #15 -- Unicode Normalization Forms8 . Individual projects will have to decide how far their decisions on normalization need be influenced by the fact that at present, by no means all hardware or soware can correctly render (or even consistently identify) abstract characters encoded using combining symbols. It should be noted however, that normalization as discussed in the documents above does not cover the problems mentioned above with East-Asian characters, except for issues connected with composed characters in Hangul. It is important that every Unicode-based project should agree on, consistently implement and fully document a comprehensive and coherent normalization practice. As well as ensuring data integrity within a given project, a consistently implemented and properly documented normalization policy is essential for successful document interchange. Character semantics In addition to the Universal Character Set itself, the Unicode Consortium maintains a database of additional character semantics9 . is includes names for each character code point and normative properties for it. Character properties, as given in this database, determine the semantics and thus the intended use of a code point or character. It also contains information that might be needed for correctly processing this character for different purposes. is database is an important reference in determining which Unicode code point to use to encode a certain character. In addition to the printed documentation and lists made available by the Unicode consortium, the information it contains may also be accessed by a number of search systems over the Web (e.g. http: //www.eki.ee/letter/). Examples of character properties included in the database include case, numeric value, directionality, and, where applicable status as a `compatibility character'10 . Where a project undertakes local definition of characters with code point in the PUA, it is desirable that any relevant additional information about the characters concerned should be recorded in an analogous way, as further discussed under 5. Representation of Non-standard Characters and Glyphs. vi.2.8 Character entities in non-validated documents An important difference between SGML and XML is that the latter allows for the processing of non-validated documents. Since validity and validation are central TEI concerns, it is unlikely that documents prepared according to these Guidelines will ever be designed or implemented as merely well-formed in the XML sense. However in the domain of XML technologies, even where a document invokes a DTD or schema, it is not always necessarily the case that an XML processor will perform a full validation of it. XSLT transformation is a common case in point. By the workflow stage at which a document is handed off to an XSLT process for transformation, it is likely that its associated DTD or schema will already have fulfilled its role of integrity assurance and quality control, and so it may be undesirable to add validation to the processing overhead. For this reason, most XSLT processors do not attempt validation by default, even if a DTD or schema is declared and accessible. is can, however, create a problem where parsed entities, (and character entities in particular in the present context) are referenced. A validating parser reads all entity declarations from the DTD (including those for character entities) in the initial phase of processing, so that they can be resolved as and when required. However, where no validation takes place, it cannot automatically be assumed that the parser will be able to 7Available at http://www.w3.org/TR/charmod. 8available at http://www.unicode.org/reports/tr15/ 9http://www.unicode.org/ucd/ 10For further details, see e Unicode Character Property Model (Unicode Technical Report #23), at http://www.unicode.org/reports/tr23/. lxiv vi.2. Characters and Character Sets resolve such entities in all circumstances. e XML standard requires a non-validating parser to read and act on entity declarations only if they are located within the document's internal subset (which does not, of course, mean that the entity declarations have to be manually merged into the document instance in advance of processing: character entity sets, for instance, count as being in the internal subset if they are placed there via a parameter entity, as is normal TEI practice). Some parsers when in non-validating mode will also access entity declarations in the external subset, but this behaviour is not mandated by the standard and should not be relied upon. Provided these facts are borne in mind, the presence of character entities in a document when parser validation is switched off should not cause any difficulties. vi.2.9 Issues arising from the internal representations of Unicode In theory it should not be necessary for encoders to have any knowledge of the various ways in which Unicode code points can be represented internally within a document or in the memory of a processing system, but experience shows that problems frequently arise in this area because of mistaken practice or defective soware, and in order to recognise the resulting symptoms and correct their causes an outline knowledge of certain aspects of Unicode internal representation is desirable. Encoding errors related to UTF-8 e code points assigned by Unicode 3.0 and later are notionally 32-bit integers, and the most straightforward way to represent each such integer in computer storage would be to use 4 eight-bit bytes. However, many of the code points for characters most commonly used in Latin scripts can be represented in one byte only and the vast majority of the remainder which are in common use (including those assigned from the most frequently used PUA range) can be expressed in two bytes alone. is accounts for the use of UTF-8 and UTF-16 and their special place in the XML standard. UTF-8 and UTF-16 are ways of representing 32-bit code points in an economical way. UTF-8 is a variable length-encoding: the more significant bits there are in the underlying code point (or in everyday terminology the bigger the number used to represent the character), the more bytes UTF-8 uses to encode it. What makes UTF-8 particularly attractive for representing Latin scripts, explaining its status as the default encoding in XML documents, is that all code points that can be expressed in seven or fewer bits (the 127 values in the original ASCII character set) are also encoded as the same seven or fewer bits (and therefore in a single byte) in UTF-8. at is why a document which is actually encoded in pure 7-bit ASCII can be fed to an XML processor without alteration and without its encoding being explicitly declared: the processor will regard it as being in the UTF-8 representation of Unicode and be able to handle it correctly on that basis. However, even within the domain of Latin-based scripts, some projects have documents which use characters from 8 bit extensions to ASCII, e.g. those in the ISO-8859-n series of encodings, and the way characters which under ISO-8859-n use all eight bits are encoded in UTF-8 is significantly different, giving rise to puzzling errors. Abstract characters that have a single byte code point where the highest bit is set (that is, they have a decimal numeric representation between 129 and 255) are encoded in ISO-8859-n as a singlebyte with the same value as the code point. But in UTF-8 code-point values inside that range are expressed as a two byte sequence. at is to say, the abstract character in question is no longer represented in the file or in memory by the same number as its code-point value: it is transformed (hence the T in UTF) into a sequence of two different numbers. Now as a side-effect of the way such UTF-8 sequences are derived from the underlying code-point value, many of the single-byte eight-bit values employed in ISO-8859-n encodings are illegal in UTF-8. is complicated situation has a simple consequence which can cause great bewilderment. XML processors will effortlessly handle character data in pure 7-bit ASCII without that encoding needing to be declared to the parser, and will similarly accept documents encoded in an undeclared ISO-8859-n encoding if they happen to use no characters outside the strict ASCII subset of the ISO character sets; but the parse will immediately fail if an eight-bit character from an ISO-8859-n set is encountered in the input stream, unless the document's lxv vi. Languages and Character Sets encoding has been explicitly and correctly declared. Explicitly declaring the encoding ought to solve the problem, and if the file is correctly encoded throughout, it will do so. But since text editors and word processors are currently acquiring different degrees of Unicode support at different rates, projects are likely to find that they have to deal with some files encoded in UTF-8 along with others in, say, ISO-8859-1. Such encoding differences may go unnoticed, especially if the proportion of characters where the internal encodings are distinguishable is relatively small (for example in a long English text with a smattering of French words). If in the process of document preparation two such files have been merged, or intermixed via `cut and paste' techniques, it is all too possible that the internal encodings of the resulting files will have become mixed as well. anks to misplaced notions of `user friendliness' some current editing soware silently corrects such miscodings as it displays the text, so that they remain hidden until the XML parser terminates with a fatal `invalid character' error. Where erroneously mixed encodings are the source of such an error, altering the encoding declaration will not solve the problem, though it may obfuscate it. Eight-bit character codes in a file declared as UTF-8 will always stop the parser. More insidiously, UTF-8 sequences in a file declared as ISO-8859-1 will not halt the parse, but will cause data corruption, because the parser will silently but erroneously convert each byte in every UTF-8 sequence into a spurious separate character, introducing semantic errors which may not become apparent until much later in the processing chain. In projects that routinely handle documents in non-Latin scripts, everyone is well aware of the need to ensure correct and consistent encoding, so in such places mixed encoding problems seldom arise, and when they do are readily identified and remedied. Real confusion tends to arise, however, in projects which have a low awareness of the issues because they employ predominantly unaccented Latin characters, with only thinlydistributed instances of accented letters, or other `special characters' where the internal representation under ISO-8859-n and UTF-8 are different (such as the copyright symbol, or, a frequent troublemaker where eventual HTML output is envisaged, the `non-breaking space'). Even, or especially, if such projects view themselves as concerned only with English documents, the close relationship between XML and Unicode means they will need to acquire an understanding of these encoding issues and develop procedures which assure consistency and integrity of encoding and its correct declaration, including the use of appropriate soware for transcoding and verification. Encoding errors related to UTF-16 e advantages of UTF-8 as an internal representation of Unicode code points outlined above do not obtain where documents are in scripts other than Latin, Cyrillic or Hebrew. Where characters with code points in the sixteen-bit range (two-byte) predominate, UTF-8 is inappropriate, because it requires three or more bytes to represent each abstract character. Here the preferred representation of Unicode (which all XMLconformant parsers must support) is UTF-16, where each code point corresponding to an abstract character is represented in two eight-bit bytes11 . is encoding presents a different hazard, especially while support for Unicode in editing soware is relatively uneven and immature. Because the code points are represented as sixteen-bit integers stored (in most popular computers) in two separate bytes, the order in which those bytes are stored becomes important. is is dependent on the underlying hardware. In the realm of desktop computing, Macintosh machines, for example, store (on disk as well as in memory) byte pairs representing 16-bit integers with the higher-value byte first, whereas PCs using Intel processors store the bytes in the reverse order (this is oen referred to with Swiian nomenclature as big-endian versus little-endian byte order). is means that if a semantically identical plain text file encoded in UTF-16 is prepared on a Macintosh and on a PC, and the two files are then saved to disk, each byte pair in one file will be in the reverse order from the corresponding byte pair in the other file. To avoid the obvious incompatibility problems, the XML standard requires that all documents whose declared encoding is UTF-16 must begin with a special pseudo-character which is not itself part of the document, but merely a Byte Order Marker (BOM) from which the processor 11e use of `surrogate' values to represent code points beyond the 16-bit range is passed over here, since it adds a complication that does not affect the key points at issue lxvi vi.2. Characters and Character Sets can determine the byte order of the document that follows. Now the insertion of a correct BOM and the consistent maintenance of the byte order throughout the file ought to be taken care of transparently by soware, but experience, especially from environments where work is distributed across big-endian and little-endian hardware, shows that this cannot always be taken for granted in the current state of soware development. As with mixed encoding problems involving UTF-8, inconsistent byte-order in UTF-16 files seems to be the result of merging or cutting and pasting between files using soware which does not correctly enforce byte order integrity, and out of misconceived `user friendliness' which conceals byte-order inconsistencies from the user. Once more, the result can be files which look correct in an editor, but which the XML parser either rejects outright or silently passes on in a seriously garbled form. Again, to avoid the consequent errors, projects need to cultivate an informed awareness of relevant encoding issues and devise policies to avoid them in the first place or detect them at an early stage. lxvii vi. Languages and Character Sets lxviii Chapter 1 e TEI Infrastructure is chapter describes the infrastructure for the encoding scheme defined by these Guidelines. It introduces the conceptual framework within which the following chapters are to be understood, and the means by which that conceptual framework is implemented. It assumes some familiarity with XML and XML schemas (see chapter v A Gentle Introduction to XML) but is intended to be accessible to any user of these Guidelines. Other chapters supply further technical details, in particular chapter 22. Documentation Elements which describes the XML schema used to express the Guidelines themselves, and chapter 23. Using the TEI which combines a discussion of modification and conformance issues with a description of the intended behaviour of an ODD processor; these chapters should be read by anyone intending to implement a new TEI-based system. e TEI encoding scheme consists of a number of modules, each of which declares particular XML elements and their attributes. Part of an element's declaration includes its assignment to one or more element classes. Another part defines its possible content and atttributes with reference to these classes. is indirection gives the TEI system much of its strength and its flexibility. Elements may be combined more or less freely to form a schema appropriate to a particular set of requirements. It is also easy to add new elements which reference existing classes or elements to a schema, as it is to exclude some of the elements provided by any module included in a schema. In principle, a TEI schema may be constructed using any combination of modules. However, certain TEI modules are of particular importance, and should always be included in all but exceptional circumstances: the module tei described in the present chapter is of this kind because it defines classes, macros, and datatypes which are used by all other modules. e core module, defined in chapter 3. Elements Available in All TEI Documents contains declarations for elements and attributes which are likely to be needed in almost any kind of document, and is therefore recommended for global use. e header module defined in chapter 2. e TEI Header provides declarations for the metadata elements and attributes constituting the TEI Header, a component which is required for TEI conformance, while the textstructure module defined in chapter 4. Default Text Structure declares basic structural elements needed for the encoding of most book-like objects. Most schemas will therefore need to include these four modules. e specification for a TEI schema is itself a TEI document, using elements from the module described in chapter 22. Documentation Elements: we refer to such a document informally as an ODD document, from the design goal originally formulated for the system: `One Document Does it all'. Stylesheets for maintaining and processing ODD documents are maintained by the TEI, and these Guidelines are also maintained as such a document. As further discussed in 23.4. Implementation of an ODD System, an ODD document can be processed to generate a schema expressed using any of the three schema languages currently in wide use: the XML DTD language, the ISO RELAX NG language, or the W3C Schema language, as well as to generate documentation such as the Guidelines and their associated web site. 1 1. e TEI Infrastructure e bulk of this chapter describes the TEI infrastructure module itself. Although it may be skipped at a first reading, an understanding of the topics addressed here is essential for anyone planning to take full advantage of the TEI customization techniques described in chapter 23.2. Personalization and Customization. e chapter begins by briefly characterizing each of the modules available in the TEI scheme. Section 1.2. Defining a TEI Schema describes in general terms the method of constructing a TEI schema in a specific schema language such as XML DTD language, RELAX NG, or W3C Schema. e next and largest part of the chapter introduces the attribute and element classes used to define groups of elements and their characteristics (section 1.3. e TEI Class System). Finally, section 1.4. Macros introduces the concept of macros, which are used to express some commonly used content models, and lists the datatypes used to constrain the range of legal values for TEI attributes (section 1.4.2. Datatype Macros). 1.1 TEI Modules ese Guidelines define several hundred elements and attributes for marking up documents of any kind. Each definition has the following components: * a prose description * a formal declaration, expressed using a special-purpose XML vocabulary defined by these Guidelines in combination with elements taken from the ISO schema language RELAX NG * usage examples Each chapter of the Guidelines presents a group of related elements, and also defines a corresponding set of declarations, which we call a module. All the definitions are collected together in the reference sections provided as an appendix. Formal declarations for a given chapter are collected together within the corresponding module. For convenience, each element is assigned to a single module, typically for use in some specific application area, or to support a particular kind of usage. A module is thus simply a convenient way of grouping together a number of associated element declarations. In the simple case, a TEI schema is made by combining together a small number of modules, as further described in section 1.2. Defining a TEI Schema below. e following table lists the modules defined by the current release of the Guidelines: Module name Formal public identifier Where defined analysis Analysis and Interpretation 17. Simple Analytic Mechanisms certainty Certainty and Uncertainty 21. Certainty and Responsibility core Common Core 3. Elements Available in All TEI Documents corpus Metadata for Language Corpora 15. Language Corpora dictionaries Print Dictionaries 9. Dictionaries drama Performance Texts 7. Performance Texts figures Tables, Formulae, Figures 14. Tables, Formul, and Graphics gaiji Character and Glyph Documentation 5. Representation of Non-standard Characters and Glyphs header Common Metadata 2. e TEI Header iso-fs Feature Structures 18. Feature Structures linking Linking, Segmentation, and Alignment 16. Linking, Segmentation, and Alignment msdescription Manuscript Description 10. Manuscript Description namesdates Names, Dates, People, and Places 13. Names, Dates, People, and Places nets Graphs, Networks, and Trees 19. Graphs, Networks, and Trees 2 1.2. Defining a TEI Schema spoken Transcribed Speech 8. Transcriptions of Speech tagdocs Documentation Elements 22. Documentation Elements tei TEI Infrastructure 1. e TEI Infrastructure textcrit Text Criticism 12. Critical Apparatus textstructure Default Text Structure 4. Default Text Structure transcr Transcription of Primary Sources 11. Representation of Primary Sources verse Verse 6. Verse For each module listed above, the corresponding chapter gives a full description of the classes, elements, and macros which it makes available when it is included in a schema. Other chapters of these Guidelines explore other aspects of using the TEI scheme. 1.2 Defining a TEI Schema To determine that an XML document is valid (as opposed to merely well-formed), its structure must be checked against a schema, as discussed in chapter v A Gentle Introduction to XML. For a valid TEI document, this schema must be a conformant TEI schema, as further defined in chapter 23.3. Conformance. Local systems may allow their schema to be implicit, but for interchange purposes the schema associated with a document must be made explicit. e method of doing this recommended by these Guidelines is to provide explicitly or by reference a TEI schema specification against which the document may be validated. A TEI-conformant schema is a specific combination of TEI modules, possibly also including additional declarations that modify the element and attribute declarations contained by each module, for example to suppress or rename some elements. e TEI provides an application-independent way of specifying a TEI schema by means of the element defined in chapter 22. Documentation Elements. e same system may also be used to specify a schema which extends the TEI by adding new elements explicitly, or by reference to other XML vocabularies. In either case, the specification may be processed to generate a formal schema, expressed in a variety of specific schema languages, such as XML DTD language, RELAX NG, or W3C Schema. ese output schemas can then be used by an XML processor such as a validator or editor to validate or otherwise process documents. Further information about the processing of a TEI formal specification is given in chapter 23. Using the TEI. 1.2.1 A Simple Customization e simplest customization of the TEI scheme combines just the four recommended modules mentioned above. In ODD format, this schema specification takes this form: is schema specification contains references to each of four modules, identified by the key attribute on the element. e schema specification itself is also given an identifier (TEI-minimal). An ODD processor will generate an appropriate schema from this set of declarations, expressed using the XML DTD language, the ISO RELAX NG language, the W3C Schema language, or in principle any other adequately powerful schema language. e resulting schema may then be associated with the document instance by one of a number of different mechanisms, as further described in chapter v A Gentle Introduction to XML. e start point (or root element) of document instances to be validated against the schema is specified by 3 1. e TEI Infrastructure means of the start attribute. Further information about the processing of an ODD specification is given in 23.4. Implementation of an ODD System. 1.2.2 A Larger Customization ese Guidelines introduce each of the modules making up the TEI scheme one by one, and therefore, for clarity of exposition, each chapter focusses on elements drawn from a single module. In reality, of course, the markup of a text will draw on elements taken from many different modules, partly because texts are heterogenous objects, and partly because encoders have different goals. Some examples of this heterogeneity include: * a text may be a collection of other texts of different types: for example, an anthology of prose, verse, and drama; * a text may contain other smaller, embedded texts: for example, a poem or song included in a prose narrative; * some sections of a text may be written in one form, and others in a different form: for example, a novel where some chapters are in prose, others take the form of dictionary entries, and still others the form of scenes in a play; * an encoded text may include detailed analytic annotation, for example of rhetorical or linguistic features; * an encoded text may combine a literal transcription with a diplomatic edition of the same or different sources; * the description of a text may require additional specialised metadata elements, for example when describing manuscript material in detail. e TEI provides mechanisms to support all of these and many other use cases. e architecture permits elements and attributes from any combination of modules to co-exist within a single schema. Within particular modules, elements and attributes are provided to support differing views of the `granularity' of a text, for example: * a definition of a corpus or collection as a series of documents, sharing a common TEI header (see chapter 15. Language Corpora) * a definition of composite texts which combine optional front- and back-matter with a group of collected texts, themselves possibly composite (see section 4.3.1. Grouped Texts) * an element for the representation of embedded texts, where one narrative appears to `float' within another (see section 4.3.2. Floating Texts) Subsequent chapters of these Guidelines describe in detail markup constructs appropriate for these and many other possible features of interest. e markup constructs can be combined as needed for any given set of applications or project. For example, a project aiming to produce an ambitious digital edition of a collection of manuscript materials, to include detailed metadata about each source, digital images of the content, along with a detailed transcription of each source, and a supporting biographical and geographical database might need a schema combining several modules, as follows: 4 1.3. e TEI Class System Alternatively, a simpler schema might be used for a part of such a project: those preparing the transcriptions, for example, might need only elements from the core, textstructure, and transcr modules, and might therefore prefer to use a simpler schema such as that generated by the following: e TEI architecture also supports more detailed customization beyond the simple selection of modules. A schema may suppress elements from a module, suppress some of their attributes, change their names, or even add new elements and attributes. Detailed discussion of the kind of modification possible in this way is provided in 23.2. Personalization and Customization and conformance rules relating to their application are discussed in 23.3. Conformance. ese facilities are available for any schema language (though some features may not be available in all languages). e ODD language also makes it possible to combine TEI and non-TEI modules into a single schema, provided that the non-TEI module is expressed using the RELAX NG schema language (see further 22.6. Combining TEI and Non-TEI Modules). 1.3 The TEI Class System e TEI scheme distinguishes about five hundred different elements. To aid comprehension, modularity, and modification, the majority of these elements are formally classified in some way. Classes are used to express two distinct kinds of commonality among elements. e elements of a class may share some set of attributes, or they may appear in the same locations in a content model. A class is known as an attribute class if its members share attributes, and as a model class if its members appear in the same locations. In either case, an element is said to inherit properties from any classes of which it is a member. Classes (and therefore elements which are members of those classes) may also inherit properties from other classes. For example, supposing that class A is a member (or a subclass) of class B, any element which is a member of class A will inherit not only the properties defined by class A, but also those defined by class B. In such a situation, we also say that class B is a superclass of class A. e properties of a superclass are inherited by all members of its subclasses. A basic understanding of the classes into which the TEI scheme is organized is strongly recommended and is essential for any successful customization of the system. 1.3.1 Attribute Classes An attribute class groups together elements which share some set of common attributes. Attribute classes are given names beginning att. and are usually adjectival. For example, the members of the class att.canonical have in common a key and a ref attribute, both of which are inherited from their membership in the class 5 1. e TEI Infrastructure rather than individually defined for each element. ese attributes are said to be defined by (or inherited from) the att.canonical class. If another element were to be added to the TEI scheme for which these attributes were considered useful, the simplest way to provide them would be to make the new element a member of the att.canonical class. Note also that this method ensures that the attributes in question are always defined in the same way, taking the same default values etc., no matter which element they are attached to. Some attribute classes are defined within the tei infrastructural module and are thus globally available. Other attribute classes are specific to particular modules and thus defined in other chapters. Attributes defined by such classes will not be available unless the module concerned is included in a schema. e attributes provided by an attribute class are those specified by the class itself, either directly, or by inheritance from another class. For example, the attribute class att.pointing.group provides attributes domains and targFunc to all of its members. is class is however a subclass of the att.pointing class, from which its members also inherit the attributes type and evaluate. Members of the class att.pointing will thus have these two attributes, while members of the class att.pointing.group will have all four. Note that some modules define superclasses of an existing infrastructural class. For example, the global attribute class att.divLike makes attributes org, part, and sample available, while the att.metrical class, which is specific to the verse module, provides attributes met, real, and rhyme. Because att.metrical is defined as a superclass of att.divLike, all six of these attributes are available to elements; the declaration for att.metrical adds its three attributes to the three already defined by att.divLike when the verse module is included in a schema. If, however, this module is not included in a schema, then the att.divLike elements supplies only the three attributes first mentioned. Attributes specific to particular modules are documented along with the relevant module rather than in the present chapter. One particular attribute class, known as att.global, is common to all modules, and is therefore described in some detail in the next section. A full list of all attribute classes is given in Appendix B Attribute Classes below. 1.3.1.1 Global Attributes e following attributes are defined for every TEI element. att.global provides attributes common to all elements in the TEI encoding scheme. @xml:id (identifier) provides a unique identifier for the element bearing the attribute. @n (number) gives a number (or other label) for an element, which is not necessarily unique within the document. @xml:lang (language) indicates the language of the element content using a `tag' generated according to BCP 47 @rend (rendition) indicates how the element in question was rendered or presented in the source text. @rendition points to a description of the rendering or presentation used for this element in the source text. @xml:base provides a base URI reference with which applications can resolve relative URI references into absolute URI references. ese attributes are optionally available for any TEI element; none of them is required. 1.3.1.1.1 Element Identifiers and Labels e value supplied for the xml:id attribute must be a legal name, as defined in the World Wide Web Consortium's XML Recommendation. is means that it must begin with a letter, or the underscore character 6 1.3. e TEI Class System (`_'), and contain no characters other than letters, digits, hyphens, underscores, full stops, and certain combining and extension characters.1 In XML names (and thus the values of xml:id in an XML TEI document) uppercase and lowercase letters are distinguished, and thuspartTime and parttime are two distinctly different names, and could (though perhaps unwisely) be used to denote two different element occurrences. If two elements are given the same identifier, a validating XML parser will signal a syntax error. e following example, therefore, is not valid:

What's it going to be then, eh?

There was me, that is Alex, and my three droogs, that is Pete, Georgie, and Dim, ...

Source: [24] For a discussion of methods of providing unique identifiers for elements, see section 3.10.2. Creating New Reference Systems. e n attribute also provides an identifying name or number for an element, but in this case the information need not be a legal xml:id value. Its value may be any string of characters; typically it is a number or other similar enumerator or label. For example, the numbers given to the items of a numbered list may be recorded with the n attribute; this would make it possible to record errors in the numeration of the original, as in this list of chapters, transcribed from a faulty original in which the number 10 is used twice, and 11 is omitted: About These Guidelines A Gentle Introduction to SGML Verse Drama Spoken Materials Print Dictionaries e n attribute may also be used to record non-unique names associated with elements in a text, possibly together with a unique identifier as in the following example:
Source: [174] As noted above there is no requirement to record a value for either the xml:id or the n attribute. Any XML processor can identify the sequential position of one element within another in an XML document without any additional tagging. An encoding in which each line of a long poem is explicitly labelled with its numerical sequence such as the following 1e colon is also by default a valid name character; however, it has a specific purpose in XML (to indicate namespace prefixes), and may not therefore be used in any other way within a name. 7 1. e TEI Infrastructure is therefore probably redundant. 1.3.1.1.2 Language Indicators e xml:lang attribute indicates the natural language and writing system applicable to the content of a given element. If it is not specified, the value is inherited from that of the immediately enclosing element. As a rule, therefore, it is simplest to specify the base language of the text on the element, and allow most elements to take the default value for xml:lang; the language of an element then need be explicitly specified only for elements in languages other than the base language. It is strongly recommended that all language shis in the source be explicitly identified by use of the xml:lang attribute, as further described in chapter vi Languages and Character Sets. e values used for the xml:lang attribute must be constructed in a particular way, using values from standard lists. See further vi.1 Language identification. e following two encodings convey the same information about the language of the text. In the first, the xml:lang attributes on the elements specify the same value as that on the parent

element, while in the second they inherit that value without specifying it.

... Both parties deprecated war, but one of them would make war rather than let the nation survive, and the other would accept war rather than let it perish, and the war came.

Source: [131]

... Both parties deprecated war, but one of them would make war rather than let the nation survive, and the other would accept war rather than let it perish, and the war came.

In the following example, by contrast, the xml:lang attribute on the element must be given if we wish to record the fact that the technical terms used are Latin rather than English; no xml:lang attribute is needed on the element, by contrast, because it is in the same language as its parent.

The constitution declares that no bill of attainder or ex post facto law shall be passed. ...

Source: [140] Note that additional information about a particular language may be supplied in the element within the header (see section 2.4.2. Language Usage). 8 1.3. e TEI Class System 1.3.1.1.3 Rendition Indicators e rend attribute is used to give information about the physical presentation of the text in the source. In the following example, it is used to indicate that both the emphasized word and the proper name are printed in italics:

... Their motives might be pure and pious; but he was equally alarmed by his knowledge of the ambitious Bohemond, and his ignorance of the Transalpine chiefs: ...

Source: [85] If all or most and elements are rendered in the text by italics, it will be more convenient to register that fact in the TEI header once and for all (using the element discussed below) and specify a rend value only for any elements which deviate from the stated rendition. Although the contents of the rend attribute are free text, in any given project, encoders are advised to adopt a standard vocabulary with which to describe typographic or manuscript rendition of the text. e element defined in 2.3.4. e Tagging Declaration may be used to hold such descriptions, expressed in free text, or using a formal language. A element can then be associated with any element, either by default, or by means of the global rendition attribute. For example: font-style: italic

e rendition attribute always points to one or more elements, each of which defines some aspect of the rendering or appearance of the text in its original form. ese details may be described using a formal language, such as CSS (Lie and Bos (eds.) (1999)) or XSL-FO (Berglund (ed.) (2006)); in some other formal language developed for a specific project; or informally in running prose. Although languages such as CSS and XSL-FO are generally used to describe document output to screen or print, they nonetheless provide formal and precise mechanisms for describing the appearance of many source documents, especially print documents, but also many aspects of manuscript documents. For example, both CSS and XSL-FO provide mechanisms for describing typefaces, weight, and styles; character and line spacing; and so on. If both rendition and rend attributes are provided for a given element, the latter always takes precedence. e rendition attribute is analogous to the X/HTML class attribute, which references style declarations in a Cascading Style Sheet. e rend attribute is analogous to the XHTML or HTML style attribute, which provides a mechanism for embedding inline rendition information at the point of use within a document. Note that, in either case, the TEI attributes describe the rendition or appearance of the source document, not intended output renditions, although oen the two may be closely related. 1.3.2 Model Classes As noted above, the members of a given TEI model class share the property that they can all appear in the same location within a document. Wherever possible, the content model of a TEI element is expressed not directly in terms of specific elements, but indirectly in terms of particular model classes. is makes content models simpler and more consistent; it also makes them much easier to understand and to modify. 9 1. e TEI Infrastructure Like attribute classes, model classes may have subclasses or superclasses. Just as elements inherit from a class the ability to appear in certain locations of a document (wherever the class can appear), so all members of a subclass inherit the ability to appear wherever any superclass can appear. To some extent, the class system thus provides a way of reducing the whole TEI galaxy of elements into a tidy hierarchy. is is however not entirely the case. In fact, the nature of a given class of elements can be considered along two dimensions: as noted, it defines a set of places where the class members are permitted within the document hierarchy; it also implies a semantic grouping of some kind. For example, the very large class of elements which can appear within a paragraph comprises a number of other classes, all of which have the same structural property, but which differ in their field of application. Some are related to highlighting, while others relate to names or places, and so on. In some cases, the `set of places where class members are permitted' is very constrained: it may just be within one specific element, or one class of element, for example. In other cases, elements may be permitted to appear in very many places, or in more than one such set of places. ese factors are reflected in the way that model classes are named. If a model class has a name containing part, such as model.divPart or model.biblPart then it is primarily defined in terms of its structural location. For example, those elements (or classes of element) which appear as content of a

constitute the model.divPart class; those which appear as content of a constitute the model.biblPart class. If, however, a model class has a name containing like, such as model.biblLike or model.nameLike, the implication is that its members all have some additional semantic property in common, for example containing a bibliographic description, or containing some form of name, respectively. ese semantically-motivated classes oen provide a useful way of dividing up large structurally-motivated classes: for example, the very general structural class model.pPart.data (`data elements that form part of a paragraph') has four semantically-motivated member classes (model.addressLike, model.dateLike, model.measureLike, and model.nameLike), the last of these being itself a superclass with three members. Although most classes are defined by the tei infrastructure module, a class cannot be populated unless some other specific module is included in a schema, since element declarations are contained by modules. Classes are not declared `top down', but instead gain their members as a consequence of individual elements' declaration of their membership. e same class may therefore contain different members, depending on which modules are active. Consequently, the content model of a given element (being expressed in terms of model classes) may differ depending on which modules are active. Some classes contain only a single member, even when all modules are loaded. One reason for declaring such a class is to make it easier for a customization to add new member elements in a specific place, particularly in areas where the TEI does not make fully elaborated proposals. For example, the TEI class model.rdgLike, initially empty, is expanded by the textcrit module to include just the TEI element. A project wishing to add an alternative way of structuring text-critical information could do so by defining their own elements and adding it to this class. Another reason for declaring single-member classes is where the class members are not needed in all documents, but appear in the same place as elements which are very frequently required. For example, the specialised element used to represent a non-Unicode character or glyph is provided as the only member of the model.gLike class when the gaiji module is added to a schema. References to this class are included in almost every content model, since if it is used at all the must be available wherever text is available; however these references have no effect unless the gaiji module is loaded. At the other end of the scale, a few of the classes predefined by the tei module are subsequently populated with very many members. For example, the class model.pPart groups all the classes of element which can appear within a

or paragraph element. e core module alone adds more than fiy elements to this class; the namesdates module adds another twenty, as does the tagdocs module. Since the

element is one of the basic building blocks of a TEI document it is not surprising that each module will need to add elements to it. e class system here provides a very convenient way of controlling the resulting complexity. Typically, 10 1.3. e TEI Class System elements are not added directly to these very general classes, but via some intermediate semantically-motivated class. Just as there are a few classes which have a single member, so there are some classes which are used only once in the TEI architecture. ese classes, which have no superclass and therefore do not fit into the class hierarchy defined here, are a convenient way of maintaining elements which are highly structured internally, but which appear from the outside to be uniform objects like others at the same level.2 Members of such classes can only ever appear within one element, or one class of elements. For example, the class model.addrPart is used only to express the content model for the element

; it references some other classes of elements, which can appear elsewhere, and also some elements which can only appear inside an address. 1.3.2.1 Basic Model Classes e TEI class system makes the following threefold division of elements: divisions high level, possibly self-nesting, major divisions of texts. ese elements populate the classes model.divLike, model.div1Like, etc. chunks elements such as paragraphs and other paragraph-level elements, which can appear directly within texts or within such divisions, but not within other chunks. ese elements populate the class model.divPart, either directly or by means of other classes such as model.pLike (paragraph-like elements), model.entryLike, etc. phrase-level elements elements such as highlighted phrases, book titles, or editorial corrections which can occur only within chunks (paragraphs or paragraph-level elements), but not between them (and thus cannot appear directly within a division). ese elements populate the class model.phrase.3 e TEI identifies the following fundamental groupings derived from these three: inter-level elements elements such as lists, notes, quotations, etc. which can appear either between chunks (as children of a
) or within them; these elements populate the class model.inter. Note that this class is not a superset of the model.phrase and model.chunk classes but rather the group of elements which are both chunk-like and phrase-like; the classes model.phrase, model.pLike, and model.inter are all disjoint. components elements which can appear directly within texts or text divisions; this is a combination of the inter- and chunk- level elements defined above. ese elements populate the class model.common, which is defined as a superset of the classes model.divPart, model.inter, and (when the dictionary module is included in a schema) model.entryLike. Broadly speaking, the front, body, and back of a text each comprises a series of components, optionally grouped into divisions. As noted above, some elements and element classes belong to none of these groupings; however, over twothirds of the 500+ elements defined in the present edition of these Guidelines are classified in this way. Future editions of these recommendations will extend and develop this classification scheme. A complete alphabetical list of all model classes is provided in Appendix A Model Classes. 2In former editions of these Guidelines, such elements were known metaphorically as `crystals'. 3Note that in this context, phrase means any string of characters, and can apply to individual words, parts of words, and groups of words indifferently; it does not refer only to linguistically-motivated phrasal units. is may cause confusion for readers accustomed to applying the word in a more restrictive sense. 11 1. e TEI Infrastructure Content model Number of elements using this Description macro.phraseSeq 83 any combination of text with elements from the model.gLike, model.global, or model.phrase classes macro.paraContent 49 macro.phraseSeq with the addition of model.inter empty 39 elements that have no content macro.specialPara 24 macro.paraContent with the addition of model.divPart macro.phraseSeq.limited 24 a subset of model.phraseSeq appropriate for use in nontranscriptional contexts text 21 plain untagged text macro.xtext 19 any combination of text with elements from the model.gLike class Table 1.2: 1.4 Macros e infrastructure module defined by this chapter also declares a number of macros, or shortcut names for frequently occurring parts of other declarations. Macros are used in two ways in the TEI scheme: to stand for frequently-encountered content models, or parts of content models (1.4.1. Standard Content Models); and to stand for attribute datatypes (1.4.2. Datatype Macros). 1.4.1 Standard Content Models As far as possible, the TEI schemas use the following set of frequently-encountered content models to help achieve consistency among different elements. macro.paraContent (paragraph content) defines the content of paragraphs and similar elements. macro.limitedContent (paragraph content) defines the content of prose elements that are not used for transcription of extant materials. macro.phraseSeq (phrase sequence) defines a sequence of character data and phrase-level elements. macro.phraseSeq.limited (limited phrase sequence) defines a sequence of character data and those phrase-level elements that are not typically used for transcribing extant documents. macro.schemaPattern provides a pattern to match elements from the chosen schema language macro.specialPara ('special' paragraph content) defines the content model of elements such as notes or list items, which either contain a series of component-level elements or else have the same structure as a paragraph, containing a series of phrase-level and inter-level elements. macro.xtext (extended text) defines a sequence of character data and gaiji elements. e present version of the TEI Guidelines includes some 500 different elements.Table 1 shows, in descending order of frequency, the seven most commonly used content models. 1.4.2 Datatype Macros e values which attributes may take in a TEI schema are defined, for the most part, by reference to a TEI datatype. Each such datatype is defined in terms of other primitive datatypes, derived mostly from W3C Schema Datatypes, literal values, or other datatypes. is indirection makes it possible for a TEI application to set constraints either globally or in individual cases, by redefining the datatype definition or the reference to it respectively. In some cases, the TEI datatype includes additional usage constraints which cannot be enforced by existing schema languages, although a TEI-compliant processor should attempt to validate them (see further discussion in chapter 23.3. Conformance). Where literal values or name tokens are used in a datatype definition, an associated value list supplies definitions for the significance of suggested or (in the case of closed lists) all possible values. 12 1.4. Macros TEI-defined datatypes may be grouped into those which define normalised values for numeric quantities, probabilities, or temporal expressions, those which define various kinds of shorthand codes or keys, and those which define pointers or links. e following datatypes are used for attributes which are intended to hold normalized values of various kinds. First, expressions of quantity or probability: data.certainty defines the range of attribute values expressing a degree of certainty. data.probability defines the range of attribute values expressing a probability. data.numeric defines the range of attribute values used for numeric values. data.count defines the range of attribute values used for a non-negative integer value used as a count. Examples of attributes using the data.probability datatype include degree on or ; examples of data.numeric include quantity on members of the att.measurement class or value on ; examples of data.count include cols on and
. Next, the datatypes used for attributes which are intended to hold normalized dates or times, durations, or truth values: data.duration.w3c defines the range of attribute values available for representation of a duration in time using W3C datatypes. data.duration.iso defines the range of attribute values available for representation of a duration in time using ISO 8601 standard formats data.temporal.w3c defines the range of attribute values expressing a temporal expression such as a date, a time, or a combination of them, that conform to the W3C XML Schema Part 2: Datatypes specification. data.temporal.iso defines the range of attribute values expressing a temporal expression such as a date, a time, or a combination of them, that conform to the international standard Data elements and interchange formats ­ Information interchange ­ Representation of dates and times. data.truthValue defines the range of attribute values used to express a truth value. data.xTruthValue (extended truth value) defines the range of attribute values used to express a truth value which may be unknown. data.language defines the range of attribute values used to identify a particular combination of human language and writing system. data.sex defines the range of attribute values used to identify human or animal sex. Note that in each of these cases the values used are those recommended by existing international standards: ISO 8601 as profiled by XML Schema Part 2: Datatypes Second Edition in the case of durations, times, and date; W3C Schema datatypes in the case of truth values; BCP 47 in the case of language; and ISO 5218 in the case of sex. e following datatypes have more specialised uses: data.outputMeasurement defines a range of values for use in specifying the size of an object that is intended for display on the web. data.namespace defines the range of attribute values used to indicate XML namespaces as defined by the W3C Namespaces in XML Technical Recommendation. data.pattern (regular expression pattern) defines attribute values which are expressed as a regular expression. data.pointer defines the range of attribute values used to provide a single URI pointer to any other resource, either within the current document or elsewhere. 13 1. e TEI Infrastructure By far the largest number of TEI attributes take values which are coded values or names of some kind. ese values may be constrained or defined in a number of different ways, each of which is given a different name, as follows: data.key defines the range of attribute values expressing a coded value by means of an arbitrary identifier, typically taken from a set of externally-defined possibilities. data.word defines the range of attribute values expressed as a single word or token. data.name defines the range of attribute values expressed as an XML Name. data.enumerated defines the range of attribute values expressed as a single XML name taken from a list of documented possibilities. data.code defines the range of attribute values expressing a coded value by means of a pointer to some other element which contains a definition for it. e attribute key provided by the att.canonical class is currently the only attribute of type data.key. It is used to supply an externally-defined identifier, such as a database key or filename. Because such identifiers are externally-defined, no constraints are placed on their possible values: any string of Unicode characters may be used. Any constraints on their values, such as the rules for constructing a valid database key in a particular system, may be documented by a element in the TEI Header, but are not enforced by the datatype as defined here. Such system-specific constraints may however be added to a TEI schema by using the customisation techniques methods described in 23.2. Personalization and Customization. Attributes of type data.word, such as age on , are used to supply an identifier expressed as any kind of single token or word. e TEI places a few constraints on the characters which may be used for this purpose: only Unicode characters classified as letters, digits, punctuation characters, or symbols can appear in an attribute value of this kind. Note in particular that such values cannot include whitespace characters. Legal values include cholmondeley, été, 1234, _content, or xml:id, but not grand wazoo. Attributes of this kind are sometimes used to associate (by co-reference) elements of different types. Attributes of type data.name are also words in this sense, but they have the additional constraint that they must be legal XML identifiers, as defined by the XML 1.0 specification, or successors. As such, they may not begin with digits or punctuation characters. Legal identifiers include cholmondeley, été, e_content, or xml:id, but not grand wazoo or 1234. Attributes of this kind are typically used to represent XML element or attribute names. Attributes of type data.enumerated, such as new on or evidence supplied by att.editLike, have the same definition as data.word above, with the added constraint that the word supplied is taken from a specific list of possibilities. In each case, the element or class specification which includes the definition for the attribute will also contain a list of possible values, together with a prose description of their intended significance. is list may be open (in which case the list is advisory), or closed (in which case it determines the range of legal values). In this latter case, the datatype will not be data.enumerated, but an explicit list of the possible values. Attributes of type data.code are similar in function, in that they also supply encoded names for values which are defined in more detail elsewhere. In this case, however, the full definition is supplied as content of another XML element, typically but not necessarily in the same document, and it is referenced by means of a pointer. An attribute may, of course, take more than one value of a given type, for example a list of pointer values, or a list of words. In the TEI scheme, this information is regarded as a property of the element used to document the attribute in question rather than as a distinct `datatype'. See further 22.4.5.1. Datatypes. 1.5 The TEI Infrastructure Module e tei module defined by this chapter is a required component of any TEI schema. It provides declarations for all datatypes, and initial declarations for the attribute classes, model classes, and macros used by other modules in the TEI scheme. Its components are listed below in alphabetical order: 14 1.5. e TEI Infrastructure Module Module tei: Declarations for classes, datatypes, and macros available to all TEI modules * Classes defined: att.ascribed att.canonical att.damaged att.datable att.datable.w3c att.declarable att.declaring att.dimensions att.divLike att.duration.w3c att.editLike att.global att.handFeatures att.internetMedia att.interpLike att.measurement att.naming att.personal att.placement att.segLike att.sourced att.spanning att.tableDecoration att.timed att.transcriptional att.translatable att.typed att.xmlspace model.addrPart model.addressLike model.biblLike model.biblPart model.castItemPart model.catDescPart model.choicePart model.common model.dateLike model.div1Like model.div2Like model.div3Like model.div4Like model.div5Like model.div6Like model.div7Like model.divBottom model.divBottomPart model.divGenLike model.divLike model.divPart model.divTop model.divTopPart model.divWrapper model.egLike model.emphLike model.entryPart model.entryPart.top model.featureVal model.featureVal.complex model.featureVal.single model.frontPart model.frontPart.drama model.gLike model.global model.global.edit model.global.meta model.glossLike model.graphicLike model.handDescPart model.headLike model.hiLike model.highlighted model.imprintPart model.inter model.lLike model.lPart model.labelLike model.limitedPhrase model.listLike model.measureLike model.milestoneLike model.msItemPart model.nameLike model.nameLike.agent model.noteLike model.oddDecl model.oddRef model.offsetLike model.orgStateLike model.pLike model.pLike.front model.pPart.data model.pPart.edit model.pPart.editorial model.pPart.msdesc model.pPart.transcriptional model.persEventLike model.persStateLike model.persTraitLike model.personLike model.personPart model.phrase model.phrase.xml model.physDescPart model.placeEventLike model.placeLike model.placeNamePart model.placeStateLike model.placeTraitLike model.ptrLike model.publicationStmtPart model.qLike model.quoteLike model.resourceLike model.respLike model.segLike model.settingPart model.specDescLike model.stageLike model.textDescPart model.titlepagePart * Macros defined: data.certainty data.code data.count data.duration.iso data.duration.w3c data.enumerated data.key data.language data.name data.namespace data.numeric data.outputMeasurement data.pattern data.pointer data.probability data.sex data.temporal.iso data.temporal.w3c data.truthValue data.word data.xTruthValue macro.anyXML macro.limitedContent macro.paraContent macro.phraseSeq macro.phraseSeq.limited macro.schemaPattern macro.specialPara macro.xtext e order in which declarations are made within the infrastructure module is critical, since several class declarations refer to others, which must therefore precede them. Other constraints on the order of declarations derive from the way in which the modularity of the TEI scheme is implemented in different schema languages. e XML DTD fragment implementing this TEI module makes extensive use of parameter entities and marked sections to effect a kind of conditional construction; the RELAX NG schema fragment similarly predeclares a number of patterns with null (`notAllowed') values. ese issues are further discussed in chapter 23.4. Implementation of an ODD System. 15 1. e TEI Infrastructure 16 Chapter 2 e TEI Header is chapter addresses the problems of describing an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented. Such documentation is equally necessary for scholars using the texts, for soware processing them, and for cataloguers in libraries and archives. Together these descriptions and declarations provide an electronic analogue to the title page attached to a printed work. ey also constitute an equivalent for the content of the code books or introductory manuals customarily accompanying electronic data sets. Every TEI-conformant text must carry such a set of descriptions, prefixed to it and encoded as described in this chapter. e set is known as the TEI header, tagged , and has four major parts: 1. a file description, tagged , containing a full bibliographical description of the computer file itself, from which a user of the text could derive a proper bibliographic citation, or which a librarian or archivist could use in creating a catalogue entry recording its presence within a library or archive. e term computer file here is to be understood as referring to the whole entity or document described by the header, even when this is stored in several distinct operating system files. e file description also includes information about the source or sources from which the electronic document was derived. e TEI elements used to encode the file description are described in section 2.2. e File Description below. 2. an encoding description, tagged , which describes the relationship between an electronic text and its source or sources. It allows for detailed description of whether (or how) the text was normalized during transcription, how the encoder resolved ambiguities in the source, what levels of encoding or analysis were applied, and similar matters. e TEI elements used to encode the encoding description are described in section 2.3. e Encoding Description below. 3. a text profile, tagged , containing classificatory and contextual information about the text, such as its subject matter, the situation in which it was produced, the individuals described by or participating in producing it, and so forth. Such a text profile is of particular use in highly structured composite texts such as corpora or language collections, where it is oen highly desirable to enforce a controlled descriptive vocabulary or to perform retrievals from a body of text in terms of text type or origin. e text profile may however be of use in any form of automatic text processing. e TEI elements used to encode the profile description are described in section 2.4. e Profile Description below. 4. a revision history, tagged , which allows the encoder to provide a history of changes made during the development of the electronic text. e revision history is important for version control and for resolving questions about the history of a file. e TEI elements used to encode the revision description are described in section 2.5. e Revision Description below. 17 2. e TEI Header A TEI header can be a very large and complex object, or it may be a very simple one. Some application areas (for example, the construction of language corpora and the transcription of spoken texts) may require more specialized and detailed information than others. e present proposals therefore define both a core set of elements (all of which may be used without formality in any TEI header) and some additional elements which become available within the header as the result of including additional specialized modules within the schema. When the module for language corpora (described in chapter 15. Language Corpora) is in use, for example, several additional elements are available, as further detailed in that chapter. e next section of the present chapter briefly introduces the overall structure of the header and the kinds of data it may contain. is is followed by a detailed description of all the constituent elements which may be used in the core header. Section 2.6. Minimal and Recommended Headers , at the end of the present chapter, discusses the recommended content of a minimal TEI header and its relation to standard library cataloguing practices. 2.1 Organization of the TEI Header 2.1.1 The TEI Header and its Components e element should be clearly distinguished from the front matter of the text itself (for which see section 4.5. Front Matter). A composite text, such as a corpus or collection, may contain several headers, as further discussed below. In the usual case, however, a TEI-conformant text will contain a single element, followed by a single element. e header element has the following description: (TEI Header) supplies the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text. @type specifies the kind of document to which the header is attached, for example whether it is a corpus or individual text. As discussed above, the element has four principal components: (file description) contains a full bibliographic description of an electronic file. (encoding description) documents the relationship between an electronic text and the source or sources from which it was derived. (text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting. (revision description) summarizes the revision history for a file. Of these, only the element is required in all TEI headers; the others are optional. e top level elements in the full form of a TEI header are thus: 18 2.1. Organization of the TEI Header while a minimal header takes the form: In the case of language corpora or collections, it may be desirable to record header information either at the level of the individual components in the corpus or collection, or at the level of the corpus or collection itself (more details concerning the tagging of composite texts are given in section 15. Language Corpora, which should be read in conjunction with the current chapter). e type attribute may be used to indicate whether the header applies to a corpus or a single text. A corpus may thus take the form: 2.1.2 Types of Content in the TEI Header e elements occurring within the TEI header may contain several types of content; the following list indicates how these types of content are described in the following sections: free prose Most elements contain simple running prose at some level. Many elements may contain either prose (possibly organized into paragraphs) or more specific elements, which themselves contain prose. In this chapter's descriptions of element content, the phrase prose description should be understood to imply a series of paragraphs, each marked as a

element. e word phrase, by contrast, should be understood to imply character data, interspersed as need be with phrase-level elements, but not organized into paragraphs. For more information on paragraphs, highlighted phrases, lists, etc., see section 3.1. Paragraphs. 19 2. e TEI Header grouping elements Elements whose names end with the suffix Stmt (e.g. , ) usually enclose a group of specialized elements recording some structured information. In the case of the bibliographic elements, the suffix Stmt is used in names of elements corresponding to the `areas' of the International Standard Bibliographic Description.1 In most cases grouping elements may contain prose descriptions as an alternative to the set of specialized elements, thus allowing the encoder to choose whether or not the information concerned should be presented in a structured form or in prose. declarations Elements whose names end with the suffix Decl (e.g. , ) enclose information about specific encoding practices applied in the electronic text; oen these practices are described in coded form. Typically, such information takes the form of a series of declarations, identifying a code with some more complex structure or description. A declaration which applies to more than one text or division of a text need not be repeated in the header of each such text or subdivision. Instead, the decls attribute of each text (or subdivision of the text) to which the declaration applies may be used to supply a cross-reference to it, as further described in section 15.3. Associating Contextual Information with a Text. descriptions Elements whose names end with the suffix Desc (e.g. , ) contain a prose description, possibly, but not necessarily, organized under some specific headings by suggested sub-elements. 2.1.3 Model Classes in the TEI Header e TEI Header provides a very rich collection of metadata categories, but makes no claim to be exhaustive. It is certainly the case that individual projects may wish to record specialised metadata which either does not fit within one of the predefined categories identified by the TEI Header or requires a more specialized element structure than is proposed here. To overcome this problem, the encoder may elect to define additional elements using the customization methods discussed in 23.2. Personalization and Customization. e TEI class system makes such customizations simpler to effect and easier to use in interchange. ese classes are specific to parts of the header: model.applicationLike groups elements used to record application-specific information about a document in its header. model.catDescPart groups component elements of the TEI Header Category Description. model.editorialDeclPart groups elements which may be used inside and appear multiple times. model.encodingPart groups elements which may be used inside and appear multiple times. model.profileDescPart groups elements which may be used inside and appear multiple times. model.headerPart groups high level elements which may appear more than once in a TEI Header. model.sourceDescPart groups elements which may be used inside and appear multiple times. model.textDescPart groups elements used to categorise a text for example in terms of its situational parameters. 1 For more information on this highly influential family of standards, first proposed in 1969 by the International Federation of Library Associations, see http://www.ifla.org/VII/s13/pubs/isbd.htm. On the relation between the TEI proposals and other standards for bibliographic description, see further section 2.7. Note for Library Cataloguers. 20 2.2. e File Description 2.2 The File Description is section describes the element, which is the first component of the element. e bibliographic description of a machine-readable or digital text resembles in structure that of a book, an article, or any other kind of textual object. e file description element of the TEI header has therefore been closely modelled on existing standards in library cataloguing; it should thus provide enough information to allow users to give standard bibliographic references to the electronic text, and to allow cataloguers to catalogue it. Bibliographic citations occurring elsewhere in the header, and also in the text itself, are derived from the same model (on bibliographic citations in general, see further section 3.11. Bibliographic Citations and References). See further section 2.7. Note for Library Cataloguers. e bibliographic description of an electronic text should be supplied by the mandatory element: (file description) contains a full bibliographic description of an electronic file. e element contains three mandatory elements and four optional elements, each of which is described in more detail in sections 2.2.1. e Title Statement to 2.2.6. e Notes Statement below. ese elements are listed below in the order in which they must be given within the element. (title statement) groups information about the title of a work and those responsible for its intellectual content. (edition statement) groups information relating to one edition of a text. describes the approximate size of a text as stored on some carrier medium, whether digital or non-digital, specified in any convenient units. (publication statement) groups information concerning the publication or distribution of an electronic or other text. (series statement) groups information about the series, if any, to which a publication belongs. (notes statement) collects together any notes providing information about a text additional to that recorded in other parts of the bibliographic description. (source description) describes the source from which an electronic text was derived or generated, typically a bibliographic description in the case of a digitized text, or a phrase such as "born digital" for a text which has no previous existence. A file description containing all possible sub-elements has the following structure: 21 2. e TEI Header Several of these elements may be omitted; a minimal file description has the following structure: 2.2.1 The Title Statement e element is the first component of the element, and is mandatory: (title statement) groups information about the title of a work and those responsible for its intellectual content. It contains the title given to the electronic work, together with one or more optional statements of responsibility which identify the encoder, editor, author, compiler, or other parties responsible for it: contains a title for any kind of work. <author> in a bibliographic reference, contains the name of the author(s), personal or corporate, of a work; the primary statement of responsibility for any bibliographic item. <editor> secondary statement of responsibility for a bibliographic item, for example the name of an individual, institution or organization, (or of several such) acting as editor, compiler, translator, etc. <sponsor> specifies the name of a sponsoring organization or institution. <funder> (funding body) specifies the name of an individual, institution, or organization responsible for the funding of a project or text. <principal> (principal researcher) supplies the name of the principal researcher responsible for the creation of an electronic text. <respStmt> (statement of responsibility) supplies a statement of responsibility for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc. do not suffice or do not apply. <resp> (responsibility) contains a phrase describing the nature of a person's intellectual responsibility. <name> (name, proper noun) contains a proper noun or noun phrase. 22 2.2. e File Description e <title> element contains the chief name of the electronic work, including any alternative title or subtitles it may have. It may be repeated, if the work has more than one title (perhaps in different languages) and takes whatever form is considered appropriate by its creator. Where the electronic work is derived from an existing source text, it is strongly recommended that the title for the former should be derived from the latter, but clearly distinguishable from it, for example by the addition of a phrase such as `: an electronic transcription' or `a digital edition'. is will distinguish the electronic work from the source text in citations and in catalogues which contain descriptions of both types of material. e electronic work will also have an external name (its `filename' or `data set name') or reference number on the computer system where it resides at any time. is name is likely to change frequently, as new copies of the file are made on the computer system. Its form is entirely dependent on the particular computer system in use and thus cannot always easily be transferred from one system to another. Moreover, a given work may be composed of many files. For these reasons, these Guidelines strongly recommend that such names should not be used as the <title> for any electronic work. Helpful guidance on the formulation of useful descriptive titles in difficult cases may be found in the Anglo-American Cataloguing Rules (Gorman and Winkler, 1978, chapter 25) or in equivalent national-level bibliographical documentation. e elements <author>, <editor>, <sponsor>, <funder>, and <principal>, are specializations of the more general <respStmt> element. ese elements are used to provide the statements of responsibility which identify the person(s) responsible for the intellectual or artistic content of an item and any corporate bodies from which it emanates. Any number of such statements may occur within the title statement. At a minimum, identify the author of the text and (where appropriate) the creator of the file. If the bibliographic description is for a corpus, identify the creator of the corpus. Optionally include also names of others involved in the transcription or elaboration of the text, sponsors, and funding agencies. e name of the person responsible for physical data input need not normally be recorded, unless that person is also intellectually responsible for some aspect of the creation of the file. Where the person whose responsibility is to be documented is not an author, sponsor, funding body, or principal researcher, the <respStmt> element should be used. is has two subcomponents: a <name> element identifying a responsible individual or organization, and a <resp> element indicating the nature of the responsibility. No specific recommendations are made at this time as to appropriate content for the <resp>: it should make clear the nature of the responsibility concerned, as in the examples below. Names given may be personal names or corporate names. Give all names in the form in which the persons or bodies wish to be publicly cited. is would usually be the fullest form of the name, including first names.2 Examples: <titleStmt> <title>Capgrave's Life of St. John Norbert: a machine-readable transcription compiled by P.J. Lucas 2Agencies compiling catalogues of machine-readable files are recommended to use available authority lists, such as the Library of Congress Name Authority List, for all common personal names. 23 2. e TEI Header Two stories by Edgar Allen Poe: electronic version Poe, Edgar Allen (1809-1849) compiled by James D. Benson Yogadarśanam (artht yogastrapha): a digital edition. The Yogastras of Patajali: a digital edition. Wellcome Institute for the History of Medicine Dominik Wujastyk Wieslaw Mical data entry and proof correction Jan Hajic conversion to TEI-conformant markup 2.2.2 The Edition Statement e element is the second component of the element. It is optional but recom- mended. (edition statement) groups information relating to one edition of a text. It contains either phrases or more specialized elements identifying the edition and those responsible for it: (edition) describes the particularities of one edition of a text. (statement of responsibility) supplies a statement of responsibility for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc. do not suffice or do not apply. (name, proper noun) contains a proper noun or noun phrase. (responsibility) contains a phrase describing the nature of a person's intellectual responsibility. For printed texts, the word edition applies to the set of all the identical copies of an item produced from one master copy and issued by a particular publishing agency or a group of such agencies. A change in the identity of the distributing body or bodies does not normally constitute a change of edition, while a change in the master copy does. For electronic texts, the notion of a `master copy' is not entirely appropriate, since they are far more easily copied and modified than printed ones; nonetheless the term edition may be used for a particular state of a machine-readable text at which substantive changes are made and fixed. Synonymous terms used in these Guidelines are version, level, and release. e words revision and update, by contrast, are used for minor changes to a file which do not amount to a new edition. No simple rule can specify how `substantive' changes have to be before they are regarded as producing a new edition, rather than a simple update. e general principle proposed here is that the production of a new edition 24 2.2. e File Description entails a significant change in the intellectual content of the file, rather than its encoding or appearance. e addition of analytic coding to a text would thus constitute a new edition, while automatic conversion from one coded representation to another would not. Changes relating to the character code or physical storage details, corrections of misspellings, simple changes in the arrangement of the contents and changes in the output format do not normally constitute a new edition, whereas the addition of new information (e.g. a linguistic analysis expressed in part-of-speech tagging, sound or graphics, referential links to external data sets) almost always does. Clearly, there will always be borderline cases and the matter is somewhat arbitrary. e simplest rule is: if you think that your file is a new edition, then call it such. An edition statement is optional for the first release of a computer file; it is mandatory for each later release, though this requirement cannot be enforced by the parser. Note that all changes in a file, whether or not they are regarded as constituting a new edition or simply a new revision, should be independently noted in the revision description section of the file header (see section 2.5. e Revision Description). e element should contain phrases describing the edition or version, including the wordedition, version, or equivalent, together with a number or date, or terms indicating difference from other editions such as new edition, revised edition etc. Any dates that occur within the edition statement should be marked with the element. e n attribute of the element may be used as elsewhere to supply any formal identification (such as a version number) for the edition. One or more elements may also be used to supply statements of responsibility for the edition in question. ese may refer to individuals or corporate bodies and can indicate functions such as that of a reviser, or can name the person or body responsible for the provision of supplementary matter, of appendices, etc., in a new edition. For further detail on the element, see section 3.11. Bibliographic Citations and References. Some examples follow: Second draft, substantially extended, revised, and corrected. Student's edition, June 1987 New annotations by George Brown 2.2.3 Type and Extent of File e element is the third component of the element. It is optional. describes the approximate size of a text as stored on some carrier medium, whether digital or non-digital, specified in any convenient units. For printed books, information about the carrier, such as the kind of medium used and its size, are of great importance in cataloguing procedures. e print-oriented rules for bibliographic description of an item's medium and extent need some re-interpretation when applied to electronic media. An electronic file exists as 25 2. e TEI Header a distinct entity quite independently of its carrier and remains the same intellectual object whether it is stored on a magnetic tape, a CD-ROM, a set of floppy disks, or as a file on a mainframe computer. Since, moreover, these Guidelines are specifically aimed at facilitating transparent document storage and interchange, any purely machine-dependent information should be irrelevant as far as the file header is concerned. is is particularly true of information about file-type although library-oriented rules for cataloguing oen distinguish two types of computer file: `data' and `programs'. is distinction is quite difficult to draw in some cases, for example, hypermedia or texts with built in search and retrieval soware. Although it is equally system-dependent, some measure of the size of the computer file may be of use for cataloguing and other practical purposes. Because the measurement and expression of file size is fraught with difficulties, only very general recommendations are possible; the element is provided for this purpose. It contains a phrase indicating the size or approximate size of the computer file in one of the following ways: * in bytes of a specified length (e.g. `4000 16-bit bytes') * as falling within a range of categories, for example: ­ less than 1 Mb ­ between 1 Mb and 5 Mb ­ between 6 Mb and 10 Mb ­ over 10 Mb * in terms of any convenient logical units (for example, words or sentences, citations, paragraphs) * in terms of any convenient physical units (for example, blocks, disks, tapes) e use of standard abbreviations for units of quantity is recommended where applicable, here as elsewhere (see http://physics.nist.gov/cuu/Units/binary.html). Examples: between 1 16-bit MB and 2 16-bit MB 4.2 MiB 4532 bytes 3200 sentences 5 90 mm High Density Diskettes 2.2.4 Publication, Distribution, etc. e element is the fourth component of the element and is mandatory. (publication statement) groups information concerning the publication or distribution of an electronic or other text. It may contain either a simple prose description organized as one or more paragraphs, or one or more elements from the model.publicationStmt class. is class groups a number of elements which are discussed in order below. provides the name of the organization responsible for the publication or distribution of a bibliographic item. supplies the name of a person or other agency responsible for the distribution of a text. (release authority) supplies the name of a person or other agency responsible for making an electronic file available, other than a publisher or distributor. e publisher is the person or institution by whose authority a given edition of the file is made public. e distributor is the person or institution from whom copies of the text may be obtained. Where a text is 26 2.2. e File Description not considered formally published, but is nevertheless made available for circulation by some individual or organization, this person or institution is termed the release authority. At least one of the above three elements must be present, unless the entire publication statement is given as prose. Each may be followed by one or more of the following elements, in the following order:3 (publication place) contains the name of the place where a bibliographic item was published.

contains a postal address, for example of a publisher, an organization, or an individual. (identifying number) supplies any standard or non-standard number used to identify a bibliographic item. @type categorizes the number, for example as an ISBN or other standard series. supplies information about the availability of a text, for example any restrictions on its use or distribution, its copyright status, etc. @status supplies a code identifying the current availability of the text. contains a date in any format. Note that the dates, places, etc., given in the publication statement relate to the publisher, distributor, or release authority most recently mentioned. If the text was created at some date other than its date of publication, its date of creation should be given within the element, not in the publication statement. Give any other useful dates (e.g., dates of collection of data) in a note. Additional detailed elements may be used for the encoding of names, dates, and addresses, as further described in section 3.5. Names, Numbers, Dates, Abbreviations, and Addresses when the module described in chapter 13. Names, Dates, People, and Places is included in a schema. Examples: Oxford University Press Oxford 1989 0-19-254705-4

Copyright 1989, Oxford University Press

James D. Benson London 1984 Sigma Press
21 High Street, Wilmslow, Cheshire M24 3DF 3is constraint is not however enforced by the current version of the TEI Guidelines. 27 2. e TEI Header
1991 Oxford Text Archive 1256

Available with prior consent of depositor for purposes of academic research and teaching only.

2.2.5 The Series Statement e element is the fih component of the element and is optional. (series statement) groups information about the series, if any, to which a publication belongs. In bibliographic parlance, a series may be defined in one of the following ways: * A group of separate items related to one another by the fact that each item bears, in addition to its own title proper, a collective title applying to the group as a whole. e individual items may or may not be numbered. * Each of two or more volumes of essays, lectures, articles, or other items, similar in character and issued in sequence. * A separately numbered sequence of volumes within a series or serial. e element may contain a prose description or one or more of the following more specific elements: contains a title for any kind of work. <idno> (identifying number) supplies any standard or non-standard number used to identify a bibliographic item. <respStmt> (statement of responsibility) supplies a statement of responsibility for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc. do not suffice or do not apply. <resp> (responsibility) contains a phrase describing the nature of a person's intellectual responsibility. <name> (name, proper noun) contains a proper noun or noun phrase. e <idno> may be used to supply any identifying number associated with the item, including both standard numbers such as an ISSN and particular issue numbers. (Arabic numerals separated by punctuation are recommended for this purpose: 6.19.33, for example, rather than VI/xix:33). Its type attribute is used to categorize the number further, taking the value ISSN for an ISSN for example. Examples: <seriesStmt> <title level="s">Machine-Readable Texts for the Study of Indian Literature ed. by Jan Gonda 1.2 28 2.2. e File Description 0 345 6789 2.2.6 The Notes Statement e element is the sixth component of the element and is optional. If used, it contains one or more elements, each containing a single piece of descriptive information of the kind treated as `general notes' in traditional bibliographic descriptions. (notes statement) collects together any notes providing information about a text additional to that recorded in other parts of the bibliographic description. contains a note or annotation. Some information found in the notes area in conventional bibliography has been assigned specific elements in these Guidelines; in particular the following items should be tagged as indicated, rather than as general notes: * the nature, scope, artistic form, or purpose of the file; also the genre or other intellectual category to which it may belong: e.g. `Text types: newspaper editorials and reportage, science fiction, westerns, and detective stories'. ese should be formally described within the element (section 2.4. e Profile Description). * summary description providing a factual, non-evaluative account of the subject content of the file: e.g. `Transcribes interviews on general topics with native speakers of English in 17 cities during the spring and summer of 1963.' ese should also be formally described within the element (section 2.4. e Profile Description). * bibliographic details relating to the source or sources of an electronic text: e.g. `Transcribed from the Norton facsimile of the 1623 Folio'. ese should be formally described in the element (section 2.2.7. e Source Description). * further information relating to publication, distribution, or release of the text, including sources from which the text may be obtained, any restrictions on its use or formal terms on its availability. ese should be placed in the appropriate division of the element (section 2.2.4. Publication, Distribution, etc.). * publicly documented numbers associated with the file: e.g. `ICPSR study number 1803' or `Oxford Text Archive text number 1243'. ese should be placed in an element within the appropriate division of the element. International Standard Serial Numbers (ISSN), International Standard Book Numbers (ISBN), and other internationally agreed upon standard numbers that uniquely identify an item, should be treated in the same way, rather than as specialized bibliographic notes. Nevertheless, the element may be used to record potentially significant details about the file and its features, e.g.: * dates, when they are relevant to the content or condition of the computer file: e.g. `manual dated 1983', `Interview wave I: Apr. 1989; wave II: Jan. 1990' * names of persons or bodies connected with the technical production, administration, or consulting functions of the effort which produced the file, if these are not named in statements of responsibility in the title or edition statements of the file description: e.g. `Historical commentary provided by Mark Cohen' * availability of the file in an additional medium or information not already recorded about the availability of documentation: e.g. `User manual is loose-leaf in eleven paginated sections' * language of work and abstract, if not encoded in the element, e.g. `Text in English with summaries in French and German' 29 2. e TEI Header * e unique name assigned to a serial by the International Serials Data System (ISDS), if not encoded in an * lists of related publications, either describing the source itself, or concerned with the creation or use of the electronic work, e.g. `Texts used in Burrows (1987)' Each such item of information may be tagged using the general-purpose element, which is described in section 3.8. Notes, Annotation, and Indexing. Groups of notes are contained within the element, as in the following example: Historical commentary provided by Mark Cohen. OCR scanning done at University of Toronto. ere are advantages, however, to encoding such information with more precise elements elsewhere in the TEI header, when such elements are available. For example, the notes above might be encoded as follows: ... Mark Cohen historical commentary University of Toronto OCR scanning 2.2.7 The Source Description e element is the seventh and final component of the element. It is a mandatory element and is used to record details of the source or sources from which a computer file is derived. is might be a printed text or manuscript, another computer file, an audio or video recording of some kind, or a combination of these. An electronic file may also have no source, if what is being catalogued is an original text created in electronic form. (source description) describes the source from which an electronic text was derived or generated, typically a bibliographic description in the case of a digitized text, or a phrase such as "born digital" for a text which has no previous existence. e element may contain little more than a simple prose description, or a brief note stating that the document has no source:

Born digital.

Alternatively, it may contain elements drawn from the following three classes: model.biblLike groups elements containing a bibliographic description. model.sourceDescPart groups elements which may be used inside and appear multiple times. 30 2.2. e File Description model.listLike groups list-like elements. ese classes make available by default a range of ways of providing bibliographic citations which specify the provenance of the text. For written or printed sources, the source may be described in the same way as any other bibliographic citation, using one of the following elements: (bibliographic citation) contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged. (structured bibliographic citation) contains a structured bibliographic citation, in which only bibliographic sub-elements appear and in a specified order. (citation list) contains a list of bibliographic citations of any kind. ese elements are described in more detail in section 3.11. Bibliographic Citations and References. Using them, a source might be described in very simple terms: The first folio of Shakespeare, prepared by Charlton Hinman (The Norton Facsimile, 1968) or with more elaboration: Eugne Sue Martin, l'enfant trouvé Mémoires d'un valet de chambre Bruxelles et Leipzig C. Muquardt 1846 When the header describes a text derived from some pre-existing TEI-conformant or other digital document, it may be simpler to use the following element: (fully-structured bibliographic citation) contains a fully-structured bibliographic citation, in which all components of the TEI file description are present. since this is designed specifically for documents derived from texts which were `born digital', as further discussed in section 2.2.8. Computer Files Derived from Other Computer Files . When the module for manuscript description is included in a schema, this class also makes available the following element: (manuscript description) contains a description of a single identifiable manuscript. which enables the encoder to record very detailed information about one or more manuscript or analogous sources, as further discussed in 10. Manuscript Description. e model.sourceDescPart class also makes available additional elements when additional modules are included. For example, when the spoken module is included, the element may also include the following special-purpose elements, intended for cases where an electronic text is derived from a spoken text rather than a written one: 31 2. e TEI Header (script statement) contains a citation giving details of the script used for a spoken text. (recording statement) describes a set of recordings used as the basis for transcription of a spoken text. Full descriptions of these elements and their contents are given in section 8.2. Documenting the Source of Transcribed Speech. e source description may also include lists of names, persons, places, etc. when these are considered to form part of the source for an encoded document. When such information is recorded using the specialized elements discussed in the namesdates module (13. Names, Dates, People, and Places), the class model.listLike makes available the following elements to hold such information: (list of canonical names) contains a list of nyms, that is, standardized names for any thing. (list of organizations) contains a list of elements, each of which provides information about an identifiable organization. (list of persons) contains a list of descriptions, each of which provides information about an identifiable person or a group of people, for example the participants in a language interaction, or the people referred to in a historical source. (list of places) contains a list of places, optionally followed by a list of relationships (other than containment) defined amongst them. 2.2.8 Computer Files Derived from Other Computer Files If a computer file (call it B) is derived not from a printed source but from another computer file (call it A) which includes a TEI file header, then the source text of computer file B is another computer file, A. e four sections of A's file header will need to be incorporated into the new header for B in slightly differing ways, as listed below: fileDesc A's file description should be copied into the section of B's file description, enclosed within a element profileDesc A's should be copied into B's, in principle unchanged; it may however be expanded by project-specific information relating to B. encodingDesc A's encoding practice may or (more likely) may not be the same as B's. Since the object of the encoding description is to define the relationship between the current file and its source, in principle only changes in encoding practice between A and B need be documented in B. e relationship between A and its source(s) is then only recoverable from the original header of A. In practice it may be more convenient to create a new complete for B based on A's. revisionDesc B is a new computer file, and should therefore have a new revision description. If, however, it is felt useful to include some information from A's , for example dates of major updates or versions, such information must be clearly marked as relating to A rather than to B. is concludes the discussion of the element and its contents. 2.3 The Encoding Description e element is the second major subdivision of the TEI header. It specifies the methods and editorial principles which governed the transcription or encoding of the text in hand and may also include sets of coded definitions used by other components of the header. ough not formally required, its use is highly recommended. (encoding description) documents the relationship between an electronic text and the source or sources from which it was derived. 32 2.3. e Encoding Description e encoding description may contain paragraphs of text, marked up using the

element, or it may contain more specialised elements taken from the model.encodingPart class. By default, this class makes available the following elements: (project description) describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected. (sampling declaration) contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection. (editorial practice declaration) provides details of editorial principles and practices applied during the encoding of a text. (tagging declaration) provides detailed information about the tagging applied to a document. (references declaration) specifies how canonical references are constructed for this text. (classification declarations) contains one or more taxonomies defining any classificatory codes used elsewhere in the text. (application information) records information about an application which has edited the TEI file. Each of these elements is further described in the appropriate section below. Other modules have the ability to extend this class; examples are noted in section 2.3.8. Module-Specific Declarations 2.3.1 The Project Description e element may be used to describe, in prose, the purpose for which a digital resource was created, together with any other relevant information concerning the process by which it was assembled or collected. is is of particular importance for corpora or miscellaneous collections, but may be of use for any text, for example to explain why one kind of encoding practice has been followed rather than another. (project description) describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected. For example:

Texts collected for use in the Claremont Shakespeare Clinic, June 1990.

2.3.2 The Sampling Declaration e element may be used to describe, in prose, the rationale and methods used in selecting texts, or parts of text, for inclusion in the resource. (sampling declaration) contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection. It should include information about such matters as * the size of individual samples * the method or methods by which they were selected 33 2. e TEI Header * the underlying population being sampled * the object of the sampling procedure used but is not restricted to these.

Samples of 2000 words taken from the beginning of the text.

It may also include a simple description of any parts of the source text included or excluded.

Text of stories only has been transcribed. Pull quotes, captions, and advertisements have been silently omitted. Any mathematical expressions requiring symbols not present in the ISOnum or ISOpub entity sets have been omitted, and their place marked with a GAP element.

A sampling declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the decls attribute of each text (or subdivision of the text) to which the sampling declaration applies may be used to supply a cross-reference to it, as further described in section 15.3. Associating Contextual Information with a Text. 2.3.3 The Editorial Practices Declaration e element is used to provide details of the editorial practices applied during the encoding of a text. (editorial practice declaration) provides details of editorial principles and practices applied during the encoding of a text. It may contain a prose description only, or one or more of a set of specialized elements, members of the TEI model.editorialDeclPart class. Where an encoder wishes to record an editorial policy not specified above, this may be done by adding a new element to this class, using the mechanisms discussed in chapter 23.2. Personalization and Customization. Some of these policy elements carry attributes to support automated processing of certain well-defined editorial decisions; all of them contain a prose description of the editorial principles adopted with respect to the particular feature concerned. Examples of the kinds of questions which these descriptions are intended to answer are given in the list below. (correction principles) states how and under what circumstances corrections have been made in the text. @status indicates the degree of correction applied to the text. @method indicates the method adopted to indicate corrections within the text. Was the text corrected during or aer data capture? If so, were corrections made silently or are they marked using the tags described in section 3.4. Simple Editorial Changes? What principles have been adopted with respect to omissions, truncations, dubious corrections, alternate readings, false starts, repetitions, etc.? 34 2.3. e Encoding Description indicates the extent of normalization or regularization of the original source carried out in converting it to electronic form. @source indicates the authority for any normalization carried out. @method indicates the method adopted to indicate normalizations within the text. Was the text normalized, for example by regularizing any non-standard spellings, dialect forms, etc.? If so, were normalizations performed silently or are they marked using the tags described in section 3.4. Simple Editorial Changes? What authority was used for the regularization? Also, what principles were used when normalizing numbers to provide the standard values for the value attribute described in section 3.5.3. Numbers and Measures and what format used for them? specifies editorial practice adopted with respect to quotation marks in the original. @marks (quotation marks) indicates whether or not quotation marks have been retained as content within the text. @form specifies how quotation marks are indicated within the text. How were quotation marks processed? Are apostrophes and quotation marks distinguished? How? Are quotation marks retained as content in the text or replaced by markup? Are there any special conventions regarding for example the use of single or double quotation marks when nested? Is the file consistent in its practice or has this not been checked? summarizes the way in which hyphenation in a source text has been treated in an encoded version of it. @eol (end-of-line) indicates whether or not end-of-line hyphenation has been retained in a text. Does the encoding distinguish `so' and `hard' hyphens? What principle has been adopted with respect to end-of-line hyphenation where source lineation has not been retained? Have so hyphens been silently removed, and if so what is the effect on lineation and pagination? describes the principles according to which the text has been segmented, for example into sentences, tone-units, graphemic strata, etc. How is the text segmented? If or segmentation units have been used to divide up the text for analysis, how are they marked and how was the segmentation arrived at? (standard values) specifies the format used when standardized date or number values are supplied. In most cases, attributes bearing standardized values (such as the when or when-iso attribute on dates) should conform to a defined W3C or ISO datatype. In cases where this is not appropriate, this element may be used to describe the standardization methods underlying the values supplied. describes the scope of any analytic or interpretive information added to the text in addition to the transcription. 35 2. e TEI Header Has any analytic or `interpretive' information been provided -- that is, information which is felt to be non-obvious, or potentially contentious? If so, how was it generated? How was it encoded? If featurestructure analysis has been used, are elements (section 18.11. Feature System Declaration) present? Any information about the editorial principles applied not falling under one of the above headings should be recorded in a distinct list of items. Experience shows that a full record should be kept of decisions relating to editorial principles and encoding practice, both for future users of the text and for the project which produced the text in the first instance. Some simple examples follow:

s elements mark orthographic sentences and are numbered sequentially within their parent div element

The part of speech analysis applied throughout section 4 was added by hand and has not been checked.

Errors in transcription controlled by using the WordPerfect spelling checker.

All words converted to Modern American spelling following Websters 9th Collegiate dictionary.

All opening quotation marks represented by entity reference odq; all closing quotation marks represented by entity reference cdq.

An editorial practices declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the decls attribute of each text (or subdivision of the text) to which it applies may be used to supply a cross-reference to it, as further described in section 15.3. Associating Contextual Information with a Text. 2.3.4 The Tagging Declaration e element is used to record the following information about the tagging used within a particular text: * the namespace to which elements appearing within the transcribed text belong. * how oen particular elements appear within the text, so that a recipient can validate the integrity of a text during interchange. * any comment relating to the usage of particular elements not specified elsewhere in the header. * a default rendition applicable to all instances of an element. is information is conveyed by the following elements: 36 2.3. e Encoding Description supplies information about the rendition or appearance of one or more elements in the source text. @scheme identifies the language used to describe the rendition. supplies the formal name of the namespace to which the elements documented by its children belong. supplies information about the usage of a specific element within a text. e element consists of an optional sequence of elements, each of which must bear a unique identifier, followed by an optional sequence of one or more elements, containing a series of elements, one for each distinct element from that namespace occurring within the outermost element of a TEI document. 2.3.4.1 Rendition e element allows the encoder to specify how one or more elements are rendered in the original source in any of the following ways: * using an informal prose description * using a standard stylesheet language such as CSS or XSL-FO * using a project-defined formal language One or more such specifications may be associated with elements of a document in two ways: * the render attribute of the appropriate element may be used to indicate a default rendition for all occurrences of the named element * the global rendition attribute may be used on any element to indicate its rendition, over-riding any supplied default value e global rend attribute may also be used to supply an informal description of the rendering for an element; if this is supplied in addition to the rendition attribute it takes precedence, just as it also overrides any default specified for that element. For example, the following schematic shows how an encoder might specify that all

elements are by default to be rendered using one set of specifications identified as style1, while elements are to use a different set, identified as style2: ... description of one default rendition here ... ... description of another default rendition here ... ... ...

This paragraph,mostly rendered in style1, contains a few words rendered in style2

This paragraph is all rendered in style2

This is back to style1

37 2. e TEI Header As noted above, the content of the element may describe the appearance of the source material using prose, a project-defined formal language, or either of the existing standard languages: the Cascading Stylesheet Language (Lie and Bos (eds.) (1999)) and the XML vocabulary for specifying formatting semantics which forms a part of the W3C's Extensible Stylesheet Language (Berglund (ed.) (2006)). e scheme attribute indicates which of these applies to a given element, and takes the following values: free Informal free text description css Cascading Stylesheet Language xslfo Extensible Stylesheet Language Formatting Objects other A user-defined formal description language In the following extended example we consider how best to capture the appearance of a typical early 20th century titlepage, such as that in the following figure: Elements for the encoding of the information on a titlepage are presented in 4.6. Title Pages; here we consider how we might go about encoding some of the visual information as well, using the element and its corresponding attribute. 38 2.3. e Encoding Description First we define a rendition element for each aspect of the source page rendition that we wish to retain. Details of CSS are given in Lie and Bos (eds.) (1999); we use it here simply to provide a vocabulary with which to describe such aspects as font size and style, letter and line spacing, and colour. text-align: center; font-size: small; font-size: large; font-size: x-large; font-size: xx-large letter-spacing: +3pt; line-height: 150%; line-height: 200%; color: red; e global rendition attribute can now be used to specify on any element which of the above rendition features apply to it. For example, a title page might be encoded as follows: THE POEMS OF ALGERNON CHARLES SWINBURNE IN SIX VOLUMES VOLUME I. POEMS AND BALLADS FIRST SERIES LONDON CHATTO & WINDUS 1904 Source: [191] 2.3.4.2 Tag usage As noted above, each element, if present, should contain exactly one occurrence of a element for each distinct element from the given namespace that occurs within the outermost element 39 2. e TEI Header associated with the in which it appears.4 e element is used to supply a count of the number of occurrences of this element within the text, which is given as the value of its occurs attribute. It may also be used to hold any additional usage information, which is supplied as running prose within the element itself. For example: Used only to mark English words italicised in the copy text. is indicates that the element appears a total of 28 times in the element in question, and that the encoder has used it to mark italicised English words only. e withId attribute may optionally be used to specify how many of the occurrences of the element in question bear a value for the global xml:id attribute, as in the following example: Marks page breaks in the York (1734) edition only is indicates that the element occurs 321 times, on each of which an identifier is provided. e content of the element is not susceptible of automatic processing. It should not therefore be used to hold information for which provision is already made by other components of the encoding description. A TEI conformant document is not required to provide any elements, but if it does, then TEI recommended practice is to provide and elements for each distinct element and namespace used in the associated text. If, in addition, counts are specified by the occurs attributes, these must correspond with the number of such elements present in the document. 2.3.5 The Reference System Declaration e element is used to document the way in which any standard referencing scheme built into the encoding works. It may contain either a series of prose paragraphs or the following specialized elements: (references declaration) specifies how canonical references are constructed for this text. (canonical reference pattern) specifies an expression and replacement pattern for transforming a canonical reference into a URI. (reference state) specifies one component of a canonical reference defined by the milestone method. Note that not all possible referencing schemes are equally easily supported by current soware systems. A choice must be made between the convenience of the encoder and the likely efficiency of the particular soware applications envisaged, in this context as in many others. For a more detailed discussion of referencing systems supported by these Guidelines, see section 3.10. Reference Systems below. A referencing scheme may be described in one of three ways using this element: * as a prose description * as a series of pairs of regular expressions and XPaths * as a concatenation of sequentially organized milestones Each method is described in more detail below. Only one method can be used within a single element. More than one element can be included in the header if more than one canonical reference scheme is to be used in the same document, but the current proposals do not check for mutual inconsistency. 4In the case of a TEI corpus (15. Language Corpora), a in a corpus header will describe tag usage across the whole corpus, while one in an individual text header will describe tag usage for the individual text concerned. 40 2.3. e Encoding Description 2.3.5.1 Prose Method e referencing scheme may be specified within the by a simple prose description. Such a description should indicate which elements carry identifying information, and whether this information is represented as attribute values or as content. Any special rules about how the information is to be interpreted when reading or generating a reference string should also be specified here. Such a prose description cannot be processed automatically, and this method of specifying the structure of a canonical reference system is therefore not recommended for automatic processing. For example:

The n attribute of each text in this corpus carries a unique identifying code for the whole text. The title of the text is held as the content of the first head element within each text. The n attribute on each div1 and div2 contains the canonical reference for each such division, in the form 'XX.yyy', where XX is the book number in Roman numerals, and yyy the section number in arabic. Line breaks are marked by empty lb elements, each of which includes the through line number in Casaubon's edition as the value of its n attribute.

The through line number and the text identifier uniquely identify any line. A canonical reference may be made up by concatenating the n values from the text, div1, or div2 and calculating the line number within each part.

2.3.5.2 Search-and-Replace Method is method oen requires a significant investment of effort initially, but permits extremely flexible addressing. For details, see section 16.2.5. Canonical References. (canonical reference pattern) specifies an expression and replacement pattern for transforming a canonical reference into a URI. 2.3.5.3 Milestone Method is method is appropriate when only `milestone' tags (see section 3.10.3. Milestone Elements) are available to provide the required referencing information. It does not provide any abilities which cannot be mimicked by the search-and-replace referencing method discussed in the previous section, but in the cases where it applies, it provides a somewhat simpler notation. A reference based on milestone tags concatenates the values specified by one or more such tags. Since each tag marks the point at which a value changes, it may be regarded as specifying the refState of a variable. A reference declaration using this method therefore specifies the individual components of the canonical reference as a sequence of elements: (reference state) specifies one component of a canonical reference defined by the milestone method. @unit indicates what kind of state is changing at this milestone. @delim (delimiter) supplies a delimiting string following the reference component. @length specifies the fixed length of the reference component. For example, the reference `Matthew 12:34' might be thought of as representing the state of three variables: the book variable is in state `Matthew'; the chapter variable is in state `12', and the verse variable is in state `34'. If milestone tagging has been used, there should be a tag marking the point in the text at which each of the 41 2. e TEI Header above `variables' changes its state.5 To find `Matthew 12:34' therefore an application must scan le to right through the text, monitoring changes in the state of each of these three variables as it does so. When all three are simultaneously in the required state, the desired point will have been reached. ere may of course be several such points. e delim and length attributes are used to specify components of a canonical reference using this method in exactly the same way as for the stepwise method described in the preceding section. e other attributes are used to determine which instances of tags in the text are to be checked for state-changes. A statechange is signalled whenever a new tag is found with unit and, optionally, ed attributes identical to those of the element in question. e value for the new state may be given explicitly by the n attribute on the element, or it may be implied, if the n attribute is not specified. For example, for canonical references in the form xx.yyy where the xx represents the page number in the first edition, and yyy the line number within this page, a reference system declaration such as the following would be appropriate: is implies that milestone tags of the form will be found throughout the text, marking the positions at which page and line numbers change. Note that no value has been specified for the n attribute on the second milestone tag above; this implies that its value at each state change is monotonically increased. For more detail on the use of milestone tags, see section 3.10.3. Milestone Elements. e milestone referencing scheme, though conceptually simple, is not supported by a generic SGML or XML parser. Its use places a correspondingly greater burden of verification and accuracy on the encoder. A reference system declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the decls attribute of each text (or subdivision of the text) to which the declaration applies may be used to supply a cross-reference to it, as further described in section 15.3. Associating Contextual Information with a Text. 2.3.6 The Classification Declaration e element is used to group together definitions or sources for any descriptive classification schemes used by other parts of the header. Each such scheme is represented by a element, which may contain either a simple bibliographic citation, or a definition of the descriptive typology concerned; the following elements are used in defining a descriptive classification scheme: (classification declarations) contains one or more taxonomies defining any classificatory codes used elsewhere in the text. 5On the tag itself, what are here referred to as `variables' are identified by the combination of the ed and unit attributes. 42 2.3. e Encoding Description defines a typology used to classify texts either implicitly, by means of a bibliographic citation, or explicitly by a structured taxonomy. contains an individual descriptive category, possibly nested within a superordinate category, within a user-defined taxonomy. (category description) describes some category within a taxonomy or text typology, either in the form of a brief prose description or in terms of the situational parameters used by the TEI formal textDesc. e element has two slightly different, but related, functions. For well-recognized and documented public classification schemes, such as Dewey or other published descriptive thesauri, it contains simply a bibliographic citation indicating where a full description of a particular taxonomy may be found. Dewey Decimal Classification Abridged Edition 12 For less easily accessible schemes, the element contains a description of the taxonomy itself as well as an optional bibliographic citation. e description consists of a number of elements, each defining a single category within the given typology. e category is defined by the contents of a nested element, which may contain either a phrase describing the category, or any number of elements from the model.catDescPart class. When the corpus module is included in a schema, this class provides the element whose components allow the definition of a text type in terms of a set of `situational parameters' (see further section 15.2.1. e Text Description; if the corpus module is not included in a schema, this class is empty and the element may contain only plain text. If the category is subdivided, each subdivision is represented by a nested element, having the same structure. Categories may be nested to an arbitrary depth in order to reflect the hierarchical structure of the taxonomy. Each element bears a unique xml:id attribute, which is used as the target for elements referring to it. Brown Corpus Press Reportage Daily Sunday National Provincial Political 43 2. e TEI Header Sports Religion Books Periodicals and tracts Linkage between a particular text and a category within such a taxonomy is made by means of the element within the element, as described in section 2.4.3. e Text Classification. Where the taxonomy permits of classification along more than one dimension, more than one category will be referenced by a particular , as in the following example, which identifies a text with the sub-categories `Daily', `National', and `Political' within the category `Press Reportage' as defined above. 2.3.7 The Application Information Element It is sometimes convenient to store information relating to the processing of an encoded resource within its header. Typical uses for such information might be: * to allow an application to discover that it has previously opened or edited a file, and what version of itself was used to do that; * to show (through a date) which application last edited the file to allow for diagnosis of any problems that might have been caused by that application; * to allow users to discover information about an application used to edit the file * to allow the application to declare an interest in elements of the file which it has edited, so that other applications or human editors may be more wary of making changes to those sections of the file. e class model.applicationLike provides an element, , which may be used to record such information within the element. (application information) records information about an application which has edited the TEI file. provides information about an application which has acted upon the document. @ident Supplies an identifier for the application, independent of its version number or display name. @version Supplies a version number for the application, independent of its identifier or display name. Each element identifies the current state of one soware application with regard to the current file. is element is a member of the att.datable class, which provides a variety of attributes for associating this state with a date and time, or a temporal range. e ident and version attributes should be used to uniquely identify the application and its major version number (for example, ImageMarkupTool 1.5). It is not intended that an application should add a new each time it touches the file. 44 2.4. e Profile Description e following example shows how these elements might be used to document the fact that version 1.5 of an application called `Image Markup Tool' has an interest in two parts of a document which was last saved on June 6 2006. e parts concerned are accessible at the URLs given as target for the two elements. 2.3.8 Module-Specific Declarations e elements discussed so far are available to any schema. When the schema in use includes some of the more specialised TEI modules, these make available other more module-specific components of the encoding declaration. ese are discussed fully in the documentation for the module in question, but are also noted briefly here for convenience. e element is available only when the iso-fs module is included in a schema. Its purpose is to document the feature system declaration (as defined in chapter 18.11. Feature System Declaration) underlying any analytic feature structures (as defined in chapter 18. Feature Structures) present in the text documented by this header. e element is available only when the verse module is included in a schema. Its purpose is to document any metrical notation scheme used in the text, as further discussed in section 6.3. Rhyme and Metrical Analysis. It consists either of a prose description or a series of elements. e element is available only when the textcrit module is included in a schema. Its purpose is to document the method used to encode textual variants in the text, as discussed in section 12.2. Linking the Apparatus to the Text. 2.4 The Profile Description e element is the third major subdivision of the TEI Header. It is an optional element, the purpose of which is to enable information characterizing various descriptive aspects of a text or a corpus to be recorded within a single unified framework. (text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting. In principle, almost any component of the header might be of importance as a means of characterizing a text. e author of a written text, its title or its date of publication, may all be regarded as characterizing it at least as strongly as any of the parameters discussed in this section. e rule of thumb applied has been to exclude from discussion here most of the information which generally forms part of a standard bibliographic style description, if only because such information has already been included elsewhere in the TEI header. e core element has three optional components, represented by the following elements: contains information about the creation of a text. (language usage) describes the languages, sublanguages, registers, dialects, etc. represented within a text. (text classification) groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc. 45 2. e TEI Header ese elements are further described in the remainder of this section. ree other elements may also appear within the element when the corpus module described in chapter 15. Language Corpora is included in a schema: (text description) provides a description of a text in terms of its situational parameters. (participation description) describes the identifiable speakers, voices, or other participants in a linguistic interaction. (setting description) describes the setting or settings within which a language interaction takes place, either as a prose description or as a series of setting elements. For descriptions of these elements, see section 15.2. Contextual Information. e following element can appear in the element when the transcr module for the transcription of primary sources described in chapter 11. Representation of Primary Sources is included in a schema: contains one or more elements documenting the different hands identified within the source texts. For a description of this element, see section 11.4.1. Document Hands. Its purpose is to group together a number of elements, each of which describes a different hand or equivalent identified within a manuscript. e element can also appear within a structured manuscript description, when the msdescription module described in chapter 10. Manuscript Description is included in a schema. For this reason, the element is actually declared within the header module, but is only accessible to a schema when one or other of the transcr or msdescription modules is included in a schema. See further the discussion at 11.4.1. Document Hands. 2.4.1 Creation e element contains phrases describing the origin of the text, e.g. the date and place of its composition. contains information about the creation of a text. e date and place of composition are oen of particular importance for studies of linguistic variation; since such information cannot be inferred with confidence from the bibliographic description of the copy text, the element may be used to provide a consistent location for this information: August 1992 Taos, New Mexico 2.4.2 Language Usage e element is used within the element to describe the languages, sublanguages, registers, dialects, etc. represented within a text. It contains one or more elements, each of which provides information about a single language, notably the quantity of that language present in the text. Note that this element should not be used to supply information about any non-standard characters or glyphs used by this language; such information should be recorded in the element in the encoding description (see further 5. Representation of Non-standard Characters and Glyphs). (language usage) describes the languages, sublanguages, registers, dialects, etc. represented within a text. characterizes a single language or sublanguage used within a text. @usage specifies the approximate percentage (by volume) of the text which uses this language. 46 2.4. e Profile Description @ident (identifier) Supplies a language code constructed as defined in BCP 47 which is used to identify the language documented by this element, and which is referenced by the global xml:lang attribute. A element may be supplied for each different language used in a document. If used, its ident attribute should specify an appropriate language identifier, as further discussed in section vi.1 Language identification. is is particularly important if extended language identifiers have been used as the value of xml:lang attributes elsewhere in the document. Here is an example of the use of this element: Québecois Canadian business English British English 2.4.3 The Text Classification e second component of the core element is the element. is element is used to classify a text according to one or more of the following methods: * by reference to a recognized international classification such as the Dewey Decimal Classification, the Universal Decimal Classification, the Colon Classification, the Library of Congress Classification, or any other system widely used in library and documentation work * by providing a set of keywords, as provided for example by British Library or Library of Congress Cataloguing in Publication data * by referencing any other taxonomy of text categories recognized in the field concerned, or peculiar to the material in hand; this may include one based on recurring sets of values for the situational parameters defined in section 15.2.1. e Text Description, or the demographic elements described in section 15.2.2. e Participant Description e last of these may be particularly important for dealing with existing corpora or collections, both as a means of avoiding the expense or inconvenience of reclassification and as a means of documenting the organizing principles of such materials. e following elements are provided for this purpose: contains a list of keywords or phrases identifying the topic or nature of a text. @scheme identifies the controlled vocabulary within which the set of keywords concerned is defined. (classification code) contains the classification code used for this text in some standard classification system. @scheme identifies the classification system or taxonomy in use. (category reference) specifies one or more defined categories within some taxonomy or text typology. @target identifies the categories concerned e element simply categorizes an individual text by supplying a list of keywords which may describe its topic or subject matter, its form, date, etc. In some schemes, the order of items in the list is significant, for example, from major topic to minor; in others, the list has an organized substructure of its own. No recommendations are made here as to which method is to be preferred. Wherever possible, such keywords 47 2. e TEI Header should be taken from a recognized source, such as the British Library/Library of Congress Cataloguing in Publication data in the case of printed books, or a published thesaurus appropriate to the field. e scheme attribute should be used to indicate the source of the keywords used. is is done by supplying the value used for the xml:id attribute of a element within which further details of the source concerned may be found. e element occurs in the part of the encoding declarations within the TEI Header and is described in section 2.3.6. e Classification Declaration. For example: Data base management SQL (Computer program language) English literature -- History and criticism -- Data processing. English literature -- History and criticism -- Theory, etc. English language -- Style -- Data processing. Style, Literary -- Data processing. e element also categorizes an individual text, by supplying a numerical or other code used in a recognized classification scheme, such as the Dewey Decimal Classification. e scheme attribute is used to indicate the source of the classification scheme: this may be a pointer of any kind, either to a TEI element, likely in the current document, as in the examples above, or to some canonical source for the scheme, as in the following example: 005.756 QA76.9 820.285 e element categorizes an individual text by pointing to one or more elements. e element (which is fully described in section 2.3.6. e Classification Declaration) holds information about a particular classification or category within a given taxonomy. Each such category must have a unique identifier, which may be supplied as the value of the target attribute for elements which are regarded as falling within the category indicated. A text may, of course, fall into more than one category, in which case more than one identifier will be supplied as the value for the target attribute on the element, as in the following example: e scheme attribute may be supplied to specify the taxonomy to which the categories identified by the target attribute belong, if this is not adequately conveyed by the resource pointed to. For example, 48 2.5. e Revision Description Here the same text has been classified as of categories b.a4 and b.d2 within the Brown classification scheme (presumed to be available from http://www.example.com/browncorpus), and as of category `A45' within the SUC classification scheme documented at the URL given. e distinction between the and elements is that the values used as identifying codes are exhaustively enumerated, typically with the header, for the former, while the latter may be used to indicate a more open ended or descriptive classification system. 2.5 The Revision Description e final sub-element of the TEI header, the element, provides a detailed change log in which each change made to a text may be recorded. Its use is optional but highly recommended. It provides essential information for the administration of large numbers of files which are being updated, corrected, or otherwise modified as well as extremely useful documentation for files being passed from researcher to researcher or system to system. Without change logs, it is easy to confuse different versions of a file, or to remain unaware of small but important changes made in the file by some earlier link in the chain of distribution. No change should be made in any TEI-conformant file without corresponding entries being made in the change log. (revision description) summarizes the revision history for a file. summarizes a particular change or correction made to a particular version of an electronic text which is shared between several researchers. e main purpose of the revision description is to record changes in the text to which a header is prefixed. However, it is recommended TEI practice to include entries also for significant changes in the header itself (other than the revision description itself, of course). At the very least, an entry should be supplied indicating the date of creation of the header. e log consists of a list of entries, one for each change. is may be encoded using either the regular element, as described in section 3.7. Lists or as a series of special purpose elements, each of which contains a more detailed description of the changes made. e attributes when and who are used to indicate the date of the change and the person responsible for it respectively. e description of the change itself can range from a simple phrase to a series of paragraphs. If a number is to be associated with one or more changes (for example, a revision number), the global n attribute may be used to indicate it. It is recommended to give changes in reverse chronological order, most recent first. For example: The Amorous Prince, or, the Curious Husband, 1671 Behn, Aphra Caton, Paul electronic publication editor Gui, Weihsin encoder 49 2. e TEI Header Wernimont, Jacqueline encoder Changed drama.verse lgs to ps. we have opened a discussion about the need for a new value for type of lg, drama.free.verse, in order to address the verse of Behn which is not in regular iambic pentameter. For the time being these instances are marked with a comment note until we are able to fully consider the best way to encode these instances. Added key and reg to names. Completed renovation. Validated. In the above example, the who attributes point to elements; they could equally well point to elements. 2.6 Minimal and Recommended Headers e TEI header allows for the provision of a very large amount of information concerning the text itself, its source, its encodings, and revisions of it, as well as a wealth of descriptive information such as the languages it uses and the situation(s) in which it was produced, together with the setting and identity of participants within it. is diversity and richness reflects the diversity of uses to which it is envisaged that electronic texts conforming to these Guidelines will be put. It is emphatically not intended that all of the elements described above should be present in every TEI Header. e amount of encoding in a header will depend both on the nature and the intended use of the text. At one extreme, an encoder may expect that the header will be needed only to provide a bibliographic identification of the text adequate to local needs. At the other, wishing to ensure that their texts can be used for the widest range of applications, encoders will want to document as explicitly as possible both bibliographic and descriptive information, in such a way that no prior or ancillary knowledge about the text is needed in order to process it. e header in such a case will be very full, approximating to the kind of documentation oen supplied in the form of a manual. Most texts will lie somewhere between these extremes; textual corpora in particular will tend more to the latter extreme. In the remainder of this section we demonstrate first the minimal, and next a commonly recommended, level of encoding for the bibliographic information held by the TEI header. Supplying only the minimal level of encoding required, the TEI header of a single text might look like the following example: Thomas Paine: Common sense, a machine-readable transcript compiled by Jon K Adams 50 2.6. Minimal and Recommended Headers Oxford Text Archive The complete writings of Thomas Paine, collected and edited by Phillip S. Foner (New York, Citadel Press, 1945) e only mandatory component of the TEI Header is the element. Within this, , , and are all required constituents. Within the title statement, a title is required, and an author should be specified, even if it is unknown, as should some additional statement of responsibility, here given by the element. Within the , a publisher, distributor, or other agency responsible for the file must be specified. Finally, the source description should contain at the least a loosely structured bibliographic citation identifying the source of the electronic text if (as is usually the case) there is one. We now present the same example header, expanded to include additionally recommended information, adequate to most bibliographic purposes, in particular to allow for the creation of an AACR2-conformant bibliographic record. We have also added information about the encoding principles used in this (imaginary) encoding, about the text itself (in the form of Library of Congress subject headings), and about the revision of the file. Common sense, a machine-readable transcript Paine, Thomas (1737-1809) compiled by Jon K Adams 1986 Oxford Text Archive.
Oxford University Computing Services, 13 Banbury Road, Oxford OX2 6RB, UK
Brief notes on the text are in a supplementary file. 51 2. e TEI Header Foner, Philip S. The collected writings of Thomas Paine New York Citadel Press 1945

Editorial notes in the Foner edition have not been reproduced.

Blank lines and multiple blank spaces, including paragraph indents, have not been preserved.

The following errors in the Foner edition have been corrected: p. 13 l. 7 cotemporaries contemporaries p. 28 l. 26 [comma] [period] p. 84 l. 4 kin kind p. 95 l. 1 stuggle struggle p. 101 l. 4 certainy certainty p. 167 l. 6 than that p. 209 l. 24 publshed published

No normalization beyond that performed by Foner, if any.

All double quotation marks rendered with ", all single quotation marks with apostrophe.

Hyphenated words that appear at the end of the line in the Foner edition have been reformed.

The values of when-iso on the time element always end in the format HH:MM or HH; i.e., seconds, fractions thereof, and time zone designators are not present.

Compound proper names are marked.

Dates are marked.

Italics are recorded without interpretation.

52 2.7. Note for Library Cataloguers
Library of Congress Subject Headings Library of Congress Classification
1774 English. Political science United States -- Politics and government -- Revolution, 1775-1783 JC 177 CMSMcQ finished proofreading L.B. finished proofreading R.G. finished proofreading R.G. finished data entry R.G. began data entry
Many other examples of recommended usage for the elements discussed in this chapter are provided here, in the reference index and in the associated tutorials. 2.7 Note for Library Cataloguers A strong motivation in preparing the material in this chapter was to provide in the TEI file header a viable chief source of information for cataloguing computer files. e file header is not a library catalogue record, and so will not make all of the distinctions essential in standard library work. It also includes much information generally excluded from standard bibliographic descriptions. It is the intention of the developers, however, 53 2. e TEI Header to ensure that the information required for a catalogue record be retrievable from the TEI file header, and moreover that the mapping from the one to the other be as simple and straightforward as possible. Where the correspondence is not obvious, it may prove useful to consult one of the works which were influential in developing the content of the TEI file header. ese include: ISBD(G) e International Standard Book Description (General) is an international standard setting out what information should be recorded in a description of a bibliographical item. ere are also separate ISBDs covering different types of material, e.g. ISBD(M) for monographs, ISBD(ER) for electronic resources. ese separate ISBDs follow the same general scheme as the main ISBD(G), but provide appropriate interpretations for the specific materials under consideration. AACR2 e Anglo-American Cataloguing Rules, Second Edition, 2002 Revision: 2005 Update are the official guidelines for the construction of catalogues in general libraries in the English-speaking world. Other national cataloguing codes exist as well. AACR2 is explicitly based on the general framework of the ISBD(G) and the subsidiary ISBDs: it gives a description of how to catalogue items according to the ISBDs, and how to construct indexes and cross-references. ANSI Z.39.29 ANSI Z.39.29 is an American national standard governing bibliographic references for use in bibliographies, end-of-work lists, references in abstracting and indexing publications, and outputs from computerized bibliographic data bases. is standard has however now been withdrawn, pending substantial revision. e international standard which covers the same area is ISO 690:1987. Other relevant standards include BS 1629:1989, BS 5605:1978, and BS 6371:1983. 2.8 The TEI Header Module e module described in this chapter makes available the following components: Module header: e TEI Header * Elements defined: appInfo application authority availability biblFull cRefPattern catDesc catRef category change classCode classDecl correction creation distributor edition editionStmt editorialDecl encodingDesc extent fileDesc funder geoDecl handNote hyphenation idno interpretation keywords langUsage language namespace normalization notesStmt principal profileDesc projectDesc publicationStmt quotation refState refsDecl rendition revisionDesc samplingDecl segmentation seriesStmt sourceDesc sponsor stdVals tagUsage tagsDecl taxonomy teiHeader textClass titleStmt typeNote * Classes defined: model.applicationLike model.editorialDeclPart model.encodingPart model.headerPart model.profileDescPart model.sourceDescPart e selection and combination of modules to form a TEI schema is described in 1.2. Defining a TEI Schema. 54 Chapter 3 Elements Available in All TEI Documents is chapter describes elements which may appear in any kind of text and the tags used to mark them in all TEI documents. Most of these elements are freely floating phrases, which can appear at any point within the textual structure, although they must generally be contained by a higher-level element of some kind (such as a paragraph). A few of the elements described in this chapter (for example, bibliographic citations and lists) have a comparatively well-defined internal structure, but most of them have no consistent inner structure of their own. In the general case, they contain only a few words, and are oen identifiable in a conventionally printed text by the use of typographic conventions such as shis of font, use of quotation or other punctuation marks, or other changes in layout. is chapter begins by describing the

tag used to mark paragraphs, the prototypical formal unit for running text in many TEI modules. is is followed, in section 3.2. Treatment of Punctuation, by a discussion of some specific problems associated with the interpretation of conventional punctuation, and the methods proposed by the Guidelines for resolving ambiguities therein. e next section (section 3.3. Highlighting and Quotation) describes a number of phrase-level elements commonly marked by typographic features (and thus well-represented in conventional markup languages). ese include features commonly marked by font shis (section 3.3.2. Emphasis, Foreign Words, and Unusual Language) and features commonly marked by quotation marks (section3.3.3. Quotation) as well as such features as terms, cited words, and glosses (section 3.3.4. Terms, Glosses, Equivalents, and Descriptions). Section 3.4. Simple Editorial Changes introduces some phrase-level elements which may be used to record simple editorial interventions, such as emendation or correction of the encoded text. e elements described here constitute a simple subset of the full mechanisms for encoding such information (described in full in chapter 11. Representation of Primary Sources), which should be adequate to most commonly encountered situations. e next section (section 3.5. Names, Numbers, Dates, Abbreviations, and Addresses) describes several phrase-level and inter-level elements which, although oen of interest for analysis or processing, are rarely explicitly identified in conventional printing. ese include names (section 3.5.1. Referring Strings), numbers and measures (section 3.5.3. Numbers and Measures), dates and times (section 3.5.4. Dates and Times), abbreviations (section 3.5.5. Abbreviations and eir Expansions), and addresses (section 3.5.2. Addresses). In the same way, the following section (section 3.6. Simple Links and Cross-References) presents only a subset of the facilities available for the encoding of cross-references or text-linkage. e full story may be found in chapter 16. Linking, Segmentation, and Alignment; the tags presented here are intended to be usable for a wide variety of simple applications. Sections 3.7. Lists, and 3.8. Notes, Annotation, and Indexing, describe two kinds of quasi-structural elements: lists and notes. ese may appear either within chunk-level elements such as paragraphs, or between them. Several kinds of lists are catered for, of an arbitrary complexity. e section on notes discusses both notes found 55 3. Elements Available in All TEI Documents in the source and simple mechanisms for adding annotations of an interpretive nature during the encoding; again, only a subset of the facilities described in full elsewhere (specifically, in chapter 17. Simple Analytic Mechanisms) is discussed. Section 3.9. Graphics and other non-textual componentsintroduces some simple ways of representing graphic or other non-textual content found in a text. A fuller discussion of the multimedia facilities supported by these Guidelines may be found in chapters 14. Tables, Formul, and Graphics and 16. Linking, Segmentation, and Alignment. Next, section 3.10. Reference Systems, describes methods of encoding within a text the conventional system or systems used when making references to the text. Some reference systems have attained canonical authority and must be recorded to make the text useable in normal work; in other cases, a convenient reference system must be created by the creator or analyst of an electronic text. Like lists and notes, the bibliographic citations discussed in section 3.11. Bibliographic Citations and References, may be regarded as structural elements in their own right. A range of possibilities is presented for the encoding of bibliographic citations or references, which may be treated as simple phrases within a running text, or as highly-structured components suitable for inclusion in a bibliographic database. Additional elements for the encoding of passages of verse or drama (whether prose or verse) are discussed in section 3.12. Passages of Verse or Drama. e chapter concludes with a technical overview of the structure and organization of the module described here. is should be read in conjunction with chapter 1. e TEI Infrastructure, describing the structure of the TEI document type definition. 3.1 Paragraphs e paragraph is the fundamental organizational unit for all prose texts, being the smallest regular unit into which prose can be divided. Prose can appear in all TEI texts, even those that are primarily of another genre (e.g., verse); thus the paragraph is described here, as an element which can appear in any kind of text. Paragraphs can contain any of the other elements described within this chapter, as well as some other elements which are specific to individual text types. We distinguish phrase-level elements, which must be entirely contained within a paragraph and cannot appear except within one, from chunks, which can appear between, but not within, paragraphs, and from inter-level elements, which can appear either within a single paragraph or between paragraphs. e class of phrases includes emphasized or quoted phrases, names, dates, etc. e class of inter-level elements includes bibliographic citations, notes, lists, etc. e class of chunks includes the paragraph itself, and other elements which have similar structural properties, notably the (anonymous block) element described in 16.3. Blocks, Segments, and Anchors) which may be used as an alternative to the paragraph in some kinds of texts. Because paragraphs may appear in different base or additional tag sets, their possible contents may differ in different kinds of documents. In particular, additional elements not listed in this chapter may appear in paragraphs in certain kinds of text. However, the elements described in this chapter are always by default available in all kinds of text. e paragraph is marked using the

element:

(paragraph) marks paragraphs in prose. If a consistent internal subdivision of paragraphs is desired, the or (`segment') elements may be used, as discussed in chapters 16. Linking, Segmentation, and Alignment and 17. Simple Analytic Mechanisms respectively. More usually, however, paragraphs have no firm internal structure, but contain prose encoded as a mix of characters, entity references, phrases marked as described in the rest of this chapter, and embedded elements like lists, figures, or tables. Since paragraphs are usually explicitly marked in Western texts, typically by indentation, the application of the

tag usually presents few problems. 56 3.2. Treatment of Punctuation In some cases, the body of a text may comprise but a single paragraph:

I fully appreciate Gen. Pope's splendid achievements with their invaluable results; but you must know that Major Generalships in the Regular Army, are not as plenty as blackberries.

Source: [130] is news story shows typically short journalistic paragraphs: SARAJEVO, Bosnia and Herzegovina, April 19

Serbs seized more territory in this struggling new country today as the United States Air Force ended a two-day airlift of humanitarian aid into the capital, Sarajevo.

International relief workers called on European Community nations to step up their humanitarian aid to the former Yugoslav republic, in conjunction with new American aid flights if necessary.

A special envoy from the European Community, Colin Doyle, harshly condemned the decision by Serbs to shell Sarajevo on Saturday night during a visit to the Bosnian capital by a senior American official, Deputy Assistant Secretary of State Ralph R. Johnson.

...

e following extract from a Russian fairy tale demonstrates how other phrase level elements (in this case elements representing direct speech; see section 3.3.3. Quotation) may be nested within, but not across, paragraphs:

A fly built a castle, a tall and mighty castle. There came to the castle the Crawling Louse. Who, who's in the castle? Who, who's in your house? said the Crawling Louse. I, I, the Languishing Fly. And who art thou? I'm the Crawling Louse.

Then came to the castle the Leaping Flea. Who, who's in the castle? said the Leaping Flea. I, I, the Languishing Fly, and I, the Crawling Louse. And who art thou? I'm the Leaping Flea.

Then came to the castle the Mischievous Mosquito. Who, who's in the castle? said the Mischievous Mosquito. I, I, the Languishing Fly, and I, the Crawling Louse, and I, the Leaping Flea. And who art thou? I'm the Mischievous Mosquito.

Source: [32] 3.2 Treatment of Punctuation Punctuation marks cause problems for text markup when they are not available in the character set used and when they are significantly ambiguous. To a large extent, the availability of the Unicode character set addresses 57 3. Elements Available in All TEI Documents most such problems, since it provides specific code points for most punctuation marks, and also distinguishes glyphs (such as stop, comma, and hyphen) which are used with different functions. us, for example, different Unicode code points are available for the hyphen used as a minus sign, as a word breaking hyphen, as a so hyphen, or as a `non-breaking' hyphen. e facilities described in chapter 5. Representation of Non-standard Characters and Glyphs may also be used to define markup for non-standard punctuation characters. Full stop (period) may mark (orthographic) sentence boundaries, abbreviations, decimal points, or serve as a visual aid in printing numbers. ese usages can be distinguished by tagging S-units, abbreviations, and numbers, as described in sections 16.3. Blocks, Segments, and Anchors, 3.5.5. Abbreviations and eir Expansions, and 3.5.3. Numbers and Measures. However, there are independent reasons for tagging these, whether or not they are marked by full stops, and the polysemy of the full stop itself is perhaps no different from that of any character in the writing system. Question mark and exclamation mark typically mark the end of orthographic sentences, but may also be used as a mid-sentence comment by the author (! to express surprise or some other strong feeling, ? to query a word or expression or mark a sentence as dubious in linguistic discussion). ese uses may be distinguished by marking S-units, in which case the mid-sentence uses of these punctuation marks may be le unmarked, or tagged using the element discussed in 17.1. Linguistic Segment Categories. Dashes are used for a variety of purposes: insertion, interruption, new speaker (in dialogue), list item. In the latter two cases it is preferable to mark the underlying feature using the elements or , on which see section 3.3.3. Quotation, and section 3.7. Lists, respectively. Quotation marks may be removed from text contained by or elements, especially as quotations are not always marked by quotation marks (notably long quotations) or may be marked in a variety of ways; see the discussion of quotation and related features in section 3.3.3. Quotation. Apostrophes must be distinguished from single quote marks. As with hyphens, this disambiguation may be performed by selecting an appropriate Unicode character, but it may also be represented by using explicit XML tags for quotations as suggested above. However, apostrophes have a variety of uses. In English they mark contractions, genitive forms, and (occasionally) plural forms. Full disambiguation of these uses belongs to the level of linguistic analysis and interpretation. Parentheses and other marks of suspension such as dashes or ellipses are oen used to signal information about the syntactic structure of a text fragment. Full disambiguation of their uses also belongs to the level of linguistic analysis and interpretation, and is therefore discussed in chapter 17. Simple Analytic Mechanisms. Where punctuation marks are disambiguated by tagging the underlying feature they signal, it may be debated whether they should be excluded or le as part of the text. In the case of quotation marks, it may be more convenient to distinguish opening from closing marks simply by using the appropriate Unicode character than to use the element, with or without a rend attribute. e solution chosen will vary depending upon the feature and depending upon the purpose of the project. 3.3 Highlighting and Quotation is section deals with a variety of textual features, all of which have in common that they are frequently realized in conventional printing practice by the use of such features as underlining, italic fonts, or quotation marks, collectively referred to here as highlighting. Aer an initial discussion of this phenomenon and alternate approaches to encoding it, this section describes ways of encoding the following textual features, all of which are conventionally rendered using some kind of highlighting: * emphasis, foreign words and other linguistically distinct uses of highlighting * representation of speech and thought, quotation, etc. * technical terms, glosses, etc. 58 3.3. Highlighting and Quotation 3.3.1 What Is Highlighting? By highlighting we mean the use of any combination of typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings.1 e purpose of highlighting is generally to draw the reader's attention to some feature or characteristic of the passage highlighted; this section describes the elements recommended by these Guidelines for the encoding of such textual features. In conventionally printed modern texts, highlighting is oen employed to identify words or phrases which are regarded as being one or more of the following: * distinct in some way -- as foreign, dialectal, archaic, technical, etc. * emphatic, and which would for example be stressed when spoken * not part of the body of the text, for example cross-references, titles, headings, labels, etc. * identified with a distinct narrative stream, for example an internal monologue or commentary. * attributed by the narrator to some other agency, either within the text or outside it: for example, direct speech or quotation. * set apart from the text in some other way: for example, proverbial phrases, words mentioned but not used, names of persons and places in older texts, editorial corrections or additions, etc. e textual functions indicated by highlighting may not be rendered consistently in different parts of a text or in different texts. (For example, a foreign word may appear in italics if the surrounding text is in roman, but in roman if the surrounding text is in italics.) For this reason, these Guidelines distinguish between the encoding of rendering itself and the encoding of the underlying feature expressed by it. Highlighting as such may be encoded by using either of the global attributes rend or rendition attributes (see 1.3.1.1. Global Attributes). is allows the encoder both to specify the function of a highlighted phrase or word, by selecting the appropriate element described here or elsewhere in the Guidelines, and to further describe the way in which it is highlighted, by means of the rend attribute. If the encoder wishes to offer no interpretation of the feature underlying the use of highlighting in the source text, then the element may be used, which indicates only that the text so tagged was highlighted in some way. (highlighted) marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made. e element is provided by the model.hiLike class. e possible values carried by the rend attribute are not formally defined in this version of the Guidelines. Since the rend attribute may be used to document any peculiarity of the way a given segment of text was rendered in the original source text, it may need to express a very large range of typographic features, by no means restricted to typeface, type size, etc. Where it is both appropriate and feasible, these Guidelines recommend that the textual feature marked by the highlighting should be encoded, rather than just the simple fact of the highlighting. is is for the following reasons: * the same kind of highlighting may be used for different purposes in different contexts * the same textual function may be highlighted in different ways in different contexts * for analytic purposes, it is in general more useful to know the intended function of a highlighted phrase than simply that it is distinct. In many, if not most, cases the underlying function of a highlighted phrase will be obvious and noncontroversial, since the distinctions indicated by a change of highlighting correspond with distinctions discussed elsewhere in these Guidelines. e elements available to record such distinctions are, for the most 1Although the way in which a spoken text is performed, (for example, the voice quality, loudness, etc.) might be regarded as analogous to `highlighting' in this sense, these Guidelines recommend distinct elements for the encoding of such `highlighting' in spoken texts. See further section 8.3.6. Shis. 59 3. Elements Available in All TEI Documents part, members of the model.emphLike class. is and the model.hiLike class mentioned above constitute the model.highlighted class, which is a phrase level class. Members of this class may appear anywhere within paragraph level elements. e distinction between the two classes is simple, and typified by the two elements and : the former marks simply that a passage is typographically distinct in some way, while the latter asserts that a passage is linguistically emphasized for some purpose. ese two properties, though oen combined, are not identical. It should however be recognized, however, that cases do exist in which it is not economically feasible to mark the underlying function (e.g. in the preparation of large text corpora), as well as cases in which it is not intellectually appropriate (as in the transcription of some older materials, or in the preparation of material for the study of typographic practice). In such cases, the element or some other element from the model.hiLike class should be used. Elements which are sometimes realized by typographic distinction but which are not discussed in this section include (discussed in section 3.11. Bibliographic Citations and References) and <name> (discussed in section 3.5.1. Referring Strings). 3.3.2 Emphasis, Foreign Words, and Unusual Language is subsection discusses the following elements: <foreign> (foreign) identifies a word or phrase as belonging to some language other than that of the surrounding text. <emph> (emphasized) marks words or phrases which are stressed or emphasized for linguistic or rhetorical effect. <distinct> identifies any word or phrase which is regarded as linguistically distinct, for example as archaic, technical, dialectal, non-preferred, etc., or as forming part of a sublanguage. ese elements are all members of the model.emphLike class. 3.3.2.1 Foreign Words or Expressions Words or phrases which are not in the main language of the text should be tagged as such, at least where the fact is indicated in the text. Where the word or phrase concerned is already distinguished from the rest of the text by virtue of its function (for example, because it is a name, a technical term, a quotation, a mentioned word, etc.) then the global xml:lang attribute should be used to specify additionally that its language distinguishes it from the surrounding text. Any element in the TEI scheme may take a xml:lang attribute, which specifies both the writing system and the language used by its content (see section vi.1 Language identification for discussion of this attribute). Where there is no other applicable element, the element <foreign> may be used to provide a peg onto which the xml:lang may be attached. <q>Aren't you confusing <foreign xml:lang="la">post hoc</foreign> with <foreign xml:lang="la">propter hoc</foreign>?</q> said the Bee Master. <q>Wax-moth only succeed when weak bees let them in.</q> Source: [112] e <foreign> element should not be used to represent foreign words which are mentioned or glossed within the text: for these use the appropriate element from section 3.3.4. Terms, Glosses, Equivalents, and Descriptions below. Compare the following example sentences: John eats a <foreign xml:lang="fr">croissant</foreign> every morning. 60 3.3. Highlighting and Quotation <mentioned xml:lang="fr">Croissant</mentioned> is difficult to pronounce with your mouth full. A <term xml:lang="fr">croissant</term> is a crescent-shaped piece of light, buttery, pastry that is usually eaten for breakfast, especially in France. Source: [45] 3.3.2.2 Emphatic Words and Phrases e <emph> element is provided to mark words or phrases which are linguistically emphatic or stressed. Text which is only typographically `emphasized' falls into the class of highlighted text, and may be tagged with the <hi> element. In printed works, emphasis is generally indicated by devices such as the use of an italic font, a large typeface, or extra wide letter spacing; in manuscripts and typescripts, it is usually indicated by the use of underlining. As the following examples demonstrate, an encoder may choose whether or not to make explicit the particular type of rendition associated with the emphasis by use of the rend attribute. If a source text consistently renders a particular feature (e.g. emphasis or words in foreign languages) in a particular way, the rendering associated with that feature may be described in the TEI header using the <rendition> element. e rend attribute may then be used to describe examples which deviate from the norm. For example, assuming that the TEI Header has defined a default rendering for the <emph> element, the following encoding would use it: <q>Sex, sir, is <emph>purely</emph> a question of appetite!</q> Tarr exclaimed. Source: [128] If on the other hand no such default has been defined for the element, the encoder may specify it informally using the rend attribute: <q>What it all comes to is this,</q> he said. <q> <emph rend="italic">What does Christopher Robin do in the morning nowadays?</emph> </q> Source: [144] or, if a <rendition> element has been provided in the header (but not necessarily associated with any other element), the rendition attribute may be used to point to it: <l>Here Thou, great <name rend="italics">Anna</name>! whom three Realms obey,</l> <l>Doth sometimes Counsel take -- and sometimes <emph rendition="#italic">Tea</emph>.</l> <!-- in the header ... --> <rendition xml:id="italic" scheme="css">text-style:italic</rendition> 61 3. Elements Available in All TEI Documents Source: [160] Further information on the use of the <rendition> element is provided at 2.3.4. e Tagging Declaration. e <hi> element is used to mark words or phrases which are highlighted in some way, but for which identification of the intended distinction is difficult, controversial, or impossible. It enables an encoder simply to record the fact of highlighting, possibly describing it by the use of a rend or rendition attribute, as discussed above, without however taking a position as to the function of the highlighting. is may also be useful if the text is to be processed in two stages: representing simply typographic distinctions during a first pass, and then replacing the <hi> elements with more specific elements in a second pass. Some simple examples: <hi rend="gothic">And this Indenture further witnesseth</hi> that the said <hi rend="italic">Walter Shandy</hi>, merchant, in consideration of the said intended marriage ... Source: [189] In this example, the first highlighted phrase uses black letter or gothic print to mimic the appearance of a legal document, and italic to mark Walter Shandy as a name. In a second pass, the elements <head> or <label> might be appropriate for the first use, and the element <name> for the second. The heaviest rain, and snow, and hail, and sleet, could boast of the advantage over him in only one respect. They often <hi rend="quoted">came down</hi> handsomely, and Scrooge never did. Source: [59] In this example, the phrase came down uses inverted commas to indicate a play on words.2 In a second pass, the element <soCalled> might be preferred. 3.3.2.3 Other Linguistically Distinct Material For some kinds of analysis, it may be desirable to encode the linguistic distinctiveness of words and phrases with more delicacy than is allowed by the <foreign> element. e <distinct> element is provided for this purpose. Its attributes allow for additional information characterizing the nature of the linguistic distinction to be made in two distinct ways: the type attribute simply assigns a user-defined code of some kind to the word or phrase which assigns it to some register, sub-language, etc. No recommendations as to the set of values for this attribute are provided at this time, as little consensus exists in the field. Alternatively, the remaining three attributes may be used in combination to place a word or phrase on a three-dimensional scale sometimes used in descriptive linguistics, as for example in Mattheier et al, 1988. e time attribute places a word diachronically, for example as archaic, old-fashioned, contemporary, futuristic, etc.; the space attribute places a word diatopically, that is, with respect to a geographical classification, for example as national, regional, international, etc.; the social attribute places a word diastatically, that is, with respect to a social classification, for example as technical, polite, impolite, restricted, etc. Again, no recommendations are made for the values of these attributes at this time; the encoder should provide a description of the scheme used in the appropriate section of the header (see section 2.3. e Encoding Description). Examples: 2e Oxford English Dictionary documents the phrase to come down in the sense `to bring or put down; esp. to lay down money; to make a disbursement' as being in use, mostly in colloquial or humorous contexts, from at least 1700 to the latter half of the 19th century. 62 3.3. Highlighting and Quotation Next morning a boy in that dormitory confided to his bosom friend, a <distinct type="psSlang">fag</distinct> of Macrea's, that there was trouble in their midst which King <distinct type="archaic">would fain</distinct> keep secret. Source: [113] Next morning a boy in that dormitory confided to his bosom friend, a <distinct time="1900" space="GB" social="publicschool">fag</distinct> of Macrea's, that there was trouble in their midst which King <distinct time="archaic">would fain</distinct> keep secret. Where more complex (or more rigorous) interpretive analyses of the associations of a word are required, the more detailed and general mechanisms described in chapter 18. Feature Structures should be preferred to these simple characterizations. It may also be preferable to record the kinds of analysis suggested here by means of the simple annotation element <note> described in section 3.8. Notes, Annotation, and Indexing, or the <span> element described in section 17.3. Spans and Interpretations. 3.3.3 Quotation One form of presentational variation found particularly frequently in written and printed texts is the use of quotation marks. As with the typographic variations discussed in the preceding section, it is generally helpful to separate the encoding of the underlying textual feature (for example, a quotation or a piece of direct speech) from the encoding of its rendering (for example, the use of a particular style of quotation marks). is section discusses the following elements, all of which are oen rendered by the use of quotation marks: <q> (separated from the surrounding text with quotation marks) contains material which is marked as (ostensibly) being somehow different than the surrounding text, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used. <said> (speech or thought) indicates passages thought or spoken aloud, whether explicitly indicated in the source or not, whether directly or indirectly reported, whether by real people or fictional characters. @direct may be used to indicate whether the quoted matter is regarded as direct or indirect speech. @aloud may be used to indicate whether the quoted matter is regarded as having been vocalized or signed. <quote> (quotation) contains a phrase or passage attributed by the narrator or author to some agency external to the text. <cit> (cited quotation) contains a quotation from some other document, together with a bibliographic reference to its source. In a dictionary it may contain an example text with at least one occurrence of the word form, used in the sense being described, or a translation of the headword, or an example. <mentioned> marks words or phrases mentioned, not used. <soCalled> contains a word or phrase for which the author or narrator indicates a disclaiming of responsibility, for example by the use of scare quotes or italics. 63 3. Elements Available in All TEI Documents e elements <mentioned> and <soCalled> are members of the class model.emphLike; the <q> and <said> are members of the class model.qLike in their own right, while <cit> and <quote> are members of model.quoteLike, a subclass of model.qLike. is class is a subclass of model.inter; hence all of these elements are permitted both within and between paragraph-level elements. e most common and important use of quotation marks is, of course, to mark quotation, by which we mean simply any part of the text attributed by the author or narrator to some agency other than the narrative voice. e <q> element may be used if no further distinction beyond this is judged necessary. If however it is felt necessary to distinguish passages which are in some sense external to the work from passages of direct speech or thought, a more precise element may be chosen from the list above. Typical examples include passages cited from other works, for which the element <quote> may be used, and words or phrases spoken or thought by people or characters within the current work, for which the element <said> may be used. e <soCalled> element is used for cases where the author or narrator distances him or herself from the words in question without however attributing them to any other voice in particular. e <mentioned> element is appropriate for a case where a word or phrase is being discussed in the body of a text rather than forming part of the text directly. As noted above, if the distinction among these various reasons why a passage is offset from surrounding text cannot be made reliably, or is not of interest, then all quoted matter may simply be marked using the <q> element. Quotation may be indicated in a printed source by changes in type face, by special punctuation marks (single or double or angled quotes, dashes, etc.) and by layout (indented paragraphs, etc.). If these characteristics are of interest, one or other of the global rend or rendition attributes discussed in section 1.3.1.1. Global Attributes may be used to record them. Quotation marks themselves may, like other punctuation marks, be felt for some purposes to be worth retaining within a text, quite independently of their description by the rend attribute. is should generally be done using the appropriate Unicode character, or, if this is not possible, a numeric character reference (see v.6.1 Character References). Alternatively, the encoder may suppress all quotation marks, possibly recording their form using some appropriate set of conventions in the rend attribute. Some examples are shown below: <said rend="PRE lsquo POST rsquo">Who-e debel you?</said> -- he at last said -- <said rend="PRE lsquo POST rsquo">you no speak-e, damme, I kill-e.</said> And so saying, the lighted tomahawk began flourishing about me in the dark. Source: [142] Adolphe se tourna vers lui : <said>-- Alors, Albert, quoi de neuf?</said> <said>-- Pas grand-chose.</said> <said>-- Il fait beau,</said> dit Robert. Adolphe se tourna vers lui : <said rend="PRE mdash">Alors, Albert, quoi de neuf ?</said> <said rend="PRE mdash">Pas grand-chose.</said> <said rend="PRE mdash">Il fait beau,</said> dit Robert. 64 3.3. Highlighting and Quotation Source: [164] As members of the att.ascribed class, elements <said> and <q> share the following attribute: att.ascribed provides attributes for elements representing speech or action that can be ascribed to a specific individual. @who indicates the person, or group of people, to whom the element content is ascribed. is may be used to make explicit who is speaking: Adolphe se tourna vers lui : <said who="#Adolphe">-- Alors, Albert, quoi de neuf?</said> <said who="#Albert">-- Pas grand-chose.</said> <said who="#Robert">-- Il fait beau,</said> dit Robert. <!-- .... --> <list type="speakers"> <item xml:id="Adolphe"/> <item xml:id="Albert"/> <item xml:id="Robert"/> </list> e who attribute may be supplied whether or not an indication of the speaker is given explicitly in the text. It may take the form (as above) of a normalized form of the speaker's name, but its role is to act as a pointer to a location elsewhere in the text where data about each speaker may be supplied. e most appropriate place to place such information is within the participant description component of the TEI Header, as further discussed in 15.2.2. e Participant Description but for simple cases like the above, a simple list of speakers located in the front or back matter of the text may suffice. It may also be useful to distinguish representations of speech from representations of thought, in modern printed texts oen indicated by a change of typeface. e aloud attribute is provided for this purpose, as in this example: <said aloud="true">Oh yes,</said> said Henry, <said aloud="false">I mean Gordon Macrae, for example...</said> <said aloud="false">Jungian Analyst with Winebox! That's what you called him, you callous bastard, didn't you? Eh? Eh?</said> Source: [210] Quoted matter may be embedded within quoted matter, as when one speaker reports the speech of another: <said who="#Wilson">Spaulding, he came down into the office just this day eight weeks with this very paper in his hand, and he says:-- <said who="#WilsonSpaulding">I wish to the Lord, Mr. Wilson, that I was a red-headed man.</said> </said> <!-- ... --> <list type="speakers"> <item xml:id="Wilson">Wilson</item> <item xml:id="WilsonSpaulding">Spaulding reported by Wilson</item> <!-- ...--> </list> 65 3. Elements Available in All TEI Documents Source: [63] Direct speech nested in this way is treated in the same way as elsewhere: a change of rendition may occur, but the same element should be used. An encoder may however choose to distinguish between direct speech which contains quotations from extra-textual matter and direct speech itself, as in the following example: <p> <said>The Lord! The Lord! It is Sakya Muni himself,</said> the lama half sobbed; and under his breath began the wonderful Buddhist invocation:-<said> <quote> <l>To Him the Way -- the Law -- Apart --</l> <l>Whom Maya held beneath her heart</l> <l>Ananda's Lord -- the Bodhisat</l> </quote> And He is here! The Most Excellent Law is here also. My pilgrimage is well begun. And what work! What work!</said> </p> Source: [114] Quotations from other works are oen accompanied by a reference to their source. e <cit> element may be used to group together the quotation and its associated bibliographic reference, which should be encoded using the elements for bibliographic references discussed in section 3.11. Bibliographic Citations and References, as in the following example. <div xml:id="mm01" type="chapter"> <head>Chapter 1</head> <epigraph> <cit> <quote> <l>Since I can do no good because a woman</l> <l>Reach constantly at something that is near it.</l> </quote> <bibl> <title>The Maid's Tragedy Beaumont and Fletcher

Miss Brooke had that kind of beauty which seems to be thrown into relief by poor dress...

Source: [69] Like other bibliographic references, the citation attached to a quotation may be represented simply by a pointer, as in this example: Lexicography has shown little sign of being affected by the work of followers of J.R. Firth, probably best summarized in his slogan, You shall know a word by the company it keeps. (Firth, 1957) 66 3.3. Highlighting and Quotation Source: [97] Unlike most of the other elements discussed in this chapter, direct speech and quotations may frequently contain other high-level elements such as paragraphs or verse lines, as well as being themselves contained by such elements. ree possible solutions exist for this well-known structural problem: * the quotation is broken into segments, each of which is entirely contained within a paragraph * the quotation is marked up using stand-off markup * the quotation boundaries are represented by empty segment boundary delimiter elements For further discussion and several examples, see chapter 20. Non-hierarchical Structures. Finally, in this section, the element is provided for all cases in which quotation marks are used to distance the quoted text from the narrator or speaker. Common examples include the `scare' quotes oen found in newspaper headlines and advertising copy, where the effect is to cast doubts on the veracity of an assertion: PM dodges election threat in interview Source: [194] e same element should be used to mark a variety of special ironic usages. Some further examples follow: He hated good books. Croissants indeed! toast not good enough for you? Although Chomsky's decision that all NL sentences are finite objects was never justified by arguments from the attested properties of NLs, it did have a certain social justification. It was commonly assumed in works on logic until fairly recently that the notion language is necessarily restricted to finite strings. Source: [119] 3.3.4 Terms, Glosses, Equivalents, and Descriptions is section describes a set of textual elements which are used to provide a gloss, alternate identification, or description of something. Technical terms are oen italicized or emboldened upon first mention in printed texts; an explanation or gloss is sometimes given in quotation marks. Linguistic analyses conventionally cite words in languages under discussion in italics, providing a gloss immediately following marked with single quotation marks. Other texts in which individual words or phrases are mentioned (for example, as examples) rather than used may mark them either with italics or with quotation marks, and will gloss them less regularly. contains a single-word, multi-word, or symbolic designation which is regarded as a technical term. identifies a phrase or word used to provide a gloss or definition for some other word or phrase. 67 3. Elements Available in All TEI Documents ese elements are also members of the class model.emphLike. A may appear with or without a gloss, as may a element. Where the is present, it may be linked to the term it is glossing by means of its target attribute. To establish such a link, the encoder should give an xml:id value to the or element and provide that id as the value of the target attribute on the element. e following examples demonstrate this facility: Examples: We may define discoursal point of view as the relationship, expressed through discourse structure, between the implied author or some other addresser, and the fiction. Source: [123] A computational device that infers structure from grammatical strings of words is known as a parser, and much of the history of NLP over the last 20 years has been occupied with the design of parsers. Source: [82] Note that the element is intended for use with words or phrases identified as terminological in nature; where words or phrases are simply being cited, discussed, or glossed in a text, it will oen be more appropriate to use the element, as in the following example: There is thus a striking accentual difference between a verbal form like eluthemen we were released, accented on the second syllable of the word, and its participial derivative lutheis released, accented on the last. Source: [170] For technical terminology in particular, and generally in terminological studies, it may be useful to associate an instance of a term within a text with a canonical definition for it, which is stored either elsewhere in the same text (for example in a glossary of terms) or externally, for example in a database, authority file, or published standard. e attributes key and ref discussed in section 3.5.1. Referring Strings below are available on the element for this purpose. Another group of elements is used to supply different kinds of names for objects described by the TEI. Examples of this are documentation of elements, attributes, classes (and also attribute values where appropriate), and description of glyphs. (alternate identifier) supplies the recommended XML name for an element, class, attribute, etc. in some language. (description) contains a brief description of the object documented by its parent element, including its intended usage, purpose, or application where this is appropriate. (equivalent) specifies a component which is considered equivalent to the parent element, either by co-reference, or by external link. 68 3.3. Highlighting and Quotation @uri (uniform resource identifier) references the underlying concept of which the parent is a representation by means of some external identifier @filter references an external script which contains a method to transform instances of this element to canonical TEI @name names the underlying concept of which the parent is a representation Along with the element mentioned above, these elements constitute the model.glossLike class. e element may be used to provide a brief explanation for the name of the object if this is not self-explanatory. For example, the specification for the element used to mark arbitrary blocks of text begins as follows: anonymous block A may also be supplied for an attribute name or an attribute value in similar circumstances: suspension the abbreviation provides the first letter(s) of the word or phrase, omitting the remainder. contraction the abbreviation omits some letter(s) in the middle. Note that this is quite distinct from the use of the element, which contains a full description of the intended semantics for the object. e element is used to document equivalencies between the concept represented by this object and the same concept as described in other schemes or ontologies. e uri attribute is used to supply a pointer to some location where such external concepts are defined. For example, to indicate that the TEI element corresponds to the concept defined by the CIDOC CRM category E69, the declaration for the former might begin as follows: e element may also be used to map newly-defined elements onto existing constructs in the TEI, using the filter and name attributes to point to an implementation of the mapping. is is useful when a TEI customization (see 23.2. Personalization and Customization) defines `shortcuts' for convenience of data entry or markup readability. For example, suppose that in some TEI customization an element has been defined which is conceptually equivalent to the standard markup construct . e following declarations would additionally indicate that instances of the element can be converted to canonical TEI 69 3. Elements Available in All TEI Documents by obtaining a filter from the URI specified, and running the procedure with the name bold. e mimeType attribute specifies the language (in this case XSL) in which the filter is written: bold contains a sequence of characters rendered in a bold face. e element is used to provide an alternative name for an object, for example using a different natural language. us, the following might be used to indicate that the element should be identified using the German word Abkürzung: Abkürzung In the same way, the following specification for the element indicates that the attribute url may also be referred to using the alternate identifier href: href By default, the of a component is identical to the value of its ident attribute. e contents of the element provide a brief characterization of the intended function of the object being documented in a form that permits its quotation out of context, as in the following example: identifies a word or phrase as belonging to some language other than that of the surrounding text. By convention, a element begins with a verb such as contains, indicates, specifies, etc. and contains a single clause. 3.3.5 Some Further Examples As a simple example of the elements discussed here, consider the following sentence: 70 3.3. Highlighting and Quotation On the one hand the Nibelungenlied is associated with the new rise of romance of twelhcentury France, the romans d'antiquité, the romances of Chrétien de Troyes, and the German adaptations of these works by Heinrich van Veldeke, Hartmann von Aue, and Wolfram von Eschenbach. A first approximation to the encoding of this sentence might be simply to record the fact that the phrases printed above in italics are highlighted, as follows: On the one hand the Nibelungenlied is associated with the new rise of romance of twelfth-century France, the romans d'antiquité, the romances of Chrétien de Troyes, ... Source: [5] is encoding would, however, lose the important distinction between an italicized title and an italicized foreign phrase. Many other phrases might also be italicized in the text, and a retrieval program seeking to identify foreign terms (for example) would not be able to produce reliable results by simply looking for italicized words. Where economic and intellectual constraints permit, therefore, it would be preferable to encode both the function of the highlighted phrases and their appearance, as follows: On the one hand the Nibelungenlied is associated with the new rise of romance of twelfth-century France, the romans d'antiquité, the romances of Chrétien de Troyes, ... In this example, the decision as to which textual features are distinguished by the highlighting is relatively uncontroversial. As a less straightforward example, consider the use of italic font in the following passage: A pretty common case, I believe; in all vehement debatings. She says I am too witty; Anglicé, too pert; I, that she is too wise; that is to say, being likewise put into English, not so young as she has been: in short, she is grown so much into a mother, that she had forgotten she ever was a daughter. ... Clearly, the word vehement is not italicized for the same reason as the phrase not so young as she has been; the former is emphasized, while the latter is proverbial. It also provides an ironic gloss for the words too wise, in the same way as too pert glosses too witty. e glossed phrases are not, however, technical terms or cited words, but quoted phrases, as if the writer were putting words into her own and her mother's mouths. Finally, the words mother and daughter are apparently italicized simply to oppose them in the sentence; certainly they do not fit into any of the categories so far proposed as reasons for italicizing. Note also that the word Anglicé is not italicized although it is not generally considered an English word. e following sample encoding for the above passage attempts to take into account all the above points: A pretty common case, I believe; in all vehement debatings. She says I am too witty; Anglicé, too pert; I, that she is too wise; that is to say, being likewise put into English, not so young as she has been: in short, she is grown so much into a mother, that she had forgotten she ever was a daughter. Source: [166] 71 3. Elements Available in All TEI Documents 3.4 Simple Editorial Changes As in editing a printed text, so in encoding a text in electronic form, it may be necessary to accommodate editorial comment on the text and to render account of any changes made to the text in preparing it. e tags described in this section may be used to record such editorial interventions, whether made by the encoder, by the editor of a printed edition used as a copy text, by earlier editors, or by the copyists of manuscripts. e tags described here handle most common types of editorial intervention and stereotyped comment; where less structured commentary of other types is to be included, it should be marked using the element described in section 3.8. Notes, Annotation, and Indexing. Systematic interpretive annotation is also possible using the various methods described in chapter16. Linking, Segmentation, and Alignment. e examples given here illustrate only simple cases of editorial intervention; in particular, they permit economical encoding of a simple set of alternative readings of a short span of text. To encode multiple views of large or heterogenous spans of text, the mechanisms described in chapter 16. Linking, Segmentation, and Alignment should be used. To encode multiple witnesses of a particular text, a similar mechanism designed specifically for critical editions is described in chapter 12. Critical Apparatus. For most of the elements discussed here, some encoders may wish to indicate both a responsibility, that is, a code indicating the person or agency responsible for making the editorial intervention in question, and also an indication of the degree of certainty which the encoder wishes to associate with the intervention. Because these requirements are common to many of the elements discussed in this section, they are provided by an attribute class, called att.editLike. All members of this class carry the following optional attributes: att.editLike provides attributes describing the nature of a encoded scholarly intervention or interpretation of any kind. @cert (certainty) signifies the degree of certainty associated with the intervention or interpretation. @resp (responsible party) indicates the agency responsible for the intervention or interpretation, for example an editor or transcriber. @evidence indicates the nature of the evidence supporting the reliability or accuracy of the intervention or interpretation. Many of the elements discussed here can be used in two ways. eir primary purpose is to indicate that the text encoded as the element's content represents an editorial intervention (or non-intervention) of a specific kind, indicated by the element itself. However, pairs or other meaningful groupings of such elements can also be supplied, wrapped within a special purpose element: groups a number of alternative encodings for the same point in a text. is element enables the encoder to represent for example a text in its `original' uncorrected and unaltered form, alongside the same text in one or more `edited' forms. is usage permits soware to switch automatically between one `view' of a text and another, so that (for example) a stylesheet may be set to display either the text in its original form or aer the application of editorial interventions of particular kinds. Elements which can be combined in this way constitute the model.choicePart class. e default members of this class are , , , , , , and ; their functions and usage are described further below. ree categories of editorial intervention are discussed in this section: * indication or correction of apparent errors * indication or regularization of variant, irregular, non-standard, or eccentric forms * editorial additions, suppressions, and omissions A more extended treatment of the use of these tags in transcriptional and editorial work is given in chapter 11. Representation of Primary Sources. 72 3.4. Simple Editorial Changes 3.4.1 Apparent Errors When the copy text is manifestly faulty, an encoder or transcriber may elect simply to correct it without comment, although for scholarly purposes it will oen be more generally useful to record both the correction and the original state of the text. e elements described here enable all three approaches, and allows the last to be done is such a way as make it easy for soware to present either the original or the correction. (latin for thus or so) contains text reproduced although apparently incorrect or inaccurate. (correction) contains the correct form of a passage apparently erroneous in the copy text. e following examples show alternative treatment of the same material. e copy text reads: Another property of computer-assisted historical research is that data modelling must permit any one textual feature or part of a textual feature to be a part of more than one information model and to allow the researcher to draw on several such models simultaneously, for example, to select from a machine-readable text those marginal comments which indicate that the date's mentioned in the main body of the text are incorrect. An encoder may choose to correct the typographic error, either silently or with an indication that a correction has been made, as follows: ... marginal comments which indicate that the dates mentioned in the main body of the text are incorrect. Alternatively, the encoder may simply record the typographic error without correcting it, either without comment or with a element to indicate the error is not a transcription error in the encoding: ... marginal comments which indicate that the date's mentioned in the main body of the text are incorrect. If the encoder elects both to record the original source text and to provide a correction for the sake of word-search and other programs, both and are used, wrapped in a : ... marginal comments which indicate that the dates date's mentioned in the main body of the text are incorrect. e and elements can appear in either order. If it is desired to indicate the person or edition responsible for the emendation, this might be done as follows: ... marginal comments which indicate that the dates date's mentioned in the main body of the text are incorrect. 73 3. Elements Available in All TEI Documents editor C.M. Sperberg McQueen Here the resp attribute has been used to indicate responsibility for the correction. Its value (#msm) is an example of the pointer values discussed in section 3.6. Simple Links and Cross-References; in this case, it points to a element within the TEI Header, but any element might be indicated in this way, including for example a element (if the module described in 13. Names, Dates, People, and Placeshas been included), or one of the bibliographic elements described in 3.11. Bibliographic Citations and References, if the correction has been taken from some other source. e resp attribute is available for all elements which are part of the att.editLike class. e same class makes available a cert attribute,which may be used to indicate the degree of editorial confidence in a particular correction, as in the following example: An Autumn Antony it was, That grew the more by reaping Source: [176] See further the discussion in section 11.3.3. Correction and Conjecture. Where, as here, the correction takes the form of adding text not otherwise present in the text being encoded, the encoder should use the element. Where the correction is present in the text being encoded, and consists of some combination of visible additions and deletions, the elements or should be used: see further section 3.4.3. Additions, Deletions, and Omissions below. Where the correction takes the form of addition of material not present in the original because of physical damage or illegibility, the element may be used. Where the `correction' is simply a matter of expanding an abbreviation the element may be used. ese and other elements to support the detailed encoding of authorial or scribal interventions of this kind are all provided by the module described in chapter 11. Representation of Primary Sources. 3.4.2 Regularization and Normalization When the source text makes extensive use of variant forms or non-standard spellings, it may be desirable for a number of reasons to regularize it: that is, to provide `standard' or `regularized' forms equivalent to the nonstandard forms.3 As with other such changes to the copy text, the changes may be made silently (in which case the TEI header should specify the types of silent changes made) or may be explicitly marked using the following elements: (regularization) contains a reading which has been regularized or normalized in some sense. (original form) contains a reading which is marked as following the original, rather than being normalized or corrected. groups a number of alternative encodings for the same point in a text. Typical applications for these elements include the production of editions intended for student or lay readers, linguistic research in which spelling or usage variation is not the main question at issue, production of spelling dictionaries, etc. Consider this 16th-century text: 3In some contexts, the term regularization has a narrower and more specific significance than that proposed here: the element may be used for any kind of regularization, including normalization, standardization, and modernization. 74 3.4. Simple Editorial Changes how godly a dede it is to overthrowe so wicked a race the world may judge: for my part I thinke there canot be a greater sacryfice to God. An encoder may choose to preserve the original spelling of this text, but simply flag it as nonstandard by using the element with no attributes specified, as follows:

...how godly a dede it is to overthrowe so wicked a race the world may judge: for my part I thinke there canot be a greater sacryfice to God

Alternatively, the encoder may simply indicate that certain words have been modernized by using the element with no attributes specified, as follows:

...how godly a deed it is to overthrow so wicked a race the world may judge: for my part I think there cannot be a greater sacrifice to God.

Alternatively, the encoder may elect to record both old and new spellings, so that (for example) the same electronic text may serve as the basis of an old- or new-spelling edition:

...how godly a dede deed it is to overthrowe overthrow so wicked a race the world may judge: for my part I thinke think there canot cannot be a greater sacryfice sacrifice to God.

Source: [30] As elsewhere, the resp attribute may be used to specify the agency responsible for the regularization. 3.4.3 Additions, Deletions, and Omissions e following elements are used to indicate when words or phrases have been omitted from, added to, or marked for deletion from, a text. Like the other editorial elements, they allow for a wide range of editorial practices: 75 3. Elements Available in All TEI Documents (gap) indicates a point where material has been omitted in a transcription, whether for editorial reasons described in the TEI header, as part of sampling practice, or because the material is illegible, invisible, or inaudible. @reason gives the reason for omission. Sample values include sampling, inaudible, irrelevant, cancelled. contains a word, phrase, or passage which cannot be transcribed with certainty because it is illegible or inaudible in the source. @reason indicates why the material is hard to transcribe. (addition) contains letters, words, or phrases inserted in the text by an author, scribe, annotator, or corrector. (deletion) contains a letter, word, or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, annotator, or corrector. Encoders may choose to omit parts of the copy text for reasons ranging from illegibility of the source or impossibility of transcribing it, to editorial policy, e.g. a systematic exclusion of poetry or prose from an encoding. e full details of the policy decisions concerned should be documented in the TEI Header (see section 2.3. e Encoding Description). Each place in the text at which omission has taken place should be marked with a element, with optionally further information about the reason for the omission, its extent, and the person or agency responsible for it, as in the following examples: Note that the extent of the gap may be marked precisely using attributes unit and quantity, or more descriptively using the extent attribute. Other, more detailed, options are also available for representing dimensions of any kind; see further 10.3.4. Dimensions. e element may be used to supply a description of the material omitted, where that is considered useful: irrelevant commentary ... Their arrangement with respect to Jupiter and to each other was as follows: astrological figure That is, there were two stars on the easterly side and one to the west; ... Source: [78] e and elements may be used to record where words or phrases have been added or deleted in the copy text. ey are not appropriate where longer passages have been added or deleted, which span several elements; for these, the elements and described in chapter 11.3.4. Additions and Deletions must be used. 76 3.4. Simple Editorial Changes Additions to a text may be recorded for a number of reasons. Sometimes they are marked in a distinctive way in the source text, for example by brackets or insertion above the line (supralinear insertion), as in the following example, taken from a 19th century manuscript: The story I am going to relate is true as to its main facts, and as to the consequences of these facts from which this tale takes its title. Source: [79] e element should not be used to mark editorial changes, such as supplying a word omitted by mistake from the source text or a passage present in another version. In these cases, either the or tags should be used, as discussed above in section 3.4.1. Apparent Errors, and in section 11.3.3. Correction and Conjecture, respectively. e element is used to mark passages in the original which cannot be read with confidence, or about which the transcriber is uncertain for other reasons, as for example when transcribing a partially inaudible or illegible source. Its reason and resp attributes are used, as with the element, to indicate the cause of uncertainty and the person responsible for the conjectured reading. For example: And where the sandy mountain Fenwick scald The sea between yet hence his pray'r prevail'd Source: [141] or from a spoken text:

... and then marbled queen...

Where the material affected is entirely illegible or inaudible, the element discussed above should be used in preference. e element is used to mark material which is deleted in the source but which can still be read with some degree of confidence, as opposed to material which has been omitted by the encoder or transcriber either because it is entirely illegible or for some other reason. is is of particular importance in transcribing manuscript material, though deletion is also found in printed texts, sometimes for humorous purposes: One day I will sojourn to your shores I live in the middle of England But! Norway! My soul resides in your watery fiords fyords fiiords Inlets. Source: [198] e rend attribute may be used to distinguish different methods of deletion in manuscript or typescript material, as in this line from the typescript of Eliot's Waste Land: 77 3. Elements Available in All TEI Documents Mein Frisch schwebt weht der Wind Deletion in manuscript or typescript is oen associated with addition: Inviolable Inexplicable splendour of Corinthian white and gold Source: [71] e element discussed in 11.3.5. Substitutions provides a way of grouping additions and deletions of this kind. e element should not be used where the deletion is such that material cannot be read with confidence, or read at all, or where the material has been omitted by the transcriber or editor for some other reason. Where the material deleted cannot be read with confidence, the tag should be used with the reason attribute indicating that the difficulty of transcription is due to deletion. Where material has been omitted by the transcriber or editor, this may be indicated by use of the element. A deletion in which some parts may be read but not others may thus be represented by one or more elements intermingled with text, all contained by a element. 3.5 Names, Numbers, Dates, Abbreviations, and Addresses is section describes a number of textual features which it is oen convenient to distinguish from their surrounding text. Names, dates, and numbers are likely to be of particular importance to the scholar treating a text as source for a database; distinguishing such items from the surrounding text is however equally important to the scholar primarily interested in lexis. e treatment of these textual features proposed here is not intended to be exhaustive: fuller treatments for names, numbers, measures, and dates are provided in the names and dates module (see chapter 13. Names, Dates, People, and Places). 3.5.1 Referring Strings A referring string is a phrase which refers to some person, place, object, etc. Two elements are provided to mark such strings: (referencing string) contains a general purpose name or referring string. (name, proper noun) contains a proper noun or noun phrase. ese elements are both members of the att.typed class, from which they inherit the following attributes: att.typed provides attributes which can be used to classify or subclassify elements in any way. @type characterizes the element in some sense, using any convenient classification scheme or typology. @subtype provides a sub-categorization of the element, if needed which may be used to further categorize the kind of object referred to. Examples include: 78 3.5. Names, Numbers, Dates, Abbreviations, and Addresses

My dear Mr. Bennet , said his lady to him one day, have you heard that Netherfield Park is let at last?

Source: [8]

Collectors of water-rents were appointed by the Watering Committee. They were paid a commission not exceeding four per cent, and gave bond.

Source: [3]

It being one of the principles of the Circumlocution Office never, on any account whatsoever, to give a straightforward answer, Mr Barnacle said, Possibly.

Source: [60] As the following example shows, the element may be used for any reference to a person, place, etc., not only to references in the form of a proper noun or noun phrase.

My dear Mr. Bennet , said his lady to him one day ...

e element by contrast is provided for the special case of referencing strings which consist only of proper nouns; it may be used synonymously with the element, or nested within it if a referring string contains a mixture of common and proper nouns. e following example shows an alternative way of encoding the short sentence from Pride and Prejudice quoted above:

My dear Mr. Bennet, said his lady to him one day, have you heard that Netherfield Park is let at last?

As the following example shows, a proper name may be nested within a referring string: His Excellency the Life President, Ngwazi Dr H. Kamuzu Banda Source: [138] 79 3. Elements Available in All TEI Documents Simply tagging something as a name is generally not enough to enable automatic processing of personal names into the canonical forms usually required for reference purposes. e name as it appears in the text may be inconsistently spelled, partial, or vague. Moreover, name prefixes such as van or de la may or may not be included as part of the reference form of a name, depending on the language and country of origin of the bearer. Two issues arise in this context: firstly, there may be a need to encode a regularised form of a name, distinct from the actual form in the source to hand; secondly, there may be a need to identify the particular person, place, etc. referred to by the name, irrespective of whether the name itself is normalized or not. e element , introduced in 3.4.2. Regularization and Normalization is provided for the former purpose; the attributes key or ref for the latter. e key and ref attributes are common to all members of the att.canonical class and are defined as follows: att.canonical provides attributes which can be used to associate a representation such as a name or title with canonical information about the object being named or referenced. @key provides an externally-defined means of identifying the entity (or entities) being named, using a coded value of some kind. @ref (reference) provides an explicit means of locating a full definition for the entity being named by means of one or more URIs. A very useful application for them is as a means of gathering together all references to the same individual or location scattered throughout a document:

My dear Mr. Bennet, said his lady to him one day, have you heard that Netherfield Park is let at last?

Mme. de Volanges marie sa fille: c'est encore un secret; mais elle m'en a fait part hier.

Source: [117] e value of the key attribute may be an unexpanded code, as in the examples above, with no particular significance. More usually however, it will be an externally defined code of some kind, as provided by a standard reference source.

Heathrow

e ref attribute can be used to point directly to some other resource providing more information about the entity named by the element, such as an authority record in a database, an encylopaedia entry, another element in the same or a different document etc. 80 3.5. Names, Numbers, Dates, Abbreviations, and Addresses

Heathrow

is use should be distinguished from the use of a nested (regularization) element to provide the standard form of a referring string, as in this example:

My personal life during the administration of Col. Polk (Polk, James K.) has but poorly compensated me for the suspended enjoyments and pursuits of private and professional spheres

Source: [52] e element discussed in 3.4. Simple Editorial Changes may be used if it is desired to record both a normalized form of a name and the name used in the source being encoded:

Walter de la Mare de la Mare, Walter was born at Charlton, in Kent, in 1873.

Source: [192] e element discussed in 3.8.2. Index Entries may be more appropriate if the function of the regularization is to provide a consistent index:

Montaillou is not a large parish. At the time of the events which led to Fournier Benedict XII, Pope of Avignon (Jacques Fournier) 's investigations, the local population consisted of between 200 and 250 inhabitants.

Source: [118] Although adequate for many simple applications, these methods have two inconveniences: if the name occurs many times, then its regularised form must be repeated many times; and the burden of additional XML markup in the body of the text may be inconvenient to maintain and complex to process. For applications such as onomastics, relating to persons or places named rather than the name itself, or wherever a detailed analysis of the component parts of a name is needed, the specialized elements described in chapter 13. Names, Dates, People, and Places or the analytical tools described in chapter 18. Feature Structures should be used. 81 3. Elements Available in All TEI Documents 3.5.2 Addresses ese Guidelines propose the following elements to distinguish postal and electronic addresses:
contains a postal address, for example of a publisher, an organization, or an individual. (electronic mail address) contains an e-mail address identifying a location to which e-mail messages can be delivered. ese two elements constitute the class of model.addressLike elements; for other kinds of address this class may be extended by adding new elements if necessary. ese Guidelines provide no particular means for encoding the substructure of an email address (for example, distinguishing the local part from the domain part), nor of distinguishing personal email addresses from generic or fictitious ones. editors@tei-c.org e simplest way of encoding a postal address is to regard it as a series of distinct lines, just as they might be written on an envelope. e following element supports this view: (address line) contains one line of a postal address. Here is an example of a postal address encoded using this approach:
110 Southmoor Road, Oxford OX2 6RB, UK
Alternatively, an address may be encoded as a structure of more semantically rich elements. e class model.addrPart element class identifies a number of such possible components: a full street address including any name or number identifying a building as well as the name of the street or route on which it is located. (name, proper noun) contains a proper noun or noun phrase. (postal code) contains a numerical or alphanumeric code used as part of a postal address to simplify sorting or delivery of mail. (postal box or post office box) contains a number or other identifier for some postal delivery point other than a street address. model.nameLike groups elements which name or refer to a person, place, or organization. model.persNamePart groups elements which form part of a personal name. model.placeNamePart groups elements which form part of a place name. Any number of elements from the model.addrPart class may appear within an address and in any order. None of them is required. Where code letters are commonly used in addresses (for example, to identify regions or countries) a useful practice is to supply the full name of the region or country as the content of the element, but to supply the abbreviatory code as the value of the global n attribute, so that (for example) an application preparing formatted labels can readily find the required information. Other components of addresses may be represented using the general-purpose element or (when the additional module for names and dates is included) the more specialized elements provided for that purpose. Using just the elements defined by the core module, the above address could thus be represented as follows: 82 3.5. Names, Numbers, Dates, Abbreviations, and Addresses
110 Southmoor Road Oxford OX2 6RB United Kingdom
e order of elements within an address is highly culture-specific, and is therefore unconstrained:
Universit di Bologna Italy 40126 Bologna via Marsala 24
For further discussion of ways of regularizing the names of places, see section 3.5. Names, Numbers, Dates, Abbreviations, and Addresses. A full postal address may also include the name of the addressee, tagged as above using the general purpose element. When a schema includes the names and dates module discussed in chapter 13. Names, Dates, People, and Places, a large number of more specific elements such as or will be available from the class model.addrPart. e above example might then be encoded as follows:
110 Southmoor Road Oxford OX2 6RB United Kingdom
3.5.3 Numbers and Measures is section describes elements provided for the simple encoding of numbers and measurements and gives some indication of circumstances in which this may usefully be done. e following phrase level elements are provided for this purpose: (number) contains a number, written in any form. @type indicates the type of numeric value. @value supplies the value of the number in standard form. contains a word or phrase referring to some quantity of an object or commodity, usually comprising a number, a unit, and a commodity name. @type specifies the type of measurement in any convenient typology. (measure group) contains a group of dimensional specifications which relate to the same object, for example the height and width of a manuscript page. Like names or abbreviations, numbers can occur virtually anywhere in a text. Numbers are special in that they can be written with either letters or digits (twenty-one, xxi, and 21) and their presentation is languagedependent (e.g. English 5th becomes Greek 5.; English 123,456.78 equals French 123.456,78). For many kinds of application, e.g. natural-language processing or machine translation, numbers are not regarded as `lexical' in the same way as other parts of a text. For these and other applications, the 83 3. Elements Available in All TEI Documents element provides a convenient method of distinguishing numbers from the surrounding text. For other kinds of application, numbers are only useful if normalized: here the element is useful precisely because it provides a standardized way of representing a numerical value. For example: xxxiii twenty-one ten percent 10% 5th one half 1/2 In its fullest form, a measure consists of a number, a phrase expressing units of measure and a phrase expressing the commodity being measured, though not all of these components need be present in every case. It may be helpful to distinguish measures from surrounding text for two reasons. Firstly, a measure may be expressed using a particular notation or system of abbreviations which the encoder does not wish to regard as lexical. Secondly, a quantitative application may wish to distinguish and normalize the internal components of a measure, in order to perform calculations on them. Consider, as an example of the first case, the following list of Celia's charms, in which the encoder has chosen to make explicit the measurements:
Unimportant Small and round Green White yellow Mobile 13" 11"
Source: [12] 84 3.5. Names, Numbers, Dates, Abbreviations, and Addresses In the same way, it may be convenient to mark representations of currency which might otherwise be misinterpreted as lexical:

...the sum of 12s 6d...

In general, normalization of a measure will require specification of one or more of its three parts: the quantity, the units, and possibly also the commodity being measured. is is accomplished by supplying values for the three attributes quantity, unit, and commodity, which are supplied by the att.measurement class: att.measurement provides attributes to represent a regularized or normalized measurement. @quantity specifies the number of the specified units that comprise the measurement @unit indicates the units used for the measurement, usually using the standard symbol for the desired units. @commodity indicates the substance that is being measured With these attributes, the measurement of Celia's neck may be specified in a normalized form: 13" Such techniques are particularly useful when representing historical data such as inventories: ii bags hops six trusses Woolen and linen goods 5 tonnes coale Source: [207] e element is provided as a means of grouping several related measurements together, either because the measurement involves several dimensions (for example height and width) or to avoid the need to repeat all the normalizing attributes: 85 3. Elements Available in All TEI Documents xiv v x 3.5.4 Dates and Times Dates and times, like numbers, can appear in widely varying culture- and language-dependent forms, and can pose similar problems in automatic language processing. Such elements constitute the model.dateLike class, of which the default members are: contains a date in any format. @calendar indicates the system or calendar to which the date represented by the content of this element belongs.
Suitably classy music starts. Mix through to Wilde's drawing room. A crowd of suitably dressed folk are engaged in typically brilliant conversation, laughing affectedly and drinking champagne. Prince of Wales

My congratulations, Wilde. Your latest play is a great success.

Source: [104] 7.3.1 Technical Information Traditional stage scripts may contain additional technical information about such production-related factors as lighting, `blocking' (that is, detailed notes on actors' movements), or props required at particular points. More technical information about intended production effects may also appear in published versions of screenplays or movie scripts. Where these are presented simply as marginal notes, they may be encoded using the generalpurpose element defined in section 3.8. Notes, Annotation, and Indexing. Alternatively, they may be formally distinguished from other stage directions by using the specialized element: (technical stage direction) describes a special-purpose stage direction that is not meant for the actors. @type categorizes the technical stage direction. @perf (performance) identifies the performance or performances to which this technical direction applies. Like stage directions, elements can appear anywhere within a speech or between speeches. 7.4 Module for Performance Texts e module described in this chapter makes available the following components: Module drama: Performance texts * Elements defined: actor camera caption castGroup castItem castList epilogue move performance prologue role roleDesc set sound tech view e selection and combination of modules to form a TEI schema is described in 1.2. Defining a TEI Schema. 224 Chapter 8 Transcriptions of Speech e module described in this chapter is intended for use with a wide variety of transcribed spoken material. It should be stressed, however, that the present proposals are not intended to support unmodified every variety of research undertaken upon spoken material now or in the future; some discourse analysts, some phonologists, and doubtless others may wish to extend the scheme presented here to express more precisely the set of distinctions they wish to draw in their transcriptions. Speech regarded as a purely acoustic phenomenon may well require different methods from those outlined here, as may speech regarded solely as a process of social interaction. is chapter begins with a discussion of some of the problems commonly encountered in transcribing spoken language (section 8.1. General Considerations and Overview). Section 8.2. Documenting the Source of Transcribed Speech documents some additional TEI Header elements which may be used to document the recording or other source from which transcribed text is taken. Section 8.3. Elements Unique to Spoken Texts describes the basic structural elements provided by this module. Finally, section 8.4. Elements Defined Elsewhere of this chapter reviews further problems specific to the encoding of spoken language, demonstrating how mechanisms and elements discussed elsewhere in these Guidelines may be applied to them. 8.1 General Considerations and Overview ere is great variation in the ways different researchers have chosen to represent speech using the written medium.1 is reflects the special difficulties which apply to the encoding or transcription of speech. Speech varies according to a large number of dimensions, many of which have no counterpart in writing (for example, tempo, loudness, pitch, etc.). e audibility of speech recorded in natural communication situations is oen less than perfect, affecting the accuracy of the transcription. Spoken material may be transcribed in the course of linguistic, acoustic, anthropological, psychological, ethnographic, journalistic, or many other types of research. Even in the same field, the interests and theoretical perspectives of different transcribers may lead them to prefer different levels of detail in the transcript and different styles of visual display. e production and comprehension of speech are intimately bound up with the situation in which speech occurs, far more so than is the case for written texts. A speech transcript must therefore include some contextual features; determining which are relevant is not always simple. Moreover, the ethical problems in recording and making public what was produced in a private setting and intended for a limited audience are more frequently encountered in dealing with spoken texts than with written ones. Speech also poses difficult structural problems. Unlike a written text, a speech event takes place in time. Its beginning and end may be hard to determine and its internal composition difficult to define. Most researchers agree that the utterances or turns of individual speakers form an important structural component in most kinds of speech, but these are rarely as well-behaved (in the structural sense) as paragraphs or other analogous 1For a discussion of several of these see Edwards and Lampert (eds.) (1993); Johansson (1994); and Johansson et al. (1991). 225 8. Transcriptions of Speech units in written texts: speakers frequently interrupt each other, use gestures as well as words, leave remarks unfinished and so on. Speech itself, though it may be represented as words, frequently contains items such as vocalized pauses which, although only semi-lexical, have immense importance in the analysis of spoken text. Even non-vocal elements such as gestures may be regarded as forming a component of spoken text for some analytic purposes. Below the level of the individual utterance, speech may be segmented into units defined by phonological, prosodic, or syntactic phenomena; no clear agreement exists, however, even as to appropriate names for such segments. Spoken texts transcribed according to the guidelines presented here are organized as follows. e overall structure of a TEI spoken text is identical to that of any other TEI text: the element for a spoken text contains a element, followed by a element. Even texts primarily composed of transcribed speech may also include conventional front and back matter, and may even be organized into divisions like printed texts. We may say, therefore, that these Guidelines regard transcribed speech as being composed of arbitrary high-level units called texts. A spoken might typically be a conversation between a small number of people, a lecture, a broadcast TV item, or a similar event. Each such unit has associated with it a providing detailed contextual information such as the source of the transcript, the identity of the participants, whether the speech is scripted or spontaneous, the physical and social setting in which the discourse takes place and a range of other aspects. Details of the header in general are provided in chapter 2. e TEI Header; the particular elements it provides for use with spoken texts are described below (8.2. Documenting the Source of Transcribed Speech). Details concerning additional elements which may be used for the documentation of participant and contextual information are given in 15.2. Contextual Information. Defining the bounds of a spoken text is frequently a matter of arbitrary convention or convenience. In public or semi-public contexts, a text may be regarded as synonymous with, for example, a lecture, a broadcast item, a meeting, etc. In informal or private contexts, a text may be simply a conversation involving a specific group of participants. Alternatively, researchers may elect to define spoken texts solely in terms of their duration in time or length in words. By default, these Guidelines assume of a text only that: * it is internally cohesive, * it is describable by a single header, and * it represents a single stretch of time with no significant discontinuities. Deviation from these assumptions may be specified (for example, the org attribute on the element may take the value compos to specify that the components of the text are discrete) but is not recommended. Within a it may be necessary to identify subdivisions of various kinds, if only for convenience of handling. e neutral
element discussed in section 4.1. Divisions of the Body is recommended for this purpose. It may be found useful also for representing subdivisions relating to discourse structure, speech act theory, transactional analysis, etc., provided only that these divisions are hierarchically well-behaved. Where they are not, as is oen the case, the mechanisms discussed in chapters 16. Linking, Segmentation, and Alignment and 20. Non-hierarchical Structures may be used. A spoken text may contain any of the following components: * utterances * pauses * vocalized but non-lexical phenomena such as coughs * kinesic (non-verbal, non-lexical) phenomena such as gestures * entirely non-linguistic incidents occurring during and possibly influencing the course of speech * writing, regarded as a special class of incident in that it can be transcribed, for example captions or overheads displayed during a lecture 226 8.2. Documenting the Source of Transcribed Speech * shis or changes in vocal quality Elements to represent all of these features of spoken language are discussed in section 8.3. Elements Unique to Spoken Texts below. An utterance (tagged ) may contain lexical items interspersed with pauses and non-lexical vocal sounds; during an utterance, non-linguistic incidents may occur and written materials may be presented. e element can thus contain any of the other elements listed, interspersed with a transcription of the lexical items of the utterance; the other elements may all appear between utterances or next to each other, but except for they do not contain any other elements nor any data. A spoken text itself may be without substructure, that is, it may consist simply of units such as utterances or pauses, not grouped together in any way, or it may be subdivided. If the notion of what constitutes a `text' in spoken discourse is inevitably rather an arbitrary one, the notion of formal subdivisions within such a `text' may appear even more debatable. Nevertheless, such divisions may be useful for such types of discourse as debates, broadcasts, etc., where structural subdivisions can easily be identified, or more generally wherever it is desired to aggregate utterances or other parts of a transcript into units smaller than a complete `text'. Examples might include `conversations' or `discourse fragments', or more narrowly, `that part of the conversation where topic x was discussed', provided only that the set of all such divisions is coextensive with the text. Each such division of a spoken text should be represented by the numbered or un-numbered
elements defined in chapter 4. Default Text Structure. For some detailed kinds of analysis a hierarchy of such divisions may be found useful; nested
elements may be used for this purpose, as in the following example showing how a collection made up of transcribed `sound bites' taken from speeches given by a politician on different occasions, might be encoded. Each extract is regarded as a distinct
, nested within a single composite
as follows:
As a member of the class att.declaring, the
element may also carry a decls attribute, for use where the divisions of a text do not all share the same set of the contextual declarations specified in the TEI header. (See further section 15.3. Associating Contextual Information with a Text). 8.2 Documenting the Source of Transcribed Speech Where a computer file is derived from a spoken text rather than a written one, it will usually be desirable to record additional information about the recording or broadcast which constitutes its source. Several additional elements are provided for this purpose within the source description component of the TEI Header: (script statement) contains a citation giving details of the script used for a spoken text. (recording statement) describes a set of recordings used as the basis for transcription of a spoken text. (recording event) details of an audio or video recording event used as the source of a spoken text, either directly or from a public broadcast. @type the kind of recording. As a member of the att.duration class, the element inherits the following attribute: att.duration.w3c attributes for recording normalized temporal durations. @dur (duration) indicates the length of this element in time. 227 8. Transcriptions of Speech Note that detailed information about the participants or setting of an interview or other transcript of spoken language should be recorded in the appropriate division of the profile description, discussed in chapter 15. Language Corpora, rather than as part of the source description. e source description is used to hold information only about the source from which the transcribed speech was taken, for example, any script being read and any technical details of how the recording was produced. If the source was a previously-created transcript, it should be treated in the same way as any other source text. e element should be used where it is known that one or more of the participants in a spoken text is speaking from a previously prepared script. e script itself should be documented in the same way as any other written text, using one of the three citation tags mentioned above. Utterances or groups of utterances may be linked to the script concerned by means of the decls attribute, described in section 15.3. Associating Contextual Information with a Text. CNN Network News News headlines 12 Jun 91 e is used to group together information relating to the recordings from which the spoken text was transcribed. e element may contain either a prose description or, more helpfully, one or more elements, each corresponding with a particular recording. e linkage between utterances or groups of utterances and the relevant recording statement is made by means of the decls attribute, described in section 15.3. Associating Contextual Information with a Text. e element should be used to provide a description of how and by whom a recording was made. is information may be provided in the form of a prose description, within which such items as statements of responsibility, names, places, and dates may be identified using the appropriate phrase-level tags. Alternatively, a selection of elements from the model.recordingPart class may be provided. is element class makes available the following elements: contains a date in any format.
contains the text of a caption or other text displayed as part of a film script or screenplay. describes a sound effect or musical sequence specified within a screen play or radio script. @type categorizes the sound in some respect, e.g. as music, special effect, etc. @discrete indicates whether the sound overlaps the surrounding speeches or interrupts them. Some examples of the use of these elements follow: 222 7.3. Other Types of Performance Text Angle on Olivia. Ryan's wife, standing nervously alone on the sidelines, biting her lip. She's scared and she shows it. Where particular words or phrases within a direction are emphasized (by change of typeface or use of capital letters), an appropriate phrase-level element may be used to indicate the fact, as in the following examples, where certain words in the original are given in small capitals: George glances at the window--and freezes. New angle--shock cut Out the window the body of a dead man suddenly slams into frame. He dangles grotesquely, held up by his coat caught on a protruding bolt. George gasps. The train whistle screams. Ext. TV control van--Early morning. The T.V. announcer from the Ryan interview stands near the Control Van, the lake in b.g. T.V. Announcer

Several years ago, Jack Ryan was a highly successful hydroplane racer ...

All of these elements, like other stage directions, can appear both within and between speeches. TV Announcer VO

Working with Ryan are his two coworkers-- Strut Bowman, the mechanical engineer-- Angle on Strut standing in the tow boat, walkie-talkie in hand, watching Ryan carefully. --and Roger Dalton, a rocket systems analyst, and one of the scientists from the Jet Propulsion Lab ...

Benjy

Now to business.

Ford and Zaphod

To business.

Glasses clink. 223 7. Performance Texts Benjy

I beg your pardon?

Ford

I'm sorry, I thought you were proposing a toast.

Source: [1] Zoom in to overlay showing some stock film of hansom cabs galloping past.
London, 1895. The residence of Mr Oscar Wilde.
contains text displayed in tabular form, in rows and columns. @rows indicates the number of rows in the table. @cols (columns) indicates the number of columns in each row of the table. contains one row of a table. contains one cell of a table. e
element is defined as a member of the class inter; it may therefore appear both within other components (such as paragraphs), or between them, provided that the module defined in this chapter has been enabled, as described at the beginning of this chapter. It is to a large extent arbitrary whether a table should be regarded as a series of rows or as a series of columns. For compatibility with currently available systems, however, these Guidelines require a row-by-row description of a table. It is also possible to describe a table simply as a series of cells; this may be useful for tabular material which is not presented as a simple matrix. e attributes rows and cols may be used to indicate the size of a table, or to indicate that a particular cell or row of a table spans more than one row or column. For both tables and cells, rows and columns are always given in top-to-bottom, le-to-right order, although formatting properties such as those provided by CSS may be used to specify that they should be displayed differently. ese Guidelines do not require that the size of a table be specified; for most formatting and many other applications, it will be necessary to process the whole table in two passes in any case. Where cells span more than one column or row, the encoder must determine whether this is a purely presentational effect (in which case the rend attribute may be more appropriate), whether the part of the table affected would be better treated as a nested table, or whether to use the spanning attributes listed above. e role attribute may be used to categorize a single cell, or set a default for all the cells in a given row. e present Guidelines distinguish the roles of label and data only, but the encoder may define other roles, such as `derived', `numeric', etc., as appropriate. ese three attributes are provided by the attribute class att.tableDecoration of which both and are members; see further 1.3.1. Attribute Classes. e following simple example demonstrates how the data presented as a labelled list in section 3.7. Lists might be represented by an encoder wishing to preserve its original appearance as a table: 442 14.1. Tables
Report of the conduct and progress of Ernest Pontifex. Upper Vth form -- half term ending Midsummer 1851 Classics Idle listless and unimproving Mathematics ditto Divinity ditto Conduct in house Orderly General conduct Not satisfactory, on account of his great unpunctuality and inattention to duties
Source: [26] Note that this encoding makes no attempt to represent the full significance of the `ditto' cells above; these might be regarded as simple links between the cells containing them and that to which they refer, or as virtual copies of it. For ways of representing either interpretation, see chapter 16. Linking, Segmentation, and Alignment. e following example demonstrates how a simple statistical table may be represented using this scheme: Poor Man's Lodgings in Norfolk (Mayhew, 1843) Dossing Cribs or Lodging Houses Beds Needys or Nightly Lodgers Bury St Edmund's 5 8 128 Thetford 3 6 36 443 14. Tables, Formul, and Graphics Attleboro' 3 5 20 Wymondham 1 11 22
Note the use of a blank cell in the first row to ensure that the column labels are correctly aligned with the data. Again, this encoding does not explicitly represent the alignment between column and row labels and the data to which they apply. Where the primary emphasis of an encoding is on the semantic content of a table, a more explicit mechanism for the representation of structured information such as that provided by the feature structure mechanism described in chapter 18. Feature Structures may be preferred. Alternatively, the general purpose linkage and alignment mechanisms described in chapter 16. Linking, Segmentation, and Alignment may also be applied to individual cells of a table. e content of a table cell need not be simply character data. It may also contain any sequence of the phrase-level elements described in chapter 3. Elements Available in All TEI Documents, thus allowing for the encoding of potentially more useful semantic information, as in the following example, where the fact that one cell contains a number and the other contains a place name has been explicitly recorded: US State populations, 1990 Wyoming 453,588 Alaska 550,043 Montana 799,065 Rhode Island 444 14.1. Tables 1,003,464
e role attribute provides a slightly less verbose means of conveying the same information: US State populations, 1990 Wyoming 453,588 Alaska 550,043 Montana 799,065 Rhode Island 1,003,464
e use of semantically marked elements within a enables the encoder to convey something about the nature and significance of the information, rather than merely suggesting how to display it in rows and columns. 14.1.2 Other Table Schemas Many authoring systems include built-in support for their own or for public table schemas. ese provide an enhanced user interface and good formatting capabilities, but are oen product-specific, despite their use of a XML markup language. e DTD developed by the Association of American Publishers (AAP) and standardized in ANSI Z39.59 provided a very simple encoding for correspondingly simple tables. is has been further developed, together with the table DTD documented in ISO Technical Report 9537, and now forms part of ISO 12083. e TEI table model described above has functionality very similar to that defined by ISO 12083. For more complex tables, the most effective publicly-available DTD is probably that developed by the US Department of Defense CALS project. is supports vertical and horizontal spanning and various kinds of text rotation and justification within cells and is also directly supported by a number of existing SGML soware systems. e CALS table model is much too complex to describe fully here; for historical background see http: //www.hbingham.com/technical/tables/calstbhs.htm; for more recent simplifications of it and current implementations see http://www.oasis-open.org/specs/tablemodels.php. As with any other XML vocabulary, the XML version of the CALS model may readily be included in a TEI schema, using the techniques described in 23.2. Personalization and Customization. 445 14. Tables, Formul, and Graphics e XHTML table model (XHTMLTM 1.0 e Extensible HyperText Markup Language (Second Edition) (2000)) is based on the HTML table model (Ragget et al. (eds.) (1999)). Both models support arrangement of arbitrary data into rows and columns of cells. Table rows and columns may be grouped to convey additional structural information and may be rendered by user agents in ways that emphasize this structure. Support for incremental rendering of tables and for rendering on `non-visual' user agents is also available. Special elements and attributes are provided to associate metadata with tables. ey indicate the table's purpose, or are for the benefit of people using speech or Braille-based user agents. Tables are not recommended for use purely as a means to lay out document content, as this leads to many accessibility problems (see further http://www.w3.org/TR/WCAG10-HTML-TECHS/#tables). Stylesheets provide a far more effective means of controlling layout and other visual characteristics in both HTML and XML documents. 14.2 Formul and Mathematical Expressions Mathematical and chemical formul pose problems similar to those posed by tables in that rendition may be of great significance and hard to disentangle from content. ey also require access to a wide range of special characters, for most of which standard entity names already exist in the documented ISO entity sets (see further chapters vi Languages and Character Sets and 5. Representation of Non-standard Characters and Glyphs). Formul and tables are also similar in that well-researched and detailed DTD fragments have already been developed for them independently of the TEI. ey differ in that (for mathematics at least) there also exists a richly detailed text-based but non-SGML notation which is very widely used: this is the TeX system, and the sets of descriptive macros developed for it such as LaTeX, AMS-TeX, and AMS-LaTeX. e AAP and ISO standards mentioned in section 14.1. Tables above both provide DTDs for equations as well as for tables, which now form part of ISO 12083. e European Mathematical Trust, an organization set up specifically to enhance research support for European mathematicians, has also defined a general purpose mathematical DTD known as EuroMath (http://www.dcs.fmph.uniba.sk/~emt/), for which it provides both soware and services. Most if not all of the functionality provided by these DTDs can now be found in the OpenMath and MathML XML-based systems briefly described below. As with tables, in all the SGML and XML solutions a tension exists between the need to encode the way a formula is written (its appearance) and the need to represent its semantics. If the object of the encoding is purely to act as an interchange format among different formatting programs, then there is no need to represent the mathematical meaning of an expression. If however the object is to use the encoding as input to an algebraic manipulation system (such as Mathematica or Maple) or a database system, clearly simply representing superscripts and subscripts will be inadequate. e present Guidelines make no attempt to add to the number of available DTDs for representing formul. Instead, we recommend that the user make an informed choice from those already available. e module described in this chapter makes available only the following element, which should be used to encode any formula, no matter what notation is employed: contains a mathematical or other formula. @notation supplies the name of a previously defined notation used for the content of the element. By default, a is assumed to contain character data which is not validated in any way: $e=mc^2$ e character data must still be well-formed, of course, which means that < and & must be escaped with entity references or numeric character references, e.g. 446 14.2. Formul and Mathematical Expressions $\matrix{0 & 1\cr<0&>1}$ If desired, the content of the element may be redefined to include elements defined by some other module, such as that of ISO 12083, or to use elements from the more recently defined OpenMath or MathML schemas. When the content of a element is not expressed in XML the notation used should be specified using the notation attribute as above, and in the following longer example:

Achilles runs ten times faster than the tortoise and gives the animal a headstart of ten meters. Achilles runs those ten meters, the tortoise one; Achilles runs that meter, the tortoise runs a decimeter; Achilles runs that decimeter, the tortoise runs a centimeter; Achilles runs that centimeter, the tortoise, a millimeter; Fleet-footed Achilles, the millimeter, the tortoise, a tenth of a millimeter, and so on to infinity, without the tortoise ever being overtaken. . . Such is the customary version. The problem does not change, as you can see; but I would like to know the name of the poet who provided it with a hero and a tortoise. To those magical competitors and to the series $$ {1 \over 10} + {1 \over 100} + {1 \over 1000} + {1 \over 10,\!000} + \dots $$ the argument owes its fame.

Source: [20] e notation attribute supplies the name of a notation (`TeX'), which is expected to be identified somewhere in document metadata. Mathematical Markup Language (MathML) (Carlisle et al. (eds.) (2003)) is a vocabulary for describing mathematical notation, capturing both its structure and content. It provides two types of markup: Presentation Markup, which captures the notational structure of an expression and could be seen as the `TeX for the Web' and Content Markup, which captures themathematical structure of an expression. Most of its content elements correspond with the range of operators, relations, and named functions typically found at the high-school level of mathematics. e tortoise example given above in TeX can be re-expressed in MathML as 1 10 + 447 14. Tables, Formul, and Graphics 1 100 + 1 1000 + 1 10000 + ... MathML 2.0 provides support for a `Semantic Math-Web', XML namespaces, and other current XML standards, such as XML DOM, OMG IDL, ECMAScript, and Java. It also provided a modularized version of the MathML DTD so that MathML fragments `embedded' in XHTML 1.1 documents can be correctly validated. e OpenMath (http://www.nag.co.uk/projects/OpenMath.html) project is coordinated by the OpenMath Society (http://www.openmath.org/) and funded by the European Commission under the Esprit Multimedia Standards Initiative that commenced in September 1997. It is likely to become a key standard for communicating semantically rich representations of mathematical objects both on and off the Web in a platform-independent manner. e OpenMath Standard (http://www.openmath.org/V2/standard/index.html) consists of specifications for 1. OpenMath objects, representing the structure of formul (http://www.openmath.org/V2/standard/ objects.html); 2. Content Dictionaries, providing semantic context (http://www.openmath.org/V2/standard/cd. html); 3. Encodings, both binary (http://www.openmath.org/V2/standard/binary.html) and XML (http: //www.openmath.org/V2/standard/xml.html). OpenMath and MathML have certain common aspects. ey both use prefix operators, both are XMLbased and they both construct their objects by applying certain rules recursively. Such similarities facilitate mapping between the two standards. ere are also some key differences between MathML and OpenMath. 448 14.3. Specific Elements for Graphic Images OpenMath does not provide support for presentation of mathematical objects and its scope of semanticallyoriented elements is much broader that of MathML, with the expressive power to cover virtually all areas of computational mathematics. In fact, a particular set of Content Dictionaries, the `MathML CD Group', covers the same areas of mathematics as the Content Markup elements of MathML 2.0. Finally, OMDoc (http://www.mathweb.org/omdoc/) is an extension of the OpenMath standard that supplies markup for structures such as axioms, theorems, proofs, definitions, texts (mixing formal content with mathematical text). In-line versus block placement for an equation can be distinguished if desired, via the global rend attribute. e global n and xml:id attributes may also be used to label or identify the formula, as in the following example:

The volume of a sphere is given by the formula: V = 4 3 r 3 which is readily calculated.

As we have seen in equation , ...

14.3 Specific Elements for Graphic Images e following special purpose elements are used to indicate the presence of graphic images within a document:
groups elements representing or containing graphic information such as an illustration or figure. indicates the location of an inline graphic, illustration, or figure. provides encoded binary data representing an inline graphic or other object. (description of figure) contains a brief prose description of the appearance or content of a graphic figure, for use when documenting an image without displaying it. e and elements form part of the common core module, and are discussed in section 3.9. Graphics and other non-textual components. e
element is used to contain images, captions, and textual descriptions of the pictures. e images themselves are specified using the element, whose url attribute provides the location of an image. For example: 449 14. Tables, Formul, and Graphics
ree kinds of content may be supplied inside a
element: the element may be used to transcribe (or supply) a descriptive heading or title for the graphic itself as in this example:
Figure One: The View from the Bridge
Figures are oen accompanied not only by a title or heading, but by a paragraph or so of commentary or caption. One or more

or elements may be used to transcribe any caption or discussion of the figure in the source:

Above:

The drawing room of the Pullman house, the white and gold saloon where the magnate delighted in giving receptions for several hundred people.

The figure shows an elaborately decorated room, at least twenty-five feet side to side and fifty feet long, with ornate mouldings and Corinthian columns on the walls, overstuffed armchairs and loveseats arranged in several conversational groupings, and two large chandeliers.
Source: [133] Here, the paragraph `e drawing room ... several hundred people' is transcribed from the source, while the description is provided by the encoder, for use by applications which cannot display the graphic directly. In documents created in electronic form with the needs of print-handicapped readers in mind, the element may be provided by the author rather than a subsequent encoder.
Figure One: The View from the Bridge A Whistleresque view showing four or five sailing boats in the foreground, and a series of buoys strung out between them.
Where the graphic itself contains large amounts of text, perhaps with a complex structure, and perhaps difficult to distinguish from the graphic, the encoder should choose whether to regard the graphic as containing the text (in which case, a nested element may be included within the
element) or to regard the enclosed text as being a separate division of the element in which the graphic appears. In this latter case, an appropriate
or (etc.) element may be used for the text represented within 450 14.3. Specific Elements for Graphic Images the graphic, and the
element embedded within it. e choice will depend to a large degree on the encoder's understanding of the relationship between the graphic and the surrounding text. A figure which is internally divided, or contains sub-figures, may be encoded with nested
elements, as in the following example.
Parallel
Perspective
The two canonical view volumes, for the (a) parallel and (b) perspective projections. Note that -z is to the right.
Source: [76] Like any other element in the TEI scheme, figures may be given identifiers so that they can be aligned with other elements, and linked to or from them, as described in chapter 16. Linking, Segmentation, and Alignment. Some common examples are discussed briefly here; full information is provided in that chapter. It is oen desirable to maintain two versions of an image in an electronic file: one a low resolution or `thumbnail' version which, when selected by the user, causes the other, high resolution, version to be accessed. In TEI terms, the thumbnail image acts as a reference to the other. Supposing that a thumbnail version of the figure discussed above is available as fig1th.png", we might embed a reference to the image using the simple element discussed in section 3.6. Simple Links and Cross-References: Click here for enlightenment
Another common requirement is to associate part or the whole of an image with a textual element not necessarily contiguous to it in the text; this is sometimes known as a callout. When the module for transcription is included in a schema, specific attributes for parts of a text and parts (or all) of a digital image are available; these are discussed in 11.1. Digital Facsimiles. In addition, chapter 16. Linking, Segmentation, and Alignment may be consulted for other mechanisms available for this purpose. e following example assumes that we wish to associate one portion of the image held as `fig1' with chapter two of some text, and another portion of it with chapter three. e application may be thought of as a hypertext browser in which the user selects from a graphic image which part of a text to read next, but the mechanism is independent of this particular application. e first requirement is some way of identifying and hence pointing to sub-parts of a graphic image. is may be done by pointing into an XML graphic representation, for example an SVG file. us 451 14. Tables, Formul, and Graphics ese elements identify two areas within the image `Fig1' by pointing at elements inside the XML file Fig1.svg, which contains the following. e next requirement is some way of identifying the parts of the document to which a link is to be made. e most obvious way of doing this is to use the global xml:id attribute: Now, all that is needed to linking these areas to the relevant chapters is a element, as described in section 16.1. Links: In this example, the SVG representation of the graphic is stored externally to the TEI document and linked by means of a pointer. It is also possible to embed the SVG representation directly within the TEI by extending the content model of the
element to permit an element from the SVG namespace. Like other customizations of the TEI scheme, this is carried out using the techniques documented in section 1.2. Defining a TEI Schema; further examples are provided in chapter 16. Linking, Segmentation, and Alignment. 452 14.4. Overview of Basic Graphics Concepts 14.4 Overview of Basic Graphics Concepts e first major distinction in graphic representation is that between raster graphics and vector graphics. A raster image is a list of points, or dots. Scanners, fax machines and other simple devices easily produce digital raster images, and such images are therefore quite common. A vector image, in contrast, is a list of geometrical objects, such as lines, circles, arcs, or even cubes. ese are much more difficult to produce, and so are mainly encountered as the output of sophisticated systems such as architectural and engineering CAD programs. Raster images are difficult to modify because by definition they only encode single points: a line, for example, cannot grow or shrink as such, since it is not identified as such. Only its component parts are identified, and only they can be manipulated. erefore the resolution or dot-size of a raster image is important, which is not the case with vector images. It is also far more difficult to convert raster images to vector images than to perform the opposite conversion. Raster images generally require more storage space than vector images, and a wide variety of methods exists for compressing them; the variation in these methods leads to corresponding variations in representations for storage and transmission of raster images. Motion video usually consists of a long series of raster images. Data compression is even more effective on video than on single raster images (mainly owing to redundancy which arises from the usual similarity of adjacent frames). Notations for representing full-motion video are hotly debated at this time, and any user of these Guidelines would do well to obtain up-to-date expert advice before undertaking a project using them. e compression methods used with any of these image types may be `lossy' or `lossless'. Methods for lossy compression save space by discarding a small portion of the image's detail, such as fine distinctions of shading. When decompressed, therefore, such an image will be only a close approximation of the original. In contrast, lossless compression guarantees that the exact uncompressed image will be reproducible from the compressed form: only truly redundant information is removed. In general, therefore, lossless compression does not save quite so much space as lossy compression, though it does guarantee fidelity to the original uncompressed image. Raster images may be characterized by their resolution, which is the number of dots per inch used to represent the image. Doubling the resolution will give a more precise image, but also quadruple the storage requirement (before compression), and affect processing time for any operations to be performed, such as displaying an image for a reader. Motion video also has resolution in time: the number of frames to be shown per second. Encoders should consider carefully what resolution(s) and frame rate(s) to use for particular applications; these Guidelines express no recommendation in this matter, save the universal ones of consistency and documentation. Within any image, it is typical to refer to locations via Cartesian coordinate axes: values for x, y, and sometimes z and/or time. However, graphic notations vary in whether coordinates count from le-to-right and top-to-bottom, or another way. ey also vary in whether coordinates are considered real (inches, millimeters, and so on), or virtual (dots). ese Guidelines do not recommend any of these methods over another, but all decisions made should be applied consistently, and documented in the section of the TEI header.1 Methods of aligning images and text are discussed in 11.1. Digital Facsimiles. e chromatic values of an image may be rendered in many different ways. In monochrome images every displayed point is either black or white. In gray-scale images, each point is rendered in some shade of gray, the number of shades varying from system to system. In true polychrome images, points are rendered in different hues, again with varying limitations affecting the number of distinct shades and the means by which they are displayed. 1Since no special purpose element is provided for this purpose by the current version of the Guidelines, such information should be provided as one or more distinct paragraphs at the end of the element described in section 2.3. e Encoding Description. 453 14. Tables, Formul, and Graphics 14.5 Graphic Image Formats As noted above, there exists a wide variety of different graphics formats, and the following list is in no way exhaustive. Moreover, inclusion of any format in this list should not be taken as indicating endorsement by the TEI of this format or any products associated with it. Some of the formats listed here are proprietary to a greater or lesser extent and cannot therefore be regarded as standards in any meaningful sense. ey are however widely used by many different vendors. e following formats are widely used at the present time, and likely to remain supported by more than one vendor's soware: * BMP: Microso bitmap format * CGM: Computer Graphics Metafile * GIF: Graphics Interchange Format * JPEG: Joint Photographic Expert Group * PBM: Portable Bit Map * PCX: IBM PC raster format * PICT: Macintosh drawing format * PNG: Portable Network Graphics format * Photo-CD: Kodak Photo Compact Disk format * QuickTime: Apple real-time image system * SMIL: Synchronized Multimedia Integration Language format * SVG: Scalable Vector Graphics format * TIFF: Tagged Image File Format Brief descriptions of all the above are given below. Where possible, current addresses or other contact information are shown for the originator of each format. Many formal standards, especially those promulgated by ISO and many related national organizations (ANSI, DIN, BSI, and many more), are available from those national organizations. Addresses may be found in any standard organizational directory for the country in question. 14.5.1 Vector Graphic Formats CGM: Computer Graphics Metafile is vector graphics format is specified by an ISO standard, ISO 8632:1987, amended in 1990. It defines binary, character, and plain-text encodings; the non-binary forms are safer for blind interchange, especially over networks. Documentation on CGM is available from ISO and from its member national bodies such as AFNOR, ANSI, BSI, DIN, JIS, etc. SVG: Scalable Vector Graphics format SVG is a language for describing two-dimensional vector and mixed vector or raster graphics in XML. It is defined by the Scalable Vector Graphics (SVG) 1.0 Specification, W3C Recommendation, 04 September 2001, and is available at http://www.w3.org/TR/2001/ REC-SVG-20010904/. PICT: Macintosh drawing format is format is universally supported on Macintosh (tm) systems, and readable by a limited range of soware for other systems. Documentation is available from Apple Computer Company, Cupertino, California USA. 14.5.2 Raster Graphic Formats PNG: Portable Network Graphics format PNG is a non-proprietary raster format currently widely available. It provides an extensible file format for the lossless, portable, well-compressed storage of raster images. 454 14.5. Graphic Image Formats Indexed-color, grayscale, and truecolor images are supported, plus an optional alpha channel. Sample depths range from 1 to 16 bits. It is defined by IETF RFC 2083, March 1997. TIFF: Tagged Image File Format Currently the most widely supported raster image format, especially for black and white images, TIFF is also one of the few formats commonly supported on more than one operating system. e drawback to TIFF is that it actually is a wrapper for several formats, and some TIFF-supporting soware does not support all variants. TIFF files may use LZW, CCITT Group 4, or PackBits compression methods, or may use no compression at all. Also, TIFF files may be monochrome, greyscale, or polychromatic. All such options should be specified in prose at the end of the section of the TEI header for any document including TIFF images. TIFF is owned by Aldus Corporation. Documentation on TIFF is available from them at Craigcook Castle, Craigcook Road, Edinburgh EH4 3UH, Scotland, or 411 First Avenue South, Seattle, Washington 98104 USA. GIF: Graphics Interchange Format Raster images are widely available in this form, which was created by CompuServe Information Services, but has by now been implemented for many other systems as well. Documentation on GIF is copyright by, and is available from, CompuServe Incorporated, Graphics Technology Department, 5000 Arlington Center Boulevard, Columbus, Ohio 43220 USA. PBM: Portable Bit Map PBM files are easy to process, eschewing all compression in favor of transparency of file format. PBM files can, of course, be compressed by generic file-compression tools for storage and transfer. Public domain soware exists which will convert many other formats to and from PBM. Documentation on PBM is copyright by Jeff Poskanzer, and is available widely on the Internet. PCX: IBM PC raster format is format is used by most IBM PC paint programs, and supports both monochrome and polychromatic images. Documentation is available from ZSo Corporation, Technical Support Department, ATTN: Technical Reference Manual, 450 Franklin Rd. Suite 100, Marietta, GA 30067 USA. BMP: Microso bitmap format is format is the standard raster format for computer using Microso Windows (tm) or Presentation Manager (tm). Documentation is available from Microso Corporation. 14.5.3 Photographic and Motion Video Formats JPEG: Joint Photographic Experts Group is standard is sponsored by CCITT and by ISO. It is ISO/IEC Dra International Standard 10918-1, and CCITT T.81. It handles monochrome and polychromatic images with a variety of compression techniques. JPEG per se, like CCITT Group IV, must be encapsulated before transmission; this can be done via TIFF, or via the JPEG File Interchange Format (JFIF), as commonly done for Internet delivery. QuickTime: Apple real-time image system QuickTime is a proprietary method introduced by Apple Computer Company to synchronize the display of various data. e data can include frames of video, sound, lighting control mechanisms, and other things. Viewers for QuickTime productions are available for Apple and other computers. Further information is available from Apple Computer Incorporated, 10201 North de Anza Boulevard MS 23AQ, Cupertino, California 95014 USA. Photo-CD: Kodak Photo Compact Disk format is format was introduced by Kodak for rasterizing photographs and storing them on CD-ROMs (about one hundred 35mm file images fit on one disk), for display on televisions or CD-I systems. Information on Photo-CD is available from Kodak Limited, Research and Development, Headstone Drive, Harrow, Middlesex HA1 4TY, UK. 455 14. Tables, Formul, and Graphics SMIL: Synchronized Multimedia Integration Language format SMIL is a W3C Recommendation which supports the integration of independent multimedia objects into a synchronized multimedia presentation. It provides multimedia authors with easily-defined basic timing relationships, fine-tuned synchronization, spatial layout, direct inclusion of non-text and non-image media objects, hyperlink support for time-based media, adaptiveness to varying user and system characteristics. SMIL 1.0 (http://www.w3.org/TR/REC-smil/) became a W3C Recommendation on June 15, 1998, and was further developed in SMIL 2.0. SMIL 2.0 adds native support for transitions, animation, eventbased interaction, extended layout facilities, and more sophisticated timing and synchronization primitives to the SMIL 1.0 language. It also allows reuse of SMIL syntax and semantics in other XML-based languages, in particular those who need to represent timing and synchronization. For example, SMIL 2.0 components are used for integrating timing into XHTML Document Types and into SVG. SMIL 2.0 also provides recommendations for Document Types based on SMIL 2.0 Modules (http://www.w3.org/TR/smil20/smil-modules.html). One such Document Type is the SMIL 2.0 Language Profile (http://www.w3.org/TR/smil20/smil20-profile.html). It contains support for all of the major SMIL 2.0 features including animation, content control, layout, linking, media object, meta-information, structure, timing, and transition effects and is designed for Web clients that support direct playback from SMIL 2.0 markup. SMIL 2.0 (http://www.w3.org/TR/smil20/) became a W3C Recommendation on August 7, 2001, becoming the first vocabulary to provide XML Schema support and to have reached such status. As noted above, the reader will encounter many, many other graphics formats. 14.6 Module for Tables, Formul, and Graphics e module described in this chapter provides the following features: Module figures: Tables, formul, and figures * Elements defined: cell figDesc figure formula row table e selection and combination of modules to form a TEI schema is described in 1.2. Defining a TEI Schema. 456 Chapter 15 Language Corpora e term language corpus is used to mean a number of rather different things. It may refer simply to any collection of linguistic data (for example, written, spoken, signed, or multimodal), although many practitioners prefer to reserve it for collections which have been organized or collected with a particular end in view, generally to characterize a particular state or variety of one or more languages. Because opinions as to the best method of achieving this goal differ, various subcategories of corpora have also been identified. For our purposes however, the distinguishing characteristic of a corpus is that its components have been selected or structured according to some conscious set of design criteria. ese design criteria may be very simple and undemanding, or very sophisticated. A corpus may be intended to represent (in the statistical sense) a particular linguistic variety or sublanguage, or it may be intended to represent all aspects of some assumed `core' language. A corpus may be made up of whole texts or of fragments or text samples. It may be a `closed' corpus, or an `open' or `monitor' corpus, the composition of which may change over time. However, since an open corpus is of necessity finite at any particular point in time, the only likely effect of its expansibility from the encoding point of view may be some increased difficulty in maintaining consistent encoding practices (see further section 15.5. Recommendations for the Encoding of Large Corpora). For simplicity, therefore, our discussion largely concerns ways of encoding closed corpora, regarded as single but composite texts. Language corpora are regarded by these Guidelines as composite texts rather than unitary texts (on this distinction, see chapter 4. Default Text Structure). is is because although each discrete sample of language in a corpus clearly has a claim to be considered as a text in its own right, it is also regarded as a subdivision of some larger object, if only for convenience of analysis. Corpora share a number of characteristics with other types of composite texts, including anthologies and collections. Most notably, different components of composite texts may exhibit different structural properties (for example, some may be composed of verse, and others of prose), thus potentially requiring elements from different TEI modules. Aside from these high-level structural differences, and possibly differences of scale, the encoding of language corpora and the encoding of individual texts present identical sets of problems. Any of the encoding techniques and elements presented in other chapters of these Guidelines may therefore prove relevant to some aspect of corpus encoding and may be used in corpora. erefore, we do not repeat here the discusssion of such fundamental matters as the representation of multiple character sets (see chapter vi Languages and Character Sets); nor do we attempt to summarize the variety of elements provided for encoding basic structural features such as quoted or highlighted phrases, cross-references, lists, notes, editorial changes and reference systems (see chapter 3. Elements Available in All TEI Documents). In addition to these general purpose elements, these Guidelines offer a range of more specialized sets of tags which may be of use in certain specialized corpora, for example those consisting primarily of verse (chapter 6. Verse), drama (chapter 7. Performance Texts), transcriptions of spoken text (chapter 8. Transcriptions of Speech), etc. Chapter 1. e TEI Infrastructure 457 15. Language Corpora should be reviewed for details of how these and other components of the Guidelines should be tailored to create a document type definition appropriate to a given application. In sum, it should not be asssumed that only the matters specifically addressed in this chapter are of importance for corpus creators. is chapter does however include some other material relevant to corpora and corpus-building, for which no other location appeared suitable. It begins with a review of the distinction between unitary and composite texts, and of the different methods provided by these Guidelines for representing composite texts of different kinds (section 15.1. Varieties of Composite Text). Section 15.2. Contextual Information describes a set of additional header elements provided for the documentation of contextual information, of importance largely though not exclusively to language corpora. is is the additional module for language corpora proper. Section 15.3. Associating Contextual Information with a Text discusses a mechanism by which individual parts of the TEI Header may be associated with different parts of a TEI-conformant text. Section 15.4. Linguistic Annotation of Corpora reviews various methods of providing linguistic annotation in corpora, with some specific examples of relevance to current practice in corpus linguistics. Finally, section 15.5. Recommendations for the Encoding of Large Corpora provides some general recommendations about the use of these Guidelines in the building of large corpora. 15.1 Varieties of Composite Text Both unitary and composite texts may be encoded using these Guidelines; composite texts, including corpora, will typically make use of the following tags for their top-level organization. contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text. (TEI document) contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a element. (TEI Header) supplies the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text. @type specifies the kind of document to which the header is attached, for example whether it is a corpus or individual text. contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample. contains the body of a composite text, grouping together a sequence of distinct texts (or groups of such texts) which are regarded as a unit for some purpose, for example the collected works of an author, a sequence of prose essays, etc. Full descriptions of these may be found in chapter 2. e TEI Header (for ), and chapter 4. Default Text Structure (for , and ); this section discusses their application to composite texts in particular. In these Guidelines, the word text refers to any stretch of discourse, whether complete or incomplete, unitary or composite, which the encoder chooses (perhaps merely for purposes of analytic convenience) to regard as a unit. e term composite text refers to texts within which other texts appear; the following common cases may be distinguished: * language corpora * collections or anthologies * poem cycles and epistolary works (novels or essays written in the form of collections or series of letters) * otherwise unitary texts, within which one or more subordinate texts are embedded e elements listed above may be combined to encode each of these varieties of composite text in different ways. 458 15.1. Varieties of Composite Text In corpora, the component samples are clearly distinct texts, but the systematic collection, standardized preparation, and common markup of the corpus oen make it useful to treat the entire corpus as a unit, too. Some corpora may become so well established as to be regarded as texts in their own right; the Brown and LOB corpora are now close to achieving this status. e element is intended for the encoding of language corpora, though it may also be useful in encoding newspapers, electronic anthologies, and other disparate collections of material. e individual samples in the corpus are encoded as separate elements, and the entire corpus is enclosed in a element. Each sample has the usual structure for a document, comprising a followed by a element. e corpus, too, has a corpus-level element, in which the corpus as a whole, and encoding practices common to multiple samples may be described. e overall structure of a TEI-conformant corpus is thus: Header information which relates to the whole corpus rather than to individual components of it should be factored out and included in the element prefixed to the whole. is two-level structure allows for contextual information to be specified at the corpus level, at the individual text level, or at both. Discussion of the kinds of information which may thus be specified is provided below, in section 15.2. Contextual Information, as well as in chapter 2. e TEI Header. Information of this type should in general be specified only once: a variety of methods are provided for associating it with individual components of a corpus, as further described in section 15.3. Associating Contextual Information with a Text. In some cases, the design of a corpus is reflected in its internal structure. For example, a corpus of newspaper extracts might be arranged to combine all stories of one type (reportage, editorial, reviews, etc.) into some higher-level grouping, possibly with sub-groups for date, region, etc. e element provides no direct support for reflecting such internal corpus structure in the markup: it treats the corpus as an undifferentiated series of components, each tagged . If it is essential to reflect a single permanent organization of a corpus into sub- and sub-sub-corpora, then the corpus or the high-level subcorpora may be encoded as composite texts, using the element described below and in section 4.3.1. Grouped Texts. e mechanisms for corpus characterization described in this chapter, however, are designed to reduce the need to do this. Useful groupings of components may easily be expressed using the text classification and identification elements described in section 15.2.1. e Text Description, and those for associating declarations with corpus components described in section 15.3. Associating Contextual Information with a Text. ese methods also allow several different methods of text grouping to co-exist, each to be used as needed at different times. is helps minimize the danger of crossclassification and mis-classification of samples, and helps improve the flexibility with which parts of a corpus may be characterized for different applications. Anthologies and collections are oen treated as texts in their own right, if only for historical reasons. In conventional publishing, at least, anthologies are published as units, with single editorial responsibility and common front and back matter which may need to be included in their electronic encodings. e texts collected in the anthology, of course, may also need to be identifiable as distinct individual objects for study. 459 15. Language Corpora Poem cycles, epistolary novels, and epistolary essays differ from anthologies in that they are oen written as single works, by single authors, for single occasions; nevertheless, it can be useful to treat their constituent parts as individual texts, as well as the cycle itself. Structurally, therefore, they may be treated in the same way as anthologies: in both cases, the body of the text is composed largely of other texts. e element is provided to simplify the encoding of collections, anthologies, and cyclic works; as noted above, the element can also be used to record the potentially complex internal structure of language corpora. For a full description, see chapter 4. Default Text Structure. Some composite texts, finally, are neither corpora, nor anthologies, nor cyclic works: they are otherwise unitary texts within which other texts are embedded. In general, they may be treated in the same way as unitary texts, using the normal and elements. e embedded text itself may be encoded using the element, which may occur within quotations or between paragraphs or other chunk-level elements inside the sections of a larger text. For further discussion, see chapter 4. Default Text Structure. All composite texts share the characteristic that their different component texts may be of structurally similar or dissimilar types. If all component texts may all be encoded using the same module, then no problem arises. If however they require different modules, then these must be included in the schema. is process is described in more detail in section 1.1. TEI Modules. 15.2 Contextual Information Contextual information is of particular importance for collections or corpora composed of samples from a variety of different kinds of text. Examples of such contextual information include: the age, sex, and geographical origins of participants in a language interaction, or their socio-economic status; the cost and publication data of a newspaper; the topic, register or factuality of an extract from a textbook. Such information may be of the first importance, whether as an organizing principle in creating a corpus (for example, to ensure that the range of values in such a parameter is evenly represented throughout the corpus, or represented proportionately to the population being sampled), or as a selection criterion in analysing the corpus (for example, to investigate the language usage of some particular vector of social characteristics). Such contextual information is potentially of equal importance for unitary texts, and these Guidelines accordingly make no particular distinction between the kinds of information which should be gathered for unitary and for composite texts. In either case, the information should be recorded in the appropriate section of a TEI Header, as described in chapter 2. e TEI Header. In the case of language corpora, such information may be gathered together in the overall corpus header, or split across all the component texts of a corpus, in their individual headers, or divided between the two. e association between an individual corpus text and the contextual information applicable to it may be made in a number of ways, as further discussed in section 15.3. Associating Contextual Information with a Text below. Chapter 2. e TEI Header, which should be read in conjunction with the present section, describes in full the range of elements available for the encoding of information relating to the electronic file itself, for example its bibliographic description and those of the source or sources from which it was derived (see section 2.2. e File Description); information about the encoding practices followed with the corpus, for example its design principles, editorial practices, reference system, etc. (see section 2.3. e Encoding Description); more detailed descriptive information about the creation and content of the corpus, such as the languages used within it and any descriptive classification system used (see section 2.4. e Profile Description); and version information documenting any changes made in the electronic text (see section 2.5. e Revision Description). In addition to the elements defined by chapter 2. e TEI Header, several other elements can be used in the TEI header if the additional module defined by this chapter is invoked. ese additional tags make it possible to characterize the social or other situation within which a language interaction takes place or is experienced, the physical setting of a language interaction, and the participants in it. ough this information may be relevant to, and provided for, unitary texts as well as for collections or corpora, it is more oen recorded for 460 15.2. Contextual Information the components of systematically developed corpora than for isolated texts, and thus this module is referred to as being `for language corpora'. When the module defined in this chapter is included in a schema, a number of additional elements become available within the element of the TEI Header (discussed in section 2.4. e Profile Description). (text description) provides a description of a text in terms of its situational parameters. (participation description) describes the identifiable speakers, voices, or other participants in a linguistic interaction. (setting description) describes the setting or settings within which a language interaction takes place, either as a prose description or as a series of setting elements. ese elements, members of the model.profileDescPart, are discussed in the remainder of the chapter. 15.2.1 The Text Description e element provides a full description of the situation within which a text was produced or experienced, and thus characterizes it in a way relatively independent of any a priori theory of text-types. It is provided as an alternative or a supplement to the common use of descriptive taxonomies used to categorize texts, which is fully described in section 2.4.3. e Text Classification, and section 2.3.6. e Classification Declaration. e description is organized as a set of values and optional prose descriptions for the following eight situational parameters, each represented by one of the following eight elements: (primary channel) describes the medium or channel by which a text is delivered or experienced. For a written text, this might be print, manuscript, e-mail, etc.; for a spoken one, radio, telephone, face-to-face, etc. @mode specifies the mode of this channel with respect to speech and writing. describes the internal composition of a text or text sample, for example as fragmentary, complete, etc. @type specifies how the text was constituted. describes the nature and extent of originality of this text. @type categorizes the derivation of the text. (domain of use) describes the most important social context in which the text was realized or for which it is intended, for example private vs. public, education, religion, etc. @type categorizes the domain of use. describes the extent to which the text may be regarded as imaginative or non-imaginative, that is, as describing a fictional or a non-fictional world. @type categorizes the factuality of the text. describes the extent, cardinality and nature of any interaction among those producing and experiencing the text, for example in the form of response or interjection, commentary, etc. @type specifies the degree of interaction between active and passive participants in the text. @active specifies the number of active participants (or addressors) producing parts of the text. @passive specifies the number of passive participants (or addressees) to whom a text is directed or in whose presence it is created or performed. describes the extent to which a text may be regarded as prepared or spontaneous. @type a keyword characterizing the type of preparedness. characterizes a single purpose or communicative function of the text. 461 15. Language Corpora @type specifies a particular kind of purpose. @degree specifies the extent to which this purpose predominates. ese elements constitute a model class called model.textDescPart; new parameters may be defined by defining new elements and adding them to that class, as further described in 23.2. Personalization and Customization. By default, a text description will contain each of the above elements, supplied in the order specified. Except for the element, which may be repeated to indicate multiple purposes, no element should appear more than once within a single text description. Each element may be empty, or may contain a brief qualification or more detailed description of the value expressed by its attributes. It should be noted that some texts, in particular literary ones, may resist unambiguous classification in some of these dimensions; in such cases, the situational parameter in question should be given the content `not applicable' or an equivalent phrase. Texts may be described along many dimensions, according to many different taxonomies. No generally accepted consensus as to how such taxonomies should be defined has yet emerged, despite the best efforts of many corpus linguists, text linguists, sociolinguists, rhetoricians, and literary theorists over the years. Rather than attempting the task of proposing a single taxonomy of text-types (or the equally impossible one of enumerating all those which have been proposed previously), the closed set of situational parameters described above can be used in combination to supply useful distinguishing descriptive features of individual texts, without insisting on a system of discrete high-level text-types. Such text-types may however be used in combination with the parameters proposed here, with the advantage that the internal structure of each such text-type can be specified in terms of the parameters proposed. is approach has the following analytical advantages:1 * it enables a relatively continuous characterization of texts (in contrast to discrete categories based on type or topic) * it enables meaningful comparisons across corpora * it allows analysts to build and compare their own text-types based on the particular parameters of interest to them * it is equally applicable to spoken, written, or signed texts Two alternative approaches to the use of these parameters are supported by these Guidelines. One is to use pre-existing taxonomies such as those used in subject classification or other types of text categorization. Such taxonomies may also be appropriate for the description of the topics addressed by particular texts. Elements for this purpose are described in section 2.4.3. e Text Classification, and elements for defining or declaring such classification schemes in section 2.3.6. e Classification Declaration. A second approach is to develop an application-specific set of feature structures and an associated feature system declaration, as described in chapters 18. Feature Structures and 18.11. Feature System Declaration. Where the organizing principles of a corpus or collection so permit, it may be convenient to regard a particular set of values for the situational parameters listed in this section as forming a text-type in its own right; this may also be useful where the same set of values applies to several texts within a corpus. In such a case, the set of text-types so defined should be regarded as a taxonomy. e mechanisms described in section 2.3.6. e Classification Declaration may be used to define hierarchic taxonomies of such text-types, provided that the component of the element contains a element rather than a prose description. Particular texts may then be associated with such definitions using the mechanisms described in sections 2.4.3. e Text Classification. Using these situational parameters, an informal domestic conversation might be characterized as follows: 1Schemes similar to that proposed here were developed in the 1960s and 1970s by researchers such as Hymes, Halliday, and Crystal and Davy, but have rarely been implemented; one notable exception being the pioneering work on the Helsinki Diachronic Corpus of English, on which see Kytö and Rissanen (1988) 462 15.2. Contextual Information informal face-to-face conversation each text represents a continuously recorded interaction among the specified participants plans for coming week, local affairs mostly factual, some jokes e following example demonstrates how the same situational parameters might be used to characterize a novel: print; part issues 15.2.2 The Participant Description e element in the element provides additional information about the participants in a spoken text or, where this is judged appropriate, the persons named or depicted in a written text. When the detailed elements provided by the namesdates module described in 13. Names, Dates, People, and Places are included in a schema, this element can contain detailed demographic or descriptive information about individual speakers or groups of speakers, such as their names or other personal characteristics. Individually identified persons may also identified by a code which can then be used elsewhere within the encoded text, for example as the value of a who attribute. It should be noted that although the terms speaker or participant are used throughout this section, it is intended that the same mechanisms may be used to characterize fictional person or `voices' within a written text, except where otherwise stated. For the purposes of analysis of language usage, the information specified here should be equally applicable to written, spoken, or signed texts. e element contains a description of the participants in an interaction, which may be supplied as straightforward prose, possibly containing a list of names, encoded using the usual and elements, or alternatively using the more specific and detailed element provided by the namesdates module described in 13. Names, Dates, People, and Places. For example, a participant in a recorded conversation might be described informally as follows:

Female informant, well-educated, born in Shropshire UK, 12 Jan 463 15. Language Corpora 1950, of unknown occupation. Speaks French fluently. Socio-Economic status B2 in the PEP classification scheme.

Alternatively, when the namesdates module is included in a schema, information about the same participant described above might be provided in a more structured way as follows: 12 Jan 1950 Shropshire, UK English French Long term resident of Hull University postgraduate Unknown An identified character in a drama or a novel may also be regarded as a participant in this sense, and encoding using the same techniques:2

The chief speaking characters in this novel are Emma Woodhouse Mr Darcy

Here, the characters are simply listed without the detailed structure which use of the element permits. 15.2.3 The Setting Description e element is used to describe the setting or settings in which language interaction takes place. It may contain a prose description, analogous to a stage description at the start of a play, stating in broad terms the locale, or a more detailed description of a series of such settings. Each distinct setting is described by means of a element. describes one particular setting in which a language interaction takes place. 2It is particularly useful to define participants in a dramatic text in this way, since it enables the who attribute to be used to link elements to definitions for their speakers; see further section 7.2.2. Speeches and Speakers. 464 15.2. Contextual Information Individual settings may be associated with particular participants by means of the optional who attribute which this element inherits as a member of the att.ascribed if, for example, participants are in different places. is attribute identifies one or more individual participants or participant groups, as discussed earlier in section 15.2.2. e Participant Description. If this attribute is not specified, the setting details provided are assumed to apply to all participants represented in the language interaction. Note however that it is not possible to encode different settings for the same participant: a participant is deemed to be a person within a specific setting. e element may contain either a prose description or a selection of elements from the classes model.nameLike.agent, model.dateLike, or model.settingPart. By default, when the module definded by this chapter is included in a schema, these classes thus provide the following elements : (name, proper noun) contains a proper noun or noun phrase. contains a date in any format.