Project Documentation

Preparation of document transcriptions

The authoritative guide to TEI markup is the TEI Guidelines, here:|

These guidelines are based on the TEI Tite subset.


  1. The transcription should be exactly as it appears in the exemplar (i.e. the facsimile page), except in cases specified below.
  2. For special, non-ASCII characters (e.g. letters with macrons) or accents, we use the relevant unicode as described here.
  3. We use unicode for vowel ligatures such as Æ, æ, and œ (see appendix), but we don't mark or substitute ligatures involving consonants (i.e. leave the letters apart). 
  4. We preserve the hyphenation and make a clear distinction between hard- and soft- hyphens. Soft hyphens are inserted in order to break a line (i.e. as a tactic for managing line length). Hard hyphens are those that connect parts of a compound word. We will mark all soft-hyphens at line breaks like this: e.g “Hi<lb break="no" rend="hyphen">storicall”. Hard hyphens will be left as bare hyphens: e.g. “death-bed”. Hyphenation of words at line breaks can be ambiguous (i.e. sometimes, they might be words that would be hyphenated anyway). If such cases, wrap the whole thing in <unclear> </unclear> tags. A double hyphen (=) will have the attribute "dhyphen" and if there is no punctuation at the break, use the attribute "none".

Tagging textual and basic structural features


<unclear reason="illegible">p</unclear>lainely

  • obscured (in blot, scribbling, or strikethrough)
  • illegible (the letter forms can't be discerned)
  • ambiguous (the letter form might be one or another letter and you can't decide)


plai<gap extent="1" unit="char" reason="damaged"/>ely

In cases where we create gaps in a transcription for editorial reasons (chiefly, in excerpting passages from a document), we indicate a gap at each level of the document hierarchy, before and after the excerpt, where gaps are created: <gap reason="irrelevant" resp="ed"/>

  • damaged (there is some damage to the support, i.e. paper)
  • omission (apparent intentional scribal omission)
  • irrelevant (with @resp="ed": omitted by editors as not relevant to our propose).


Used in combination with <gap>:

plai<gap extent="1" unit="char" reason=""/><supplied>n</supplied>ely


  • Italic: <hi rend="italic">italic</hi> 
  • Superscript: W<hi rend="sup">th</hi> 
  • Black-letter or Gothic: <hi rend="black">black-letter</hi> 
  • Underline: <hi rend="ul">underline</hi>
  • Underdot: <hi rend="ud">underdot</hi>
  • Strikethrough: <hi rend="strike">strikethrough</hi> 
  • Small-caps: C<hi rend="small-caps">ollins</hi> 
  • Caps: <hi rend="caps">Collins</hi> 


We are conservative in our use of <sic>, reserving it for cases that where there is a clear error in word or punctuation, using <orig>/<reg> instead in cases where spelling or punctuation might causes confusion for the reader.

e.g. both <choice><sic>hand</sic><corr>hands</corr></choice>


We regularize most peculiarities of early modern print publication, such as double capitals at the start of major divisions, as well as spelling and punctuation that might cause confusion.

e.g. both <choice><orig>ONce</orig><reg>Once</reg></choice> upon a time

Abbreviation and Expansion

<choice><abbr>w<hi rend="sup">th</hi></abbr><expan>with</expan></choice>

Deletions and additions

<del rend="strike">hands</del><add place="above">hearts</add>

  • strikethrough
  • overwrite
  • erased
  • blotted
  • scribble
  • above (above the line)
  • below (below the line)
  • inline (inserted within the line)
  • margin (in the margin, usually with marker in the intended location)
  • end (at the end of the document, usually with marker in the intended location)
  • top (at the top of the document, usually with marker in the intended location)
  • overleaf (on the other side of the leave, usually with marker in the intended location)
  • opposite (on the page opposite, usually with marker in the intended location)


If the note is original in the document:

<note type="orig" place="margin-left">marginalia</note>

Editorial explanation:

<note type="ed" resp="BN" place"margin-right">marginalia</note>

Line breaks


in cases where a word is broken:

Hi<lb break="no" rend="hyphen"/>storicall.   ... if there is a normal hyphen
Hi<lb break="no" rend="dhyphen"/>storicall.    ... if there is a double hyphen (=)
Hi<lb break="no" rend="none"/>storicall.   ... if there is a no hyphen

Note: if there is a word break at the end of the page, place the <lb break="no" rend="hyphen"/> at the end of the line where the work is broken.

Page breaks

If there is a broken word at the end of the last line of the last page, code it this way:
... here is a brok<lb break="no"/><pb n="23"/>en word ...

Running headers, footers, catchwords, signatures 

<fw type="header" place="tm">Header</fw>: a header, in the top margin, centre
<fw type="catch" place="br">Catchword</fw>: a catchword, bottom right
<fw type="footer" place="bl">Footer</fw>: a footer, bottom left
<fw type="pageNum" place="tr">1</fw>: a page number, top right


''If it is a side margin:''
  • margin-right
  • margin-left
''If it is a top or bottom margin:''
  • tl=top left
  • tr=top right
  • tm=top middle
  • bl=bottom left
  • br=bottom right
  • bm=bottom middle

Languages other the dominant language of the text 

<foreign xml:lang="la">Latin language text</foreign>

note: the "xml:lang" attribute can be used inside any element.

<hi rend="italic" xml:lang="la">Latin language text</hi>


Images, drawings, illustrations, etc. are represented as <figure> with a description wrapped in <figDesc> and square brackets: <figure><figDesc>[Fig: text of description]</figDesc></figure>


<lb/><quote rend="center" xml:lang="la">A quotation from Cicero in Latin</quote>

Changes in hand

<handShift @scribe="scribeB"/>
  • The start of a new scribe
<handShift @script="scriptB"/>
  • The start of a new script but by the same scribe
<handShift medium="red-ink"/>
  • Marks the beginning of a sequence of text written in a new hand or new quality in the inscription (e.g. colour or medium)


Tagging entities and other semantic content


People are tagged with <name> and @type="person" and a reference to the corresponding entry in the project database: 

<name type="person" ref="26">John Tradescant, the Younger</name>

Places are handled in the same manner:

<name type="place" ref="342">South-Lambeth</name>


<date when="1678-07-18" period="cont">July 18th 1678</date> 

We record dates as they appear on the page.


At the most basic level, for documents that refer to, mention, or describe real-world objects, we must be able to identify what constitutes a reference to an object, identify a string of text that can function as a verbal identifier (a “name” in the loosest sense), and then correlate with that named object all the relevant context. 

We identify each mention and its context using <seg> with @type=object.

We identify a handle for each object using <rs> with @type="object.

We identify all groups of objects, distinguishing between general groupings and sets.

No comments:

Post a Comment

Guillermo del Toro: A Modern Appreciator of Curiosities

A recent special exhibition at the Art Gallery of Ontario featured a re-creation of Guillermo del Toro’s personal collection of memorabilia,...