{"id":89,"date":"2021-05-15T12:04:56","date_gmt":"2021-05-15T16:04:56","guid":{"rendered":"https:\/\/johnsonsdictionaryonline.org\/blog\/?p=89"},"modified":"2021-05-19T12:08:01","modified_gmt":"2021-05-19T16:08:01","slug":"what-happens-after-typing-the-words","status":"publish","type":"post","link":"https:\/\/johnsonsdictionaryonline.com\/blog\/what-happens-after-typing-the-words\/","title":{"rendered":"What happens after typing the words?"},"content":{"rendered":"\n<p>When we started this project, people would often say something like, \u201cSo, you have to get a computer to recognize the words on the page. That\u2019s basically it, right? What else is there to do?\u201d<br><br>Lots more! <br><br>One very important task has been marking up the words using XML (eXtensible Markup Language).  XML works a lot like html, for those of you familiar with html, but instead of relying on a limited set of tags, XML lets us invent new tags to fit our project. Additionally, where html normally addresses the appearance of a text, XML lets us mark the kind of content that the text represents. Our project mainly uses a variety of XML called TEI, named for the <em>Text Encoding Initiative<\/em>, a consortium that developed and maintained it.<br><br>Our project owes an enormous debt of gratitude to Ian Lancashire of the <a rel=\"noreferrer noopener\" href=\"https:\/\/leme.library.utoronto.ca\/\" data-type=\"URL\" data-id=\"https:\/\/leme.library.utoronto.ca\/\" target=\"_blank\">Lexicons of Early Modern English (LEME)<\/a> project at the University of Toronto, who donated to us a complete TEI-encoded transcription of Johnson\u2019s 1755 first folio edition. <br><br>Without that gift, we might still be cleaning up text that was generated via OCR (Optical Character Recognition). OCR technology does not yet do a good job of recognizing 18<sup>th<\/sup> century print. For example, here\u2019s an example of the OCR output for <em>kecksy<\/em>, followed by a facsimile image of the entry for <a rel=\"noreferrer noopener\" href=\"https:\/\/johnsonsdictionaryonline.com\/1755\/kecksy_ns\" data-type=\"URL\" data-id=\"https:\/\/johnsonsdictionaryonline.com\/1755\/kecksy_ns\" target=\"_blank\"><em>kecksy<\/em><\/a>:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"638\" height=\"133\" src=\"https:\/\/johnsonsdictionaryonline.org\/blog\/wp-content\/uploads\/2021\/05\/OCR-kecksy-1.png\" alt=\"\" class=\"wp-image-95\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/OCR-kecksy-1.png 638w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/OCR-kecksy-1-300x63.png 300w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/OCR-kecksy-1-150x31.png 150w\" sizes=\"auto, (max-width: 638px) 100vw, 638px\" \/><figcaption>OCR output for <em>kecksy<\/em><\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/johnsonsdictionaryonline.org\/blog\/wp-content\/uploads\/2021\/05\/kecksy-facsimile-1755.png\" alt=\"\" class=\"wp-image-96\" width=\"463\" height=\"139\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-facsimile-1755.png 572w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-facsimile-1755-300x90.png 300w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-facsimile-1755-150x45.png 150w\" sizes=\"auto, (max-width: 463px) 100vw, 463px\" \/><figcaption>facsimile of 1755 entry for <em>kecksy<\/em><\/figcaption><\/figure><\/div>\n\n\n\n<p>Predictably, the software has trouble recognizing the long \u201cs\u201d character. But when it fails to understand \u0283, it doesn\u2019t simply insert an <em>f<\/em>. It substitutes <em>f<\/em>, or <em>j<\/em>; in other passages it might substitute <em>t<\/em> or <em>l<\/em> or <em>p<\/em> or <em>1<\/em> or a wide variety of other characters, or it might just skip the letter altogether. And the software also had trouble with <em>E, c, n, s, e,<\/em> and <em>h<\/em> . . . and that\u2019s just in this relatively short entry!&nbsp;<\/p>\n\n\n\n<p>You can see how the transcript from LEME has made our work much easier!<\/p>\n\n\n\n<p>Still, even though the LEME file was already marked in XML-TEI, we still had plenty of markup to do. Here\u2019s a portion of the entry for <em>kecksy<\/em> as marked in the original LEME transcription and then as marked in our edited transcription:&nbsp;<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/johnsonsdictionaryonline.org\/blog\/wp-content\/uploads\/2021\/05\/leme-kecksy-beginning.png\" alt=\"\" class=\"wp-image-97\" width=\"608\" height=\"123\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/leme-kecksy-beginning.png 789w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/leme-kecksy-beginning-300x60.png 300w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/leme-kecksy-beginning-150x30.png 150w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/leme-kecksy-beginning-768x155.png 768w\" sizes=\"auto, (max-width: 608px) 100vw, 608px\" \/><figcaption>excerpt of XML from LEME text<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/johnsonsdictionaryonline.org\/blog\/wp-content\/uploads\/2021\/05\/sjd-kecksy-beginning.png\" alt=\"\" class=\"wp-image-98\" width=\"579\" height=\"283\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/sjd-kecksy-beginning.png 811w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/sjd-kecksy-beginning-300x146.png 300w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/sjd-kecksy-beginning-150x73.png 150w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/sjd-kecksy-beginning-768x375.png 768w\" sizes=\"auto, (max-width: 579px) 100vw, 579px\" \/><figcaption>excerpt of XML from our text<\/figcaption><\/figure><\/div>\n\n\n\n<p>The LEME markup faithfully preserves the layout on the printed page, which is an important goal of digital humanities.&nbsp;<\/p>\n\n\n\n<p>However, our dictionary prioritizes accessibility over preservation. Readers can view the original printed pages using our \u201cBrowse\u201d menu, and soon we will display a facsimile of every entry alongside every transcription. Because our readers can easily view the original print layout, our transcription doesn&#8217;t need to preserve every detail of that layout. Instead, our goal is to provide transcribed text that can be easily read, quoted, etc. To reach this goal, we removed some markup, such as &lt;\/lb&gt; tags that marked line breaks. Our transcribed text flows to match whatever margins are set by the reader\u2019s device.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1003\" height=\"221\" src=\"https:\/\/johnsonsdictionaryonline.org\/blog\/wp-content\/uploads\/2021\/05\/kecksy-wide-margin.png\" alt=\"\" class=\"wp-image-99\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-wide-margin.png 1003w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-wide-margin-300x66.png 300w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-wide-margin-150x33.png 150w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-wide-margin-768x169.png 768w\" sizes=\"auto, (max-width: 1003px) 100vw, 1003px\" \/><figcaption>entry for <em>kecksy<\/em> on widescreen device<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/johnsonsdictionaryonline.org\/blog\/wp-content\/uploads\/2021\/05\/kecksy-on-phone-Screenshot_20210508-163854-512x1024.jpg\" alt=\"\" class=\"wp-image-100\" width=\"229\" height=\"458\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-on-phone-Screenshot_20210508-163854-512x1024.jpg 512w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-on-phone-Screenshot_20210508-163854-150x300.jpg 150w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-on-phone-Screenshot_20210508-163854-75x150.jpg 75w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-on-phone-Screenshot_20210508-163854.jpg 720w\" sizes=\"auto, (max-width: 229px) 100vw, 229px\" \/><figcaption>entry for <em>kecksy<\/em> on smartphone<\/figcaption><\/figure><\/div>\n\n\n\n<p>Additionally, we added markup to enable our database to work more effectively. For example, we marked \u201cSkinner\u201d as the name of an author, with a link to our personography table that will soon allow readers to display additional information about him. And we added attributes to language names to enable our database to locate mentions of a language no matter how it is abbreviated. For example, our database can find mentions of French whether it is spelled <em>French<\/em> or <em>Fr.<\/em>, and it can distinguish between the French language and other things that are French.&nbsp;<\/p>\n\n\n\n<p>We were able to use a software process for some of this markup, but much of it we completed by hand: teams of people, combing through the dictionary one entry at a time, looking for features to mark. To make the process easier, we built an administrative display that indicated the presence of some tags with colored font. That way, when we saw a word in blue (for example), we could tell that it was already marked with a &lt;placeName&gt; tag without having to open up the XML file.&nbsp;<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"533\" height=\"269\" src=\"https:\/\/johnsonsdictionaryonline.org\/blog\/wp-content\/uploads\/2021\/05\/kecksy-colored-admin-display.png\" alt=\"\" class=\"wp-image-101\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-colored-admin-display.png 533w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-colored-admin-display-300x151.png 300w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2021\/05\/kecksy-colored-admin-display-150x76.png 150w\" sizes=\"auto, (max-width: 533px) 100vw, 533px\" \/><figcaption>administrative working display of <em>kecksy<\/em><\/figcaption><\/figure><\/div>\n\n\n\n<p>Of course, if the name of a place did not appear in blue, we <em>did<\/em> need to open up the XML file for that word to add the tag. Happily, eXide, the database that stores our XML files, provides a web-based interface for easier editing. Our volunteers have spent many hours of quality time editing the XML.&nbsp;<\/p>\n\n\n\n<p>Would we consider this work to be <em><a href=\"https:\/\/johnsonsdictionaryonline.com\/1755\/lexicographer_ns\" target=\"_blank\" rel=\"noreferrer noopener\">drudgery<\/a><\/em>? Sometimes! But Johnson\u2019s definitions are a delight, so the work can be surprisingly engaging.&nbsp;<\/p>\n\n\n\n<p>As the project moves forward, we will provide additional search features that have been made possible through this labor.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>People often ask us something like, \u201cSo, you have to get a computer to recognize the words on the page. That\u2019s basically it, right? What else is there to do?\u201d Lots! Here, we explain XML markup.<\/p>\n","protected":false},"author":1,"featured_media":68,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[6,4],"tags":[],"class_list":["post-89","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-backstage","category-dictionary"],"_links":{"self":[{"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/posts\/89","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/comments?post=89"}],"version-history":[{"count":9,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/posts\/89\/revisions"}],"predecessor-version":[{"id":119,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/posts\/89\/revisions\/119"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/media\/68"}],"wp:attachment":[{"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/media?parent=89"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/categories?post=89"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/tags?post=89"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}