{"id":956,"date":"2022-07-10T08:54:00","date_gmt":"2022-07-10T12:54:00","guid":{"rendered":"http:\/\/net4573.net.ucf.edu\/sjd\/blog\/?p=956"},"modified":"2022-06-27T15:56:12","modified_gmt":"2022-06-27T19:56:12","slug":"fixing-tangled-xmlids","status":"publish","type":"post","link":"https:\/\/johnsonsdictionaryonline.com\/blog\/fixing-tangled-xmlids\/","title":{"rendered":"Fixing tangled xml:ids"},"content":{"rendered":"\n<p>Usually when people hear about the <em>Johnson&#8217;s Dictionary Online<\/em> project, they assume that our work involves:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Getting a computer to turn page images into text via Optical Character Recognition (OCR)<\/li><li>Fixing typos<\/li><li>That&#8217;s it.<\/li><\/ol>\n\n\n\n<p>In reality, this project involves innumerable tasks that most people would never think about. Some of these tasks are more interesting than others. This post is about one of these tasks that consumed a big part of my week.<\/p>\n\n\n\n<p>You may have noticed that Johnson&#8217;s dictionary often links multiple headwords to the same definition:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"614\" height=\"294\" src=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-aspick-1755.png\" alt=\"1755 entry for Asp and Aspick, with headwords joined by a bracket to the same definition. https:\/\/johnsonsdictionaryonline.com\/1755\/asp_ns_(3). \" class=\"wp-image-957\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-aspick-1755.png 614w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-aspick-1755-300x144.png 300w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-aspick-1755-150x72.png 150w\" sizes=\"auto, (max-width: 614px) 100vw, 614px\" \/><figcaption>Entry for &#8220;Asp&#8221; and &#8220;Aspick&#8221;*<\/figcaption><\/figure>\n<\/div>\n\n\n<p>These linked-headword entries need to be checked to make sure that our transcription contains the right headwords. Every entry is contained in a separate XML file which must be individually edited, removing any headwords that don\u2019t belong and adding any headwords that are missing. When a headword needs to be added, the XML needs to be added also. So, for example, to add the headword \u201casp\u201d to an entry, I added the following to the XML file:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"277\" height=\"142\" src=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-headword-xml.png\" alt=\"XML for the headword &quot;Asp&quot; that shows superEntry, entry, form, and hi elements\" class=\"wp-image-958\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-headword-xml.png 277w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-headword-xml-150x77.png 150w\" sizes=\"auto, (max-width: 277px) 100vw, 277px\" \/><figcaption>xml for headword <em>Asp<\/em><\/figcaption><\/figure>\n<\/div>\n\n\n<p>I was happily reviewing these multiple-headword entries (the 984 instances in the 1773 edition\u20141755 having already been checked), when suddenly it occurred to me that some of our xml:id values were getting duplicated.<\/p>\n\n\n\n<p>Each headword is supposed to have a unique xml:id, and our software checks to make sure that none of the xml:ids are duplicated in a single file . . . but nothing checks to make sure that xml:ids are unique across the 83,938 entry files of the entire project.<\/p>\n\n\n\n<p>And I realized that there were indeed duplicates, because when I was adding headwords\u2014such as \u201casp,\u201d above\u2014I was actually duplicating the xml:ids from a different entry for \u201cAsp\u201d:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"598\" height=\"30\" src=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-alone-1755.png\" alt=\"1755 entry for asp https:\/\/johnsonsdictionaryonline.com\/1755\/asp_ns_(1)\" class=\"wp-image-959\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-alone-1755.png 598w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-alone-1755-300x15.png 300w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/asp-alone-1755-150x8.png 150w\" sizes=\"auto, (max-width: 598px) 100vw, 598px\" \/><figcaption>Entry for &#8220;Asp&#8221;<\/figcaption><\/figure>\n<\/div>\n\n\n<p>Both entries in 1773 were using the xml:id   <em>f1773-asp-1<\/em>. &nbsp;Duplicated xml:ids will cause problems when we build hyperlinks for Johnson\u2019s cross-references. A cross-reference is a direction in one entry to consult another entry. For example, Johnson says \u201cSee ASP\u201d in the second entry for \u201cAspick\u201d:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"593\" height=\"116\" src=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/aspick-alone.png\" alt=\"1755 entry for aspick https:\/\/johnsonsdictionaryonline.com\/1755\/aspick_ns_(2)\" class=\"wp-image-960\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/aspick-alone.png 593w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/aspick-alone-300x59.png 300w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/aspick-alone-150x29.png 150w\" sizes=\"auto, (max-width: 593px) 100vw, 593px\" \/><figcaption>Entry for &#8220;Aspick&#8221;<\/figcaption><\/figure>\n<\/div>\n\n\n<p>We plan to build a link from that cross reference to the other entry, so when you click on the words &#8220;See ASP,&#8221; the entry for &#8220;Asp&#8221; opens up. And what will make that link possible? The xml:id!<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"319\" height=\"100\" src=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/xml-for-xr.png\" alt=\"XML for the cross reference from &quot;Aspick&quot; to &quot;Asp.&quot; Contains xr, ref, hi elements\" class=\"wp-image-961\" srcset=\"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/xml-for-xr.png 319w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/xml-for-xr-300x94.png 300w, https:\/\/johnsonsdictionaryonline.com\/blog\/wp-content\/uploads\/2022\/06\/xml-for-xr-150x47.png 150w\" sizes=\"auto, (max-width: 319px) 100vw, 319px\" \/><figcaption>XML for cross-reference<\/figcaption><\/figure>\n<\/div>\n\n\n<p>We can&#8217;t link without specifying which unique entry we are linking to. I realized that I needed to find all duplicated xml:ids in both editions and change the duplicates to something unique. Then I needed to figure out which entries were already referencing the duplicated xml:ids and update them to use the newly &#8220;unique-ified&#8221; xml:ids. <\/p>\n\n\n\n<p>Software tools are a big help with this kind of thing. It would be very hard to review 85,991** xml:id values and find duplications just with my own eyeballs, and I&#8217;d probably make mistakes trying to do it. But even so, the process isn&#8217;t what you&#8217;d call &#8220;fast,&#8221; and keeping track of the changes can be quite the challenge.<\/p>\n\n\n\n<p>Our next step will be to figure out a good process for building links out of the xml:id values. We had envisioned that each xml:id could be easily turned into a unique filename. For example, <em>f1773-asp-1<\/em> can be automatically transformed to the filename <em>f1773-asp-1.xml<\/em> . When we distinguish between homographs, sometimes an entry will cross-reference to an xml:id that doesn\u2019t match a filename. We have some ideas for solving the problem, and I\u2019m sure we\u2019ll figure it out. <\/p>\n\n\n\n<p>I said above that some tasks are more interesting than others. I\u2019ll be honest: this is one of the less interesting tasks. Still, it is satisfying to identify an XML snarl and untangle it.<\/p>\n\n\n\n<p class=\"has-small-font-size\">*This post uses images from 1755, even though I was editing 1773, because it was easier for me to grab them.<\/p>\n\n\n\n<p class=\"has-small-font-size\">**There are more xml:ids than xml files because some entries have more than one headword.<\/p>\n\n\n\n<p>&#8211;<em>Beth Rapp Young<\/em> <em>(PI)<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Usually when people hear about the Johnson&#8217;s Dictionary Online project, they assume that our work involves: Getting a computer to turn page images into text via Optical Character Recognition (OCR) Fixing typos That&#8217;s it. In reality, this project involves innumerable tasks that most people would never think about. Some of these tasks are more interesting than others. This post is&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":68,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[6,4,19],"tags":[5,13,18],"class_list":["post-956","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-backstage","category-dictionary","category-xml","tag-backstage","tag-dictionary","tag-xml"],"_links":{"self":[{"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/posts\/956","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/comments?post=956"}],"version-history":[{"count":5,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/posts\/956\/revisions"}],"predecessor-version":[{"id":966,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/posts\/956\/revisions\/966"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/media\/68"}],"wp:attachment":[{"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/media?parent=956"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/categories?post=956"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/johnsonsdictionaryonline.com\/blog\/wp-json\/wp\/v2\/tags?post=956"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}