Fixing tangled xml:ids
Usually when people hear about the Johnson’s Dictionary Online project, they assume that our work involves:
- Getting a computer to turn page images into text via Optical Character Recognition (OCR)
- Fixing typos
- That’s it.
In reality, this project involves innumerable tasks that most people would never think about. Some of these tasks are more interesting than others. This post is about one of these tasks that consumed a big part of my week.
You may have noticed that Johnson’s dictionary often links multiple headwords to the same definition:
These linked-headword entries need to be checked to make sure that our transcription contains the right headwords. Every entry is contained in a separate XML file which must be individually edited, removing any headwords that don’t belong and adding any headwords that are missing. When a headword needs to be added, the XML needs to be added also. So, for example, to add the headword “asp” to an entry, I added the following to the XML file:
I was happily reviewing these multiple-headword entries (the 984 instances in the 1773 edition—1755 having already been checked), when suddenly it occurred to me that some of our xml:id values were getting duplicated.
Each headword is supposed to have a unique xml:id, and our software checks to make sure that none of the xml:ids are duplicated in a single file . . . but nothing checks to make sure that xml:ids are unique across the 83,938 entry files of the entire project.
And I realized that there were indeed duplicates, because when I was adding headwords—such as “asp,” above—I was actually duplicating the xml:ids from a different entry for “Asp”:
Both entries in 1773 were using the xml:id f1773-asp-1. Duplicated xml:ids will cause problems when we build hyperlinks for Johnson’s cross-references. A cross-reference is a direction in one entry to consult another entry. For example, Johnson says “See ASP” in the second entry for “Aspick”:
We plan to build a link from that cross reference to the other entry, so when you click on the words “See ASP,” the entry for “Asp” opens up. And what will make that link possible? The xml:id!
We can’t link without specifying which unique entry we are linking to. I realized that I needed to find all duplicated xml:ids in both editions and change the duplicates to something unique. Then I needed to figure out which entries were already referencing the duplicated xml:ids and update them to use the newly “unique-ified” xml:ids.
Software tools are a big help with this kind of thing. It would be very hard to review 85,991** xml:id values and find duplications just with my own eyeballs, and I’d probably make mistakes trying to do it. But even so, the process isn’t what you’d call “fast,” and keeping track of the changes can be quite the challenge.
Our next step will be to figure out a good process for building links out of the xml:id values. We had envisioned that each xml:id could be easily turned into a unique filename. For example, f1773-asp-1 can be automatically transformed to the filename f1773-asp-1.xml . When we distinguish between homographs, sometimes an entry will cross-reference to an xml:id that doesn’t match a filename. We have some ideas for solving the problem, and I’m sure we’ll figure it out.
I said above that some tasks are more interesting than others. I’ll be honest: this is one of the less interesting tasks. Still, it is satisfying to identify an XML snarl and untangle it.
*This post uses images from 1755, even though I was editing 1773, because it was easier for me to grab them.
**There are more xml:ids than xml files because some entries have more than one headword.
–Beth Rapp Young (PI)