In recent posts, we’ve been giving you a backstage look at how we are building this online edition of Samuel Johnson’s A Dictionary of the English Language. In this post, we’ll explain how the computer calls up words when you search for them—and why the search results look the way they do.
Briefly, our results display follows modern dictionary conventions, which differ somewhat from the conventions that Johnson followed.
We created “labels” for our search results
When you search for a word—or click the “Random Word” button multiple times—the site returns a list of words that descends down the left side of the screen.
Internally, we call these words labels because each one identifies a particular entry. When you click on a word/label, the online dictionary displays the entry for that word in Johnson’s Dictionary.
This concept seems straightforward enough . . . until you realize that the words in the list don’t actually appear in that form anywhere in the dictionary entry! For example, here’s the entry for rhombick, adj.:
Although the terms rhombick and adj appear in the entry, they do not appear in this form. For one thing, the word rhombick is written with a combination of regular and small caps, and it has an accent mark after the o. Also, a peek inside the XML reveals a lot of markup around these terms that wouldn’t belong in a results list:
Because nothing in the transcribed entry was easy to grab and turn into a label, we decided to supply the labels ourselves. Our editorial philosophy has been to align every XML file with Johnson’s text and to keep our modern additions separate where possible. Accordingly, we store the word labels in their own database. Here’s a small piece:
When someone types a word into the search box, the computer looks for matches in the search term column, finds the corresponding files in the filename column, and displays the corresponding labels from the label column.
Our labels follow modern conventions
In keeping with our goal of making Johnson’s text as accessible as a contemporary dictionary, our word labels follow modern conventions, which differ from Johnson’s conventions in several ways. All these differences are intended to make the word labels easier to read:
We omit stress marks
Johnson indicates a word’s primary stress by putting an accent mark into the stressed syllable. Stress is the relative emphasis with which syllables of a word are spoken. Imagine the difference in stress between saying the word “abstract” AB-stract versus saying it ab-STRACT.
The accent mark gives information about pronouncing the word, but the mark is not part of the word’s normal spelling. Our transcriptions preserve Johnson’s stress marks, but our word labels omit them.
We modernize capitalization
Johnson developed a format for headword capitalization that provides information about each word’s origins. Most headwords appear in ALL CAPS. Headwords that Johnson considered to be derived from another English word are presented in Small Capital Letters (which, unfortunately, WordPress will not display; see the image of the 1755 entry for fumingly, below, for an example). And headwords that Johnson considered to be more foreign than English appear in ITALICIZED CAPITAL LETTERS.
Our entry transcriptions preserve Johnson’s headword formatting. Our labels, however, follow modern lexicographic practices; words appear in lowercase letters unless capitalization is required by modern convention, as with proper nouns; see, for example, this entry for Hippocrates’s sleeve, n.s.:
We omit articles and particles
Johnson’s headwords are often preceded by an article (A, An, The) or particle (To), a practice that was common at the time. Modern dictionaries omit these words, and so do the word labels, though we preserve these words in the transcription.
We standardize parts of speech
Johnson notes the parts of speech of many headwords, but—as is typical for the eighteenth century—he did not consistently employ the same tidy abbreviation every time. His part of speech designations range from the brief n.s. to the more elaborate this is a kind of substantive, being, according to its signification, singular or plural. Some of his abbreviations are clear, if variable, such as part. / partic. / particip. / participle. Others are more opaque, such as ad., which could signify either adjective or adverb. At times, Johnson did not list any part of speech.
Our entry transcriptions preserve Johnson’s part of speech designations as he provided them. Our labels, however, standardize Johnson’s abbreviations in order to make search results easier to read. For example, the labels use the abbreviation part. for participle no matter how Johnson abbreviated the term in the entry.
It’s important to note that our labels present Johnson’s designations in abbreviated form. Our labels do not attempt to interpret Johnson’s designations. Where Johnson’s part of speech designation is opaque, as with ad., we change nothing. And where Johnson’s part of speech designation is too long to fit on a label, we omit it from the label but do not attempt to substitute something else.
We number homographs
Often, multiple headwords have identical spellings; in other words, they are homographs. At times, different headwords have not only the same spelling but also the same part of speech, which means that their labels could also be identical. Unfortunately, a list of identical labels can falsely appear to be an error:
To distinguish these labels, we number them:
Yes, we did this work by hand
We were able to automate some of the label-making process. For example, we were able to extract a list of (most) headwords and their parts of speech from our XML files. But a great deal of this work has been carried out by actual people, looking at words one at a time.
Yes, we’re still working on the labels (you can help)
We’re still working hard to improve the site infrastructure so that our labels display as intended. Sometimes we discover that a particular label has a typo, or that a search finds the correct entry but displays the wrong label, or that a search yields unexpected results.
Please help by alerting us to any surprising or confusing search results you encounter!