Parsing Author Names
The many vagaries of parsing and dealing with proper names are legend (for the general lay of the problem, see this article from the W3C). Combining this with the many ways in which (i) authors format their names for publication and (ii) bibliographic data providers format this name data before we receive it makes the problem even worse.
RLetters is thus forced to make a few assumptions about the data to maximize search results. First, all data in the Solr server is assumed to be in some variation of “First Last” format (with appropriate allowances made for abbreviations and missing middle names, see below). This preserves the ability for the authors list to be comma-separated, which is specified by our Solr schema.
When a user attempts to search for articles by author, then, the following processing steps are performed.
- Parse the name to determine what the user is searching for. Our name-parsing step takes account of suffixes (Jr., Sr., etc.) as well as what BibTeX traditionally calls the “von-part” (von, van, van der, etc.). It also is able to detect when users provide a string of initials instead of a name (e.g., “JHQ Doe”). Finally, it is aware of several varieties of “Last, First” ordering, which will be automatically converted.
- Create a Solr query for (1) just the first and last name, and (2, if available) the first, last, and middle names.
- For each set of names, check the following:
- If the name is a single initial, search it with a wildcard, so it might match a full name.
- If a name is not a single initial, query it both as the full name and as an initial without a wildcard.
- If multiple initials in a row are present in any search term, create a new search term which has them collapsed together.
This logic is easiest to display in action. Consider a particularly complex case, where the user searches for “Doe, John Jay L.". The parsing of this name goes like this:
Recognize the “Last, First” format.
first: John Jay L. last: Doe
Construct one query for just “John Doe”.
Only one first name, and it’s not a single initial. We thus produce two search queries for this name:
"John Doe" "J Doe"
Construct one query for “John Jay L. Doe”.
This has three first and middle names, and one of them is a single initial. First, we produce every combination of the names that are present, both as full names and as initials (also, note the wildcard, as L was specified by the user as an initial):
"John Jay L* Doe" "John J L* Doe" "J Jay L* Doe" "J J L* Doe"
Now, we add to this list the result of combining together multiple runs of initials:
"John JL* Doe" "JJ L* Doe" "J JL* Doe" "JJL* Doe"
Finally, combine them all together to get:
"John Doe" "J Doe" "John Jay L* Doe" "John J L* Doe" "John JL* Doe" "J Jay L* Doe" "J J L* Doe" "JJ L* Doe" "J JL* Doe" "JJL* Doe"
Thus we’ve created a sum total of ten search terms from this particular name. They’ll match everything from “JJL Doe” “John Jay Lucas Doe”.
If you can think of some edge cases we’ve missed, we’d love to hear about it! It’s of course known that this will fail miserably when it comes to non-Western names in non-Latin scripts. Unfortunately, there’s very little data available for testing that’s in such formats, so we don’t really have anything to go on, and most data from journal publishers is released in romanized or latinized forms (for better and/or worse).