12 August 2010

The Google Book Count is Bunk (Probably)

This past weekend, I pointed to an essay by Leonid Taycher, summarizing the process by which Google Books determines how many books there are.

Jon Stokes at ars technica has responded with "Google's count of 130 million books is probably bunk."

Stokes's principal criticism has to do with the notoriously faulty metadata at Google Books. He writes:
But the problem with Google's count, as is clear from the GBS [Google Book Search] count post itself, is that GBS's metadata collection is riddled with errors of every sort. Or, as linguist and GBS critic Goeff Nunberg put it last year in a blog post, Google's metadata is "train wreck: a mish-mash wrapped in a muddle wrapped in a mess."
(I've retained the link to Nunberg's post at Language Log on 29 August 2009, "Google Books: A Metadata Train Wreck.")

Just a few days ago, I had occasion to be reminded just how poor Google's metadata is.

(Exactly why we have to use highfalutin jargon like "metadata" when talking about dates of publication is not yet entirely clear to me.)

For a historian, Google Books would seem to have tremendous potential for research:  in theory, it is intended to include everything ever published, digitized but searchable.  The "Advanced Book Search" interface is still a bit crude for historical research, but it's possible to set searches for (in effect) "everything before 1800."  I've done this myself, and found some quite interesting things this way.  I'll probably be writing about some of these results here in the near future.

A few days ago, I started reading The Cambridge Companion to Jazz, edited by Mervyn Cooke and David Horn. The first essay in the book is "The word jazz," by Krin Gabbard. On page 2, Gabbard writes:
According to several researchers, the earliest appearance of the word jazz in written form was probably in San Francisco newspapers. In 1913, Ernest J. Hopkins offered this definition: 'something like life, vigor, energy, effervescence of spirit, joy, pep, magnetism, verve, virility, ebulliency, courage, happiness—oh, what's the use?—JAZZ.'
Gabbard doesn't disclose the identity of any of the "several researchers."  Instead, he gives just a single blanket reference further along in the paragraph to Lewis Porter, Jazz: A Century of Change, where perhaps one is supposed to go to look for this information.

Now, The Cambridge Companion to Jazz was published in 2002, before Google Books existed.  So it occurred to me that perhaps a search of Google Books would turn up an earlier printed occurrence of "jazz."  Visions of glory, wealth, and appearances on Oprah danced before my mind's eye. So with eager anticipation, I used Advanced Book Search to look for occurrences of "jazz" before December 1915, with the interface set to show 30 results per page.  (Here's my search.)

And here's what happened.

Every one of the 30 results on the first page of hits was faulty in some way.

Here are the first 15.  I've summarized the listings, but give the date exactly as it appears in the search results

1. F. Scott Fitzgerald, Tales of the Jazz Age, 1910.
No preview (so no way to check the metadata), and no indication of the source of the copy. Fitzgerald's book actually appeared in 1922 (according to Wikipedia, which in this case is certainly more trustworthy).
2. Philippine Magazine, 1909.
A snippet view of an article by Charles E. Griffith, Jr., “Jazz and Music Education." The header visible in the snippet reads "August 192[6]" (the last digit is too blurry to distinguish reliably from “5”). One supposes that 1909 is the date when the magazine began to appear, but there's no evident way to use Google Books to check this for an item given only in snippet view.
3. School of Music (University of Michigan) publications, 1880.
Snippet view. Here, the accompanying text on the results page makes clear that this hit isn’t relevant. (The text given on the results page is, literally: "versify oi Michigan JAZZ WEEK PRESENTS: hool of Music Wednesday, March 22, 1989 Rackham Auditorium 8:00 pm"). 

I suppose the 1880 comes from the earliest dated School of Music publication, not from the first use of the word "jazz."  (But is it really true that the word "jazz" did not appear in any of these publications until 1989?  Who knows?  At least, there's no evident way to use Google Books to answer that question.)
4. The directory of recorded jazz and swing music, vol. 2. Delphic Press, 1900.
No preview, no source location. A search of WorldCat shows this to be David Arthur Carey and Albert McCarthy, The directory of recorded jazz and swing music. There are two records for this item in WorldCat, one with the date 1950 (which I suppose is correct), and the other 1900 (the likely source of the error in Google Books).
5. The Publishers Weekly, “p. 739,” 1911.
Here we finally have a full view.  The full reference for the hit on "jazz" is:
"Thurm, T: G., comp. / Hawaii: almanac and annual for 1910. '10 / (Ja22) c. il. O. pap., 75 c."
So here the date may be right, but the hit on "jazz" seems to be a mistranscription of the code "Ja22."
6. The living age, 1844.
The snippet view shows references to "jazz band," "jazz orchestra," and "jazz bands."  But this is obviously impossible in 1844, so one supposes, again, that the date refers to the earliest publication of the periodical, not the date of the reference.  But given the highly restricted snippet view, there's no way to tell.
7. Nelson’s Encyclopaedia, 1907.
This one was briefly exciting (what can I say; it was after midnight, and my skepticism was already asleep). We get a full view, and there is an entire paragraph on jazz in the “America” section of the article “Music.” 

However, the footer on the page reads “VOL. VIII.—March ’25”
8. Oscar Peterson live at the Northsea Jazz festival, 1081.
No preview, but we can safely assume this is a typo for 1981, not a reference to a performance by Peterson shortly after the Norman Conquest.
9. Country life, 1901. Vol. 182
7 hits on "jazz," but again a snippet view, so there is no way within Google Books to determine the dates of the issues in which the hits actually appear.  Very likely not 1901.
10. 1865 to the Present[:] A United States History for High Schools, 1865.
9 references to "jazz," and we get a full view.  But we can also see where the "1865" came from.  The copyright date is 1967 (it's a Teacher's Edition, by the way).
11. Practice, New York Institute for Social Therapy and Research, 1891.
This turns out to be a snippet view again, with 10 hits in volume 5, in a context referring to James P. Johnson, Fats Waller, Willie "The Lion" Smith, and Andre Previn.  So not really 1891, I guess.
12. Technology review - Page 438. Massachusetts Institute of Technology. Association of Class Secretaries, Massachusetts Institute of Technology. Alumni Association, MIT Alumni/ae Association, 1899.
A full view.  The context of the hits is rather charming:
Of course, every one remembers Joe Champagne of Tech Show fame. We regret that lack of space forbids reproducing in full the articles appearing in the " Boston Post " beginning January 26, entitled " Can you Jazz? Learn in your own home how to do all the new fascinating steps."
We read on: " Here is a chance to learn from high terpsichorean authority the latest dances from Jazz to Military. Mr. Joseph L. Champagne, maitre de danse at the Copley-Plaza, will instruct Sunday Post readers how to do the Jazz, the Butterfly, the Bounce and the new military steps."
I confess I gave up looking for the title page of the issue (it is very long), but it appears to date from 1919.
13. The Negro; a list of significant books, The New York Public Library, 1900.
Why in the world is this available only in snippet view?  Since the first hit on "jazz" also contains an entry on Roy Campanella, I guess 1900 must be wrong.
14. Deformation III: a chamber concerto for jazz drum set and mallet [percussion ensemble], Gary Powell Nash, 1900.
Well, obviously this date is wrong, but there's no preview at all, so I can't tell you when the piece was actually written or published.
15. The life of the bee, Maurice Maeterlinck, 1901, p. 155.
I was actually excited about this one for a few minutes.  The hit is:  “In the jazz orchestra you can hear the same sort of thing in the facetious jests of the saxophones or the bassoons.”

 It turns out that the date is wrong. The title page reads:  “The Life of the Bee / by / Maurice Maeterlinck / Translated by Alfred Sutro / NEW YORK / DODD, MEAD AND COMPANY 1911”

But 1911 is actually even more plausible than 1901. So I'm still hopeful.

But wait, something is wrong. The running head on page 155, where the hit occurs, is “Heart and Mind,” and this isn’t in the table of contents of Maeterlinck's book. And why would a book on bees have a reference to jazz bands anyway?

It turns out that there is a second, completely unrelated book beginning immediately after the end of Google's digitzed copy of  Life of the Bee.  It is Franc-Nohain, Life’s An Art,  New York: Henry Holt, 1930.  And that's where the hit occurs.  Since both books are rather long, it seems unlikely that they are bound together in the University of Michigan Library, from which these scans come.  But I can't be sure.  I think someone simply forgot to divide the digital files for the two books.
The remaining 15 results out of the first 30 for my search are similarly faulty in a variety of ways.  Google Book Search has returned several books on chemistry and medicine because it has mistranscribed "Gazz." (for Gazzetta) as "Jazz."  The Xi Psi Phi Quarterly is given a date of 1909, but the issue in which the word "jazz" occurs seems to date from around 1919 or 1920.  And so on.

So much for my appearance on Oprah...

It was rather alarming to find that several pages in the copy of Life of a Bee were badly ripped. Here are two examples (pp. 390 and 414, respectively).  I have the sinking suspicion that the damage may have been caused during the scanning process.

Google Books has a very very long way to go before it becomes a reliable resource for research.  Let's hope it doesn't destroy irreplaceable originals in the meantime.
    "(Exactly why we have to use highfalutin jargon like 'metadata' when talking about dates of publication is not yet entirely clear to me.)"

    It's an example of Joel Spolsky's "Law of Leaky Abstractions" -- Google Books' software developers are concerned with delivering "data" to the end user, that is, the content of the books that they index. The "metadata" is information about the data, i.e., about the physical books whose content they are delivering.

    Google Books delivers content (data).

    Your library's online catalog delivers only metadata.

    But Google Books needs both, as the user needs the functionality of the online catalog, in order to get to the content, just as the visitor to the brick-and-mortar library needs the catalog to retrieve the book whose content is sought (Ha! Just avoided a he/she/they by going passive voice!).

    What I've never understood is why Google chose to reconstruct the metadata from the content, instead of simply importing it from the library catalogs of the institutions whose books were scanned. Every book in Google Books is from an institution with a fully developed online catalog and likely in the vast majority of cases, for each book there was complete and accurate information in that electronic data (often also linking to authority records like MARC and OCLC, which could be used to link different copies of the same book).

    This has always puzzled me. I can't come up with a justification for abandoning the benefit of all that good data, so I'm left with the alarming idea that the people designing the Google Books data acquisition project included no professional librarians, or anybody with any academic or scholarly background (not even in information services -- any scholar would, I'd expect, think of this).

    How else could they have missed something that seems so obvious?

    David W. Fenton