13 August 2010

Daily Digest, 2010.08.12

Thursday: The benefits of self-organizing peer review; the dwindling choice of "other" e-readers; free software for creating ePub e-books; the pros and cons of science blog networks; the sky is always falling for the "content industry" (just ask John Philip Sousa); a review of Faye's Heidegger; defending evolution in the 31st century; the worse recording ever made?; The Beatles complete, on ukelele; Thompson and Bordwell on exposition and cross-cutting in Inception; A Hundred Thousand Billion poems online; two four-eared cats (making eight ears altogether).
Bookmark and Share

12 August 2010

The Google Book Count is Bunk (Probably)

This past weekend, I pointed to an essay by Leonid Taycher, summarizing the process by which Google Books determines how many books there are.

Jon Stokes at ars technica has responded with "Google's count of 130 million books is probably bunk."

Stokes's principal criticism has to do with the notoriously faulty metadata at Google Books. He writes:
But the problem with Google's count, as is clear from the GBS [Google Book Search] count post itself, is that GBS's metadata collection is riddled with errors of every sort. Or, as linguist and GBS critic Goeff Nunberg put it last year in a blog post, Google's metadata is "train wreck: a mish-mash wrapped in a muddle wrapped in a mess."
(I've retained the link to Nunberg's post at Language Log on 29 August 2009, "Google Books: A Metadata Train Wreck.")

Just a few days ago, I had occasion to be reminded just how poor Google's metadata is.

(Exactly why we have to use highfalutin jargon like "metadata" when talking about dates of publication is not yet entirely clear to me.)

For a historian, Google Books would seem to have tremendous potential for research:  in theory, it is intended to include everything ever published, digitized but searchable.  The "Advanced Book Search" interface is still a bit crude for historical research, but it's possible to set searches for (in effect) "everything before 1800."  I've done this myself, and found some quite interesting things this way.  I'll probably be writing about some of these results here in the near future.

A few days ago, I started reading The Cambridge Companion to Jazz, edited by Mervyn Cooke and David Horn. The first essay in the book is "The word jazz," by Krin Gabbard. On page 2, Gabbard writes:
According to several researchers, the earliest appearance of the word jazz in written form was probably in San Francisco newspapers. In 1913, Ernest J. Hopkins offered this definition: 'something like life, vigor, energy, effervescence of spirit, joy, pep, magnetism, verve, virility, ebulliency, courage, happiness—oh, what's the use?—JAZZ.'
Gabbard doesn't disclose the identity of any of the "several researchers."  Instead, he gives just a single blanket reference further along in the paragraph to Lewis Porter, Jazz: A Century of Change, where perhaps one is supposed to go to look for this information.

Now, The Cambridge Companion to Jazz was published in 2002, before Google Books existed.  So it occurred to me that perhaps a search of Google Books would turn up an earlier printed occurrence of "jazz."  Visions of glory, wealth, and appearances on Oprah danced before my mind's eye. So with eager anticipation, I used Advanced Book Search to look for occurrences of "jazz" before December 1915, with the interface set to show 30 results per page.  (Here's my search.)

And here's what happened.

Every one of the 30 results on the first page of hits was faulty in some way.

Here are the first 15.  I've summarized the listings, but give the date exactly as it appears in the search results

1. F. Scott Fitzgerald, Tales of the Jazz Age, 1910.
No preview (so no way to check the metadata), and no indication of the source of the copy. Fitzgerald's book actually appeared in 1922 (according to Wikipedia, which in this case is certainly more trustworthy).
2. Philippine Magazine, 1909.
A snippet view of an article by Charles E. Griffith, Jr., “Jazz and Music Education." The header visible in the snippet reads "August 192[6]" (the last digit is too blurry to distinguish reliably from “5”). One supposes that 1909 is the date when the magazine began to appear, but there's no evident way to use Google Books to check this for an item given only in snippet view.
3. School of Music (University of Michigan) publications, 1880.
Snippet view. Here, the accompanying text on the results page makes clear that this hit isn’t relevant. (The text given on the results page is, literally: "versify oi Michigan JAZZ WEEK PRESENTS: hool of Music Wednesday, March 22, 1989 Rackham Auditorium 8:00 pm"). 

I suppose the 1880 comes from the earliest dated School of Music publication, not from the first use of the word "jazz."  (But is it really true that the word "jazz" did not appear in any of these publications until 1989?  Who knows?  At least, there's no evident way to use Google Books to answer that question.)
4. The directory of recorded jazz and swing music, vol. 2. Delphic Press, 1900.
No preview, no source location. A search of WorldCat shows this to be David Arthur Carey and Albert McCarthy, The directory of recorded jazz and swing music. There are two records for this item in WorldCat, one with the date 1950 (which I suppose is correct), and the other 1900 (the likely source of the error in Google Books).
5. The Publishers Weekly, “p. 739,” 1911.
Here we finally have a full view.  The full reference for the hit on "jazz" is:
"Thurm, T: G., comp. / Hawaii: almanac and annual for 1910. '10 / (Ja22) c. il. O. pap., 75 c."
So here the date may be right, but the hit on "jazz" seems to be a mistranscription of the code "Ja22."
6. The living age, 1844.
The snippet view shows references to "jazz band," "jazz orchestra," and "jazz bands."  But this is obviously impossible in 1844, so one supposes, again, that the date refers to the earliest publication of the periodical, not the date of the reference.  But given the highly restricted snippet view, there's no way to tell.
7. Nelson’s Encyclopaedia, 1907.
This one was briefly exciting (what can I say; it was after midnight, and my skepticism was already asleep). We get a full view, and there is an entire paragraph on jazz in the “America” section of the article “Music.” 

However, the footer on the page reads “VOL. VIII.—March ’25”
8. Oscar Peterson live at the Northsea Jazz festival, 1081.
No preview, but we can safely assume this is a typo for 1981, not a reference to a performance by Peterson shortly after the Norman Conquest.
9. Country life, 1901. Vol. 182
7 hits on "jazz," but again a snippet view, so there is no way within Google Books to determine the dates of the issues in which the hits actually appear.  Very likely not 1901.
10. 1865 to the Present[:] A United States History for High Schools, 1865.
9 references to "jazz," and we get a full view.  But we can also see where the "1865" came from.  The copyright date is 1967 (it's a Teacher's Edition, by the way).
11. Practice, New York Institute for Social Therapy and Research, 1891.
This turns out to be a snippet view again, with 10 hits in volume 5, in a context referring to James P. Johnson, Fats Waller, Willie "The Lion" Smith, and Andre Previn.  So not really 1891, I guess.
12. Technology review - Page 438. Massachusetts Institute of Technology. Association of Class Secretaries, Massachusetts Institute of Technology. Alumni Association, MIT Alumni/ae Association, 1899.
A full view.  The context of the hits is rather charming:
Of course, every one remembers Joe Champagne of Tech Show fame. We regret that lack of space forbids reproducing in full the articles appearing in the " Boston Post " beginning January 26, entitled " Can you Jazz? Learn in your own home how to do all the new fascinating steps."
We read on: " Here is a chance to learn from high terpsichorean authority the latest dances from Jazz to Military. Mr. Joseph L. Champagne, maitre de danse at the Copley-Plaza, will instruct Sunday Post readers how to do the Jazz, the Butterfly, the Bounce and the new military steps."
I confess I gave up looking for the title page of the issue (it is very long), but it appears to date from 1919.
13. The Negro; a list of significant books, The New York Public Library, 1900.
Why in the world is this available only in snippet view?  Since the first hit on "jazz" also contains an entry on Roy Campanella, I guess 1900 must be wrong.
14. Deformation III: a chamber concerto for jazz drum set and mallet [percussion ensemble], Gary Powell Nash, 1900.
Well, obviously this date is wrong, but there's no preview at all, so I can't tell you when the piece was actually written or published.
15. The life of the bee, Maurice Maeterlinck, 1901, p. 155.
I was actually excited about this one for a few minutes.  The hit is:  “In the jazz orchestra you can hear the same sort of thing in the facetious jests of the saxophones or the bassoons.”

 It turns out that the date is wrong. The title page reads:  “The Life of the Bee / by / Maurice Maeterlinck / Translated by Alfred Sutro / NEW YORK / DODD, MEAD AND COMPANY 1911”

But 1911 is actually even more plausible than 1901. So I'm still hopeful.

But wait, something is wrong. The running head on page 155, where the hit occurs, is “Heart and Mind,” and this isn’t in the table of contents of Maeterlinck's book. And why would a book on bees have a reference to jazz bands anyway?

It turns out that there is a second, completely unrelated book beginning immediately after the end of Google's digitzed copy of  Life of the Bee.  It is Franc-Nohain, Life’s An Art,  New York: Henry Holt, 1930.  And that's where the hit occurs.  Since both books are rather long, it seems unlikely that they are bound together in the University of Michigan Library, from which these scans come.  But I can't be sure.  I think someone simply forgot to divide the digital files for the two books.
The remaining 15 results out of the first 30 for my search are similarly faulty in a variety of ways.  Google Book Search has returned several books on chemistry and medicine because it has mistranscribed "Gazz." (for Gazzetta) as "Jazz."  The Xi Psi Phi Quarterly is given a date of 1909, but the issue in which the word "jazz" occurs seems to date from around 1919 or 1920.  And so on.

So much for my appearance on Oprah...

It was rather alarming to find that several pages in the copy of Life of a Bee were badly ripped. Here are two examples (pp. 390 and 414, respectively).  I have the sinking suspicion that the damage may have been caused during the scanning process.



Google Books has a very very long way to go before it becomes a reliable resource for research.  Let's hope it doesn't destroy irreplaceable originals in the meantime.
Bookmark and Share

Daily Digest, 2010.08.11

Wednesday: more on Marc Hauser; the case against tenure; copyright protects monopoly rights, not creators; philosopher of science David Hull has died; evidence that A. afarensis (Lucy's species) used stone tools for butchering; the Perseid Meteor Shower; problems with the Hamilton Rating Scale for Depression; a collection of Photocroms from 1890-1910; my personal meat cleaver; the Barrison Sisters (Adults Only).
Bookmark and Share

Daily Digest, 2010.08.10

Tuesday: Fish on plagiarism; college students transfer bad study habits from paper to computer; review of a new history of copyright; books are dying, and that's okay; review of a history of the German language; the 15 most overrated contemporary American writers; the genetics of morphological variation in dogs; Britain's oldest house; Culture Evolves!; a bibliograhic database on human evolution; a neuroscientist's description of her own psychosis; the Seattle Opera girds for battle; Magic Purple Sunshine; 1906 color photos of Europe.
Bookmark and Share

11 August 2010

Daily Digest, 2010.08.09

Monday: Negroponte predicts demise of the book in 5 years; Gates predicts demise of university; Reynolds predicts burst of higher education bubble; Marc Hauser under investigation for research misconduct; new neurons tunnel their way to their new homes; the evolution of ecology; adaptation and admixture in evolution; the evolution of the human shoulder; a new review-article on gene-culture co-evolution; negatively associated stimuli generalize more broadly than positively associated ones; neuroeconomics and the role of oxytocin in economic decision making; an interview with Greil Marcus about Van Morrison; Villazón walks out of Copenhagen concert after 7 minutes (no refunds); how Star Trek: The Next Generation predicted the iPad.
Bookmark and Share

09 August 2010

Weekend Roundup, 6 to 8 August 2010

This Weekend (including Friday): Eric Schmidt on the information explosion; Pentagon wants leaked Afghan documents "returned" (?!); the death of the phone call; how many books are there?; Manjoo reviews Shirky; on the job with a BMI field agent; more on Rozin's critique of scientific psychology (and a link); emotional memories and sensory triggers; genome-wide association study fails to find any simple link between genes and personality; Munch's cure; Heisenberg's uncertainty principle still certainly uncertain; the human Y chromosome (pathetic looking little devil); Diana Deutsch on music and language; Wagner for Children at Bayreuth; Joseph Horowitz on "reinventing the orchestra"; a musical memorial for the Hiroshima anniversary; on the non-existence of music (or: music theory as a figment of the imagination); Thompson and Bordwell on Inception; compensation to hunters for radioactive boars; and a gypsy prediction that comes true.
Bookmark and Share

08 August 2010

More on Frink and the Iterative Translation Game

Yesterday I wrote a long post on Frink, a programmable calculator that is magnificently aware of units of measurement, in a way that your current calculator simply isn't.

That post ended with an introductory look at Frink's translation functions, and their use in the iterative translation game originally suggested by games with wordsI suggested that it would be simple to write a program in Frink to extend the number of iterations, but I left this as an exercise for my readers.

However, my readers were slow on the uptake, and since I was looking for a way to fritter away time last night (I was taking the day off from practicing), I couldn't help playing around with this myself.  Thus my two crude first-approximation programs.

The first uses Frink's own translation functions (the ones I used in yesterday's post):

while phrase=input["Enter test sentence: "]
{
n=input["Enter number of iterations: "]
i=0
    while i < eval[n]
    {
  
        phrase=Japanese[phrase] -> JapaneseToEnglish
        println[phrase]
        i=i+1
  
    }
}

The "intermediate" language here (Japanese) is hard-wired into the program (which is one of the reasons I call the program "crude"), but it is easy enough to change it. The eval[] function forces Frink to evaluate the input for "number of iterations" as an integer rather than a string.

The iterative results for Japanese using Frink's translation function can be wonderfully weird.  For example, here are the first 16 iterations for "Mary had a little lamb, whose fleece was white as snow":
  • The wool there was a white small lamb in Mary as a snow.
  • Then the wool was the lamb whose Mary is small white as a snow.
  • Then as for the wool Mary it was the lamb which is small white as a snow.
  • Then Mary's because of the wool those where it is small white as a snow were the lamb.
  • Then because the snow was the lamb, the place where it is the white where that is small because of the wool Mary those.
  • Then because the snow was the lamb, as for those that being small Mary's because of the wool the place where it is white.
  • Then small Mary is white because of the wool, assuming, that it is, because the snow was the lamb, because of those the place. 
  • Then because small Mary the snow was the lamb, that because of those which are something which has become the place so thing is white because of the wool which is supposed, is.
  • Then because small Mary was the snow lamb, it is that, that because some became the place and it is white because of the wool where therefore thing is supposed.
  • Then because small Mary was the lamb of the snow, therefore because part became the place, the fact that it is white because of the wool where that is supposed that is.
  • Then therefore because small Mary was the lamb of the snow, because the part became the place, the fact that it is white because of the wool where that it is supposed.
  • Then therefore because the part became the place, because small Mary was the lamb of the snow, the fact that that is white because of the wool which is supposed.
  • Then therefore because small Mary was the lamb of the snow, because the part became the place, the fact that it is white because of the wool where that is supposed.
  • Then therefore because the part became the place, because small Mary was the lamb of the snow, the fact that it is white because of the wool where that is supposed. 
  • Then therefore because small Mary was the lamb of the snow, because the part became the place, the fact that it is white because of the wool where that is supposed.
  • Then therefore because the part became the place, because small Mary was the lamb of the snow, the fact that it is white because of the wool where that is supposed.
Note that the translation reaches a kind of "repeating decimal," alternating between the last two states. Because I don't know how Frink's translation function works (you have to be connected to the Internet to use it), I have no idea why these results turn out as they do, although it's clear that these functions don't draw on Google Translate, as Frink has a separate function for that, translate[].

Here's my similar crude program using the  translate[] function.
while phrase=input["Enter test sentence: "]
{
n=input["Enter number of iterations: "]
i=0
    while i < eval[n]
    {
   
        phrase=translate[phrase,"en","ja"]
        phrase=translate[phrase,"ja","en"]
        println[phrase]
        i=i+1
   
    }
}

Again using "Mary had a little lamb...," it turns out that using Google Translate to Japanese and back will continually return the original sentence.

Provided, that is, that you don't forget the comma in the middle of the sentence.  The first time I ran the program, I forgot the comma, and got the following result:
  • Mary had a little lamb its fleece as white as snow.
  • Mary's little lamb, its fleece was white as snow.
  • Mary Lamb, its fleece was white as snow.
  • Mary Lamb, its fleece was white as snow.
which reaches equilibrium after three iterations.

And finally, the opening sentence from Pride & Prejudice, run iteratively through Google Translate to Japanese and back, which reaches amusing equilibrium after 5 iterations:
  • Truth universally acknowledged that the wife must have selected a man possessed of good fortune.
  • Is a universal truth, admitted that his wife should be held to select a lucky guy.
  • Universal truth, it is acknowledged that his wife should be held to select the best.
  • Universal truth is that his wife should have accepted to host the best choice.
  • Universal truth, his wife should have accepted to host the best choice.
  • Universal truth, his wife should have accepted to host the best choice.
I could spend the whole day playing with this, but it's time for breakfast...

[UPDATE: OK, I admit it, I'm an addict. I couldn't stop. But this one is too good not to add.

Here's the iterative Google translation to Japanese and back of the first sentence of Nabokov's Lolita, "Lolita, light of my life, fire of my loins.":
  • My waist Lolita, light of my life, fire.
  • Lolita my waist, my life, light the fire.
  • Lolita back my life, fire light.
  • My life back to Lolita, light of fire.
  • Back to my life Lolita, light of fire.
  • Back to the top of my life Lolita, light of fire.
  • Lolita my life back to the top of the light of fire.
  • Light the fire on my life back to Lolita.
  • Lolita is a light in my life to fight back.
  • Lolita is a light in my life to fight back.
This reaches equilibrium after 9 iterations.]
Bookmark and Share