Creating a Trillion-Field Catalog: Metadata in Google Books

Jon Orwant

Those who stayed for the last plenary presentation on Friday enjoyed a treat.  One of the most interesting and fascinating presentations of the conference was by Jon Orwant, Engineering Manager on the Google Books project.  The Google Books is in accord with Google’s mission to organize the world’s information and make it universally accessible and useful.  So far, Google has scanned about 15 million books, about 10% of those available.  This amounts to about 4 billion pages and 2 trillion words.  Google collects metadata from over 100 sources, parses the records, creates a “best” record for each data cluster, and displays appropriate parts of it on the site.  Problems are encountered with inconsistencies, particularly with multi-volume works and languages using non-Roman character sets.  One might think that ISBNs would help, but they are far from unique; in fact, ISBN 753305353 is shared by 1,413 books, and 6,000 ISBNs are associated with more than 20 titles each!  Google has scanned books in 463 languages, some of them used in only a small area and some which are no longer used.  There are even 3 books in the database in Klingon!  (Don’t try to search for them–many of the languages do not appear in the dropdown box on the Advanced Search page.) Books in many of the unusual languages have come from Christian missionaries as a result of their evangelical work.

Google has developed special handling methods to scan books from libraries without damaging them and also uses sophisticated algorithms to identify textual areas, images, tables, etc.  They try to understand the various parts of a book so that all the pages can be tagged.  Once the books have been digitized and run through optical character recognition, a large corpus of data is available for searching, but also other interesting purposes.  Using their well known 20% “free” time, several Google engineers have developed fascinating applications, such as a mashup with Google Maps showing all place names mentioned in a book, insights into human knowledge such as language changes over time, and publication rates of book subjects as a function of publication date.  Google even makes grants available to scientists and linguistic analysts to do research projects because they consider books as a corpus of human knowledge and a reflection of cultural and societal trends over time.

Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor


  1. Los números y las letras de Google Books - November 14, 2010

    […] muchos números y datos en el laborioso trabajo de organizar la mayor biblioteca en Internet.  Jon Orwant, responsable de Googe Books, ha dado algunas cifras al […]

  2. Tweets that mention Creating a Trillion-Field Catalog: Metadata in Google Books -- - November 9, 2010

    […] This post was mentioned on Twitter by Eric Rumsey, Mon. Mon said: One Stop Book Stop Creating a Trillion-Field Catalog: Metadata in Google Books: The data resulti… […]