720 attendees assembled in Boston for the 33rd annual meeting of the Society for Scholarly Publishing (SSP) on June 1-3. ( This is significantly more than last year’s attendance of 595.)

Jon Orwant

The opening session drew an overflow crowd to hear Jon Orwant, Engineering Manager at Google, speak on Approximating Omniscience.  He began by observing that we have access to more ideas than ever before, but because our efficiency at finding scholarly information has not kept pace, an era of experimentation is beginning.  Scholarly publishing is an inefficient market of ideas, with an excess of both supply and demand.  Publishers can reduce this inefficiency by packaging ideas into new forms targeted at people who are not currently being reached. The more structured the content, the more cheaply this can be done.

Google wants to digitize the world’s books, and so far it has done about 10% of them.  Twice a week, they count all the books in the world by looking at union catalogs, WorldCat, and other major catalogs, from which they have concluded that there are 129 million books in the world.  Extending this estimate, if you count all the books, scholarly publications, and inventions, there are probably about 200 million objects.  Can they all be visualized?  Orwant showed them as a graph over time from 1600 to date, which yielded a fascinating visualization.

Graph of the world’s literature by subject, according to Google

Other visualization techniques, like a tree map, can be used to examine the corpus in different ways.

Using a financial economic model, readers invest time in books and are paid out, slowly, in ideas.  You can therefore think of a book like a savings bond.  Papers are then like stocks, and journals are like mutual funds.

Information financial model

Comparing scholarship to finance, publishing tells customers what’s good but does not support research directly.  Finance develops new instruments in response to customer demand, but publishing rarely does this.  Can publishing do more to support research?  One experiment is the mutual fund approach:  mix and match articles and chapters to create a book tailored to an individual or classroom.

Google is funding researchers to do interesting types of data mining on their huge sets of data, has received a number of proposals, and has funded 29 of them.  The idea of a “semantic stack” was developed to evaluate the proposals; the research will allow us to move up the stack.  Here are 2 semantic stacks:  one for books and one for videos.

Semantic stack for books

Semantic stack for videos

Orwant has developed a “books ngram viewer” that shows graphs of phrases in books published in various time periods and can produce fascinating conclusions.For example, here is a comparison of “kindergarten”, “nursery school”, and child care” in … that shows that “child care” has become much more popular in recent years.

Using this technique, a comparison of “The United States is” and “The United States are” between 1780 and 1900 in American English literature enables one to conclude that the US as a single unified country became a much more popular mindset after the Civil War.

Another researcher has developed a “Music Ngram Viewer” that shows a time series of a particular melody appearing in a database of sheet music.

This a new entry into the world of digital humanities and suggests promising areas of research that might be able to tell us how publishing should change.  Here are 2 examples:

  • “Transcoding”, format conversion (now used to create PDFs for print, HTML for the web), might be used to render an article simultaneously into various versions for different audiences (academic peers, lay audiences, non-native speakers).
  • Articles and books could be treated as apps , letting readers play with the data, like this

The problems with this include the high cost of apps (which is expected to decline sharply soon) and getting the rights to the works (which is much harder).

Orwant concluded with this list of 7 experiments he would like to see done.


