Archive | SSP RSS for this section

A Future for Libraries and Scholarship: SSP 2011 Closing Plenary Session

David Smith, Moderator (L), and John Palfrey

The closing plenary was presented by John Palfrey, Professor, Harvard Law School, a faculty director of the Berkman Center for Internet & Society, and a co-author of Born Digital. Palfrey is also Vice Dean for Library and Information Resources at Harvard Law School, so he focused much of his presentation on 4 broad issues currently affecting libraries and librarians. He noted that although there is a common cause between scholars, teachers, publishers, and librarians, we all face perils in separate ways.  Why do we still need libraries, publishers, etc.?  What is our role in learning space?

1. Changing patterns of learning

Youth and media are both born digital, which is not difficult to observe.

Kids have great digital skills, and there are different practices in information access.  But it is not just kids who are learning in different way–everyone has a smart phone, Blackberry, etc. , and they are widely used.  By 2012, we will be more likely to access the web on a mobile device than a PC, which will not be a distraction, but interaction and multitasking.  Even now, many of us do a variety of things while we learn.

The media we are interacting with are digital, whether they are images (Flickr), audio/video (YouTube), or print (Google).  Only when reading a monograph, do students customarily use printed materials, and the reasons they give are Bed, Bath, and Beach.  In these circumstances, they prefer print by an 80/20 margin.


Credibility.  There is a lot of misinformation on the web, as well as other hidden influences.  Almost all students will look for information first on Google, and then on Wikipedia.  Some cut and paste information from the Web into their papers; others don’t trust anything online.  Almost none of them go to the history or discussion pages of Wikipedia to determine the quality of the information, but most of them will go to the bottom of the Wikipedia page and click on the source links.

Overload.  There is too much information.  A major challenge is how to make more use of time when we are connected.

2.  Innovative teaching

How do we harness what is great about the digital era?  What will connected learning look like?  Being in the mode of a creator is very important.  The creators of today’s information are also creators of the code for such services as Facebook, Google, etc.  They were students when they started creating their services.

3.  Changing patterns of research and publishing

Open access is a major innovation in digital scholarship.  How do we make more of our libraries (Harvard has 73 of them)?  It is not about budget cutting.  How do we learn from our peers?  What can we do to support digital scholarship? Harvard Law School has committed to open access in faculty publications,

Harvard Law School Faculty Policy on Open Access

and is facilitating open access for student publications as well.

Open Access to Student Writings

A  fund to pay the fees if necessary has been created .

How is our work related to mass digitization projects?  Harvard participates in the Google book scanning project and has launched a Digital Public Library of America project.  The challenge is whether we can create a free library on the scale of a large academic library and still be consistent with copyright, etc.?  This will be a useful project for the many people that are involved in text mining of these huge databases.

4.  Changing Roles for libraries and librarians

Even the richest schools do not have increasing library budgets. The best case today is that they will be flat.  We are asked to get more materials from more places.  How do we make a future for ourselves?  We are not doing enough to connect our users to all types of content, especially digital materials.  How do we think about space in a way that connects the physical and digital?  Only 1% of the employees of the Harvard Library are devoted to managing the library’s website, even though half of the library’s traffic comes through it.

How do we architect classrooms for the digital age?  We need to think about how the information architecture relates to the physical architecture.  How do we think about sharing our collections differently?  No great library can go it alone in today’s environment of non-growing budgets.  We must be more precise with our acquisition policies and determine what we have that no other library does and which we therefore have an obligation to collect.

Should we be in the business of creating better and different interfaces for people to access repositories?  We need to be aggressively in the business of creating more content online.  People are worried about losing the idea of serendipity, so we need to present information in ways that would enhance serendipity.  We can create interfaces that will allow people to interact with information in ways that they cannot physically.  Circulation data can be used to see which materials are most popular and can influence how they are “arranged” on a digital shelf, which will help people access information better.

We need to pay attention to new technologies and be in the business of adopting and shaping them.  They may well be disruptive.


Look at where the problems lie and turn the challenges into opportunities.  Opportunities lie in the areas of information creation, participation, and empowering individuals.  We need to get in the business of realizing that we have a role to play in recreating our institutions.  We are in a digital-plus era which is having a profound transition in every field.  It comes back to our mission as teachers, librarians, and publishers, which are the same, even though we are all being disrupted in different ways.

Here are Palfrey’s final conclusions.

Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor

Can You Trust Scholarly Information?

Trust Panel: (L-R) Carol Anne Meyer, Howard Ratner, Jan Brase

Cross Mark Initiatives:  Why a Monkey Matters
Carol Anne Meyer, Director of Business Development and Marketing, CrossRef

What do monkeys have to do with publishing?  Well, of course there’s the “infinite monkey theorem” about an infinite number of monkeys typing for an infinite time and eventually producing something sensible, like a work of Shakespeare!  (Even with just 50 monkeys, the probability of producing just a single word, like “Hamlet” has been estimated at approximately 1 in 15 billion.)  But I digress:  Carol Anne Meyer explained that monkeys matter in scholarly publishing because a scholarly paper on monkey behavior was retracted because of misconduct by the author.  So the question of trust is very important.  Although a blog, Retraction Watch, tracks retractions, websites do not handle them consistently, with the result that readers may never know that an article has been retracted.  Science Direct adds “RETRACTED” to the titles of such articles, but some websites do not offer any type of indication.  And what about e-books or results from federated search systems?

Meyer pointed out that many things may happen to an article after it is published.  Here are some of them:

Documents on the web are living and can be easily changed.  When content changes, readers need be aware of it.  Which version is the version of record?  Most reputable publishers are trying to communicate this information, some better than others. Here is a record of an article from Science, showing a link to a correction.

Link to an article correction

CrossRef has attemped to solve this problem by developing CrossMark, a logo that can be attached to a paper indicating that updates exist.  When the user clicks on the logo, a popup window opens indicating that updates are available.

CrossMark logo and popup window

This logo can be applied to PDFs or other web documents, providing a way for the publisher to list the Publication Record information, such as funding agencies, publication history, plagiarism screening, license types, etc.  The logo could even be displayed in Google search results, indicating the version of record.  It is important to note that CrossMark is not a DRM system.  A pilot test of CrossMark is underway now, with a 3rd quarter launch envisioned.

ORCID: An Open Registry of Scholarly IDs
Howard Ratner, Chairman, ORCID, Inc. and CTO, Nature Publishing Group

Researchers care about their identity when they join a faculty, apply for a grant, or submit a manuscript to a publisher.  ORCID (Open Researcher & contributor ID Project) supports a record of scholarly community by creating a reliable identifier record for authors.  It was started in December 2009, with launch planned for early 2012.

ORCID is a non-profit consortium of 230+ participants, with the largest group being international universities and societies.  It is open to any organization with an interest in scholarly communication.  There are many identifier silos; ORCID hopes to bridge them.  ORCID’s mission is to create a permanent, clear, and unambiguous record of scholarly communication by enabling reliable attribution of authors and contributors.  All software developed by ORCID will be released under an Open Source Software license, and fees collected will be used to ensure the longevity of ORCID.

ORCID’s first efforts will be disambiguation of author names using “trusted linking partners” (TLPs) to create a relationship with self-asserted systems.  Input of records is very easy; building the author record is key.  Knowing publications of an author is important.  ORCID/DOI pairs will be sent to publishers during the article creation process.

Jan Brase, DataCite

The concept behind DataCite is that data should be citable just like articles to give it higher visability, easy re-use and verification, enhance its reputation for the collection and documentation (for example, in a citation index), and to avoid duplications.  To accomplish its mission, DataCite assigns DOI names, which scientists know how to use, to data sets, thus linking them to the supporting scientific article.  It is a global consortium of local institutions and is hosted and managed by TIB, the German National Library of Science & Technology.  Most members of DataCite are libraries because they are trustworthy organizations for scientists.  Here are the 3 main goals of DataCite.

DataCite goals

DataCite has registered over 1 million datasets so far and has published a metadata schema for all of its members.  This metadata will shortly be uploaded into the Web of Science and other indexes.  In this way, DataCite supports researchers, data centers, and publishers.

Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor


Inanimate Alice: A Unique Educational Experience

Educational content is being transformed.  Laura Fleming, an educational consultant and librarian, discussed Inanimate Alice, one of the more innovative current projects.  Inanimate Alice is a born-digital, multimedia, and interactive story for young children.  It is interactive because it requires user interaction to move the story forward.  As the story unfolds, the episodes become more complex, and the level of interactivity increases, holding the child’s interest.  The system has several unique features: it is multimedia and can be viewed in several languages and it is an example of “transmedia” because it can be viewed on any device that is capable of running Adobe’s Flash player.  According to the Alice website:

” ‘Alice’ connects technologies, languages, cultures, generations and curricula within a sweeping narrative accessible by all. As Alice’s journey progresses, new storylines appear elsewhere providing more details and insights, enriching the tale through surprising developments. Students are encouraged to co-create developing episodes of their own, either filling in the gaps or developing new strands…children will grow up with Alice, from class-to-class from year-to-year, engaged with an ever-growing story in which they become part of the narrative.”

A downloadable education pack is available for teachers and educators to accompany the story.

Inanimate Alice is a a unique and different example of how education and e-books will advance in the future.

Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor

E-Books: Where Do We Go From Here?

E-books have become widely accepted, but many users will not be satisfied for long with static e-books that simply recreate the print book experience online.  Print reference is still selling, but new technologies are having a significant influence on e-books.  Rolf Janke, Vice President and Publisher for Sage Reference, a division of SAGE Publications, said that librarians have a love-hate relationship with e-books.  Although students expect all the features, librarians are nervous about the associated costs.

Reference used to mean print, static, and black and white content.  Today it is online (meaning e-books), on a platform, and e-book aggregators are beginning to appear.  There are some dynamic e-books, but many products are still static.  Some e-books have color.  Simultaneous usages has completely transformed the world.  Interactivity exists, but it generally consists of videos and podcasts, which is not defined as “animated”.  And the next generation will be heavily mobile.

Reference is going digital.  The basic tools provide a starting point, with interactive features add value to the user experience.  Adding interactivity is desired, but it involves costs.  Can publishers assume we will have a ROI?

A survey of librarians (photos) revealed some surprising opinions on what is valuable in reference services:  desirable features included cross-searching all content on a single platform, “Did you mean?” for spelling corrections, citation builders, and videos.  Features not seen as valuable were saving searches, editing content, linking content to social networks, and animation.  Video is the most prominently used technology in reference sources. It must be built into an article to create value, and it must provide transcripts.  SAGE released its first multimedia product in January and quickly observed that articles with video have been used more than any others in the entire SAGE collection.

Here is SAGE’s view of the e-book market.

Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor


Who Are You and What Are You Doing On My Site?–Web Analytics

Web Analytics Panel: (L-R) Melissa Blaney, David Smith (Moderator), Mike Sweet, Mark Johns, Jake Zarnegar

When I saw the title of this panel, I wondered if it should be called “The Big Brother Panel”.  Web analytics provide lots of data on who is visiting your site, and in turn that allows you to develop strategies for understanding your business.

Melissa Blaney, Manager of Platform Analytics and Communications at the American Chemical Society (ACS), led off with a description of some of the common tools that the ACS.  These include Atypon’s Literatum, which provides COUNTER reports, identity and content reports, and advertising statistics; Google Analytics, which is used to track usage of ACS online journals and community sites, and Omniture SiteCatalyst, which is used for tracking accesses to ACS’s weekly news magazine, Chemical and Engineering News.

Analytics are used by many internal stakeholders in an organization:  Executives, advertising, sales, web strategy and innovation, marketing, editorial , sales analysis and support.  The ACS has been providing COUNTER reports to stakeholders since 2002, and among other uses, one of the factors in determining renewal prices is online usage.  Platform enhancements and web development initiatives are also influenced by the data.  Beyond COUNTER other statistics can measure key performance indicators (KPIs) for the business, such as referrals, searches, geography, unique registrants, etc.  Useful information can be obtained from data on usage in various time periods, such as seasonality.

Here are some useful data that can be obtained from search analytics:

Data on referrals shows how people get to the site and where they come from.  Although Google may drive 90% of the traffic to site, discovery tool use may be more valuable.  In general, browsing seems to be decreasing, which indicates users are finding information less frequently by serendipity.  Much web traffic is international, so geographic data are important.  And world events can influence traffic; for example, when the Olympics were in China, there was a big decrease in traffic, and when storms keep people at home, traffic from that area decreases.  Tracking social media is often a challenge.  You must know your audience, define expectations, and document and categorize what works.

Using analytic data, future developments can be planned.  For example, when the ACS wanted to develop products for mobile platforms, the following data were used to prioritize which platforms to develop for first.

In conclusion, it is important to recognize that web analytics are only one source of data.  Others include focus groups, user testing, customer surveys, and direct feedback from sales teams.

Mike Sweet, CEO of Credo Reference titled his talk “Web Analytics:  Pragmatics Rule–“It’s the People”.  Many analytics tools give you a lot of data, but what does it mean?  You need analysts to interpret the data and then figure out how to implement the results. If you try to focus on all the data that’s available, you won’t achieve anything.

Credo’s primary platform goal is to increase traffic.  They began by choosing a web analytics package that provided a number of techniques:  on-site web analytics, usability testing, focus groups, market scanning, but soon learned that Google Analytics was the best for their purpose.
Choose wisely:  They chose the wrong package initially but found it not useful, so they switched to Google Analytics.  Focusing on collecting lots of data (especially from log files) and generating reports may obscure what is best for the business.  It is best to define what you are trying to do on the site and if you were able to do it.  From this insight, you can make all the platform improvements you need to make.  Prompting users to give you information on why something isn’t helpful is very simple to do and tells you a lot.  Here are Sweet’s lessons learned:

  1. Assess where you are on the evolutionary curve.
  2. Choose packages and data mining projects carefully.
  3. Don’t plan to rely solely on on-site web analytics data.  Mix on-site and off-site data to get a complete picture.
  4. Assess your teams’ agility and your platform’s extensibility.  Only gather insights into things you are actually ready to act on.

Conclusions to ponder

  1. Don’t bite off more than you can chew.
  2. Numbers aren’t customers.  Use a larger approach to improvements.
  3. The experimenter’s mindset is a key–get started and have fun!

Mark Johns, Manager, Publication Management Group at HighWire Press noted that robots are widely used to crawl sites and obtain usage data for them.  “Good” robots are good for business, so having your websites exposed to them is extremely critical.    The “not so good” robots are the overzealous up-time tracker that hits the website frequently and causes problems for it, or malicious ones crawling the web.

The web is becoming more personalized and is molding itself to users in real time.  Publishers cannot get direct access to information users because of the institutional purchasing model.  It is time to start thinking about individual users.  One example of this is the BBC, which lets users rearrange their home page to their liking (except for the ads).  We know a lot about subscribers so we can target things to them.  But we also have data about anonymous users, such as their IP address, search terms, geographical location, language settings, content viewed in a session.  Therefore, that metadata can be used to create a user profile.


Semantic User Profiling
Jake Zarnegar, President, Silverchair

The shift to the institutional subscription has created a widespread problem of anonymous users making up the bulk of a site’s users.  In an era when it is possible to track personal topic interests more closely than ever before, many publsihers currently know less about their customers than ever before.  Two ways to overcome this problem are to require users to give you information about themselves or create a semantic user profile invisibly to the user based on their site intelligence.  This can even be done for anonymous users, but this immediately raises questions about privacy.  Silverchair has developed a statement on privacy that they use with their customers:

Silverchair privacy statement

How to build up profiles:

  • Have your content semantically tagged.  Semantics provide a normalized, logical metadata layer on content.
  • Accumulation: look at user interactions by accumulating the tags of documents they look at.  Build this over time and look for patterns.
  • Construct basic semantic profiles of users.  Parse your raw logs into semantic profiles  The rules for doing this are proprietary and vary from organization to another.  (photo of typical report for an anonymous user)
  • Create affinities and put profiles together.  Affinities can be to topical interest gorups, ads, products, events, individuals, etc.  User affinities are constantly updating as the site captures more usage.
  • Use affinities to create personalized profiles for users, create a marketing campaign, or promote products when people come on to the site.
  • Use the resulting semantic profiles to understand your audience as individual information consumers, tackle the anon user problem, and provide more detailed targeting for marketing and advertising efforts.

Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor

Information Overload: Reaching Readers in a State of Distraction

Like most people in this day of information abundance, I suffer from overload. So I thought it would be interesting to hear some solutions to the problem. (Evidently, many other SSP attendees thought the same because the room was packed to overflowing, so I wondered if the session should have been called “Room overload”!)  This was a fascinating and interesting session offering different views on a common problem that besets everyone.

Phil Davis, a Postdoctoral Associate at Cornell University, began his two-part talk by describing of the problems he faced in coping with many distractions and interruptions as he attempted to write his Ph.D. dissertation. He noted that dissertations are not like scholarly articles because they contain things a journal editor does not want to see:  details of experimental research, an extensive bibliography, and suggestions for future research.

Once his research was completed, Davis carved out an hour or two every day to get away to the library and write his dissertation. But libraries are changing and are encouraging user interactions. Quiet study is now the exception, not the rule. We understand the attention economy well, but most solutions to information overload have focused on the person receiving the information: the reader, which raises some paradoxical questions:

  • Why are journals still published when authors can reach readers without them?
  • Why has repository publishing failed to become more popular?
  • Why has post-publication review failed?

Many of these problems can be traced to information overload, but as noted author Clay Sharkey said,”It is not information overload. It’s filter failure.”  Here are some symptoms of this:

Scholarly communication is a 2-sided market of authors, who are also readers, and publishers in between them. Authors know more about quality of their work than readers, which has led to the following behavior patterns:

The principal function of journals is to organize and mediate quality signaling in the author-reader market. We need to think of overload as a market problem. Journals are mediators of quality signaling, which gives them many opportunities to alleviate information overload.

Creating and consuming scholarship in the age of information overload
Oliver Goodenough, Faculty Fellow, Berkman Center for Internet and Society, Harvard University

In high school we were taught to be literal readers.  Then at college, we were bombarded with information, experienced overload, and needed to learn to skim and select.  Thus, we were told to behave in a way that is now our problem!  What is this doing to our brains, to scholarship, and to our livelihoods?

What we do and read shapes how we think. Culture and education work together, making us smarter at some tasks than we would be otherwise, and language is allows us to share and store information. For further information, a good reference is Nicholas Carr’s book The Shallows: What the Internet Is Doing to Our Brains.  The internet increases access and storage, puts many traditional economic models at risk, multiplies but shrinks the workspace, and fragments our attention.

Pre-printing (memorization, dialog and debate, etc.) originally led to printing books and papers. Did we gain or lose?   Consider what scholars produce and share.

In educating future scholars, we must help them to understand their field. Wikis are not a substitute for a review. Communal workspaces may be growing as personal ones shrink. It is important to write good articles and share them with your readers.  Publishers can have a role in this:

Users: Let’s Throw Them a Bone, Already
Kristen Fisher Ratan, Director, Strategic Development, HighWire

We have not served users well and have left them to their own devices. Our products are not targeted at them: there is a wall between the user and publisher.  David Shenk’s book, Data Smog: Surviving the Information Glut is a good reference on the symptoms of overload.

A HighWire survey of 45 Stanford researchers uncovered some interesting opinions:

  • Respect the workflow.
  • Productivity is more important than novelty.
  • Produce time-saving tools and information.

Admittedly, this was not a typical user audience.  None of the respondents use mobile phones for their information needs.  Instead, they take their laptops everywhere. To them, communication means email and Skype. Reading means skimming and using e-TOC alerts.  Still, some of their opinions are revealing to a more general audience:

Serendipity is how ideas are generated, and publishers could do better at producing tools to help it. The HighWire users were conservative towards change and did not want to take time to learn how to use new tools.  We need productivity, not novelty, but our websites tend to be like Swiss Army Knives with all the blades out!

Some things we can we do are:

  • Show abstracts in popup windows, saving clicks.
  • Clean up the real estate by providing more intelligent choices.
  • Use scrolling in place widgets like Amazon.
  • Provide visibility to content not necessarily associated with the article.
  • Don’t muddy search results with a lot of extra information (but have it nearby).
  • Think about new ways of doing things using touch screens or augmented reality.

Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor

3 SSP Presidents

At the conference luncheon today, I managed to get this photo of 3 SSP presidents.

(L-R) Terry Van Schaik, incoming president; Lois Smith, outgoing president; Ray Fastiggi, past president


Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor

Startups–At the Cutting Edge

One of the best ways to learn what’s coming down the proverbial pike is to see what startup companies are doing.  So I was intrigued by a session (which the moderator called “The Startup Beauty Parade”) featuring a series of short presentations by 6 startup companies in publishing. And I was not disappointed; these were not only interesting and forward-looking products, but they are addressing real problems of today’s users.

Startup Panel (L-R): Nathan Watson, Bill Ladd, Bill Park, Niko Gonchanoff, Kathleen Fitzgerald, Jan Reichelt, Dan Pollock (Moderator)

Here are brief summaries of each of the presentations:

Data from the Collective Desktop
Jan Reichelt (Mendeley co

Too Many Documents!

We are drowning in PDF documents that we have collected and when we try to remember what was in the ones already read, we can’t.  Mendeley helps researchers work smarter.  It works on any platform, and it’s free. It organizes PDF documents, extracts research data, allows highlighting sections of a PDF, adding “sticky notes”, and aggregation of research data into the cloud.  This makes science more collaborative and sharable.

Using Mendeley, one can set up a social research network like Facebook, and thus enhance collaboration.  Because all the uploaded articles are collected into a single server, Mendeley can identify what are the most read ones, and do other statistical analysis, thereby identifying trends.  Tags show popular groups, popular papers, etc.  Clicking on an article results in a page, metadata, sources, lists of related articles, and readership statistics.

Opportunities for publishers include deriving insights and analytics, getting traffic and usage statistics on their data, and connecting academia to the “general public”.   People not in academia are using the system to pursue their interests.  Mendeley currently has 950,000 users.

The Reading Revolution–Scribd
Kathleen Fitzgerald

Currently, there is no easy way to share documents on the web, or to search for or share research, which led to the development of Scribd.  Scribd can turn any file type into an HTML page.  It has become the world’s largest reading and sharing website and is integrated with Twitter, Facebook, and other platforms to make it easy to share.  Text of any document can be made searchable by OCR. Connections are made through content interests, not just friends, as in normal social networking platforms.


Scribd Facts

Scribd readers tend to be highly educated: 60% are ages 18-49, and 60% have college or postgrad degrees.  Many publishers are partnering with Scribd.

Scribd provides reading statistics  for any document, which allows companies to develop marketing campaigns.  This allows questions like “Where is your content being shared?” or “Where are documents being embedded?” to be answered.  Authors can revise their documents without losing existing read counts, comments, etc.

Readers can create collections of documents, such as things to read later, etc.  The National Archives created a small collection of documents on Presidents’ mothers that drew over 25,000 readers in a very short time.

Reading is ready for a revolution.  Scribd is re-imagining a reading experience for mobile platforms, ignoring the barriers between different types of content.

Take my content — please!  SureChem: the service-based business model.
Niko Goncharoff

SureChem is the first acquisition by Digital Science, a unit of Macmillan.  It is a chemical patent search designed for scientists and allows searching structures embedded in the text of patents.  SureChem has compiled a database of about 12 million structures from 20 million patents and 12 million Medline records.  

From this data, 3 products have been developed.

SureChem--3 products from the data

SureChem can be used for intellectual property “landscape analysis”, large scale chemical analysis. comparison of internal and external data, or competitive analysis. Today’s customers need to store external data beside their internal datal, manipulate external data behind the firewall, and freely share results throughout the organization.  All this is much easier if the data are owned by the user.

SureChem customers subscribe to the service, not the content.  They can generating their own content from public sources, mine the data and add value to it by showing the text where the structure was found. Advantages of this approach include:

  • Owning the data is a better investment for an organization than renting it.
  • Price increases are tied solely to improvements in functionality.
  • It is easy to add other content sources.

Bill Park, CEO, DeepDyve

Today’s market landscape is made up of 250 million knowledge workers who are generating 4 billion visits/year to publisher sites.  About half of them do not find what they want and go away, which represents a large missed opportunity for the publishers.  Many users are discovering content they cannot access because the publishers have high prices for single copies of articles.  DeepDyve has tried to solve this problem by creating a rent-an-article model, in which an article costs $.99 to $4.99, expires after 24 hours, and there is no printing or downloading.

DeepDyve partners with many publishers in both scientific and humanities areas.  For last year, most of their effort was in getting relationships and content.  Now they are concentrating on making subscriptions worthwhile.  Customers come from Google or publisher rental links to DeepDyve placed on their site.

DeepDyve thinks of themselves as a data and technology company, not a content company.  They operate as a Software as a Service (SaaS) company, where the sale is to the user, not the enterprise.

Bill Ladd:  Chief Analytic Officer, Recorded Future

Recorded Future (RF) has built largest temporal index in the world.  They discovered analytic demands for external data, such as what technology areas are changing, who is talking about what,  what’s coming next, etc.  Many of the answers are in newspapers or public websites?  Search engines make it easy to search for specific things, but they operate in a browse paradigm.  RF is organizing the internet for analysis.

The web is loaded with temporal signals, but it is impossible to search on “next month”, “last month”, etc.  Similarly, events provide additional structure.

RF’s engine takes unstructured data from the Web and applies natural language processing to the data and organizes impacts of future events that have been discussed online to determine what impact is expected.   They have an analytic engine to process this content and also use historical models to find relationships and test predictive models.


Recorded Future's Accomplishments

Nathan Watson, founder and CEO of

Information outside of journals was formerly in the heads of researchers and  in their personal knowledge.  If you worked for one of those researchers, you could get that knowledge, but otherwise you couldn’t.  Information technology allows that information to leave the heads of those people and go into separate applications and systems, so that everything will be clickable and findable.

BioRaft was developed using publicly available regulatory compliance data on hazards to scientists, track what they use, and organize it for safety.  Regulatory compliance requires companies to provide information on who is in their laboratories, what they work with, their projects,and their equipment.

BioRaft capabilities

These data are then linked to journal articles and other references to research projects.  Users want direct links to journals as well as updates, etc.  Sometimes researchers buy access to journals but nobody else in the company knows about it or is able to get access.  BioRaft is are building an enterprise management site to solve these access hurdles.

Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor

SSP 2011 Opening Keynote

Lois Smith,SSP President, at the opening session

720 attendees assembled in Boston for the 33rd annual meeting of the Society for Scholarly Publishing (SSP) on June 1-3. ( This is significantly more than last year’s attendance of 595.)

Jon Orwant

The opening session drew an overflow crowd to hear Jon Orwant, Engineering Manager at Google, speak on Approximating Omniscience.  He began by observing that we have access to more ideas than ever before, but because our efficiency at finding scholarly information has not kept pace, an era of experimentation is beginning.  Scholarly publishing is an inefficient market of ideas, with an excess of both supply and demand.  Publishers can reduce this inefficiency by packaging ideas into new forms targeted at people who are not currently being reached. The more structured the content, the more cheaply this can be done.

Google wants to digitize the world’s books, and so far it has done about 10% of them.  Twice a week, they count all the books in the world by looking at union catalogs, WorldCat, and other major catalogs, from which they have concluded that there are 129 million books in the world.  Extending this estimate, if you count all the books, scholarly publications, and inventions, there are probably about 200 million objects.  Can they all be visualized?  Orwant showed them as a graph over time from 1600 to date, which yielded a fascinating visualization.

Graph of the world’s literature by subject, according to Google

Other visualization techniques, like a tree map, can be used to examine the corpus in different ways.

Using a financial economic model, readers invest time in books and are paid out, slowly, in ideas.  You can therefore think of a book like a savings bond.  Papers are then like stocks, and journals are like mutual funds.

Information financial model

Comparing scholarship to finance, publishing tells customers what’s good but does not support research directly.  Finance develops new instruments in response to customer demand, but publishing rarely does this.  Can publishing do more to support research?  One experiment is the mutual fund approach:  mix and match articles and chapters to create a book tailored to an individual or classroom.

Google is funding researchers to do interesting types of data mining on their huge sets of data, has received a number of proposals, and has funded 29 of them.  The idea of a “semantic stack” was developed to evaluate the proposals; the research will allow us to move up the stack.  Here are 2 semantic stacks:  one for books and one for videos.

Semantic stack for books

Semantic stack for videos

Orwant has developed a “books ngram viewer” that shows graphs of phrases in books published in various time periods and can produce fascinating conclusions.For example, here is a comparison of “kindergarten”, “nursery school”, and child care” in … that shows that “child care” has become much more popular in recent years.

Using this technique, a comparison of “The United States is” and “The United States are” between 1780 and 1900 in American English literature enables one to conclude that the US as a single unified country became a much more popular mindset after the Civil War.

Another researcher has developed a “Music Ngram Viewer” that shows a time series of a particular melody appearing in a database of sheet music.

This a new entry into the world of digital humanities and suggests promising areas of research that might be able to tell us how publishing should change.  Here are 2 examples:

  • “Transcoding”, format conversion (now used to create PDFs for print, HTML for the web), might be used to render an article simultaneously into various versions for different audiences (academic peers, lay audiences, non-native speakers).
  • Articles and books could be treated as apps , letting readers play with the data, like this

The problems with this include the high cost of apps (which is expected to decline sharply soon) and getting the rights to the works (which is much harder).

Orwant concluded with this list of 7 experiments he would like to see done.


Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor


SSP 2011 Opens

The SSP 2011 conference opened last night with a reception in the exhibit hall. Here are some scenes from that event.


As usual, the bar attracted a good crowd of attendees.

Publishing is Evolving

After the conference reception, another one, sponsored by Silverchair, was held across the street in the magnificent Boston Public Library


Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor