Delivering Value Through Open Data

Open Data Panel: (L-R) Joy Palmer, Andy Powell, Stephen Dale (Moderator), Jared McGinnis

Jared McGinnis, Research Manager, Semantic Technologies began this session by describing semantic approaches to news at the Press Association (PA, a wire service for the UK and Ireland).  The PA produces 1 million text stories a year, including detailed sports data for every game, and has an archive of 350,000 photos and 50 million video clips.  It will be the official news service for the 2012 Olympics.  Semantics is at the heart of its strategy; it has between 6 billion and 10 billion RDF triples.

McGinnis listed the challenges in dealing with semantic news.

Another challenge is in capturing the semantic metadata from journalists without them having to manually enter the data.  This is important to the PA because it reduces costs, allows the content and metadata to be separated, thus providing a higher level of abstraction and better and more timely services.

The PA is not a technology company, but it is contributing to the data cloud by covering every topic, which is a major impact. The PA’s reputation is “fast, fair, accurate”, so its products must be unbiased.  Using semantic technology in its products increases the sustainability and feasibility of the Semantic Web and enhances the development of standards and the community.

The architecture is not dependent on a single vendor because the content and metadata are separated. New formats can be integrated flexibly. The PA uses an XML database on a Mark Logic platform. Concept extraction is used to suggest metadata terms to journalists, who then select the appropriate ones for the articles they write. The result is a  human-like quality of metadata terms with 90% accuracy.  Here are the advantages of this strategy.

Strategic benefits

Through metadata management, one can capture a relationship between people and locations even though they are not explicitly mentioned in the story, which greatly enhances retrieval as well as creating navigation and SEO advantages.

The PA has created an Simple News and Press (SNaP) ontology for the news industry that provides relationships between terms and creates a basis for sharing. It allows mapping between sets of data regardless of originator, provided both sets are created with the same standards. New products are easy to build because they have a shared view of how data is stored.

Andy Powell, Research Program Director, Eduserv, described a project he did for a Resource Discovery Task Force to develop some metadata guidelines for libraries, museums, and archives.  (Click here to view the complete report.)

A draft proposal was developed using the Linked Open Data Star Scheme (at 3-, 4-, and 5-star levels) to suggest 3 approaches: community formats, RDF data, and Linked Data. From the 196 comments received on the draft proposal, the guidelines were re-conceptualized, and the 5-star level was adopted, which will provide a rich semantic framework for the metadata and allow easy use of other people’s ontologies.

The lessons learned in this project were:

Joy Palmer, Senior Manager, Resource Discovery Services, at the University of Manchester described a vision for a ‘virtuous’ flow of metadata across the web (a metadata ecology for UK education and research).  The discovery process is as much about cultural change as technology. There is a new way that the web works and users behave, and the ecosystem is about creating healthy relationships between the various components, thriving on collaboration and cooperation of stakeholders.

Making data open and reusable means getting the legal issues right. Your data is not open unless it has an explicit open licensing statement. See discovery.ac.uk/principles. You cannot ‘de-risk’ open, and open also means being open to machines.

The principles of open data have now been developed; during the next year they will be implemented. One of the cornerstones will be the creation of case studies of services and outcomes in libraries, archives, and museums. The case studies will look at terms of use, data characteristics, interfaces, and services and sustainability.

Don Hawkins
Columnist, Information Today and Conference Circuit Blog Editor

 

Comments are closed.