This session, entitled “Linking It All Together: Discovering the Benefits of Connecting Data” explored linking large data sets and why one would want to do that, and included reports on two interesting case studies: one on the British National Bibliography, and the other on how the BBC uses large data sets in its coverage of sports as well as its preparations for the 2012 Olympics to be held in London next summer.
Richard Wallis, Technology Evangelist at Talis discussed why you would want to link data. He noted that large datasets are becoming more prominent; for example, the Library of Congress has 147 million assets which occupy 147 terabytes of storage. In 2011, 1.8 zettabytes (a zettabyte is 1 billion terabytes) of data will be created, and this is expected to increase to 7.9 zettabytes by 2015. This is equivalent to 1.8 million Libraries of Congress. In such a mass of data, the problem is how to find something, but that is not usually the ultimate goal. This is not a human-scale challenge! Machines must get involved, and we must make it easy for them to do so. We need to identify the “things” we are referencing. Linked data will help us do that because it builds on semantic web standards and is about identifying and linking “things”.
To identify something, we must put a label on it and categorize it. Things are identified with Uniform Resource Identifiers (URIs), which look like web addresses. Things have attributes that can be linked together so a human can understand them, as in this example describing a spacecraft.
The URIs are shown at the left of the photo, and a complete set of them is called Resource Description Framework (RDF) and is one of the standards that linked data are built on. RDFs are expressed as “triples”: a Thing has Properties, each of which has a Value. Regardless of where a Thing is found, the same identifier is used to identify it, which allows data from different sources to be easily linked.
These are the general principles for using linked data.
Linked data facilitates sharing and is easy to use because it is built on web standards. The barrier to using other people’s data is low. It is important to understand that not all linked data is open because it is being used in enterprises. And not all open data is linked. Linking open data liberates its value and helps others discover it. A linked open data community has grown up, and many organizations are using it.
All of Wallis’s slides from this presentation are available–click here.
Neil Wilson, Head of Metadata Services at the British Library, described how the British Library is creating the linked open British National Bibliography (BNB). He noted that McKinsey has predicted that the benefit value of open public data could be as much as €250 billion. Libraries, a source of trusted information, will find many benefits from linking their data and making it available:
The British Library is meeting the challenge as part of its 2020 vision:
So far, it has signed agreements with over 450 organizations in 71 countries to cooperate in offering free data services and has produced and supplied three 15-million XML datasets under a Creative Commons License.
As part of its linked open data initiative, the Library has produced an open version of the BNB, which is a description of UK published output. The reasons for this project were to:
- Advance the discussion of linked data from theory to practice by releasing a critical mass of data,
- Show commitment by using a core dataset, and
- Create a service that others could build on.
The data were released under a CC0 license (the least restrictive) and hosted on a platform developed by Talis. Existing tools provide a staff and organizational development opportunities. Mentoring and training were done by Talis staff. The project involved matching and generating links, then embedding them into the metadata. MARC records were converted to RDF XML using a series of automated steps. This resulted in a dataset of 250K records with 80M unique RDF triples.
This project showed that legacy data was not designed for this process, so care had to be taken with data modeling and sustainability. They also found that there are often tools or expertise readily available, and the effort to find them pays off and prevents reinventing the wheel. In all such projects, hidden issues will surface, but it is better to release the results early on, even if they are imperfect, and improve them as time proceeds. The learning curve can be steep; using pre-existing tools will save development time and assist in data evaluation. The effort expended to produce the BNB has resulted in significant benefits:
Based on the results of the initial BNB project, further material will be released, and the data model will be revised. New sources to link to will be identified, and monthly updates will occur.
In closing, Wilson urged anyone contemplating a similar journey to do it. Even though mistakes will be made, the lessons learned will benefit everyone.
James Howard, Executive Product Manager at the BBC, finished the session with a presentation on “Preparing for the Olympics and Beyond: Metadata, Tagging, and Lots of Sport”. The BBC Sport site is now 11 years old, and many changes have occurred. Over time, approximately 320 manually managed pages had been created, but staff resources had not increased. New sports teams and organizations had emerged, and because of resource limitations, information on them could not be effectively integrated.
Beginning with the 2010 Winter Olympics an aggregated index for each of 15 top level sports was created. For the 2010 South Africa World Cup, a page for every team, group, and player was created, as well as an “event ontology“. The BBC data was linked with external suppliers’ data, but this process incurred too much manual overhead. FIFA, the organization managing the World Cup, has an identifier for every team and player. Joining these data together will cut costs and maximize pubilcation. Content repositories are separated from the ontology to enrich user experiences. Journalists were asked to tag specific areas; the tags were used to populate the indexes. The challenge was to get enough relevance from data that someone else has tagged. Howard said that it was important not to let the developers near to the tagging tools.
Here are some of the questions that must be answered in designing such a project:
- What do you need to drive your product or domain?
- What can you use from other people?
- What do you need to keep hold of?
- How to you use the data?
- How do we contribute to the datasets?
For the London 2012 Olympics, the goal is “to show every piece of live action”. There will be 24 concurrent live streams, with 5,000 hours of live video over 16 days. The International Olympic Committee has defined the names of events and has assigned the venues. They will supply the names of 8,000 to 15,000 athletes as they are determined. These data will be joined with event results, and every event will have a page. The BBC is working with its suppliers to make sure they supply the data as cleanly and as organized as possible.
Columnist, Information Today and Conference Circuit Blog Editor