Data Papers in the Network Era

MacKenzie Smith

MacKenzie Smith, Director of Research at the MIT Libraries, made a good case for the creation and publication of “data papers”, a formal publication whose primary purpose is to expose and describe data, as opposed to analyzing it and drawing conclusions from it.  Data sharing is important because:

  • Many institutions and granting organizations, such as NIH and NSF, are now requiring researchers to share their data. They require that a mandatory “data management plan” be part of every grant application.
  • The underlying desire of researchers is to make their results reproducible. Most of them do not object to sharing their data, but it is difficult and labor-intensive. Some of researchers’ concerns are: losing control over their data, confidentiality, privacy, and intellectual property rights.  Lack of credit for sharing data and a lack of an infrastructure stops researchers from investing the effort to share their data.

In this context, “Data” means the scientific data underlying research. One key property of this type of data is that it would be prohibitively expensive or difficult to reproduce, as with time-based sampling for example. The data can be very costly to collect in the first place. Data can be in many forms, and may exist in a proprietary format from a specific instrument. It cannot be neatly packaged like a book, and the distinction between data and software is becoming quite blurred.

Interpreting the data must be part of the current research workflow. Reusable data is structured, versioned, and documented; formatted for long-term access; archived; findable and citable; and either legally unrestricted or with a clear usage policy.  A “Data Paper” is one way to help overcome these limitations. Data papers are like regular journal articles, but they describe the data itself. Recent forms of data papers support downloads from the web. NISO is developing a standard for supplemental data in journal articles.

A data publishing infrastructure must be web-based, so to achieve interoperability, we must look at linked data. The web requires identifiers, called URIs. But we need more types of identifiers for data papers, some of which have been proposed by ORCID, I2 (Institutional Identifiers), DataCite, and CrossRef. We need identifiers for people, institutions, and datasets and their subsets.

Another aspect of the data publishing infrastructure is visualization. Data browsers will be the key to success of data papers; Web browsers are not able to support linked data. The Exhibit browser, developed at MIT, is one example of a data browser. All data can be converted into linked data and viewable by a data browser. Ontologies are necessary and are not always available. We need a registry of ontologies or schemas.

Who will do all this work to allow formal data publication on the web? Many players are involved; researchers are at the center as they always are, and their role will not change. They will need to be tapped for many of the peer review and validation functions.

Players in data publication

Publishers (and societies) can produce data journals and acquire data deposits to support the data papers. They can organize peer review and quality control as they always have, recognizing that data has a very different intellectual framework than an article (it cannot be copyrighted, for example). New mechanisms for a sustainable business model will therefore need to be developed.

Libraries are exploring data curation and ontology creation. These are excellent roles for libraries. Some organizations require libraries to sign off on any grant application to see if it has a good data management plan.

We need technology companies to provide the tools for managing the data and developing uses for it.

