DOIs and the danger of data “quality”

I’ve just spent a moment looking at guidelines [PDF] from the UK’s National Environment Research Council (NERC) on how NERC funded research can obtain a persistent identifier through the DOI® system.

Just DOI it, just don't DOI it like that.

Just DOI it, just don’t DOI it like that.

NERC have a data sharing policy, and fund data centres for sharing and long-term data preservation. Like us here at GESIS, they have an interest in promoting stable persistent identifiers (in both cases Digital Object Identifier (DOI) names) that allow datasets to be cited as one would a publication. All well and all good.

I certainly have no issue with the advice they provide for researchers on obtaining a DOI name. Its good, clear, and concise. However, I’m going to expand on my reaction to one line in their guidance document. NERC state “by assigning a DOI the [Environmental Data Centre] are giving it a ‘data center stamp of approval’”. Effectively they see a DOI name (or by implication any other form of Persistent Uniform Resource Locator (PURL)) as a quality check-mark in addition to its role as a reference to an object. Except the DOI system isn’t designed to suggest the “quality goes in before the name goes on”. Just to remind myself, I quickly looked at the International DOI Foundation handbook and it doesn’t mention data quality. Identification, yes. Resolution, yes. Management, yes. Quality, no.

There is no standardized quality symbol for data themselves. Instead we have informal ones that act as proxies – not of concern to researchers themselves but do closely correlate to the contestable idea of “quality”. But remember, they remain proxies, not the variable of interest. For example, just because a data set is available from a social science data archive doesn’t mean it is any good. It means the archive think people will use it (or that we are contractually obliged to take it), it can be understood and isn’t just a set of numbers, doesn’t violate data protection laws or intellectual property rights, and doesn’t break our will or budget taking the data into our collection. So, if you order data from a data archive it is preserved and contextualized and probably good quality data, but it need not be good quality. Indeed, I suspect most archives have a data set or two that somehow ended up accepted into the collection as the result of an impenetrable act of madness or despair. Likewise receiving a DOI name might be a stamp of approval if minted by a NERC data center, but other assigners might not be so fussed about the quality of what’s getting DOIed. As this blog post reminds us, anything can be given a DOI, multiple times.

Now, archives are working towards establishing their own stamps of approval for digital preservation and archiving. The Data Seal of Approval, nestor Seal for Trustworthy Digital Archives, and ISO 16363 standards are recognized levels. Yet, these are explicit symbols of quality in digital preservation – showing an archive knows what to do, how to do it, and are doing it. Thus it indicates the quality of curation, not the quality of the data being curated. The best preserved and contextualized data set in the world could still be junk.

So should we as a community be moving towards quality symbols for data themselves? The risk of starting down that route is encountering a host of problems defining contestable notions of “quality”. Digital preservation is, after all, measurable in the sense something is both preserved and accessible or it isn’t. However, research data  is subject to all kinds of challenges as to its quality, even to the point of dismissing entire research approaches. I have no problem with NERC specifying their own concept of quality (which they effectively do), it’s just the use of DOI names as a tool to signify this. To that effect, we shouldn’t start using tools designed for one end to another.

Advertisements

About CESSDA Training

CESSDA Training offers and coordinates training activities for CESSDA, the Consortium of European Social Science Data Archives (http://www.cessda.net/). Hosted by the GESIS - Leibniz Institute for Social Sciences, our center promotes awareness throughout the research lifecycle of good research data management practice and emphasizes the importance of long-term data curation.
This entry was posted in Persistent Identifiers and tagged , , , , , , , , , . Bookmark the permalink.

One Response to DOIs and the danger of data “quality”

  1. Ah, quality – it’s a tricky one!

    I was part of the team who wrote that NERC guidance about DOIs and quality, and yes, we are using DOIs in a way that’s not part of the original specification. In our defense though, given that we have many thousands of datasets in our archives, and we’re not going to be giving DOIs to everything, it makes sense to give DOIs to the datasets that are complete, and have a more-than-sufficient amount of metadata associated with them, so that they will be usable in the long term (our definition of “quality”). We also want to encourage dataset authors to take that bit of extra time to give us that bit of extra metadata, and if we’re dangling a DOI in front of them as a carrot and it works, so be it.

    In the academic publishing world, an indicator of the quality of the paper is inferred from who the authors and the publisher are. In the data publishing world (at least for NERC) we’re using DOIs as an extra indicator. Time will tell how DOIs for data really pan out.

    The only way we can really tell the quality of a dataset is through its use, reuse, and user comments and citations over the long term. DOIs and data citation make this a lot easier.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s