I have spent two days in Berlin at a workshop entitled “Metadata and Persistent Identifiers for Social and Economic Data”. The workshop was organised by the RatSWD, GESIS, ZBW, IDSC, and Nestor, and hosted by the Berlin-Brandenburg Academy of Sciences. What follows are my own thoughts on the issues the workshop raised.
The essential problem the workshop tackled was how to get data cited in a meaningful way – meaningful in a formal citation, not just a ‘Thanks to…’ acknowledgement. Underpinning that was how to ensure meaningful metadata accompanies data and that citations are persistent, not simple URLs subject to the dreaded fate of ‘link rot’. Getting metadata and PIDs right will lead to meaningful data citation.
We all agree, don’t we, that data reuse is a good thing. It’s economical, it’s good science (I’m talking replication standard and verification here). However, one of the strong themes emergent across the presentations was that metadata is vital to data and that a lack of metadata, especially that created by depositors during the data collection, is, problematic.
Why is this a problem? Well, as was nicely illustrated in Arofan Gregory’s keynote speech. This is a changing world. The world of, as we were in Berlin, die Fahrrad verses die Facebook. Our social interactions are changing. Where as once, you’d have to get on your bike to go interact with people, these days we (all most all of us) are connected to some form of social network that means instantaneous communication with little regard for geographical obstacles. So what does this mean for research? People are used to Google and Wiki’s. They’re easy, intuitive ways to find information. Maybe not the best information, but the marginal cost of using these types of resources outweighs deliberation in most cases. The case in point being the fate of Encyclopaedia Britannica – comprehensive, authoritative, and now out of print. So it’s the same with data. It needs to be discoverable and it needs to be usable, with minimal effort to start using it (and increasingly, it needs to be linked to other resources). If we don’t provide this (we being Archives and researchers) then they (users) will go elsewhere. There’s no escaping this new reality. And to do that we need metadata and Persistent Identifiers (PIDs).
Over the two days we heard from a verity of projects working with a verity of data objects, but all facing this new challenge.
Now the issues. Stefan Kramer surmised some themes of the workshop (although these interpretations of themes identified are very much my own, even the mistakes).
How much or how little metadata should be required/offered to obtain PID? Metadata is costly to collect, but without it isn’t the data worthless anyway? After all, in this new world, if it’s not discoverable and usable – it doesn’t exist.
We all agree that PIDs are necessary and great, but at what level of granularity do PIDs get assigned. Do we assign them to single files or to the data collection of files. If we are being good scientists and pushing the replication standard, how about PIDs and metadata for syntax/command and other data processing files, so users can see what was done to the data to produce published results.
Versioning of PIDs and redirection of users to revised resources.
Another question, what do we do with updated data – either updated because of fixing an error, or repeated datasets. Should the PID point to the current version, or should each version have a new PID. Matthew Woollard from the UK Data Archive outlined how the UKDA recently adopted PIDs, and how their archive addressed this problem though a major/minor system – with end users only seeing a major revision in the PID. But how resource intensive is this, and what about issues of provenience? Do researchers need to know, or care?
How meaningful should a PID be to an end user?
For example. should they contain a recognisable text string, such as identify a publisher, if at all? Basically does it matter if users understand how a PID is constructed? Or is it enough to just have a PID. Does it bother users? Indeed, what is persistence about? Is it the identifier, the resolution service, or the target.
What best practice is emerging for where a PID takes you?
How can metadata be captured early in the life-cycle and with what tools?
There is agreement that researchers capture the best metadata during the project. It’s so much harder for archives to add study metadata retrospectively. How then should we educate researchers to consider and value metadata in a meaningful, comprehensible way? Thoughts here are the fear of irrelevance becoming apparent (and personally, I don’t think it is at the moment), and professional credit and acclimation for ummm…data citations. Of course the thought occurs that this may bias future research agendas in favour of creating data with high potential reuse value, rather than publications with impact. But that’s for another workshop…
Using PIDs to create a link between researcher data and publications.
Should the PID point to the output, i.e. article, book, chapter, should there then be a link to the data underpinning that output? Again it’s the value argument, we modern people don’t like dead ends, we like things to take us places and when we arrive, to take us somewhere else. Is that the job of PIDs though? Or are they to point to the object and that is that. Use of metadata “profiles” (declarations of elements used) to enable interoperability.
Role of PIDs and metadata in research data management plans.
With DMPs being introduced more and more at the funding stage, how should these plans get researchers to address PIDs and metadata. Again, we all agree early is better, but when is early too early?