Refreshed with croissants, coffee, and strange tasting tea part two of our ESRA bringing researcher and archivists closer together session began. What follows is based on my notes of the session, so any gross or minor misinterpretations of their work are entirely my fault, a fact I will blame on chairing the session and keeping one eye on the clock, the other on where my fancy 5 minute, 1 minute, and bright red STOP signs were (not that they were needed much).
First up was Kristine Witkowski from ICPSR and University of Michigan presenting on Enhancing Data Sharing via “Safe Designs”. Social science data has a significant concern that isn’t shared by most of our natural science colleagues – we deal with human subjects, humans that at possess moral and legal rights to protection and anonymity. Dr Witkowski’s paper seemed to be setting a model for ascertaining the important question of what level of anonymisation or aggregation is enough. The growing requirement in the United States to produce a Data Management Plan (see NIH, NSF) plus recent efforts to revamp process for protecting human subjects (ANPRM 7/22/2011) mean a multifaceted approach is required when formulating data for safe and optimal use (Lane 2007). Anonymisation and protection are issues that need to be though about early in the research design. As a result, data producers must be able to effectively draw upon disclosure research to accurately determine the work required to optimally meet data sharing goals and enhance the value and safe use of social science -data particularly for contextualized microdata. The presentation outlined how they simulate scientific practice to using representative series of artificial microdata files to estimate disclosure outcomes, measured by a comprehensive set of risks, utility, and cost elements. Resulting models based on this artificial data and population display re-identification probabilities by generating summary statistics of likely disclosure outcomes. Disclosure outcome is estimated on personal level (age, sex race, ethnicity, obesity, household composition, and spousal characteristics), geography elements (direct identifiers of region-indirect identifiers or contextual variables) and intruder scenarios (access by strangers and acquaintances) or those with unlimited time and money to attempt re-identification.
GESIS’s Monika Linne introduced DATORIUM. When fully operational DATORUM will be a web-based data sharing repository designed to be as a user-friendly tool for researchers to make their research data accessible and available. The sharing, managing, documentation and publishing of data, structured metadata (compatible with codebook standards with four mandatory and various optional properties using controlled vocabularies) and publications will be carried out autonomously by researchers, not by the archive. Although uploaded research data and the corresponding documentation will be peer-reviewed and preserved as part of the Data Archive at GESIS. DATORIUM data and the related information will be available for the scholarly community free of charge. The aim of DATORIUM is to expand the types of data held in the GESIS collection, which is currently limited to quantitative data and also
Tobias Gebel represented the DSC-BO at Bielefeld University which deals with qualitative data. Qualitative data has a weaker tradition than its qualitative twin when it comes to data sharing and documentation. DSC-BO attempts a qualitative documentation scheme as a first proposal in qualitative data reuse and preservation. The scheme is focused on qualitative interview data. The demands of qualitative research process, of organizational research, experiences of other qualitative research projects, and the requirements of DDI documentation standard all shape the design of the scheme. The scheme itself breaks down into three forms of documentation: microdata (individual data elements), paradata (data on collection process), and metadata (data about data). Tobias pointed out how, as a standard designed for quantitative research process, DDI misses the specific requirements of qualitative research. In the DDI model there is no direct relationship between microdata and paradata and that adaptation and additions are necessary. Most DDI can be applied to qualitative data (title, date, finding, PI, abstract) but the unstandardised elements (centring around data collection) as well as non-linear approach to doing qualitative research mean individual adaptations are required and limitations compatibility of qualitative data with other data.
Christoph Thewes from the University of Potsdam demonstrated a small plug-in for Stata that more or less automatically creates a data paper. What’s a data paper? It’s an essentially extended metadata description of the data, data set, or data sets. Often there is no direct link between a researcher and publication, and the underlying data. The data is only some of material used for creating a publication – not all. There is an incentive to create a direct link with research data and researcher, so finding one means finding all and thus credit is provided. While archives and repositories can often provide this service, there’s a problem for small research projects with low reuse potential, projects creating sensitive data, or those involving third party data that make archiving and sharing the data problematic. One solution is creating a data paper, published online and citable, containing simple metadata that assists discovery. The Stata plug-in automatically creates a list of variables, labels, value labels, descriptives, and file names, requiring only manual effort to fill in title, abstract, keywords, and institutional affiliation metadata fields.