CMU Comment on Draft NIH Data Management and Sharing Policy

Carnegie Mellon University Libraries Comment on NIH's DRAFT Data Management and Sharing Policy and Supplemental DRAFT Guidance

On behalf of the Carnegie Mellon University (CMU) research community, the University Libraries has collated feedback from our research community and institutional leadership that responds to NIH's Draft Data Management and Sharing Policy and Supplemental Draft Guidance.  This response is based on CMU's data sharing practices, our experience and institutional support in the data sharing arena, and specific feedback from those in receipt of NIH funding.

We applaud the NIH for taking this important step in supporting data management and sharing, and we encourage the organization to follow through with the implementation of this policy providing clear guidelines for researchers, and appropriate enforcement of the policy. As an academic institution already supporting the future of scientific research that is interdisciplinary, collaborative, reproducible, and reusable, we are excited to have the opportunity to comment on the draft NIH Policy for Data Management and Sharing and Supplemental Draft. 

By way of introduction, our principle feedback is that:

     i. We encourage public dissemination of Data Management Plans.
     ii. We recommend a more generous definition of scientific data that reflects the expansion of the scholarly record to include                laboratory notebooks, code, protocols, and other research outputs.
     iii. We would welcome clarification on how the broader usefulness of scientific data is to be determined.
     iv. We would welcome further information on mechanisms that might be used to encourage and monitor compliance with the         final policy and supplementary guidance.
     v. We encourage the earliest possible implementation of the final policy; we note that institutions, research libraries, and data management professionals have been building appropriate infrastructures for some time.

The new policy appears to support data management plans (hereafter DMPs) as living documents through a compliance period factoring in plan updates, which is an important step in encouraging researchers to regularly engage with their DMPs and ensure their research is following the protocol identified in the plans. As the draft policy states DMPs may be made publicly available, we believe this is a good practice in supporting compliance, education on DMP development, and facilitating broader best practices for a culture of open science. 

As a research-intensive academic institution, CMU has identified several areas of opportunity in the draft policy, which the NIH may wish to consider when implementing the final version of the Policy for Data Management and Sharing. These areas are organized into five thematic sections, in the order of (1) definitions of scientific data and DMP guidelines, (2) data sharing, (3) costs, (4) compliance and enforcement, and (5) effective dates. 

(1) Definitions of scientific data and DMP guidelines. The policy's scientific data definition notes laboratory notebooks are not considered data and do not need to be digitized. As an institution, we consider research products including laboratory notebooks to be valuable even if they are not considered data in this context, as they support reproducibility and reusability when included alongside data. While we understand the NIH does not consider these to be scientific data, we believe the NIH should encourage researchers to share relevant documentation along with data when possible. More broadly, we consider code, analysis environments, protocols, metadata schema, stimuli analysis pipelines, and other documentation to be essential accompaniments to data, and implore the NIH to include language in the final policy encouraging local institutions to make these ancillary outputs of research a part of the scholarly record available alongside the data. 

The current policy document states the researcher should limit their DMP to two pages. However, we encourage the NIH not to enforce a page limit, as projects will require differing levels of information depending on the type of data and the field of research.

(2) Data sharing. Within this draft policy, NIH encourages shared scientific data to be made available as long as it is deemed useful to the research community or the public. We would welcome clarification on how the decision is made to determine the broader usefulness of data. At CMU, we encourage our researchers to err on the side of sharing data, as we cannot predict all the future scenarios in which our data will be useful. As an institution in which a large proportion of our scholarly excellence and innovations are rooted in secondary reuse (computation, re-analysis, modeling) of scientific data, we deem it incredibly important to produce datasets for not only dissemination, but also reuse within and outside of our own research communities. We encourage the NIH to include language in the final policy document encouraging researchers to de-identify and share data when possible, and include language that clarifies budget allowance on related costs (further discussed in section 3). We also suggest encouraging researchers to share intermediate data when it is needed to ensure reproducibility of the funded project. We recommend data must be shared within 12 months of project end date. In Supplemental DRAFT Guidance: Elements of a NIH Data Management and Sharing Plan (Plan), more information on what constitutes 'findable' and 'trackable' would be helpful for researchers, as would a statement on ethical data use and governance. We would encourage the NIH and funded researchers to consider the FAIR (findable, accessible, interoperable, and reusable) principles that allow for broad reuse and aggregation of data outside of the original discipline including making de-identified data discoverable, machine-readable, and combinable.  On a related theme, we note the relationship between research data and the tools and software used in their generation.  We encourage the NIH to consider making recommendations around standardization and/or curation and emulation of specialist software that may be required fully to utilize data shared under this policy.

(3) Costs. As one of many academic institutions with an institutional repository, we are unclear on the cost structure and allowable costs surrounding the use of these repositories. In reference to the Supplemental DRAFT Guidance: Allowable Costs for Data Management and Sharing, would large data storage in CMU's institutional repository, KiltHub, which is hosted on the Figshare platform, be considered an allowable repository cost or could this be considered institutional infrastructure that should be covered by overhead? KiltHub allows storage of up to 1TB per project free of charge to CMU researchers, but additional storage needs require cost-sharing. Could our researchers include these additional costs within their funding proposal? Would this support be subtracted from research funding, or would this be considered separate from and therefore in addition to research funds? Similarly, we also would welcome clarification on the kinds of curation and de-identification services researchers can include within their budgets, including hiring a third-party curation service and/or using their institutional library's curation services. We also suggest providing language on additional costs the researcher(s) should consider in cases of data reuse. Will tools needed to run the data be usable or accessible in 10 years? Cost considerations for software migration, software preservation, etc. should be highlighted to the researcher(s) and encouraged for inclusion in the DMPs. In general, we encourage the NIH to include more detailed information on data archiving and allowable costs for the researcher. 

We have witnessed a general trend of steadily declining costs of storage. Therefore, it is reasonable in the long run that data would be preserved in perpetuity. We encourage the NIH to determine appropriate responsibility for payment of long-term stewardship. Our recommendation is that we focus on institutional stewardship of data for a fixed period (10 years), at which point there is a review process through which data are dark archived or discarded.

(4) Compliance and enforcement. Regarding compliance and enforcement, we would like to see more information on concrete, trackable metrics that could be placed in the policy to encourage compliance, such as supplying a citation with a DOI or permanent URL for all datasets produced in grant reports. We are also unclear on how non-compliance will affect future funding decisions for the institution, including what constitutes non-compliance and which stakeholders will track compliance. As it currently stands, the policy seems to suggest an audit risk to the institution at large if researchers are not compliant with their plans. More clear information on non-compliance in the final policy would be useful to both researchers and their host institutions; for example, would changing the metadata schema used for the data from what is proposed in the DMP be considered non-compliance, or does this refer to larger efforts such as not appropriately sharing the required data? We also encourage the NIH to clarify how compliance with the DMPs and overall policy will be enforced. In support of discoverability, we suggest the NIH implements a system for creating a discovery layer across trusted/established repositories in which stakeholders can efficiently verify the location of shared data, which would also require the organization to encourage researchers to use appropriate metadata within their datasets to increase discoverability. 

(5) Effective dates. Regarding Section IV (Effective Date(s)), we encourage the earliest possible implementation of the final data management and sharing policy. We believe the scientific community has had ample time to prepare for these data management and sharing mandates (given the 2013 OSTP data sharing memorandum), and institutions, research libraries, and data management professionals have been building appropriate infrastructures and policies to support these coming mandates. 
Carnegie Mellon University welcomes the dissemination of the NIH's DRAFT Data Management and Sharing Policy and Supplemental DRAFT Guidance, and we look forward to the publication of the final policy and supplemental guidance in due course.  Please do not hesitate to contact us should you have any questions or require clarification on any points made in this response.

Yours sincerely,
Keith Webster
Dean of University Libraries & Director of Emerging and Integrative Media Initiatives
Carnegie Mellon University