CLIR Postdoctoral Fellowship in Energy Social Science Data Curation Series Part IV: Understanding Needs of Data Support Among the CMU Energy Research Community

Data

Introduction
This is the last in a four-part series about my energy social science data curation work at the Carnegie Mellon University Libraries (University Libraries hereafter) and the Wilton E. Scott Institute for Energy Innovation (Scott Institute hereafter). In this final post, I’ll talk about the preliminary findings of our data management survey and interviews with members of the CMU energy research community regarding how the University Libraries might improve data support. If you don’t have time to read the full post, I have 5 takeaways listed at the end.

Rationale for the project
To better provide support for the CMU research community’s data management activities, the Libraries want to better understand current needs. As I have built relationships among the energy research community, it provided a good opportunity to learn about this community’s current data management practices and needs.

Combining survey and interview
I created a survey in collaboration with personnel of the Scott Institute and received some helpful feedback from several data management experts and liaison librarians at the Libraries. The survey was then distributed among the Scott Institute faculty affiliates. Unfortunately, this was during the first year of the COVID pandemic and the number of responses was very low. I then sent the survey link individually to researchers with whom I was working and also asked them to participate in a 20-to-30-minute interview with me (some survey questions were adapted from Federer et al., 2016, and some interview questions were adapted from Cooper et al., 2019). When this blog post was written, I had collected eight survey responses and conducted five interviews. Although this is a small and limited sample (it skews towards PhD students), there remain several important insights to be shared.

Brief overview of current data management practices
All participants are in the College of Engineering and on average spend 92% of their research time on energy-related projects. Most of the participants took data management courses as part of their required coursework, where they used programming languages (mostly R and Python) to process and analyze research data. Based on the interviews, I find that these engineers by training tend to be excellent independent problem solvers and self-learners. Some mentioned certain challenges in data management (e.g., coming across a hard-to-read data dictionary or not knowing how to implement an ArcGIS function in Python), but most of the time they were able to figure it out themselves after spending some time on the challenges.

In terms of data sources, most participants rely heavily on publicly available data, and some on the data provided by utility companies either through purchase or private agreement. For the publicly available data, most of the participants download them directly from primary sources (e.g., governments and non-government organizations) and rarely use open energy data aggregators (or hubs; e.g., Open Energy Modeling Initiative). The energy-related databases listed on the University Libraries’s website are also rarely used (e.g., GlobalData Power and OECD iLibrary).

Opportunities for data support and data education
In the survey, we asked several questions about the relevance and the expertise of specific data management procedures. 86% of the survey participants think cleaning data is highly relevant to their research, and most of the survey participants (86%) believe they have a medium or high level of expertise in data cleaning. A similar story can be told regarding data visualization.

The responses to other questions on data management procedures reveal opportunities for data support and education. For some of these questions, there are some discrepancies between perceived relevance and perceived expertise level; for others, the perceived relevance or expertise level is low. The Libraries may want to focus on the following areas for data support or data education opportunities.

First, 71% of the survey participants think creating metadata is highly relevant to their research, but only 14% of them think they have a high or very high level of expertise on metadata (the rest have a low to medium expertise). The interviews reveal that the participants think metadata is critical to understand a new dataset and to share data with others. Several interviewees said writing comments in code files is a common practice; one interviewee said she had created a data dictionary herself in order to share her data with others. But understanding the more technical side of metadata (e.g. metadata standards or schemas) does not seem to be a required skill. Thus, promoting the good practice of writing neatly structured code files and providing easily understandable comments in such files should continue to be considered when the Libraries provide data service or education among this community.

Next, only 29% of the survey participants believe that creating a long-term data storage plan is highly relevant, and 29% of them are not sure about whether this would be relevant or not. Meanwhile, 43% of them think that they have only low expertise on creating data storage plans. As writing such a plan has become almost a requirement in grant proposals, this is the area where the University Libraries should raise awareness and promote some best practices. One way that may help to achieve this goal is to incorporate a section on data backup and storage in the topic-or-course-specific libguides (here is an example).

Similarly, only 29% of the survey participants believe that writing data management plans is highly relevant, and 29% of them are not sure about whether this would be relevant or not. All of them believe that they have low expertise on this. This could be a result of graduate students being overly represented in the sample. Graduate students may have limited experience in writing formal data management plans. However, a library workshop on writing data management plans should still be helpful, because it can better prepare graduate students for their career at an early stage.

Data sharing
Sharing data among researchers is a common practice: 75% of the survey participants have shared research data either privately or via a repository. GitHub is the most mentioned platform for sharing data publicly (75%), followed by Figshare and Zenodo. Notably, no one chose KiltHub for data sharing. Based on the interviews, it turned out that some participants have never heard of KiltHub and some have but just have not used it. Therefore, the Libraries may want to promote KiltHub more among the energy research community.

The interviewees also shared their thoughts about the open data movement. Most of them recognized the benefits of sharing research data for transparency and replicability but also stated that there is more to be done. For example, on writing data description files for data sharing, one interviewee said, “... we don’t really need to do these things. And if we don’t need to do them, we won’t necessarily do them. I think it’s a good idea to push ourselves to do these things, because [this is] part of trying to make the research more open and more reproducible.”

Interests in workshop topics
Based on the survey, data extraction (e.g., web scraping) is the most wanted workshop topic (86% said very interested), followed by data visualization (67%) and data cleaning (57%). There are some moderate interests in copyright and licensing (29%) and research process documentation (29%). Although not directly related to data curation, one interviewee mentioned that it would be helpful to have a workshop on how to combine search terms in a systematic way during literature review.

Summary

  • In general, the CMU energy research community has a sufficient level of knowledge and skills in data curation.
  • This community has been adopting open science and open data practices, mostly enforced by funders and journals.
  • KiltHub can be promoted more among this community.
  • University Libraries may continue to receive requests for assisting the writing of data management plans (including data storage plans) from this community.
  • This community is interested in attending library workshops on web scraping, data cleaning, and data visualization.

Acknowledgments
First and foremost, many thanks to the survey participants and the interviewees. I would also like to thank Neelam Bharti (Associate Dean for Research / Senior Librarian, University Libraries), Julie Chen (Library Liaison to Civil and Environmental Engineering, Engineering and Public Policy, and Mechanical Engineering, University Libraries [at the time the current project was planned), Hannah Gunderman (Former Data, Gaming, and Popular Culture Librarian, University Libraries), Emma Slayton (Data Curation, Visualization, and GIS Specialist, University Libraries), and Sarah Young (Principal Librarian, University Libraries) for giving me feedback on the survey questions. My fellowship has been supervised by Rikk Mulligan (Digital Scholarship Strategist, University Libraries) and Anna Siefken (Executive Director [at the time the current project was conducted; on leave as of July, 2022], Scott Institute). This fellowship is made possible in partnership with the Council on Library and Information Resources (CLIR), with the generous support of the Alfred P. Sloan Foundation.

References with links (by order of appearance)

 

by Luling Huang, CLIR Postdoctoral Fellow in Data Curation for Energy Social Science