Open Science and Open Data in the Era of COVID-19

Open Science and Open Data in the Era of COVID-19 banner

As researchers from across multiple disciplines grapple with the challenges of COVID-19, the open science movement and its themes of sharing well-curated, reusable data and conducting research collaboratively and transparently appear more relevant than ever. Advocates argue that open science can accelerate discovery, enable rapid and robust peer-review, and enhance the public impact of research. 

A conversation about open science and data collaboration during the COVID-19 pandemic with Huajin Wang, Program Director for Open Science & Data Collaborations, and Hannah Gunderman, Research Data Management Consultant, both at Carnegie Mellon University Libraries, ranged from observations about how the research community has responded to the crisis, to incentives driving such response, to the considerations and tools that matter most as we move towards more open science.

More open, more collaborative, and faster sharing of research outcomes

COVID-19, as a global humanitarian crisis, has already shifted mindsets in the research community. From the release of full viral genome sequences and virus testing protocols, to case tracking dashboards and prediction models, to clinical trials to antiviral drugs and vaccines,  data and research outcomes related to COVID-19 are being  shared at a speed that has never been seen before. 

In scholarly journals, articles focused on COVID-19 now head much more rapidly towards publication. Wang said, 'Before COVID, it was common for a biomedical paper to take months or even a year to publish. But for a lot of these COVID-19 papers, journals have sped up their review process to make it possible to publish in about a month. Many scientists release their manuscript in preprint servers, such as bioRxiv or medRxiv, before submitting to a peer-reviewed journal, which enables new research discoveries to be seen by the public almost immediately.'

 While publishing behind a paywall still limits the utility of research data and results for many peer-reviewed journals, Wang has been gratified to also see many data being made available openly online. 

'It was amazing how fast the [virus] genomes got shared,' she said. 'You sequence the genome and share it in a public repository, and the world knows right away;immediately somebody else can use the genome to develop a test.' 

The rapid data sharing has spurred increased collaboration between researchers, which has previously  been hard to initiate.

'The urgency has created a shared sense of working towards a joint cause,' Wang said, 'People from different disciplines are all of a sudden focused on the same issue, so they naturally start to collaborate.'  

Gunderman added that research collaboration has not been limited to medical or biological fields. Researchers have also responded to the social impacts of COVID-19, using qualitative data. Dialogues are initiated both within and across social sciences and humanities disciplines.

 'New collaborations are coming across the board,' Gunderman said, 'Including all of what we think data can be. I think we're seeing it from quantitative to qualitative to ethnographic to geospatial to whatever else. It's pretty cool.'

Intensified sense of social good drives open science and collaboration  

Whether researchers engage in research openly, as opposed to holding data back until after publication (if not indefinitely), is often influenced by a tension many feel between advancing their careers and working for social good. 

Pressure to publish and fear of being scooped has always been a hurdle for practicing open science. 

'There is the motivation to publish, to make an impact, and to be recognized in the science community, ' Wang said. 'and there's competition so you want to publish fast and be the first to publish the story. But at the same time,I think a lot of scientists came into scientific research because they want to do something good for society.'

Gunderman said she's observed an association between humanitarian work and a commitment to public and open data. She mentioned her work at the Oak Ridge National Lab, where researchers tackled issues like human rights and climate change. Those researchers, said Gunderman, tended to see research in a distinctive way that made open science and data a clear choice: 'we've got an obvious problem…and we want more people or more researchers to be able to quickly access it.'

Wang observed that somehow the challenges of COVID-19 have 'brought out the best in people.' The profound sense of urgency around COVID-19 has been driven not only by the lack of vaccine or targeted treatment but also by an awareness of the magnitude of COVID-19's social impacts. 

'That motivation of doing social good has gotten stronger and is more urgent... I mean there's still, undeniably, a motivation of fame and pressure,' Wang said.' That's always there but still I think that for the most part the good intentions have intensified.'

Data management and curation is a key to make shared data more valuable       

Even with the shifts in motivation that COVID-19 has inspired, a simple commitment to freely sharing data isn't enough to achieve the kind of agility promised by open science. Wang said, 'The way I would phrase it is: open is not enough.'

 Not only does data need to be shared openly, she says, but it needs to be 'open in a way that is reusable, reproducible, and understandable by others so when others see your data they will be able to know what's entailed in the data.'

Open science and open data principles offer enormous potential gains in the agility of research; at a time of urgent crisis this promise can't be ignored. But if data isn't well-curated, well-structured, well-managed, and reusable, the benefit gained from sharing it and opening up to collaboration evaporates. The time potentially saved is lost.

You know, it's funny. Ever since I started working in research data management, I've heard this ‘joke' that my job, on paper, does not seem that exciting. Because I'm literally helping people by saying ‘Hey, maybe use these file naming schemes,'' Gunderson said. 'But situations like this make me push back on jokes like that…Having something like a good file naming convention seems so arbitrary but when you suddenly have 300 datasets related to the same research topic…If you didn't implement a good file naming scheme, all that data could potentially be useless. It could take hours and hours and hours to go through and see how it is organized.'

Wang pointed out that there are different kinds of platforms and formats for sharing open data and some include more built-in structural standards than others. She went on, 'One thing I'm glad about is that we have more and more infrastructure that has well-developed, domain-specific data standards that make data curation and data sharing much easier.' For example, the National Center for Biotechnology Information (NCBI) GeneBank allows researchers to deposit the virus genome sequence with ease. Other researchers are able to download the genome sequence and reuse it right away.

These platforms are repositories designed for sharing specific research data, and they maintain rigid metadata schema. Basically, Wang said, 'the research community came together and agreed ‘This is the metadata that we want and this is the metadata that we'll stick with.' As a result, when I see new material coming into this specific repository I know what to look for.'

The ability of the research community to rapidly share and respond in a situation like COVID-19 is enhanced by these platforms that have been carefully designed to receive specific and formally structured data. When sharing data, researchers don't need to pause to consider how to structure their data, because a well-designed standard schema already exists. Researchers on the receiving end also save time: they don't have to interpret the structure and schema of the data set, because their expectations for how information will be organized are already in place.

These kinds of platforms are helpful, particularly when the tools and techniques of research data management aren't an automatic part of every scholar's repertoire.

Gunderman explained, 'Much of my job is having conversations with people for whom, maybe that's not something they learned in their programs as they went through the Masters or PhD. At least in geography, my field, [research reproducibility is] not something you necessarily would talk about a lot.Wang noted that the issue of data curation is particularly relevant in light of rising interest in how machine learning and artificial intelligence (AI) technologies might be applied to COVID-19 forecasting and prediction.

'A lot of research on forecasting and predictions has been done with machine learning and AI,' Wang said. 'But... Really, you can't do anything without open data.'

CMU's recurring conference on Artificial Intelligence for Data Discovery and Reuse (AIDR) is organized by Wang and others. 
Wang said, 'One of the major consensus from the community last time we had the AIDR conference—it ended up being clear to everyone: As a community we have to be good citizens of data. We have to build a healthy data ecosystem before AI can really happen.'

Wang and Gunderman were both excited about a course that Gunderman is teaching with colleague Emma Slayton in the fall, called 'Discovering the Data Universe.'

As Gunderman explained, 'a lot of the time we don't catch researchers until they are in grad school—or they might be postdocs, they might be faculty—to have these conversations. And so we thought, if we can have these conversations with undergraduates that is the best case scenario for actually teaching them about reproducibility, the data lifecycle, data management, data visualization and different things like that at an early stage.'

On the work she and Gunderman do as part of the Open Science & Data Collaboration program within the libraries at CMU, Wang said, 'I think it's our responsibility to help educate people in data literacy, and what it means to make your data not only open but really reusable and reproducible, and basically open for scrutiny. It's important for us to do this.'

As our conversation drew to a close, Wang noted a pervasive sentiment that the response to COVID-19 is likely to change research forever. Gunderman agreed, offering her hope that, 'the same momentum that we've seen around COVID-19 continues and is also applied to other issues as well…to other really pressing issues related to science and humanity.

Chloe Woida, Project Coordinator, Open Science & Data Collaborations