A Conversation with: Huajin Wang

Librarian Huajin Wang joined the University Libraries in 2017. A cell biologist by training, with more than 10 years of research experience, she is also a member of the AIDR 2019 Program Committee.

What is AIDR 2019?
AIDR stands for Artificial Intelligence for Data Discovery and Reuse. It is a conference that aims to bring together everyone whose work is related to using AI or machine learning to facilitate data discovery and reuse. It takes place May 13-15 at Simmons Auditorium at Carnegie Mellon University.

What does 'data discovery and reuse' mean and why is it important?
'Data discovery and reuse' means finding existing data that are out there and reusing these data to solve a new problem or give an old problem a deeper look. Scientists generate lots of data every day, at a cost of millions of dollars. These large and complex datasets often contain lots of information, making it impossible for a single investigator to extract all of the useful material. So, it makes sense for multiple investors with different expertise to look at the data from different angles. 

Unfortunately, the reality is that most datasets only get used once, in the original publication. After that, these datasets often either live on the PI's server or in data repositories, and few of them get used again. All the rich information contained in the datasets that was so expensive to produce in the first place gets buried. As you can see, facilitating the reuse of data would allow science to move more quickly and be more economical.

How does AI help with data discovery and reuse?
With the recent advances in machine learning and AI, it is possible to train computers to learn certain tasks, and find optimal solutions to a problem, such as integrating different datasets. Machine learning and AI are already being used extensively to build search engines, databases, and to facilitate data analytics and automation in almost every discipline. It's about time that people working in all these disciplines come together, benefit from mutual expertise, and address these challenges together, using the power of AI.

How did you end up bringing this conference to CMU?
My interest in data discovery and reuse started long time ago when I was a biologist working in a big lab. After joining the Libraries as a research liaison, I felt that I was in an ideal position to help researchers with data problems. When I saw a funding opportunity from the National Science Foundation on data reuse, I knew this was a perfect opportunity. It happened that Nick Nystrom, my long-time collaborator at Pittsburgh Supercomputing Center (PSC) had the same idea. We collaborated with Dean Webster and Paola Buitrago, Artificial Intelligence and Big Data Group Leader at PSC, to submit a grant application together. And we got funded!

Why is CMU the right place to have this conference?
CMU has a strong community for AI research, almost everyone working at CMU, from every college, has some connection to AI. Our library is a leader in many national and international efforts to advance the state of research data management.  And PSC is a national leader for scientific computing and hosts many important community datasets. We are a perfect team.

More information about AIDR 2019.