Interdisciplinary CMU Project Restores Access to Essential Humanities Resource

Sam Lemley and Christopher Warren

Curator of Inventing Shakespeare exhibit, Sam Lemley and Associate Professor of English Christopher Warren, who led the Print & Probability project that identified the printers of the Fourth Folio.

Before October 28, 2023, the University Libraries’ Curator of Special Collections Sam Lemley made frequent use of an online resource called the “English Short Title Catalogue” (ESTC) in his research.

A key database for scholars like Lemley investigating English literature, bibliography, and the history of the book, the ESTC is a shared catalog devoted to books, serials, pamphlets, and ephemeral materials published between 1473 and 1800. Co-managed by the British Library (BL) and the University of California, Riverside's Center for Bibliographical Studies and Research, it includes works published in Britain and its colonies, particularly in North America, during that time. The database was freely searchable, and also pointed researchers to libraries where physical copies of each item could be found.

“I would use the ESTC, if not daily, at least a couple of times a week,” Lemley recalled. “Through it, you could find a list of every book printed by a particular printer in a date range, or which printers tended to work together in London in a particular year. It was the essential resource for bibliographical information — in my field, I and so many of my colleagues depended on it.”

On that date, however, access to the catalog suddenly disappeared. A ransomware cyberattack brought down most of the digital capabilities of the BL, including wifi, computer access, and even the phone lines — along with the ESTC.

Now, Lemley and a team of researchers from CMU and the University of California, San Diego are working to rebuild the resource. Led by UCSD Ph.D. student Nikolai Vogler, they have created an independent, temporary search interface as a stopgap for scholars.

“Working in a library, we’re all about providing information and evidence to people that need it,” Lemley said. “CMU as a whole, with its interdisciplinary focus on collaborative work between areas like the humanities and computer science, was the perfect environment to solve this problem.”


Print, Probability, and Preservation

Much of the earlier work Lemley did with the ESTC was related to the Print & Probability project, which uses computational tools and methods to detect new evidence in early printed books. Situated at the intersection of book history, computer vision, and machine learning, the project seeks to discover letterpress printers whose identities have eluded scholars for several hundred years. It’s currently funded by National Science Foundation and National Endowment for the Humanities grants.

The project is led by Professor of English Christopher Warren, along with Max G’Sell, a former professor in the Department of Statistics & Data Science, and Taylor Berg-Kirkpatrick, an associate professor in the Department of Computer Science and Engineering at UCSD. Lemley joined the team when he first started working with the Libraries in 2020.

The team has worked to identify anonymous printers of controversial books and pamphlets published during an era of censorship and political unrest, investigating pieces like John Milton’s 1644 pamphlet on freedom of the press, “Areopagitica,” and Thomas Hobbes’ 1651 exploration of social contract theory, “Leviathan.” They even solved the mystery of who published Shakespeare’s Fourth Folio, which was highlighted in the Libraries’ exhibition “Inventing Shakespeare: Text, Technology, and the Four Folios.”

For a majority of these investigations, team members needed to consult the ESTC.

“One of the necessary underpinnings of the project is using works with known printers. If you want to find out who printed something anonymous, you have to identify the particularities of that book or pamphlet, like distinctive pieces of damaged type, and then determine who they belonged to,” Warren explained. “In order to answer that question, you have to have a huge set of background knowledge — and a lot of that metadata lives in the ESTC.”

Thanks to a seminar at the Folger Shakespeare Library back in 2014, Warren was in possession of a snapshot from the ESTC. The workshop, which introduced scholars of the early modern period to techniques and data associated with digital humanities, shared the metadata with participants in order to illustrate the many possibilities of querying the resource at a large scale.

The snapshot proved useful for the team over the years as they investigated different texts. But when the ESTC went down, it would quickly become essential for an entire ecosystem of scholars around the world.

Rebuilding Efforts

That day in October, it was clear almost immediately that the ESTC was not going to be restored any time soon.

The hacker gang Rhysida, which orchestrated the attack, likely targeted the BL simply as an example of its ability. But when the BL refused to pay a ransom of 20 bitcoin (nearly $750,000 at the time), the hackers released the data on the dark web instead of restoring the library’s systems. The BL’s current resources were devoted to bringing back key on-site services, which are focused on the library’s own holdings.

For the foreseeable future, the original form of the ESTC was no more.

Vogler, who is studying computer science at UCSD, proposed the initial idea of creating a temporary stopgap. He had started a project a few months earlier, applying modern AI methods to clean up and enrich much of the metadata that the ESTC serves, that left him very familiar with the resource. Though he didn't have much experience with web development, he taught himself the necessary skills, and with help from others in his department, set up a working search interface.

“Nikolai is an astonishing scholar, working at the cutting edge of digital humanities and AI,” Warren said. “Where most of us bemoaned the fact that the ESTC had been taken down, he has a kind of maker’s appreciation for the possibilities of technology — he understood that this was an opportunity to serve our community of scholars.”

Vogler was able to share a rudimentary version of the resource with the public only a week after he had the idea. Over the next few months, with input from other Print & Probability members, it grew from serving 150,000 records to roughly half a million.

“It was really impactful to unite forces with this group for a project like this,” Vogler said. “Print & Probability is one of the truly successful interdisciplinary projects out there. There aren’t a lot of teams that have had such a long-lasting relationship, with the same cohesion and drive to make a difference.”


The scholarly community has turned to the new interface as the replacement for the ESTC. In the last month, the website has received more than 80,000 searches — sometimes, as many as 27,000 in a day.

In addition to the original ESTC data, the new resource includes additional information that may improve scholars’ experience. For example, the team has added metadata developed at Texas A&M University regarding printers and publishers. Vogler has also enriched much of the data as a part of his Ph.D. thesis, which focuses on building better computational machine learning models for early printing.

“This is a good illustration not only of how multiple copies keep things safe, but also of the ways that resources like the ESTC can form the core of larger data aggregation projects that enhance the initial data,” Warren explained.

With the ESTC metadata available to search once more, Lemley and Warren could resume their research — which currently focuses on identifying the printers of John Locke’s “Two Treatises of Government.” Research has also become much more accessible for scholars and Ph.D. students working on their own projects around the world.

“When Print and Probability began hosting the ESTC, it resolved a major obstacle hindering book history scholarship,” said Laura DeLuca, a doctoral candidate in the Department of English studying literary and cultural studies. DeLuca, who is also a member of the Print & Probability team, relies on the ESTC to locate specific editions of early modern books, visiting them in person or requesting digital versions of them for her research. “The ESTC is an essential bibliographic resource, and I am proud to be part of such an innovative team.”

by Sarah Bender, Communications Coordinator