Awash in Data

Juliana Spahr and Stephanie Young—joined in co-authorship, later, by Claire Grossman—were poets who began to notice something: “at most of the readings we attend, the room is mainly white.” This observation bothered them. They gathered anecdotes. They queried friends. When they heard about rooms that were other-than-white, they sought them out “only to find that they no longer existed or barely existed.” Something was wrong with poetry, so they set out to learn more. Early in their research they realized that if they wanted to know the truth about whiteness and US poetry they would need more than anecdotes. They would need data: lots of data.

They thought that creative writing programs might be less white, but they were wrong. They thought that prizewinners might be less white, and found that, though prizes were more racially equitable in the twenty-first century, the winners shared elite educational backgrounds. They thought that technology—desktop publishing, the internet—might have democratized publishing, but they learned that racism, Ivy League degrees, and elite creative writing programs maintain gross inequality in poetry.

Grossman, Spahr, and Young aren’t alone in recognizing two facts: contemporary literature is vast; and awash in data. While they surveyed rooms and rosters, literary scholars across the US had also quietly begun gathering data to document the complex, interlocking system of literary prejudice and prestige. Conspicuous blurbs create networks that subtly organize the field into coteries. Online social media platforms have resurrected and repurposed authorship. One scholar collected Publishers Weekly bestseller lists for fiction between 1950 and 2000 and discovered that 98% of the titles were written by white authors. Meanwhile, a graduate student at the University of Chicago gathered The New York Times’ bestseller lists for his dissertation. And a professor at the University of New Orleans, wanting to address the silence of African American literature on these lists, assembled Essence’s bestseller lists that captured hugely popular books like Omar Tyree’s Flyy Girl and Sister Souljah’s The Coldest Winter Ever that the mainstream lists failed to register.

Increased scholarly attention to data is the complement to a changing literary culture. “The personal computer, mobile devices, the cloud, the server farm, the search engine, the algorithm, and the network are now indispensable parts of daily life,” write Jessica Pressman and Aarthi Vadde. “They are equally indispensable to the reading, writing, and distribution of literature.” Thanks to ebooks and the ease of self-publishing, the number of fiction titles published in the US each year increased, between 1990 and 2010, by a factor of ten, from around 5000 to more than 50,000. Since 2013, that number has risen above 70,000. Among this vastness, new genres proliferate like mushrooms after a rain, none more “quintessential,” argues Mark McGurl, than Adult Baby Diaper Erotica. The participatory culture of web 2.0 has lent itself to immense collections of fan fiction at fanfiction.net and on Wattpad. More and more people listen to audiobooks. The performance of authorship in the digital literary sphere is defined against participatory culture, whether one is an enthusiastic participant, like Margaret Atwood, or a curmudgeonly holdout, like Jonathan Franzen—or a curmudgeonly participant, describing works of autofiction as “wan little husks.” Increasingly, publishers look for large social media follower counts from prospective authors who, in turn, often live Very Online. Sales can be driven by tears on TikTok.

At this point, data is the foundation of contemporary literary culture. Prominent novelists such as Teju Cole, Joshua Cohen, and Jennifer Egan have, for some years, been provoking us—through Twitter fiction, livestreaming the writing of a novel, and incorporating PowerPoint into a novel—to update our accounts of literature’s material culture in the digital age. We ought to recognize data’s function in much the way book historians have long recognized the role of print technology in production, circulation, and reception: from the printing press to offset printing to the mimeograph to desktop publishing. Adobe’s Portable Document Format was to the 1990s what wood pulp was to the 1860s. The work is ongoing. Scholars have written literary histories of word processing and the digital literary sphere, not to mention critiques of the exorbitant energy and military histories of the infrastructure that stores this voluminous data. But what about the data itself?

The data is voluminous, and the process of collecting it has only begun. Not long ago, we noticed that data collection in literary studies was happenstance. Some scholars in California wanted to investigate the whiteness of poetry, so they started filling spreadsheets. Some scholars in Iowa wanted to understand the influence of the Iowa Writers’ Workshop, so they started filling spreadsheets. One of us began filling spreadsheets with information about literary agents. One of us began filling spreadsheets with lists from publishing imprints. But we were figuring out how to do it as we went, more or less separately. We lacked criteria for what good data collection looked like. We knew we wanted to share our data eventually, but we didn’t know how best to do so. And we knew that this data, collected by these scholars, would be much more powerful together than separate.

We—Laura and Dan—decided to form a collective to address the challenges that we and our colleagues were beginning to face. We formed an editorial board. We partnered with the journal Post45 and the Emory Center for Digital Scholarship. We launched the Post45 Data Collective in April 2021.

The Post45 Data Collective peer reviews and houses post-1945 literary data on our open-access website, for use, reuse, research, and teaching. Copyright prevents us from hosting literature itself, so we include, whenever possible, IDs that allow users to access texts in the HathiTrust digital library and study them through the HathiTrust Research Center. The Data Collective hosts, as of today: a series of datasets that describe who attended the Iowa Writers’ Workshop, their advisor, and the title of their thesis, among other metadata. Building a dataset is difficult work that ought to be cited and recognized by institutions; to make that easier, we send datasets through double-blind peer review and, upon acceptance, issue them a Digital Object Identifier, or DOI, to enable citation. Several further datasets stand in the queue. These include weekly bestseller lists for fiction from The New York Times, black bestseller lists from Essence, and the data about poetry and prizes that undergirded the landmark scholarship by Claire Grossman, Juliana Spahr, and Stephanie Young.

Imagine what will be possible. One day soon we will be able to map the network of the literary field. We will be able to trace connections between literary agents, blurbs, editors, imprints, and Goodreads reviews to prizes and institutions of higher ed. We will be able to investigate tax documents from conglomerates to understand how they manage publishers and film studios. We will be able to study the influence of these networks and this management—for better and, certainly, for worse—on literature itself. This is a beginning. It’s time.

The Post45 Data Collective is supported and maintained by the Emory Center for Digital Scholarship.