New York Times Hardcover Fiction Bestsellers (1931–2020)
The New York Times Hardcover Fiction Bestsellers (1931–2020) contains three related datasets:
The first dataset provides a tabular representation of the hardcover fiction bestseller list of The New York Times every week between 1931 and 2020.
The second dataset provides title-level data for every unique title that appeared on the hardcover fiction bestseller list during this time period.
The third dataset provides HathiTrust Digital Library identifiers for every unique title that appeared on the hardcover fiction bestseller list and that also has a corresponding volume in the HathiTrust Digital Library.
Previous research using similar data has been limited to partial segments of the list, such as the top 200 longest-running bestsellers since a certain date (Piper et al. 2016a) or bestsellers from only particular years (Sorensen 2007). By contrast, this dataset covers the full list since its inception in 1931, along with each reported work’s title, author(s), date of appearance, and rank.
Significance and Context
These datasets provide valuable metadata for researchers of 20th century American literature working in fields such as cultural analytics, book and publishing history, and the sociology of literature. In cultural analytics, recent scholarship has used bestseller status as a rough proxy for popularity, enabling researchers to computationally model the textual boundaries between, for instance, popular and prizewinning fiction (Algee-Hewitt and McGurl 2015; Piper et al. 2016b; English 2016). Previous research of this kind has often relied on the Publishers Weekly annual bestseller list. Although Publishers Weekly also publishes a weekly list, it is not readily accessible to researchers. In contrast to the Publishers Weekly annual list, this dataset reports weekly bestsellers, and therefore captures a much broader subset of the historical literary marketplace.
This larger and more granular New York Times dataset presents researchers with a number of potential uses. First of all, existing experiments on bestsellers and prizewinners could be reproduced with this new data. The broader scope of this dataset is likely to dampen the apparent difference between prizewinners and bestsellers, as many prizewinners made it onto the Times list without making it onto that of Publisher’s Weekly. Second, the broader scope of the Times list provides a valuable resource for constructing corpora of historical popular literature. Weekly bestsellers have been neglected in humanities corpora relative to yearly bestsellers. Finally, the Times list could be used to support ongoing research at the intersection of literary and publishing history. As the most closely-followed public-facing bestseller list, the Times list offers insight into the works considered valuable by publishers.
Collection and Creation
Data for the New York Times bestsellers was scraped from Hawes Publications, an online repository that publishes a PDF transcript of the list for every year of the last going back to 1931. Though the Hawes files are high-quality, they are only available as PDF images. Plain text was extracted from the Hawes files programmatically with the open-source Python library pdfminer. Though the Hawes files did not come as a structured or tabular dataset, they do report bestseller information in a relatively standardized format. This allowed author, title, date, and rank information to be extracted from the plain text with a mixture of regular expressions and logical operations.
At a later stage, persistent identifiers, such as VIAF, LCCN, and Wikidata identifiers, were added by Matt Miller (Post45 DC Data Analyst) computationally. Miller also added book information from OCLC—a global library organization that contains information from more than 16,000 member libraries in more than 100 countries—such as how many copies of each edition are held by these libraries (oclc_holdings or oclc_eholdings). This information was accessed through the OCLC Classify API, which was shut down in January 2024.