New York Times Hardcover Fiction Bestsellers, 1931–2020
Author: Jordan Pruett
DOI: https://doi.org/10.18737/CNJV1733p4520220211
The New York Times Hardcover Fiction Bestsellers (1931–2020) contains three related datasets. The first dataset provides a tabular representation of the hardcover fiction bestseller list of The New York Times every week between 1931 and 2020. The second dataset provides title-level data for every unique title that appeared on the hardcover fiction bestseller list during this time period. The third dataset provides HathiTrust Digital Library identifiers for every unique title that appeared on the hardcover fiction bestseller list and that also has a corresponding volume in the HathiTrust Digital Library.
Previous research using similar data has been limited to partial segments of the list, such as the top 200 longest-running bestsellers since a certain date (Piper and Portelance, 2016) or bestsellers from only particular years (Sorenson, 2007). By contrast, this dataset covers the full list since its inception in 1931, along with each reported work’s title, author(s), date of appearance, and rank.
Significance and Context
These datasets provide valuable metadata for researchers of 20th century American literature working in fields such as cultural analytics, book and publishing history, and the sociology of literature. In cultural analytics, recent scholarship has used bestseller status as a rough proxy for popularity, enabling researchers to computationally model the textual boundaries between, for instance, popular and prizewinning fiction (Algee-Hewitt and McGurl, 2015; Piper and Portelance, 2016; English, 2016). Previous research of this kind has often relied on the Publishers Weekly annual bestseller list. Although Publishers Weekly also publishes a weekly list, it is not readily accessible to researchers. In contrast to the Publishers Weekly annual list, this dataset reports weekly bestsellers, and therefore captures a much broader subset of the historical literary marketplace.
This larger and more granular New York Times dataset presents researchers with a number of potential uses. First of all, existing experiments on bestsellers and prizewinners could be reproduced with this new data. The broader scope of this dataset is likely to dampen the apparent difference between prizewinners and bestsellers, as many prizewinners made it onto the Times list without making it onto that of Publisher’s Weekly. Second, the broader scope of the Times list provides a valuable resource for constructing corpora of historical popular literature. Weekly bestsellers have been neglected in humanities corpora relative to yearly bestsellers. Finally, the Times list could be used to support ongoing research at the intersection of literary and publishing history. As the most closely-followed public-facing bestseller list, the Times list offers insight into the works considered valuable by publishers.
1. NYT Hardcover Fiction Bestseller List (Dataset)
Collection and Creation
Data for New York Times bestsellers was scraped from Hawes Publications, an online repository that publishes a PDF transcript of the list for every year of the last going back to 1931. Though the Hawes files are high-quality, they are only available as PDF images. Plain text was extracted from the Hawes files programmatically with the open-source Python library pdfminer. Though the Hawes files did not come as a structured or tabular dataset, they do report bestseller information in a relatively standardized format. This allowed author, title, date, and rank information to be extracted from the plain text with a mixture of regular expressions and logical operations.
Description
Each row of the dataset is a single “entry” on the list, that is, a single slot for a single week. For each week, there will typically be 10 or 15 works listed. However, since the Times has varied the number of bestsellers featured in a given week, there may be 3, 6, 7, 8, or 16. A single “entry” on the list is treated as the basic unit of the dataset so that researchers can easily count the number of weeks that a given book appeared on a list, as well as the first and last weeks that it appeared.
nyt_hardcover_fiction_bestsellers-lists.tsv
- year – the year of appearance
- week – the weekly issue of the bestseller list
- rank – the book’s rank on the list for that week
- title_id – a unique ID mapping titles to the unique titles dataset
- title – title of the novel, as reported by the New York Times
- author – author of the novel, as reported by the New York Times
2. NYT Hardcover Fiction Bestseller List — Unique Titles (Dataset)
The second dataset provides title-level data for every unique title that appeared on the hardcover fiction bestseller list during this time period.
Collection and Creation
Data for New York Times bestsellers was scraped from Hawes Publications, an online repository that publishes a PDF transcript of the list for every year of the last going back to 1931. Though the Hawes files are high-quality, they are only available as PDF images. Plain text was extracted from the Hawes files programmatically with the open-source Python library pdfminer. Though the Hawes files did not come as a structured or tabular dataset, they do report bestseller information in a relatively standardized format. This allowed author, title, date, and rank information to be extracted from the plain text with a mixture of regular expressions and logical operations.
Description
nyt_hardcover_fiction_bestsellers-titles.tsv
- id – an arbitrary unique id for the novel
- title – the title of the novel, as reported by the New York Times
- author – the author of the novel, as reported by the New York Times
- year – the first year that the novel appears on the bestseller list. Note that this year may be different from the publication year
- total_weeks – the total number of weeks the title was on the list
- first_week – the first week that the novel appears on the bestseller list
- debut_rank – the book’s bestseller rank in the week of its first appearance
- best_rank – the highest rank achieved by the title while on the list
3. NYT Hardcover Fiction Bestseller List — Hathitrust Metadata (Dataset)
Collection and Creation
HathiTrust volume identifiers were matched based on string comparisons against the “Post45 HathiTrust Fiction” dataset described by Ted Underwood et al (2020), specifically the shorttitle and author metadata fields.
First, author surnames were extracted heuristically based on spacing and punctuation. Then, title and author fields in both datasets were lowercased and stripped of punctuation. Two works were then considered a match if surnames were exact matches and the Times title field overlapped at the beginning of the HathiTrust shorttitle field. This yielded 4,978 matches. Note that this includes duplicates, as HathiTrust is a volume-level collection.
This conservative matching procedure was chosen over a more generous fuzzy matching procedure in order to maximize the accuracy of matches at the expense of recall. Manual inspection suggests that many of the missed matches were in fact absent from the HathiTrust collection, but the exact number of missed potential matches is uncertain.
Description
nyt_hardcover_fiction_bestsellers-hathitrust_metadata.tsv
-
- HTID – unique volume ID from HathiTrust
- title_id – the ID for that title in nyt_titles.tsv. Note that since HathiTrust is organized around volumes rather than titles, this field contains duplicates, such as in the case of frequently-reprinted works.
Ethical Considerations
Researchers who use this dataset are encouraged to consider the limitations of drawing historical or cultural conclusions from bestseller data. Bestseller lists are not a transparent window into what the American public was “really reading” at a given historical moment; rather, they reflect editorial decisions about how and what to count. In particular, historical trends on this list are complicated by institutional shifts in book distribution that occurred during the period which it covers. The increased importance of mall stores, chain stores, and retail distributors continually altered the composition of the bookstores surveyed by the New York Times (Miller, 2006). As such, the contents of this dataset likely reflect the purchasing habits of only a particular segment of the American population, namely, those that shop at malls and chain bookstores. This population was disproportionately suburban, white, and middle-class for much of the history of the list. The list likely undercounts sales at other outlets, such as independent bookstores and religious stores.
Users of this data should also be aware that hardcover sales at bookstores are especially unrepresentative of the broader book market in the early years of the “paperback revolution” after WWII, when most popular novels were sold in paperback format at non-bookstore outlets like drugstores. These sales are entirely uncounted on bestseller lists, leading to the conspicuous absence of authors like Erle Stanley Gardner and Mickey Spillane, two of the most popular novelists of the early postwar period.
The Times only expanded its coverage to include nationwide bestsellers in September of 1945. Before that, entries are based on sales in New York or other metropolitan areas. The exact methods used by the Times are not public and the newspaper has come under periodic criticism for its bestseller reporting. For a full discussion of how the bestseller list is constructed, see Miller (2000). This dataset does not reveal anything that might be considered sensitive. All of the data in this dataset is freely available in publicly-accessible archives, as well as in the pages of the New York Times itself.