New York Times Hardcover Fiction Bestsellers (1931–2020)
The New York Times Hardcover Fiction Bestsellers (1931–2020) contains three related datasets:
The first dataset provides a tabular representation of the hardcover fiction bestseller list of The New York Times every week between 1931 and 2020.
The second dataset provides title-level data for every unique title that appeared on the hardcover fiction bestseller list during this time period.
The third dataset provides HathiTrust Digital Library identifiers for every unique title that appeared on the hardcover fiction bestseller list and that also has a corresponding volume in the HathiTrust Digital Library.
Previous research using similar data has been limited to partial segments of the list, such as the top 200 longest-running bestsellers since a certain date (Piper et al. 2016a) or bestsellers from only particular years (Sorensen 2007). By contrast, this dataset covers the full list since its inception in 1931, along with each reported work’s title, author(s), date of appearance, and rank.
Significance and Context
These datasets provide valuable metadata for researchers of 20th century American literature working in fields such as cultural analytics, book and publishing history, and the sociology of literature. In cultural analytics, recent scholarship has used bestseller status as a rough proxy for popularity, enabling researchers to computationally model the textual boundaries between, for instance, popular and prizewinning fiction (Algee-Hewitt and McGurl 2015; Piper et al. 2016b; English 2016). Previous research of this kind has often relied on the Publishers Weekly annual bestseller list. Although Publishers Weekly also publishes a weekly list, it is not readily accessible to researchers. In contrast to the Publishers Weekly annual list, this dataset reports weekly bestsellers, and therefore captures a much broader subset of the historical literary marketplace.
This larger and more granular New York Times dataset presents researchers with a number of potential uses. First of all, existing experiments on bestsellers and prizewinners could be reproduced with this new data. The broader scope of this dataset is likely to dampen the apparent difference between prizewinners and bestsellers, as many prizewinners made it onto the Times list without making it onto that of Publisher’s Weekly. Second, the broader scope of the Times list provides a valuable resource for constructing corpora of historical popular literature. Weekly bestsellers have been neglected in humanities corpora relative to yearly bestsellers. Finally, the Times list could be used to support ongoing research at the intersection of literary and publishing history. As the most closely-followed public-facing bestseller list, the Times list offers insight into the works considered valuable by publishers.
Collection and Creation
Data for the New York Times bestsellers was scraped from Hawes Publications, an online repository that publishes a PDF transcript of the list for every year of the last going back to 1931. Though the Hawes files are high-quality, they are only available as PDF images. Plain text was extracted from the Hawes files programmatically with the open-source Python library pdfminer. Though the Hawes files did not come as a structured or tabular dataset, they do report bestseller information in a relatively standardized format. This allowed author, title, date, and rank information to be extracted from the plain text with a mixture of regular expressions and logical operations.
At a later stage, persistent identifiers, such as VIAF, LCCN, and Wikidata identifiers, were added by Matt Miller (Post45 DC Data Analyst) computationally. Miller also added book information from OCLC—a global library organization that contains information from more than 16,000 member libraries in more than 100 countries—such as how many copies of each edition are held by these libraries (oclc_holdings or oclc_eholdings). This information was accessed through the OCLC Classify API, which was shut down in January 2024.
1. NYT Hardcover Fiction Weekly Bestseller List
Data Table
Description
Each row of the dataset is a single “entry” on the list, that is, a single slot for a single week. For each week, there will typically be 10 or 15 works listed. However, since the Times has varied the number of bestsellers featured in a given week, there may be 3, 6, 7, 8, or 16. A single “entry” on the list is treated as the basic unit of the dataset so that researchers can easily count the number of weeks that a given book appeared on a list, as well as the first and last weeks that it appeared.
- year – the year of appearance
- week – the weekly issue of the bestseller list
- rank – the book’s rank on the list for that week
- title_id – a unique ID mapping titles to the unique titles dataset
- title – title of the novel, as reported by the New York Times
- author – author of the novel, as reported by the New York Times
This dataset also includes various persistent identifiers and information from OCLC:
- author_lccn – Author’s LCCN from id.loc.gov
- author_viaf – Author viaf.org cluster number
- author_wikidata – Author’s Wikdiata Q number
- author_authorized_heading– Author’s authorized Name Authority Cooperative (NACO) heading
- oclc_eholdings– from OCLC Classify – the electronic holdings count
- oclc_holdings – from OCLC Classify – the total holdings count
- oclc: a unique identifier for this volume as registered in WorldCat
- oclc_owi – from OCLC Classify – the Classify work identifier
- oclc_isbn – the ISBN from the OCLC MARC record
2. NYT Hardcover Fiction Bestseller List — Unique Titles
The second dataset provides title-level data for every unique title that appeared on the hardcover fiction bestseller list during this time period.
Data Table
Description
- id – an arbitrary unique id for the novel
- title – the title of the novel, as reported by the New York Times
- author – the author of the novel, as reported by the New York Times
- year – the first year that the novel appears on the bestseller list. Note that this year may be different from the publication year
- total_weeks – the total number of weeks the title was on the list
- first_week – the first week that the novel appears on the bestseller list
- debut_rank – the book’s bestseller rank in the week of its first appearance
- best_rank – the highest rank achieved by the title while on the list
This dataset also includes various persistent identifiers and information from OCLC:
- author_lccn – Author’s LCCN from id.loc.gov
- author_viaf – Author viaf.org cluster number
- author_wikidata – Author’s Wikdiata Q number
- author_authorized_heading– Author’s authorized Name Authority Cooperative (NACO) heading
- oclc_eholdings– from OCLC Classify – the electronic holdings count
- oclc_holdings – from OCLC Classify – the total holdings count
- oclc: a unique identifier for this volume as registered in WorldCat
- oclc_owi – from OCLC Classify – the Classify work identifier
- oclc_isbn – the ISBN from the OCLC MARC record
3. NYT Hardcover Fiction Bestseller List — HathiTrust Metadata
Data Table
Collection and Creation
HathiTrust volume identifiers were matched based on string comparisons against the “Post45 HathiTrust Fiction” dataset described by Ted Underwood et al (2020), specifically the shorttitle and author metadata fields.
First, author surnames were extracted heuristically based on spacing and punctuation. Then, title and author fields in both datasets were lowercased and stripped of punctuation. Two works were then considered a match if surnames were exact matches and the Times title field overlapped at the beginning of the HathiTrust shorttitle field. This yielded 4,978 matches. Note that this includes duplicates, as HathiTrust is a volume-level collection.
This conservative matching procedure was chosen over a more generous fuzzy matching procedure in order to maximize the accuracy of matches at the expense of recall. Manual inspection suggests that many of the missed matches were in fact absent from the HathiTrust collection, but the exact number of missed potential matches is uncertain.
Description
- htid – unique volume ID from HathiTrust
- title_id – the ID for that title in the titles dataset. Note that since HathiTrust is organized around volumes rather than titles, this field contains duplicates, such as in the case of frequently-reprinted works.
- hathi_rights - the rights code from Hathi Trust
HathiTrust Rights Codes (Source)
id | name | type | dscr |
---|---|---|---|
1 | pd | copyright | public domain |
2 | ic | copyright | in-copyright |
3 | op | copyright | out-of-print (implies in-copyright) |
4 | orph | copyright | copyright-orphaned (implies copyright) |
5 | und | copyright | undetermined copyright status |
6 | umall | access | available to UM affiliates and walk-in patrons (all campuses) |
7 | ic-world | access | in-copyright and permitted as world viewable by the copyright holder |
8 | nobody | access | available to nobody; blocked for all users |
9 | pdus | copyright | public domain only when viewed in the US |
10 | cc-by-3.0 | copyright | Creative Commons Attribution license, 3.0 Unported |
11 | cc-by-nd-3.0 | copyright | Creative Commons Attribution-NoDerivatives license, 3.0 Unported |
12 | cc-by-nc-nd-3.0 | copyright | Creative Commons Attribution-NonCommercial-NoDerivatives license, 3.0 Unported |
13 | cc-by-nc-3.0 | copyright | Creative Commons Attribution-NonCommercial license, 3.0 Unported |
14 | cc-by-nc-sa-3.0 | copyright | Creative Commons Attribution-NonCommercial-ShareAlike license, 3.0 Unported |
15 | cc-by-sa-3.0 | copyright | Creative Commons Attribution-ShareAlike license, 3.0 Unported |
16 | orphcand | copyright | orphan candidate - in 90-day holding period (implies in-copyright) |
17 | cc-zero | copyright | Creative Commons Zero license (implies pd) |
18 | und-world | access | undetermined copyright status and permitted as world viewable by the depositor |
19 | icus | copyright | in copyright in the US |
20 | cc-by-4.0 | copyright | Creative Commons Attribution 4.0 International license |
21 | cc-by-nd-4.0 | copyright | Creative Commons Attribution-NoDerivatives 4.0 International license |
22 | cc-by-nc-nd-4.0 | copyright | Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license |
23 | cc-by-nc-4.0 | copyright | Creative Commons Attribution-NonCommercial 4.0 International license |
24 | cc-by-nc-sa-4.0 | copyright | Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license |
25 | cc-by-sa-4.0 | copyright | Creative Commons Attribution-ShareAlike 4.0 International license |
26 | pd-pvt | access | public domain but access limited due to privacy concerns |
27 | supp | access | suppressed from view; see note for details |
This dataset also includes various persistent identifiers and information from OCLC:
- author_lccn – Author’s LCCN from id.loc.gov
- author_viaf – Author viaf.org cluster number
- author_wikidata – Author’s Wikdiata Q number
- author_authorized_heading– Author’s authorized Name Authority Cooperative (NACO) heading
- oclc_eholdings– from OCLC Classify – the electronic holdings count
- oclc_holdings – from OCLC Classify – the total holdings count
- oclc: a unique identifier for this volume as registered in WorldCat
- oclc_owi – from OCLC Classify – the Classify work identifier
- oclc_isbn – the ISBN from the OCLC MARC record
Ethical Considerations
Researchers who use this dataset are encouraged to consider the limitations of drawing historical or cultural conclusions from bestseller data. Bestseller lists are not a transparent window into what the American public was “really reading” at a given historical moment; rather, they reflect editorial decisions about how and what to count. In particular, historical trends on this list are complicated by institutional shifts in book distribution that occurred during the period which it covers. The increased importance of mall stores, chain stores, and retail distributors continually altered the composition of the bookstores surveyed by the New York Times (Laura J. Miller 2007). As such, the contents of this dataset likely reflect the purchasing habits of only a particular segment of the American population, namely, those that shop at malls and chain bookstores. This population was disproportionately suburban, white, and middle-class for much of the history of the list. The list likely undercounts sales at other outlets, such as independent bookstores and religious stores.
Users of this data should also be aware that hardcover sales at bookstores are especially unrepresentative of the broader book market in the early years of the “paperback revolution” after WWII, when most popular novels were sold in paperback format at non-bookstore outlets like drugstores. These sales are entirely uncounted on bestseller lists, leading to the conspicuous absence of authors like Erle Stanley Gardner and Mickey Spillane, two of the most popular novelists of the early postwar period.
The Times only expanded its coverage to include nationwide bestsellers in September of 1945. Before that, entries are based on sales in New York or other metropolitan areas. The exact methods used by the Times are not public and the newspaper has come under periodic criticism for its bestseller reporting. For a full discussion of how the bestseller list is constructed, see Laura J. Miller (2000). This dataset does not reveal anything that might be considered sensitive. All of the data in this dataset is freely available in publicly-accessible archives, as well as in the pages of the New York Times itself.
References
Reuse
Citation
@article{pruett2022,
author = {Pruett, Jordan},
editor = {Sinykin, Dan and Walsh, Melanie},
title = {*New {York} {Times*} {Hardcover} {Fiction} {Bestsellers}
(1931–2020)},
journal = {Post45 Data Collective},
date = {2022-02-01},
url = {https://data.post45.org/nyt-fiction-bestsellers-data/},
doi = {10.18737/CNJV1733p4520220211},
langid = {en},
abstract = {The *New York Times* Hardcover Fiction Bestsellers
includes datasets about bestselling books from 1931 and 2020.}
}