Post45 Data Collective
  • Our Data
  • Submissions
  • People
  • About
  • News

Table of Contents

  • Significance and Context
  • Collection and Creation
  • 1. NYT Hardcover Fiction Weekly Bestseller List
    • Data Table
    • Description
  • 2. NYT Hardcover Fiction Bestseller List — Unique Titles
    • Data Table
    • Description
  • 3. NYT Hardcover Fiction Bestseller List — HathiTrust Metadata
    • Data Table
    • Collection and Creation
    • Description
  • Ethical Considerations
  • References

New York Times Hardcover Fiction Bestsellers (1931–2020)

bestsellers
fiction
dataset
Author

Jordan Pruett

Published

February 1, 2022

Doi

10.18737/CNJV1733p4520220211

Abstract
The New York Times Hardcover Fiction Bestsellers includes datasets about bestselling books from 1931 and 2020.

The New York Times Hardcover Fiction Bestsellers (1931–2020) contains three related datasets:

  1. The first dataset provides a tabular representation of the hardcover fiction bestseller list of The New York Times every week between 1931 and 2020.

  2. The second dataset provides title-level data for every unique title that appeared on the hardcover fiction bestseller list during this time period.

  3. The third dataset provides HathiTrust Digital Library identifiers for every unique title that appeared on the hardcover fiction bestseller list and that also has a corresponding volume in the HathiTrust Digital Library.

Previous research using similar data has been limited to partial segments of the list, such as the top 200 longest-running bestsellers since a certain date (Piper et al. 2016a) or bestsellers from only particular years (Sorensen 2007). By contrast, this dataset covers the full list since its inception in 1931, along with each reported work’s title, author(s), date of appearance, and rank.

Significance and Context

These datasets provide valuable metadata for researchers of 20th century American literature working in fields such as cultural analytics, book and publishing history, and the sociology of literature. In cultural analytics, recent scholarship has used bestseller status as a rough proxy for popularity, enabling researchers to computationally model the textual boundaries between, for instance, popular and prizewinning fiction (Algee-Hewitt and McGurl 2015; Piper et al. 2016b; English 2016). Previous research of this kind has often relied on the Publishers Weekly annual bestseller list. Although Publishers Weekly also publishes a weekly list, it is not readily accessible to researchers. In contrast to the Publishers Weekly annual list, this dataset reports weekly bestsellers, and therefore captures a much broader subset of the historical literary marketplace.

This larger and more granular New York Times dataset presents researchers with a number of potential uses. First of all, existing experiments on bestsellers and prizewinners could be reproduced with this new data. The broader scope of this dataset is likely to dampen the apparent difference between prizewinners and bestsellers, as many prizewinners made it onto the Times list without making it onto that of Publisher’s Weekly. Second, the broader scope of the Times list provides a valuable resource for constructing corpora of historical popular literature. Weekly bestsellers have been neglected in humanities corpora relative to yearly bestsellers. Finally, the Times list could be used to support ongoing research at the intersection of literary and publishing history. As the most closely-followed public-facing bestseller list, the Times list offers insight into the works considered valuable by publishers.

Collection and Creation

Data for the New York Times bestsellers was scraped from Hawes Publications, an online repository that publishes a PDF transcript of the list for every year of the last going back to 1931. Though the Hawes files are high-quality, they are only available as PDF images. Plain text was extracted from the Hawes files programmatically with the open-source Python library pdfminer. Though the Hawes files did not come as a structured or tabular dataset, they do report bestseller information in a relatively standardized format. This allowed author, title, date, and rank information to be extracted from the plain text with a mixture of regular expressions and logical operations.

At a later stage, persistent identifiers, such as VIAF, LCCN, and Wikidata identifiers, were added by Matt Miller (Post45 DC Data Analyst) computationally. Miller also added book information from OCLC—a global library organization that contains information from more than 16,000 member libraries in more than 100 countries—such as how many copies of each edition are held by these libraries (oclc_holdings or oclc_eholdings). This information was accessed through the OCLC Classify API, which was shut down in January 2024.

1. NYT Hardcover Fiction Weekly Bestseller List

Data Table

Creative Commons License

This work is licensed under CC BY 4.0

Description

Each row of the dataset is a single “entry” on the list, that is, a single slot for a single week. For each week, there will typically be 10 or 15 works listed. However, since the Times has varied the number of bestsellers featured in a given week, there may be 3, 6, 7, 8, or 16. A single “entry” on the list is treated as the basic unit of the dataset so that researchers can easily count the number of weeks that a given book appeared on a list, as well as the first and last weeks that it appeared.

  • year – the year of appearance
  • week – the weekly issue of the bestseller list
  • rank – the book’s rank on the list for that week
  • title_id – a unique ID mapping titles to the unique titles dataset
  • title – title of the novel, as reported by the New York Times
  • author – author of the novel, as reported by the New York Times

This dataset also includes various persistent identifiers and information from OCLC:

  • author_lccn – Author’s LCCN from id.loc.gov
  • author_viaf – Author viaf.org cluster number
  • author_wikidata – Author’s Wikdiata Q number
  • author_authorized_heading– Author’s authorized Name Authority Cooperative (NACO) heading
  • oclc_eholdings– from OCLC Classify – the electronic holdings count
  • oclc_holdings – from OCLC Classify – the total holdings count
  • oclc: a unique identifier for this volume as registered in WorldCat
  • oclc_owi – from OCLC Classify – the Classify work identifier
  • oclc_isbn – the ISBN from the OCLC MARC record

2. NYT Hardcover Fiction Bestseller List — Unique Titles

The second dataset provides title-level data for every unique title that appeared on the hardcover fiction bestseller list during this time period.

Data Table

Description

  • id – an arbitrary unique id for the novel
  • title – the title of the novel, as reported by the New York Times
  • author – the author of the novel, as reported by the New York Times
  • year – the first year that the novel appears on the bestseller list. Note that this year may be different from the publication year
  • total_weeks – the total number of weeks the title was on the list
  • first_week – the first week that the novel appears on the bestseller list
  • debut_rank – the book’s bestseller rank in the week of its first appearance
  • best_rank – the highest rank achieved by the title while on the list

This dataset also includes various persistent identifiers and information from OCLC:

  • author_lccn – Author’s LCCN from id.loc.gov
  • author_viaf – Author viaf.org cluster number
  • author_wikidata – Author’s Wikdiata Q number
  • author_authorized_heading– Author’s authorized Name Authority Cooperative (NACO) heading
  • oclc_eholdings– from OCLC Classify – the electronic holdings count
  • oclc_holdings – from OCLC Classify – the total holdings count
  • oclc: a unique identifier for this volume as registered in WorldCat
  • oclc_owi – from OCLC Classify – the Classify work identifier
  • oclc_isbn – the ISBN from the OCLC MARC record

3. NYT Hardcover Fiction Bestseller List — HathiTrust Metadata

Data Table

Collection and Creation

HathiTrust volume identifiers were matched based on string comparisons against the “Post45 HathiTrust Fiction” dataset described by Ted Underwood et al (2020), specifically the shorttitle and author metadata fields.

First, author surnames were extracted heuristically based on spacing and punctuation. Then, title and author fields in both datasets were lowercased and stripped of punctuation. Two works were then considered a match if surnames were exact matches and the Times title field overlapped at the beginning of the HathiTrust shorttitle field. This yielded 4,978 matches. Note that this includes duplicates, as HathiTrust is a volume-level collection.

This conservative matching procedure was chosen over a more generous fuzzy matching procedure in order to maximize the accuracy of matches at the expense of recall. Manual inspection suggests that many of the missed matches were in fact absent from the HathiTrust collection, but the exact number of missed potential matches is uncertain.

Description

  • htid – unique volume ID from HathiTrust
  • title_id – the ID for that title in the titles dataset. Note that since HathiTrust is organized around volumes rather than titles, this field contains duplicates, such as in the case of frequently-reprinted works.
  • hathi_rights - the rights code from Hathi Trust
All HathiTrust Rights Codes

HathiTrust Rights Codes (Source)

id name type dscr
1 pd copyright public domain
2 ic copyright in-copyright
3 op copyright out-of-print (implies in-copyright)
4 orph copyright copyright-orphaned (implies copyright)
5 und copyright undetermined copyright status
6 umall access available to UM affiliates and walk-in patrons (all campuses)
7 ic-world access in-copyright and permitted as world viewable by the copyright holder
8 nobody access available to nobody; blocked for all users
9 pdus copyright public domain only when viewed in the US
10 cc-by-3.0 copyright Creative Commons Attribution license, 3.0 Unported
11 cc-by-nd-3.0 copyright Creative Commons Attribution-NoDerivatives license, 3.0 Unported
12 cc-by-nc-nd-3.0 copyright Creative Commons Attribution-NonCommercial-NoDerivatives license, 3.0 Unported
13 cc-by-nc-3.0 copyright Creative Commons Attribution-NonCommercial license, 3.0 Unported
14 cc-by-nc-sa-3.0 copyright Creative Commons Attribution-NonCommercial-ShareAlike license, 3.0 Unported
15 cc-by-sa-3.0 copyright Creative Commons Attribution-ShareAlike license, 3.0 Unported
16 orphcand copyright orphan candidate - in 90-day holding period (implies in-copyright)
17 cc-zero copyright Creative Commons Zero license (implies pd)
18 und-world access undetermined copyright status and permitted as world viewable by the depositor
19 icus copyright in copyright in the US
20 cc-by-4.0 copyright Creative Commons Attribution 4.0 International license
21 cc-by-nd-4.0 copyright Creative Commons Attribution-NoDerivatives 4.0 International license
22 cc-by-nc-nd-4.0 copyright Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license
23 cc-by-nc-4.0 copyright Creative Commons Attribution-NonCommercial 4.0 International license
24 cc-by-nc-sa-4.0 copyright Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license
25 cc-by-sa-4.0 copyright Creative Commons Attribution-ShareAlike 4.0 International license
26 pd-pvt access public domain but access limited due to privacy concerns
27 supp access suppressed from view; see note for details

This dataset also includes various persistent identifiers and information from OCLC:

  • author_lccn – Author’s LCCN from id.loc.gov
  • author_viaf – Author viaf.org cluster number
  • author_wikidata – Author’s Wikdiata Q number
  • author_authorized_heading– Author’s authorized Name Authority Cooperative (NACO) heading
  • oclc_eholdings– from OCLC Classify – the electronic holdings count
  • oclc_holdings – from OCLC Classify – the total holdings count
  • oclc: a unique identifier for this volume as registered in WorldCat
  • oclc_owi – from OCLC Classify – the Classify work identifier
  • oclc_isbn – the ISBN from the OCLC MARC record

Ethical Considerations

Researchers who use this dataset are encouraged to consider the limitations of drawing historical or cultural conclusions from bestseller data. Bestseller lists are not a transparent window into what the American public was “really reading” at a given historical moment; rather, they reflect editorial decisions about how and what to count. In particular, historical trends on this list are complicated by institutional shifts in book distribution that occurred during the period which it covers. The increased importance of mall stores, chain stores, and retail distributors continually altered the composition of the bookstores surveyed by the New York Times (Laura J. Miller 2007). As such, the contents of this dataset likely reflect the purchasing habits of only a particular segment of the American population, namely, those that shop at malls and chain bookstores. This population was disproportionately suburban, white, and middle-class for much of the history of the list. The list likely undercounts sales at other outlets, such as independent bookstores and religious stores.

Users of this data should also be aware that hardcover sales at bookstores are especially unrepresentative of the broader book market in the early years of the “paperback revolution” after WWII, when most popular novels were sold in paperback format at non-bookstore outlets like drugstores. These sales are entirely uncounted on bestseller lists, leading to the conspicuous absence of authors like Erle Stanley Gardner and Mickey Spillane, two of the most popular novelists of the early postwar period.

The Times only expanded its coverage to include nationwide bestsellers in September of 1945. Before that, entries are based on sales in New York or other metropolitan areas. The exact methods used by the Times are not public and the newspaper has come under periodic criticism for its bestseller reporting. For a full discussion of how the bestseller list is constructed, see Laura J. Miller (2000). This dataset does not reveal anything that might be considered sensitive. All of the data in this dataset is freely available in publicly-accessible archives, as well as in the pages of the New York Times itself.

References

Algee-Hewitt, Mark, and Mark McGurl. 2015. “Between Canon and Corpus: Six Perspectives on 20th-Century Novels.” Pamphlets of the Literary Lab 8.
English, James F. 2016. “Now, Not Now: Counting Time in Contemporary Fiction Studies.” Modern Language Quarterly 77 (3): 395418. https://read.dukeupress.edu/modern-language-quarterly/article-abstract/77/3/395/19914.
Miller, Laura J. 2000. “The Best-Seller List as Marketing Tool and Historical Fiction.” Book History 3 (1): 286–304. https://muse.jhu.edu/pub/2/article/3606.
Miller, Laura J. 2007. Reluctant Capitalists: Bookselling and the Culture of Consumption. Chicago, IL: University of Chicago Press.
Piper, Andrew, Eva Portelance, Andrew Piper Portelance, and Eva. 2016b. “How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading.” Post45: Peer Reviewed, May. https://post45.org/2016/05/how-cultural-capital-works-prizewinning-novels-bestsellers-and-the-time-of-reading/.
———. 2016a. “How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading.” Post45: Peer Reviewed, May. https://post45.org/2016/05/how-cultural-capital-works-prizewinning-novels-bestsellers-and-the-time-of-reading/.
Sorensen, Alan T. 2007. “Bestseller Lists And Product Variety.” Journal of Industrial Economics 55 (4): 715–38. https://ideas.repec.org//a/bla/jindec/v55y2007i4p715-738.html.

Reuse

CC BY 4.0

Citation

BibTeX citation:
@article{pruett2022,
  author = {Pruett, Jordan},
  editor = {Sinykin, Dan and Walsh, Melanie},
  title = {*New {York} {Times*} {Hardcover} {Fiction} {Bestsellers}
    (1931–2020)},
  journal = {Post45 Data Collective},
  date = {2022-02-01},
  url = {https://data.post45.org/nyt-fiction-bestsellers-data/},
  doi = {10.18737/CNJV1733p4520220211},
  langid = {en},
  abstract = {The *New York Times* Hardcover Fiction Bestsellers
    includes datasets about bestselling books from 1931 and 2020.}
}
For attribution, please cite this work as:
Pruett, Jordan. 2022. “*New York Times* Hardcover Fiction Bestsellers (1931–2020).” Edited by Dan Sinykin and Melanie Walsh. Post45 Data Collective, February. https://doi.org/10.18737/CNJV1733p4520220211.
 

Supported by National Endowment for the Humanities Emory Center for Digital Scholarship Built with Quarto