*New York Times* Hardcover Fiction Bestsellers (1931–2020)

Jordan Pruett

doi:10.18737/CNJV1733p4520220211

New York Times Hardcover Fiction Bestsellers (1931–2020)

bestsellers

fiction

dataset

Author

Jordan Pruett

Published

February 1, 2022

Doi

10.18737/CNJV1733p4520220211

Abstract

The New York Times Hardcover Fiction Bestsellers includes datasets about bestselling books from 1931 and 2020.

The New York Times Hardcover Fiction Bestsellers (1931–2020) contains three related datasets:

The first dataset provides a tabular representation of the hardcover fiction bestseller list of The New York Times every week between 1931 and 2020.
The second dataset provides title-level data for every unique title that appeared on the hardcover fiction bestseller list during this time period.
The third dataset provides HathiTrust Digital Library identifiers for every unique title that appeared on the hardcover fiction bestseller list and that also has a corresponding volume in the HathiTrust Digital Library.

Previous research using similar data has been limited to partial segments of the list, such as the top 200 longest-running bestsellers since a certain date (Piper et al. 2016a) or bestsellers from only particular years (Sorensen 2007). By contrast, this dataset covers the full list since its inception in 1931, along with each reported work’s title, author(s), date of appearance, and rank.

Significance and Context

These datasets provide valuable metadata for researchers of 20th century American literature working in fields such as cultural analytics, book and publishing history, and the sociology of literature. In cultural analytics, recent scholarship has used bestseller status as a rough proxy for popularity, enabling researchers to computationally model the textual boundaries between, for instance, popular and prizewinning fiction (Algee-Hewitt and McGurl 2015; Piper et al. 2016b; English 2016). Previous research of this kind has often relied on the Publishers Weekly annual bestseller list. Although Publishers Weekly also publishes a weekly list, it is not readily accessible to researchers. In contrast to the Publishers Weekly annual list, this dataset reports weekly bestsellers, and therefore captures a much broader subset of the historical literary marketplace.

This larger and more granular New York Times dataset presents researchers with a number of potential uses. First of all, existing experiments on bestsellers and prizewinners could be reproduced with this new data. The broader scope of this dataset is likely to dampen the apparent difference between prizewinners and bestsellers, as many prizewinners made it onto the Times list without making it onto that of Publisher’s Weekly. Second, the broader scope of the Times list provides a valuable resource for constructing corpora of historical popular literature. Weekly bestsellers have been neglected in humanities corpora relative to yearly bestsellers. Finally, the Times list could be used to support ongoing research at the intersection of literary and publishing history. As the most closely-followed public-facing bestseller list, the Times list offers insight into the works considered valuable by publishers.

Collection and Creation

Data for the New York Times bestsellers was scraped from Hawes Publications, an online repository that publishes a PDF transcript of the list for every year of the last going back to 1931. Though the Hawes files are high-quality, they are only available as PDF images. Plain text was extracted from the Hawes files programmatically with the open-source Python library pdfminer. Though the Hawes files did not come as a structured or tabular dataset, they do report bestseller information in a relatively standardized format. This allowed author, title, date, and rank information to be extracted from the plain text with a mixture of regular expressions and logical operations.

At a later stage, persistent identifiers, such as VIAF, LCCN, and Wikidata identifiers, were added by Matt Miller (Post45 DC Data Analyst) computationally. Miller also added book information from OCLC—a global library organization that contains information from more than 16,000 member libraries in more than 100 countries—such as how many copies of each edition are held by these libraries (oclc_holdings or oclc_eholdings). This information was accessed through the OCLC Classify API, which was shut down in January 2024.

1. NYT Hardcover Fiction Weekly Bestseller List

Data Table

import {viewof dataSummaryView, Tabulator, viewof selectedColumns, viewof dataSet, tableContainer, fetchData, generateTabulatorTableFromCSV, progress, progressbar} from "8bb63a6cde9addff"

generateTabulatorTableFromCSV(
  "#table-container-nyt-lists",
  "https://raw.githubusercontent.com/Post45-Data-Collective/data/refs/heads/main/nyt_hardcover_fiction_bestsellers/nyt_hardcover_fiction_bestsellers-lists.csv",
  {
    displayedColumns: ["week", "year", "rank", "title", "author", "title_id",
                       "oclc_eholdings", "oclc_holdings", "oclc_isbn", "oclc_owi",
                       "author_authorized_heading", "author_lccn", "author_viaf", "author_wikidata"],
    columnPopups: [
      "Date of the bestseller list (week)",
      "Year of the bestseller list",
      "Rank on the list (1 = top)",
      "Title of the novel, as reported by the New York Times",
      "Author, as reported by the New York Times",
      "Internal title id mapping to the unique titles dataset",
      "OCLC electronic holdings count",
      "OCLC total holdings count",
      "OCLC ISBN",
      "OCLC Classify work identifier",
      "Author's NACO authorized heading",
      "Author's LCCN",
      "Author's VIAF cluster number",
      "Author's Wikidata Q number"
    ],
    columnWidths: { "rank": "60px", "year": "75px", "author": "120px" },
    rangeSliderColumns: ["year"],
    dateSliderColumns: ["week"],
    numericColumns: ["rank"],
    sortColumns: ["week"],
    sortOrders: ["desc"],
    buttonContainerId: "#button-container-nyt-lists",
    rawButtonId: "#download-raw-nyt-lists",
    urlCopyButtonId: "#copy-url-nyt-lists",
  }
);

Download Full Data (including hidden columns)

Download Table Data (filtered rows / visible columns)

Creative Commons License

This work is licensed under CC BY 4.0

Description

Each row of the dataset is a single “entry” on the list, that is, a single slot for a single week. For each week, there will typically be 10 or 15 works listed. However, since the Times has varied the number of bestsellers featured in a given week, there may be 3, 6, 7, 8, or 16. A single “entry” on the list is treated as the basic unit of the dataset so that researchers can easily count the number of weeks that a given book appeared on a list, as well as the first and last weeks that it appeared.

year – the year of appearance
week – the weekly issue of the bestseller list
rank – the book’s rank on the list for that week
title_id – a unique ID mapping titles to the unique titles dataset
title – title of the novel, as reported by the New York Times
author – author of the novel, as reported by the New York Times

This dataset also includes various persistent identifiers and information from OCLC:

author_lccn – Author’s LCCN from id.loc.gov
author_viaf – Author viaf.org cluster number
author_wikidata – Author’s Wikdiata Q number
author_authorized_heading– Author’s authorized Name Authority Cooperative (NACO) heading
oclc_eholdings– from OCLC Classify – the electronic holdings count
oclc_holdings – from OCLC Classify – the total holdings count
oclc: a unique identifier for this volume as registered in WorldCat
oclc_owi – from OCLC Classify – the Classify work identifier
oclc_isbn – the ISBN from the OCLC MARC record

2. NYT Hardcover Fiction Bestseller List — Unique Titles

The second dataset provides title-level data for every unique title that appeared on the hardcover fiction bestseller list during this time period.

Data Table

generateTabulatorTableFromCSV(
  "#table-container-nyt-titles",
  "https://raw.githubusercontent.com/Post45-Data-Collective/data/refs/heads/main/nyt_hardcover_fiction_bestsellers/nyt_hardcover_fiction_bestsellers-titles.csv",
  {
    displayedColumns: ["first_week", "year", "id", "title", "author", "total_weeks",
                       "best_rank", "debut_rank", "oclc_holdings", "oclc_eholdings",
                       "author_authorized_heading", "author_lccn", "author_viaf", "author_wikidata",
                       "oclc", "oclc_isbn", "oclc_owi"],
    columnPopups: [
      "First week the title appeared on the bestseller list",
      "First year the title appeared on the list",
      "Arbitrary unique id for the title",
      "Title as reported by the New York Times",
      "Author as reported by the New York Times",
      "Total number of weeks the title was on the list",
      "Highest rank achieved by the title",
      "Bestseller rank in the week of first appearance",
      "OCLC total holdings count",
      "OCLC electronic holdings count",
      "Author's NACO authorized heading",
      "Author's LCCN",
      "Author's VIAF cluster number",
      "Author's Wikidata Q number",
      "OCLC unique volume identifier (WorldCat)",
      "OCLC ISBN",
      "OCLC Classify work identifier"
    ],
    rangeSliderColumns: ["year"],
    dateSliderColumns: ["first_week"],
    numericColumns: ["total_weeks", "best_rank", "debut_rank"],
    sortColumns: ["first_week"],
    sortOrders: ["desc"],
    buttonContainerId: "#button-container-nyt-titles",
    rawButtonId: "#download-raw-nyt-titles",
    urlCopyButtonId: "#copy-url-nyt-titles",
  }
);

Download Full Data (including hidden columns)

Download Table Data (filtered rows / visible columns)

Description

id – an arbitrary unique id for the novel
title – the title of the novel, as reported by the New York Times
author – the author of the novel, as reported by the New York Times
year – the first year that the novel appears on the bestseller list. Note that this year may be different from the publication year
total_weeks – the total number of weeks the title was on the list
first_week – the first week that the novel appears on the bestseller list
debut_rank – the book’s bestseller rank in the week of its first appearance
best_rank – the highest rank achieved by the title while on the list

This dataset also includes various persistent identifiers and information from OCLC:

author_lccn – Author’s LCCN from id.loc.gov
author_viaf – Author viaf.org cluster number
author_wikidata – Author’s Wikdiata Q number
author_authorized_heading– Author’s authorized Name Authority Cooperative (NACO) heading
oclc_eholdings– from OCLC Classify – the electronic holdings count
oclc_holdings – from OCLC Classify – the total holdings count
oclc: a unique identifier for this volume as registered in WorldCat
oclc_owi – from OCLC Classify – the Classify work identifier
oclc_isbn – the ISBN from the OCLC MARC record

3. NYT Hardcover Fiction Bestseller List — HathiTrust Metadata

Data Table

nyt_hathi_data = fetchData("https://raw.githubusercontent.com/Post45-Data-Collective/data/refs/heads/main/nyt_hardcover_fiction_bestsellers/nyt_hardcover_fiction_bestsellers-hathitrust_metadata.csv")

generateTabulatorTableFromCSV(
  "#table-container-nyt-hathi",
  "https://raw.githubusercontent.com/Post45-Data-Collective/data/refs/heads/main/nyt_hardcover_fiction_bestsellers/nyt_hardcover_fiction_bestsellers-hathitrust_metadata.csv",
  {
    displayedColumns: ["first_week", "year", "title", "author", "htid", "hathi_rights",
                       "best_rank", "debut_rank", "author_authorized_heading", "author_lccn",
                       "author_viaf", "author_wikidata_qid", "isbn", "oclc",
                       "oclc_eholdings", "oclc_holdings", "oclc_owi"],
    columnPopups: [
      "First week the title appeared on the bestseller list",
      "First year the title appeared on the list",
      "Title as reported by the New York Times",
      "Author as reported by the New York Times",
      "HathiTrust unique volume identifier",
      "HathiTrust rights code",
      "Highest rank achieved by the title",
      "Bestseller rank in the week of first appearance",
      "Author's NACO authorized heading",
      "Author's LCCN",
      "Author's VIAF cluster number",
      "Author's Wikidata Q number",
      "ISBN",
      "OCLC unique volume identifier (WorldCat)",
      "OCLC electronic holdings count",
      "OCLC total holdings count",
      "OCLC Classify work identifier"
    ],
    rangeSliderColumns: ["year"],
    dateSliderColumns: ["first_week"],
    numericColumns: ["best_rank", "debut_rank"],
    categoryColumns: ["hathi_rights"],
    sortColumns: ["first_week"],
    sortOrders: ["desc"],
    buttonContainerId: "#button-container-nyt-hathi",
    rawButtonId: "#download-raw-nyt-hathi",
    urlCopyButtonId: "#copy-url-nyt-hathi",
  }
);

Download Full Data (including hidden columns)

Download Table Data (filtered rows / visible columns)

Collection and Creation

HathiTrust volume identifiers were matched based on string comparisons against the “Post45 HathiTrust Fiction” dataset described by Ted Underwood et al (2020), specifically the shorttitle and author metadata fields.

First, author surnames were extracted heuristically based on spacing and punctuation. Then, title and author fields in both datasets were lowercased and stripped of punctuation. Two works were then considered a match if surnames were exact matches and the Times title field overlapped at the beginning of the HathiTrust shorttitle field. This yielded 4,978 matches. Note that this includes duplicates, as HathiTrust is a volume-level collection.

This conservative matching procedure was chosen over a more generous fuzzy matching procedure in order to maximize the accuracy of matches at the expense of recall. Manual inspection suggests that many of the missed matches were in fact absent from the HathiTrust collection, but the exact number of missed potential matches is uncertain.

Description

htid – unique volume ID from HathiTrust
title_id – the ID for that title in the titles dataset. Note that since HathiTrust is organized around volumes rather than titles, this field contains duplicates, such as in the case of frequently-reprinted works.
hathi_rights - the rights code from Hathi Trust

All HathiTrust Rights Codes

HathiTrust Rights Codes (Source)

id	name	type	dscr
1	pd	copyright	public domain
2	ic	copyright	in-copyright
3	op	copyright	out-of-print (implies in-copyright)
4	orph	copyright	copyright-orphaned (implies copyright)
5	und	copyright	undetermined copyright status
6	umall	access	available to UM affiliates and walk-in patrons (all campuses)
7	ic-world	access	in-copyright and permitted as world viewable by the copyright holder
8	nobody	access	available to nobody; blocked for all users
9	pdus	copyright	public domain only when viewed in the US
10	cc-by-3.0	copyright	Creative Commons Attribution license, 3.0 Unported
11	cc-by-nd-3.0	copyright	Creative Commons Attribution-NoDerivatives license, 3.0 Unported
12	cc-by-nc-nd-3.0	copyright	Creative Commons Attribution-NonCommercial-NoDerivatives license, 3.0 Unported
13	cc-by-nc-3.0	copyright	Creative Commons Attribution-NonCommercial license, 3.0 Unported
14	cc-by-nc-sa-3.0	copyright	Creative Commons Attribution-NonCommercial-ShareAlike license, 3.0 Unported
15	cc-by-sa-3.0	copyright	Creative Commons Attribution-ShareAlike license, 3.0 Unported
16	orphcand	copyright	orphan candidate - in 90-day holding period (implies in-copyright)
17	cc-zero	copyright	Creative Commons Zero license (implies pd)
18	und-world	access	undetermined copyright status and permitted as world viewable by the depositor
19	icus	copyright	in copyright in the US
20	cc-by-4.0	copyright	Creative Commons Attribution 4.0 International license
21	cc-by-nd-4.0	copyright	Creative Commons Attribution-NoDerivatives 4.0 International license
22	cc-by-nc-nd-4.0	copyright	Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license
23	cc-by-nc-4.0	copyright	Creative Commons Attribution-NonCommercial 4.0 International license
24	cc-by-nc-sa-4.0	copyright	Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license
25	cc-by-sa-4.0	copyright	Creative Commons Attribution-ShareAlike 4.0 International license
26	pd-pvt	access	public domain but access limited due to privacy concerns
27	supp	access	suppressed from view; see note for details

This dataset also includes various persistent identifiers and information from OCLC:

author_lccn – Author’s LCCN from id.loc.gov
author_viaf – Author viaf.org cluster number
author_wikidata – Author’s Wikdiata Q number
author_authorized_heading– Author’s authorized Name Authority Cooperative (NACO) heading
oclc_eholdings– from OCLC Classify – the electronic holdings count
oclc_holdings – from OCLC Classify – the total holdings count
oclc: a unique identifier for this volume as registered in WorldCat
oclc_owi – from OCLC Classify – the Classify work identifier
oclc_isbn – the ISBN from the OCLC MARC record

Ethical Considerations

Researchers who use this dataset are encouraged to consider the limitations of drawing historical or cultural conclusions from bestseller data. Bestseller lists are not a transparent window into what the American public was “really reading” at a given historical moment; rather, they reflect editorial decisions about how and what to count. In particular, historical trends on this list are complicated by institutional shifts in book distribution that occurred during the period which it covers. The increased importance of mall stores, chain stores, and retail distributors continually altered the composition of the bookstores surveyed by the New York Times (Miller 2007). As such, the contents of this dataset likely reflect the purchasing habits of only a particular segment of the American population, namely, those that shop at malls and chain bookstores. This population was disproportionately suburban, white, and middle-class for much of the history of the list. The list likely undercounts sales at other outlets, such as independent bookstores and religious stores.

Users of this data should also be aware that hardcover sales at bookstores are especially unrepresentative of the broader book market in the early years of the “paperback revolution” after WWII, when most popular novels were sold in paperback format at non-bookstore outlets like drugstores. These sales are entirely uncounted on bestseller lists, leading to the conspicuous absence of authors like Erle Stanley Gardner and Mickey Spillane, two of the most popular novelists of the early postwar period.

The Times only expanded its coverage to include nationwide bestsellers in September of 1945. Before that, entries are based on sales in New York or other metropolitan areas. The exact methods used by the Times are not public and the newspaper has come under periodic criticism for its bestseller reporting. For a full discussion of how the bestseller list is constructed, see Miller (2000). This dataset does not reveal anything that might be considered sensitive. All of the data in this dataset is freely available in publicly-accessible archives, as well as in the pages of the New York Times itself.

References

Algee-Hewitt, Mark, and Mark McGurl. 2015. “Between Canon and Corpus: Six Perspectives on 20th-Century Novels.” Pamphlets of the Literary Lab 8.

English, James F. 2016. “Now, Not Now: Counting Time in Contemporary Fiction Studies.” Modern Language Quarterly 77 (3): 395418. https://read.dukeupress.edu/modern-language-quarterly/article-abstract/77/3/395/19914.

Miller, Laura J. 2000. “The Best-Seller List as Marketing Tool and Historical Fiction.” Book History 3 (1): 286–304. https://muse.jhu.edu/pub/2/article/3606.

Miller, Laura J. 2007. Reluctant Capitalists: Bookselling and the Culture of Consumption. University of Chicago Press.

Piper, Andrew, Eva Portelance, Andrew Piper Portelance, and Eva. 2016b. “How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading.” Post45: Peer Reviewed, May 10. https://post45.org/2016/05/how-cultural-capital-works-prizewinning-novels-bestsellers-and-the-time-of-reading/.

Piper, Andrew, Eva Portelance, Andrew Piper Portelance, and Eva. 2016a. “How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading.” Post45: Peer Reviewed, May 10. https://post45.org/2016/05/how-cultural-capital-works-prizewinning-novels-bestsellers-and-the-time-of-reading/.

Sorensen, Alan T. 2007. “Bestseller Lists and Product Variety.” Journal of Industrial Economics 55 (4): 715–38. https://ideas.repec.org//a/bla/jindec/v55y2007i4p715-738.html.

Citation

Reuse

CC BY 4.0

Citation

BibTeX citation:

@article{pruett2022,
  author = {Pruett, Jordan},
  editor = {Sinykin, Dan and McGrath, Laura},
  title = {*New {York} {Times*} {Hardcover} {Fiction} {Bestsellers}
    (1931–2020)},
  journal = {Post45 Data Collective},
  date = {2022-02-01},
  url = {https://data.post45.org/nyt-fiction-bestsellers-data/},
  doi = {10.18737/CNJV1733p4520220211},
  langid = {en},
  abstract = {The *New York Times* Hardcover Fiction Bestsellers
    includes datasets about bestselling books from 1931 and 2020.}
}

For attribution, please cite this work as:

Pruett, Jordan. 2022. “*New York Times* Hardcover Fiction Bestsellers (1931–2020).” Post45 Data Collective, accepted, February 1. https://doi.org/10.18737/CNJV1733p4520220211.