You have a great research question that you want to answer with text data mining (TDM) methods, and you've got some Python under your belt or you've decided to see what you can learn from a browser-based tool like Voyant. You're ready to get started on a computational text analysis project. But wait!
Where do you get the texts?
Finding usable data -- full-text collections of novels, newspaper articles, scholarly papers, or other content -- can be challenging because of license restrictions and other roadblocks. (And we don’t recommend scraping an entire library database -- please don't do that. Providers will typically shut down access for the entire campus.)
Fortunately, the Library is here to help!
Here are some popular choices from our guide:
- HathiTrust Research Center (HTRC): Download ngrams for 14 million books (similar to the content in Google Books) or analyze HTRC's full-text collection through its Data Capsule program.
- Project Gutenberg's mirrored sites: Over 50,000 public domain ebooks, with a strength in literature. The mirrored sites allow you to scrape books at scale.
- The New York Times Annotated Corpus: 1.8 million articles from the New York Times between January 1, 1987 and June 19, 2007.
- JSTOR Data for Research: download word frequencies, citations, key terms, and ngrams for scholarly journal articles in JSTOR.
Campus experts are available by email at firstname.lastname@example.org to answer questions or help you figure out access to data not already on our list. Happy computing!