Public text data

To help you get started, we've compiled a variety of datasets and APIs from which to gain inspiration. Many of these datasets have already been cleaned and normalized, so they are ready to be explored using AI tools. The use of these datasets is often intended for research purposes only. If you want to use the data in your startup, be sure to read any associated license agreements to understand if there are commercial restrictions. Also note that you are not restricted to basing your idea on the data sets below. You may discover other open source data sets that inspire your creativity or you may bring your own proprietary data sets if you wish.

And if there’s a data set you think we should add to the list, please send it to us.

Suggest a dataset to add
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contents

  1. Categories of data
  2. Data sources: most well-known
  3. Text-related APIs
  4. Hidden gems

Data categories

1. Web Pages and Articles: Textual content from websites, blogs, news articles, and online forums. It covers a wide range of topics and can be used for web scraping, sentiment analysis, and information retrieval.

2. Books and Literature: Digitized texts from classic literature, novels, academic books, and research papers. It can be used for language modeling, text summarization, and topic modeling.

3. Social Media Texts: User-generated content from social media platforms like tweets, Facebook posts, and Instagram captions. Useful for sentiment analysis, social trends analysis, and opinion mining.

4. Question-Answer Data: Pairs of questions and corresponding answers from forums, Q&A platforms, and conversational datasets. Used for natural language understanding and question-answering tasks.

5. Emails and Correspondence: Text data from emails, letters, and correspondences. Often used for email classification and email filtering tasks.

6. Reviews and Ratings: User reviews and ratings for products, services, restaurants, etc. Valuable for sentiment analysis and opinion mining.

7. Transcripts and Subtitles: Textual transcripts of audio and video content, including movie subtitles and speech transcripts. Used for speech recognition and language understanding tasks.

8. News and Media Articles: Texts from news outlets, magazines, and media sources. Useful for text categorization, topic modeling, and sentiment analysis.

9. Scientific Publications: Texts from academic papers, journals, conference proceedings, and scientific articles. Valuable for research, text summarization, and citation analysis.

10. Dictionaries and Lexicons: Lexical resources, word lists, and dictionaries that can be used for sentiment analysis and language processing.

11. Chat Logs and Conversations: Textual conversations and chat logs, useful for dialogue systems and chatbot training.

12. Legal Texts: Legal documents, contracts, court decisions, and statutes. Often used for legal text classification and information retrieval.

13. Health and Medical Texts: Medical articles, health records, and clinical notes. Valuable for medical text analysis and natural language processing in healthcare.

14. Wikipedia and WikiText: Texts from Wikipedia articles and other wiki platforms. Valuable for knowledge extraction and language modeling.

15. Educational Resources: Educational materials, textbooks, and learning resources. Useful for educational applications and content analysis.

16. Sentiment Datasets: Datasets with labeled sentiments (positive, negative, neutral) for sentiment analysis tasks.

17. Translation Datasets: Parallel texts in multiple languages, useful for machine translation and cross-lingual tasks.

18. Natural Language Inference (NLI) Datasets: Datasets with sentence pairs and entailment labels for natural language understanding tasks.

19. Image Captions: Textual descriptions or captions associated with images, commonly used for image captioning tasks.

20. Text-to-Speech (TTS) Datasets: Text data used for training text-to-speech systems.

Data sources: most well-known

1. Common Crawl: An open repository of web crawl data, containing a vast amount of text from various websites and domains. - Website: https://commoncrawl.org/

2. Gutenberg Project: Provides a large collection of free eBooks, including classic literature and historical texts. - Website: https://www.gutenberg.org/

3. Open Library: A project that aims to provide access to every book ever published, including full-text access to some works.-- Website: https://openlibrary.org/

4. Internet Archive: A vast digital library with texts, audio, video, and other formats, including books, articles, and documents.-Website: https://archive.org/

5. PubMed Central (PMC): A repository of full-text scientific articles in the biomedical and life sciences.-Website: https://www.ncbi.nlm.nih.gov/pmc/

6. ArXiv: A repository of preprints in various fields of science, including physics, mathematics, computer science, and more.-Website: https://arxiv.org/

7. DBpedia: A crowd-sourced knowledge graph that extracts structured data from Wikipedia and makes it available for use.-Website: https://wiki.dbpedia.org/

8. Wikidata: A free and open knowledge graph that provides structured data from Wikipedia and other Wikimedia projects.-Website: https://www.wikidata.org/

9. Project MUSE: Provides access to academic journals and books in the humanities and social sciences.-Website: https://muse.jhu.edu/

10. U.S. Census Bureau: Offers a wide range of public data, including statistics and reports on various topics related to the United States. -Website: https://www.census.gov/

11. Data.gov: The U.S. government's open data portal, providing access to a diverse collection of datasets, including text data. -Website: https://www.data.gov/

12. European Data Portal: Offers access to public datasets from European countries and institutions, including text data. -Website: https://www.europeandataportal.eu/

13. Quandl: A platform for financial and economic data, including text-based data like news articles and financial reports. -Website: https://www.quandl.com/

14. Kaggle: A platform that hosts a wide range of public datasets, including text data for natural language processing tasks. -Website: https://www.kaggle.com/

15. UCI Machine Learning Repository: Provides various datasets, some of which include text data for text mining and sentiment analysis. -Website: https://archive.ics.uci.edu/ml/index.php

16. NLP Progress: A collection of resources and datasets for natural language processing tasks. -Website: https://nlpprogress.com/

17. TweetsCOV19: A collection of COVID-19 related tweets for research and analysis. -Website: https://github.com/echen102/COVID-19-TweetIDs

18. Reddit: A social media platform with various subreddits containing user-generated text content on diverse topics. -Website: https://www.reddit.com/

19. Twitter API: Provides access to public tweets on specific topics or hashtags for research and analysis. -Website: https://developer.twitter.com/en/docs/twitter-api

20. Wikipedia Dumps: Offers XML dumps of Wikipedia articles, useful for large-scale text analysis. -Website: https://dumps.wikimedia.org/

21. NIST TREC Text REtrieval Conference: Provides test collections and datasets for information retrieval research. -Website: https://trec.nist.gov/

22. WikiText: A large collection of Wikipedia articles with minimal markup, useful for language modeling tasks. -Website: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

23. CONLL: A series of shared tasks and datasets for natural language processing and computational linguistics. -Website: https://www.conll.org/

24. Sentiment140: A dataset of tweets labeled for sentiment analysis (positive, negative, neutral). -Website: http://help.sentiment140.com/for-students

25. Amazon Product Reviews: A dataset of product reviews from Amazon, often used for sentiment analysis tasks. -Website: https://nijianmo.github.io/amazon/index.html

26. Reuters Corpus: A collection of news articles from the Reuters news agency, useful for text classification and information retrieval. -Website: https://trec.nist.gov/data/reuters/reuters.html

27. IMDB Reviews: A dataset of movie reviews from IMDB, commonly used for sentiment analysis. -Website: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

28. WikiSQL: A dataset for text-to-SQL tasks, involving natural language questions and SQL queries over Wikipedia tables. -Website: https://github.com/salesforce/WikiSQL

29. Enron Email Dataset: A collection of emails from the Enron Corporation, useful for email analysis and natural language processing. -Website: https://www.cs.cmu.edu/~enron/

30. SNLI: The Stanford Natural Language Inference (SNLI) Corpus, containing sentence pairs with labels for entailment or contradiction. -Website: https://nlp.stanford.edu/projects/snli/

31. Google Books Ngrams: A dataset of n-grams (word sequences) extracted from Google Books, useful for language modeling and linguistics research. -Website: https://books.google.com/ngrams

32. WebText: A collection of text from the web, extracted from URLs and commonly used for language modeling tasks. -Website: https://www.tensorflow.org/datasets/catalog/webtext

33. Multi30k: A dataset of sentence-based translations in multiple languages, commonly used for machine translation tasks. -Website: https://github.com/multi30k/dataset

34. OpenAI GPT-2 Dataset: A dataset of articles and text generated by the GPT-2 language model. -Website: https://github.com/openai/gpt-2-output-dataset

35. BookCorpus: A large collection of text from books, commonly used for language modeling tasks. -Website: https://yknzhu.wixsite.com/mbweb

36. AG News Corpus: A dataset of news articles from the AG's corpus of news articles, useful for text classification tasks. -Website: https://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

37. Quora Question Pairs: A dataset of question pairs from Quora, commonly used for question similarity tasks. -Website: https://www.kaggle.com/c/quora-question-pairs

38. SQuAD: The Stanford Question Answering Dataset, containing question-answer pairs on various articles from Wikipedia. -Website: https://rajpurkar.github.io/SQuAD-explorer/

39. 20 Newsgroups: A collection of newsgroup documents, commonly used for text classification and topic modeling. -Website: http://qwone.com/~jason/20Newsgroups/

40. LAMBADA: A dataset of narrative passages with missing endings, often used for language modeling and completion tasks. -Website: https://zenodo.org/record/2630551

41. NLI Large: A dataset for natural language inference, including multiple-choice questions and candidate sentences. -Website: https://cims.nyu.edu/~sbowman/multinli/

42. Blogger Corpus: A dataset of blog posts from different bloggers, useful for stylistic analysis and author profiling. -Website: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

43. COCO Captions: A dataset of image captions from the COCO dataset, commonly used for image captioning tasks. -Website: https://cocodataset.org/#download

44. Amazon Fine Food Reviews: A dataset of food reviews from Amazon, useful for sentiment analysis and text classification. -Website: https://www.kaggle.com/snap/amazon-fine-food-reviews

45. OpenSubtitles: A collection of movie subtitles in multiple languages, useful for language modeling and translation tasks. -Website: http://opus.nlpl.eu/OpenSubtitles-v2018.php

46. SNLI-VE: A dataset for visual entailment, containing image-sentence pairs with entailment labels. -Website: https://github.com/necla-ml/SNLI-VE

47. GPT-3 Prompt Engineering Dataset: A dataset of prompts and completions used to fine-tune GPT-3 for specific tasks. -Website: https://github.com/openai/gpt-3.5-turbo/tree/main/data

48. Craigslist Corpus: A dataset of Craigslist ads, useful for text classification and analysis of classifieds data. -Website: https://www.kaggle.com/rmisra/clothing-fit-data-for-size-recommendation

49. Yelp Reviews: A dataset of user reviews from Yelp, commonly used for sentiment analysis and text classification. -Website: https://www.yelp.com/dataset

50. TED Talks Transcripts: A dataset of TED Talk transcripts, useful for language modeling and topic analysis. -Website: https://wit3.fbk.eu/mt.php?release=2016-01

51. Tatoeba: Tatoeba is a collection of sentences and translations. -Website: https://tatoeba.org/en

52. ChangeMyView: Dataset from the “ChangeMyView” subreddit -Website: https://chenhaot.com/pages/changemyview.html

53. TriviaQA: A large-scale dataset for reading comprehension and question answering. -Website: http://nlp.cs.washington.edu/triviaqa/

54. Cornell Movie Dialog Corpus: This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts.  -Website: https://www.cs.cornell.edu/~cristian//Cornell_Movie-Dialogs_Corpus.html

55. Fake News Dataset: Text and metadata from fake and biased news sources around the web.   -Website: https://www.kaggle.com/datasets/mrisdal/fake-news

56. Open Domain Deception Dataset: This is a crowdsourced deception dataset consisting of short open domain truths and lies from 512 users. Seven lies and seven truths are provided for each user. The dataset also includes users’ demographic information, such as gender, age, country of origin, and education level.  -Website: http://web.eecs.umich.edu/~mihalcea/downloads.html#OpenDeception

Text-related APIs

1. Google Cloud Natural Language API: This API provides a range of natural language processing capabilities, including sentiment analysis, entity recognition, syntax analysis, and content classification. It can analyze text from various sources like social media, web pages, and documents.-Website: https://cloud.google.com/natural-language

2. IBM Watson Natural Language Understanding: IBM Watson NLU API offers features like sentiment analysis, entity extraction, keyword analysis, emotion detection, and content categorization. It can process unstructured text from web pages, news articles, and social media posts.-Website: https://www.ibm.com/products/natural-language-understanding

3. Microsoft Text Analytics API: Microsoft's API offers sentiment analysis, key phrase extraction, entity recognition, and language detection. It is useful for analyzing customer feedback, social media posts, and other text data sources.-Website: https://azure.microsoft.com/en-us/products/ai-services/text-analytics

4. Aylien Text Analysis API: This API provides features like sentiment analysis, entity recognition, language detection, and summarization. It can analyze text from articles, blogs, social media, and other sources.-Website: https://aylien.com/ [JTB - Obsolete]

5. MeaningCloud Text Analytics API: MeaningCloud offers various text analysis capabilities, including sentiment analysis, topic extraction, language identification, and entity recognition. It is suitable for analyzing customer feedback, social media content, and surveys.-Website: https://www.meaningcloud.com/

6. Amazon Comprehend API: Amazon Comprehend API offers sentiment analysis, entity recognition, key phrase extraction, and language detection. It can analyze text from diverse sources like emails, social media, and documents.-Website: https://docs.aws.amazon.com/comprehend/latest/APIReference/welcome.html

7. TextRazor API: TextRazor provides features like entity recognition, sentiment analysis, language detection, and topic labeling. It can process web pages, articles, and documents to extract insights and metadata.-Website: https://www.textrazor.com/

8. ParallelDots Natural Language Understanding API: This API offers sentiment analysis, text classification, emotion analysis, and keyword extraction. It can process social media content, customer reviews, and user-generated text.-Website: https://apis.paralleldots.com/text_docs/index.html

Hidden gems

1. Common Crawl: Common Crawl is an open repository of web crawl data, capturing a large portion of the web's content. It includes diverse and extensive unstructured text data from various websites and domains.-Website: https://commoncrawl.org/

2. Project Gutenberg Newsletter: Project Gutenberg offers a newsletter archive containing unstructured text data in the form of emails with discussions, announcements, and updates related to their eBook collection.-Website: https://www.gutenberg.org/

3. Wikimedia Dumps: Wikimedia Foundation provides data dumps of Wikipedia articles and discussions, offering unstructured text data on various topics beyond the standard Wikipedia API.-Website: https://dumps.wikimedia.org/

4. CrisisLex: CrisisLex is a dataset of crisis-related social media messages during various disaster events, providing unstructured text data for disaster response and information dissemination analysis.-Website: https://crisislex.org/

5. Debates and Transcripts: Websites like debates.org and debate.linux.org offer unstructured text data from debates, discussions, and Q&A sessions on various topics.-Website: debate.linux.org [JTB - Obsolete]

6. Project MUSE: Project MUSE is a digital collection of scholarly journals and books, offering unstructured text data from humanities and social science fields.-Website: https://muse.jhu.edu/

7. Wikipedia Current Events: Wikipedia Current events pages provide unstructured text data with daily summaries of notable current events worldwide.-Website: https://en.wikipedia.org/wiki/Portal:Current_events

8. EU Press Releases: The European Union provides press releases on various topics, offering unstructured text data related to EU policies and initiatives.-Website: https://europa.eu/newsroom/home_en

9. COVID-19 Open Research Dataset (CORD-19): While COVID-19 research data has gained some attention, specific sections of CORD-19 offer unstructured text data on ethical and social implications of COVID-19 research.-Website: https://www.semanticscholar.org/cord19

10. Movie Scripts: Websites like imsdb.com and scripts.com offer unstructured text data in the form of movie scripts from various films.-Website: https://imsdb.com/, https://www.scripts.com/


Recent stories

View more stories

Let’s start a company together

We are with our founders from day one, for the long run.

Start a company with us