Public text data

To help you get started, we've compiled a variety of datasets and APIs from which to gain inspiration. Many of these datasets have already been cleaned and normalized, so they are ready to be explored using AI tools. The use of these datasets is often intended for research purposes only. If you want to use the data in your startup, be sure to read any associated license agreements to understand if there are commercial restrictions. Also note that you are not restricted to basing your idea on the data sets below. You may discover other open source data sets that inspire your creativity or you may bring your own proprietary data sets if you wish.

And if there’s a data set you think we should add to the list, please send it to us.

Suggest a dataset to add
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.


  1. Categories of data
  2. Data sources: most well-known
  3. Text-related APIs
  4. Hidden gems

Data categories

1. Web Pages and Articles: Textual content from websites, blogs, news articles, and online forums. It covers a wide range of topics and can be used for web scraping, sentiment analysis, and information retrieval.

2. Books and Literature: Digitized texts from classic literature, novels, academic books, and research papers. It can be used for language modeling, text summarization, and topic modeling.

3. Social Media Texts: User-generated content from social media platforms like tweets, Facebook posts, and Instagram captions. Useful for sentiment analysis, social trends analysis, and opinion mining.

4. Question-Answer Data: Pairs of questions and corresponding answers from forums, Q&A platforms, and conversational datasets. Used for natural language understanding and question-answering tasks.

5. Emails and Correspondence: Text data from emails, letters, and correspondences. Often used for email classification and email filtering tasks.

6. Reviews and Ratings: User reviews and ratings for products, services, restaurants, etc. Valuable for sentiment analysis and opinion mining.

7. Transcripts and Subtitles: Textual transcripts of audio and video content, including movie subtitles and speech transcripts. Used for speech recognition and language understanding tasks.

8. News and Media Articles: Texts from news outlets, magazines, and media sources. Useful for text categorization, topic modeling, and sentiment analysis.

9. Scientific Publications: Texts from academic papers, journals, conference proceedings, and scientific articles. Valuable for research, text summarization, and citation analysis.

10. Dictionaries and Lexicons: Lexical resources, word lists, and dictionaries that can be used for sentiment analysis and language processing.

11. Chat Logs and Conversations: Textual conversations and chat logs, useful for dialogue systems and chatbot training.

12. Legal Texts: Legal documents, contracts, court decisions, and statutes. Often used for legal text classification and information retrieval.

13. Health and Medical Texts: Medical articles, health records, and clinical notes. Valuable for medical text analysis and natural language processing in healthcare.

14. Wikipedia and WikiText: Texts from Wikipedia articles and other wiki platforms. Valuable for knowledge extraction and language modeling.

15. Educational Resources: Educational materials, textbooks, and learning resources. Useful for educational applications and content analysis.

16. Sentiment Datasets: Datasets with labeled sentiments (positive, negative, neutral) for sentiment analysis tasks.

17. Translation Datasets: Parallel texts in multiple languages, useful for machine translation and cross-lingual tasks.

18. Natural Language Inference (NLI) Datasets: Datasets with sentence pairs and entailment labels for natural language understanding tasks.

19. Image Captions: Textual descriptions or captions associated with images, commonly used for image captioning tasks.

20. Text-to-Speech (TTS) Datasets: Text data used for training text-to-speech systems.

Data sources: most well-known

1. Common Crawl: An open repository of web crawl data, containing a vast amount of text from various websites and domains. - Website:

2. Gutenberg Project: Provides a large collection of free eBooks, including classic literature and historical texts. - Website:

3. Open Library: A project that aims to provide access to every book ever published, including full-text access to some works.-- Website:

4. Internet Archive: A vast digital library with texts, audio, video, and other formats, including books, articles, and documents.-Website:

5. PubMed Central (PMC): A repository of full-text scientific articles in the biomedical and life sciences.-Website:

6. ArXiv: A repository of preprints in various fields of science, including physics, mathematics, computer science, and more.-Website:

7. DBpedia: A crowd-sourced knowledge graph that extracts structured data from Wikipedia and makes it available for use.-Website:

8. Wikidata: A free and open knowledge graph that provides structured data from Wikipedia and other Wikimedia projects.-Website:

9. Project MUSE: Provides access to academic journals and books in the humanities and social sciences.-Website:

10. U.S. Census Bureau: Offers a wide range of public data, including statistics and reports on various topics related to the United States. -Website:

11. The U.S. government's open data portal, providing access to a diverse collection of datasets, including text data. -Website:

12. European Data Portal: Offers access to public datasets from European countries and institutions, including text data. -Website:

13. Quandl: A platform for financial and economic data, including text-based data like news articles and financial reports. -Website:

14. Kaggle: A platform that hosts a wide range of public datasets, including text data for natural language processing tasks. -Website:

15. UCI Machine Learning Repository: Provides various datasets, some of which include text data for text mining and sentiment analysis. -Website:

16. NLP Progress: A collection of resources and datasets for natural language processing tasks. -Website:

17. TweetsCOV19: A collection of COVID-19 related tweets for research and analysis. -Website:

18. Reddit: A social media platform with various subreddits containing user-generated text content on diverse topics. -Website:

19. Twitter API: Provides access to public tweets on specific topics or hashtags for research and analysis. -Website:

20. Wikipedia Dumps: Offers XML dumps of Wikipedia articles, useful for large-scale text analysis. -Website:

21. NIST TREC Text REtrieval Conference: Provides test collections and datasets for information retrieval research. -Website:

22. WikiText: A large collection of Wikipedia articles with minimal markup, useful for language modeling tasks. -Website:

23. CONLL: A series of shared tasks and datasets for natural language processing and computational linguistics. -Website:

24. Sentiment140: A dataset of tweets labeled for sentiment analysis (positive, negative, neutral). -Website:

25. Amazon Product Reviews: A dataset of product reviews from Amazon, often used for sentiment analysis tasks. -Website:

26. Reuters Corpus: A collection of news articles from the Reuters news agency, useful for text classification and information retrieval. -Website:

27. IMDB Reviews: A dataset of movie reviews from IMDB, commonly used for sentiment analysis. -Website:

28. WikiSQL: A dataset for text-to-SQL tasks, involving natural language questions and SQL queries over Wikipedia tables. -Website:

29. Enron Email Dataset: A collection of emails from the Enron Corporation, useful for email analysis and natural language processing. -Website:

30. SNLI: The Stanford Natural Language Inference (SNLI) Corpus, containing sentence pairs with labels for entailment or contradiction. -Website:

31. Google Books Ngrams: A dataset of n-grams (word sequences) extracted from Google Books, useful for language modeling and linguistics research. -Website:

32. WebText: A collection of text from the web, extracted from URLs and commonly used for language modeling tasks. -Website:

33. Multi30k: A dataset of sentence-based translations in multiple languages, commonly used for machine translation tasks. -Website:

34. OpenAI GPT-2 Dataset: A dataset of articles and text generated by the GPT-2 language model. -Website:

35. BookCorpus: A large collection of text from books, commonly used for language modeling tasks. -Website:

36. AG News Corpus: A dataset of news articles from the AG's corpus of news articles, useful for text classification tasks. -Website:

37. Quora Question Pairs: A dataset of question pairs from Quora, commonly used for question similarity tasks. -Website:

38. SQuAD: The Stanford Question Answering Dataset, containing question-answer pairs on various articles from Wikipedia. -Website:

39. 20 Newsgroups: A collection of newsgroup documents, commonly used for text classification and topic modeling. -Website:

40. LAMBADA: A dataset of narrative passages with missing endings, often used for language modeling and completion tasks. -Website:

41. NLI Large: A dataset for natural language inference, including multiple-choice questions and candidate sentences. -Website:

42. Blogger Corpus: A dataset of blog posts from different bloggers, useful for stylistic analysis and author profiling. -Website:

43. COCO Captions: A dataset of image captions from the COCO dataset, commonly used for image captioning tasks. -Website:

44. Amazon Fine Food Reviews: A dataset of food reviews from Amazon, useful for sentiment analysis and text classification. -Website:

45. OpenSubtitles: A collection of movie subtitles in multiple languages, useful for language modeling and translation tasks. -Website:

46. SNLI-VE: A dataset for visual entailment, containing image-sentence pairs with entailment labels. -Website:

47. GPT-3 Prompt Engineering Dataset: A dataset of prompts and completions used to fine-tune GPT-3 for specific tasks. -Website:

48. Craigslist Corpus: A dataset of Craigslist ads, useful for text classification and analysis of classifieds data. -Website:

49. Yelp Reviews: A dataset of user reviews from Yelp, commonly used for sentiment analysis and text classification. -Website:

50. TED Talks Transcripts: A dataset of TED Talk transcripts, useful for language modeling and topic analysis. -Website:

51. Tatoeba: Tatoeba is a collection of sentences and translations. -Website:

52. ChangeMyView: Dataset from the “ChangeMyView” subreddit -Website:

53. TriviaQA: A large-scale dataset for reading comprehension and question answering. -Website:

54. Cornell Movie Dialog Corpus: This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts.  -Website:

55. Fake News Dataset: Text and metadata from fake and biased news sources around the web.   -Website:

56. Open Domain Deception Dataset: This is a crowdsourced deception dataset consisting of short open domain truths and lies from 512 users. Seven lies and seven truths are provided for each user. The dataset also includes users’ demographic information, such as gender, age, country of origin, and education level.  -Website:

Text-related APIs

1. Google Cloud Natural Language API: This API provides a range of natural language processing capabilities, including sentiment analysis, entity recognition, syntax analysis, and content classification. It can analyze text from various sources like social media, web pages, and documents.-Website:

2. IBM Watson Natural Language Understanding: IBM Watson NLU API offers features like sentiment analysis, entity extraction, keyword analysis, emotion detection, and content categorization. It can process unstructured text from web pages, news articles, and social media posts.-Website:

3. Microsoft Text Analytics API: Microsoft's API offers sentiment analysis, key phrase extraction, entity recognition, and language detection. It is useful for analyzing customer feedback, social media posts, and other text data sources.-Website:

4. Aylien Text Analysis API: This API provides features like sentiment analysis, entity recognition, language detection, and summarization. It can analyze text from articles, blogs, social media, and other sources.-Website: [JTB - Obsolete]

5. MeaningCloud Text Analytics API: MeaningCloud offers various text analysis capabilities, including sentiment analysis, topic extraction, language identification, and entity recognition. It is suitable for analyzing customer feedback, social media content, and surveys.-Website:

6. Amazon Comprehend API: Amazon Comprehend API offers sentiment analysis, entity recognition, key phrase extraction, and language detection. It can analyze text from diverse sources like emails, social media, and documents.-Website:

7. TextRazor API: TextRazor provides features like entity recognition, sentiment analysis, language detection, and topic labeling. It can process web pages, articles, and documents to extract insights and metadata.-Website:

8. ParallelDots Natural Language Understanding API: This API offers sentiment analysis, text classification, emotion analysis, and keyword extraction. It can process social media content, customer reviews, and user-generated text.-Website:

Hidden gems

1. Common Crawl: Common Crawl is an open repository of web crawl data, capturing a large portion of the web's content. It includes diverse and extensive unstructured text data from various websites and domains.-Website:

2. Project Gutenberg Newsletter: Project Gutenberg offers a newsletter archive containing unstructured text data in the form of emails with discussions, announcements, and updates related to their eBook collection.-Website:

3. Wikimedia Dumps: Wikimedia Foundation provides data dumps of Wikipedia articles and discussions, offering unstructured text data on various topics beyond the standard Wikipedia API.-Website:

4. CrisisLex: CrisisLex is a dataset of crisis-related social media messages during various disaster events, providing unstructured text data for disaster response and information dissemination analysis.-Website:

5. Debates and Transcripts: Websites like and offer unstructured text data from debates, discussions, and Q&A sessions on various topics.-Website: [JTB - Obsolete]

6. Project MUSE: Project MUSE is a digital collection of scholarly journals and books, offering unstructured text data from humanities and social science fields.-Website:

7. Wikipedia Current Events: Wikipedia Current events pages provide unstructured text data with daily summaries of notable current events worldwide.-Website:

8. EU Press Releases: The European Union provides press releases on various topics, offering unstructured text data related to EU policies and initiatives.-Website:

9. COVID-19 Open Research Dataset (CORD-19): While COVID-19 research data has gained some attention, specific sections of CORD-19 offer unstructured text data on ethical and social implications of COVID-19 research.-Website:

10. Movie Scripts: Websites like and offer unstructured text data in the form of movie scripts from various films.-Website:,

Recent stories

View more stories

Let’s start a company together

We are with our founders from day one, for the long run.

Start a company with us