If you’ve ever worked on a personal data science project, you’ve probably spent a lot of time scouring the internet for interesting datasets to analyze.
It can be fun to sift through dozens of datasets to find the best fit, but it can also be frustrating to download and import multiple CSV files, only to find that the data is just missing, not so interesting. Fortunately, there are online repositories that keep sets of data and (mostly) remove uninteresting ones.
In this article, we’ll look at different types of data science projects, including data visualization projects, data cleansing projects, and machine learning projects, and identify the right places to find sets of data. data for each.
Whether you want to strengthen your data science portfolio by showing that you can visualize data well, or if you have a few hours to spare and want to practice your machine learning skills, we’ve got you covered.
Data sets for your Data Visualization Projects
A typical data visualization project might be something like “I want to create an infographic on how income varies in different states in the United States.”
There are a few considerations to keep in mind when looking for a good dataset for a data visualization project:
- This shouldn’t be complicated because you don’t want to spend a lot of time cleaning up your data.
- It must be sufficiently nuanced and interesting to make graphics of it.
- Ideally, each column should be well explained for the display to be accurate.
- The dataset should not have too many rows or columns, so it is easy to use.
- A good place to find good datasets for data visualization projects is news sites that publish their own data.
They usually clean the data for you and also already have some charts they created that you can reproduce or improve.
1. Newsdata.io (for news datasets)
Newsdata.io is a great platform if you are interested in historical news datasets, as they also provide news API for breaking news and historical news. Therefore, they collect news data every single day, daily. They also provide free data samples before you request your actual historical news dataset.
FiveThirtyEight is an incredibly popular interactive news and sports site launched by Nate Silver.
They write interesting data-driven articles, such as “Don’t Blame Lack of Skills For Lack of Production Hires” and “The 2016 NFL Predictions.”
FiveThirtyEight makes the datasets used in their articles available online on Github. Displays the FiveThirtyEight dataset
BuzzFeed started out as a provider of low-quality articles, but has since evolved and now writes investigative articles, such as “The Court That Rulers the World” and “The Short Life of Deonte Hoard”.
BuzzFeed makes the datasets used in its articles available on Github.
4. Socrata OpenData
Socrata OpenData is a portal that contains several own datasets which can be viewed in the browser or downloaded for viewing. A significant portion of the data comes from US government sources and many of them are out of date.
You can browse and download data from OpenData without registering. You can also use view and navigation tools to explore the data in the browser.
Data sets for your Data Processing Projects
Sometimes you just want to work with a large set of data. The end result is not as important as the process of reading and analyzing the data.
You can use tools like Spark or Hadoop to distribute processing across multiple nodes. keep in mind when looking for a good dataset for data processing:
- The cleaner the data, the better — cleaning a large dataset can take a long time.
- The dataset should be interesting.
- There should be an interesting question the data can answer.
Cloud hosting providers like Amazon and Google are good places to find large public datasets. They are incentivized to host datasets because they have them analyzed using their infrastructure (and they pay for it).
5. AWS Public Data sets
Amazon makes large datasets available on its Amazon Web Services platform. You can download the data and use it on your computer, or analyze the data in the cloud using EC2 and Hadoop via EMR. You can read more about how the program works here.
Amazon has a page that lists all the datasets to browse. You will need an AWS account, although Amazon does provide you with a free level of access for new accounts that will allow you to explore data at no cost.
6. Google Public Data sets
Just like Amazon, Google also offers a cloud hosting service, called the Google Cloud Platform. With GCP, you can use a tool called BigQuery to explore large sets of data.
Google lists all datasets on a page. You’ll need to create a GCP account, but the first 1TB request you make is free.
Wikipedia is a free, online, community-edited encyclopedia. Wikipedia contains an astonishing expanse of knowledge, with pages on everything from the Ottoman Wars of the Habsburgs to Leonard Nimoy.
As part of Wikipedia’s commitment to the advancement of knowledge, they offer all of their content free of charge and regularly generate dumps of all articles on the site. In addition, Wikipedia offers a history of changes and activities, so you can track the progress of a page on a topic over time and know who is contributing to it.
You can find different ways to download the data on the Wikipedia site. You will also find scripts to reformat the data in various ways.
Data sets for your Machine Learning Projects
When working on a machine learning project, you want to be able to predict a column from the other columns in a dataset. To do this, we need to make sure that:
- The dataset is not too complicated — if it is, we’ll be spending all of our time cleaning up the data.
- There is an interesting target column for making predictions.
- The other variables have some explanatory power for the target column.
There are online repositories of specific datasets for machine learning. These datasets are usually cleaned up early and allow algorithms to be tested very quickly.
Kaggle is a data science community that hosts machine learning contests. There are a variety of interesting datasets on the site provided externally. Kaggle offers live and historical contests.
You can download Kaggle data by entering a contest. Each competition has its own associated dataset. There are also user-supplied datasets in the new Kaggle dataset offering.
9. UCI Machine Learning Repository
The UCI Machine Learning Repository is one of the oldest sources of datasets on the web. While the datasets are user-supplied and therefore have varying levels of documentation and cleanup, the vast majority are clean and ready to apply.
UCI is a great first stop when looking for interesting datasets.
You can download the data directly from the UCI Machine Learning repository, without registration. These datasets tend to be quite small and don’t have a lot of nuances, but they are useful for machine learning.
Quandl is a repository of economic and financial data. Some of this information is free, but there are many datasets that need to be purchased. Quandl is useful for creating models to predict economic indicators or stock prices. Due to the large number of datasets available, it is possible to build a complex model that uses many datasets to predict values in another.
Data sets for Data Cleaning Projects
Sometimes it can be very satisfying to take a dataset that is spread across multiple files, clean it up, condense it into one, and then perform an analysis. In data cleansing projects, it sometimes takes hours of research to figure out what each column contains the dataset means.
Sometimes it may turn out that the dataset you are analyzing is not suitable for what you are trying to do and you will have to start over.
When looking for a good dataset for a data cleansing project, you want:
- Spread across multiple files.
- They have many nuances and many possible angles to take.
- Requires a fair amount of research to understand.
- Be as “real” as possible.
These types of datasets are typically found on dataset aggregators. These aggregators tend to have datasets from multiple sources, without much care. Too much care gives us overly precise datasets that are difficult to thoroughly cleanse.
data.world describes itself as “the social network for data people”, but it could be more correctly described as “GitHub for data”. It is a place where you can search, copy, analyze, and download datasets.
Additionally, you can upload your data to data.world and use it to collaborate with others. In a relatively short time, it has become one of the benchmarks for data acquisition, with many datasets provided by users and fantastic datasets thanks to data.world’s partnerships with various organizations that include a large amount of US federal government data.
A key differentiator of data.world are the tools they created to make working with data easier: you can write SQL queries in their interface to explore data and merge multiple datasets. They also have SDKs for R and python to make it easier to capture and work with data in your favorite tool.
Data.gov is a relatively new site that is part of a US effort for open government. Data.gov allows you to download data from several US government agencies.
Data can range from government budgets to school performance scores. Most of the data require further research and it can sometimes be difficult to understand which dataset is the “correct” version.
Anyone can download the data, although some data sets require additional steps, such as accepting license agreements.
You can browse the datasets on Data.gov directly, without registering. You can browse by domain or search for a specific data set.
13. The World Bank
The World Bank is a global development organization that provides loans and advice to developing countries. The World Bank regularly funds programs in developing countries and then collects data to track the success of those programs.
You can browse the World Bank datasets directly without registering. Data sets have many missing values and sometimes require multiple clicks to actually access the data.
Reddit, a popular community chat site, has a section dedicated to sharing interesting datasets. This is called the subreddit or / r / dataset. The scope of these datasets varies a lot, as they are all user-submitted, but they tend to be very interesting and nuanced.
15. Academic Torrents
Academic Torrents is a new site focused on sharing datasets from scientific papers. It’s a newer site, so it’s hard to say what the more common types of datasets will look like. For now, it has tons of interesting datasets that lack context.
You can browse the datasets directly on the site. Since this is a torrent site, all datasets can be downloaded immediately, but you will need a Bittorrent client. Deluge is a good free option.