January 26, 2021

Top Five Websites for Generating Datasets for Machine Learning Projects

Bala Venkateshbalavenkatesh.s

DURATION

15min

Mendeley Data

https://data.mendeley.com/

This website provides a unique ODI for each version of your dataset. From the above picture you can see that the search field must be used to search data. Consequently, If you have a huge amount of data, you can save your data and share with others in a more secure way. Through this, we may not be required to pay money to store TB level data. It owns around 26.3 million data so we can be confident that our data will be more secure.

Topcoder

https://www.topcoder.com/thrive/search?title=datasets
Screenshot from 2021-06-14 09-56-16

With Topcoder Datasets you have access to high-quality labeled data, some of which you can’t find anywhere else. We also provide the tools to shape your own data. These datasets are intended to help you shape and strengthen algorithmic models and sharpen your data science abilities along the way.

Start by exploring our favorite datasets to get a better understanding of what they contain and if interested, try pairing them up with your algorithms or others found in Models & Algos.

Deepmind

https://deepmind.com/research?filters={"tags":["Datasets"]}

Deepmind is a research company which was founded in September 2010 and later acquired by Google in 2014. On Deepmind we can get a lot of video datasets in kinetics. It also provides some useful research papers. Recently, they have implemented alphfold, which will detect protein structure, which is a very big milestone in Artificial Intelligence.

Google

https://datasetsearch.research.google.com/

Google has a dataset search engine tool you can use to search more than twenty-five million publicly available datasets. The dataset result contains description of the dataset contents as well as author citations. Google dataset publisher uses schema.org to describe their metadata. If you would like to publish your own dataset then you can use schema.org to publish your dataset with metadata.

Limitation:
It does not provide any API to download or search the dataset.

Amazon

https://registry.opendata.aws/

Amazon has a huge amount of data. Also they have Facebook Data for Good, NASA Space Act Agreement and the Amazon Sustainability Data Initiative. If you want to add your dataset check out how here. The whole dataset is not maintained by the AWS team, some third party individuals and companies are working on this. For more information check out the Github

Microsoft

https://msropendata.com/

Microsoft provides a beta version of open data. We can search for data as categories which is very useful to get an accurate dataset. In total, they have ten categories of dataset such as healthcare, computer science, and physics. You can easily directly copy them to Azure-based virtual machines and data science virtual machines. We can find related research papers around the dataset. This data is completely maintained by Microsoft researchers, industry partners, and academic advisers.

Conclusion

I hope this third party website list is helpful for your machine learning project. If you are a beginner, I suggest you try the Kaggle website where you can get a notebook with a dataset. It will be more helpful for you to start playing with a machine learning project.

Chat on Discord