The Following are the below links for DataSets:
https://www.data.gov/ (US Government Open Data)
Data.gov is managed and hosted by the U.S. General Services Administration, Technology Transformation Service. Data.gov is powered by two open source applications, CKAN and WordPress, and it is developed publicly on GitHub. Learn how you can contribute to Data.gov. Data.gov follows the Project Open Data schema – a set of required fields (Title, Description, Tags, Last Update, Publisher, Contact Name, etc.) for every data set displayed on Data.gov.
The site contains more than 200,000 data points. These datasets vary from data about Agriculture, Climate, Consumer, Ecosystems, Education, Energy, Finance, Health, Local Government, Manufacturing, Maritime, Ocean, Public Safety, Science & Research and many more areas.
https://data.gov.in/ (India Government Open Data)
Open Government Data (OGD) Platform India – data.gov.in – is a platform for supporting Open Data initiative of Government of India. This is the home of the Indian Government’s open data. The portal is intended to be used by Government of India Ministries/ Departments their organizations to publish datasets, documents, services, tools and applications collected by them for public use. It intends to increase transparency in the functioning of Government and also open avenues for many more innovative uses of Government Data to give different perspective.
The base Open Government Data Platform India is a joint initiative of Government of India and US Government. Open Government Data Platform India is also packaged as a product and made available in open source for implementation by countries globally.
The entire product is available for download at the Open Source Code Sharing Platform “GitHub”.
Open Government Data Platform India has 4 (four) major modules, as detailed below, implemented on a single Drupal instance – An Open Source based Content Framework Solution
- Data Management System (DMS) – Module for contributing data catalogs by various government agencies for making those available on the front end website after a due approval process through a defined workflow.
- Content Management System (CMS) – Module for managing and updating various functionalities and content types of the Open Government Data Platform India Platform.
- Visitor Relationship Management (VRM) – Module for collating and disseminating viewer feedback on various data catalogs.
- Communities – Module for community users to interact and share their zeal and views with others, who share common interests as that of theirs.
https://data.worldbank.org/ (World Bank Open Data)
This site is designed to make World Bank data easy with open data from the World bank. This site is to find, download, and use of world data. All of the data found here can be used free of charge with minimal restrictions. The platform provides several tools like Open Data Catalog, Access data through Web API, World development indices, Education Indices, Micro Data, Open Data at the world bank etc.
https://rbi.org.in/Scripts/Statistics.aspx (Data on Indian economy, banking and finance, metrics on money market)
Data available from the Reserve Bank of India. This link provides data on various aspects of Indian economy, banking and finance, metrics on money market operations, balance of payments, use of banking and several products. Also help for BFSI (Banking, Financial services and Insurance) domain in India.
https://dbie.rbi.org.in/DBIE/dbie.rbi?site=home (Database on Indian Economy)
The Reserve Bank of India (RBI) has rich traditions of publishing data on various aspects of the Indian Economy through several of its publications. Through this website (DBIE), data are mainly presented through time-series formatted reports. These reports have been organized under sectors and sub-sectors according to their periodicities. Reports can be saved as excel sheets for further analysis.
https://data.fivethirtyeight.com/ (FiveThirtyEight Datasets)
Here is a link to datasets used by Five Thirty Eight in their stories. Each dataset includes the data, a dictionary explaining the data and the link to the story carried out by Five Thirty Eight. If you want to learn how to create data stories, it can’t get better than this.
https://cloud.google.com/bigquery/public-data/ (Google DataSets)
Google provides a few datasets as part of its Big Query tool. This includes baby names, data from GitHub public repositories, all stories & comments from Hacker News etc.
A public dataset is any dataset that is stored in BigQuery and made available to the general public. The public datasets listed in the BigQuery documentation are datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these datasets and provides public access to the data via a project. You pay only for the queries that you perform on the data (the first 1 TB per month is free, subject to query pricing details).
https://research.google.com/youtube8m/ (YouTube-8M video DataSets)
A few months back, Google Research Group released YouTube labelled dataset, which consists of 8 million YouTube video IDs and associated labels from 4800 visual entities. It comes with pre-computed, state-of-the-art vision features from billions of frames.
YouTube-8M is a large-scale labelled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities. It comes with precomputed state-of-the-art audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to get started on this dataset by training a baseline video model in less than a day on a single machine! At the same time, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.
https://registry.opendata.aws/ (Amazon Web Services – AWS DataSets)
Amazon provides a few big datasets, which can be used on their platform or on your local computers. You can also analyze the data in the cloud using EC2 and Hadoop via EMR. Popular datasets on Amazon include full Enron email dataset, Google Books n-grams, NASA NEX datasets, Million Songs dataset and many more. More information can be found here.
Kaggle has come up with a platform, where people can donate datasets and other community members can vote and run Kernel / scripts on them. They have more than 350 datasets in total – with more than 200 as Featured datasets. While some of the initial datasets were usually present at other places, I have seen a few interesting datasets on the platform, not present at other places. Along with new datasets, another benefit of the interface is that you can see scripts and questions from community members on the same interface.
https://archive.ics.uci.edu/ml/datasets.html (UCI Machine Learning Repository)
UCI Machine Learning Repository is clearly the most famous data repository. It is usually the first place to go, if you are looking for datasets related to machine learning repositories. The datasets include a diverse range of datasets from popular datasets like Iris and Titanic survival to recent contributions like that of Air Quality and GPS trajectories. The repository contains more than 350 datasets with labels like domain, purpose of the problem (Classification / Regression). You can use these filters to identify good datasets for your need.
Quandl provides financial, economic and alternative data from various sources through their website / API or direct integration with a few tools. Their datasets are classified as Open or Premium. You can access all the open datasets for Free, but you need to pay for the premium datasets. If you search, you still get good datasets on the platform. Eg. Stock Exchange data from India is available for free.
http://www.kdd.org/kdd-cup (Past KDD Cups)
KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining. Archives includes datasets and instructions. Winners are available for most years.
SIGKDD’s mission is to provide the premier forum for advancement, education, and adoption of the “science” of knowledge discovery and data mining from all types of data stored in computers and networks of computers.
https://www.drivendata.org/ (Driven Data)
Driven Data finds real-world challenges where data science can be used to create a positive social impact. They then run online modelling competitions for data scientists to develop the best models to solve them. If you are interested in use of data science for social good – this is the place to be.
http://drivendata.co/ (Driven Data Labs Projects)
This link provided the Data for Projects on Big Data, Data Science and Business Analytics.
https://dhsprogram.com/data/available-datasets.cfm (Demographic & Health Survey Data)
DHS supports a range of data collection options that can be tailored to fit specific monitoring and evaluation needs of countries.
Demographic and Health Surveys (DHS)
Provide data for a wide range of monitoring and impact evaluation indicators in the areas of population, health, and nutrition.
AIDS Indicator Surveys (AIS)
Provide countries with a standardized tool to obtain indicators for the effective monitoring of national HIV/AIDS programs.
Service Provision Assessment (SPA) Surveys
Provide information about the characteristics of health facilities and services available in a country.
Malaria Indicator Surveys (MIS)
Provide data on bednet ownership and use, prevention of malaria during pregnancy, and prompt and effective treatment of fever in young children. In most cases, biomarker tests for malaria and anemia are also included.
Key Indicators Surveys (KIS)
Provide monitoring and evaluation data for population and health activities in small areas—regions, districts, catchment areas—that may be targeted by an individual project. KIS can also be used in nationally representative surveys.
Other Quantitative Surveys
Include Benchmark Surveys, KAP Surveys, Panel Surveys and other specialized surveys.
Provides informed answers to questions that lie outside the purview of standard quantitative approaches.
https://data.opendatasoft.com/pages/home/ (Open DataSoft’s Data Network)
This link provides for OpenDataSoft updates the data daily and geocode them for a better service, OpenDataSoft offers dashboards based on those data. Build yours in minutes and OpenDataSoft develops a state-of-the-art API and fuzzy search at a competitive price.
Thank you for reading my post. I hope these above lists of resources would prove extremely useful for people for Data Science, Big Data, Business Analytics, Artificial Intelligence, IoT, Neural Networks, Fuzzy Logic projects. I would love to hear your thought in the comments below such that I may get motivated to write these kinds of more articles.
(Dr. Kamal Gulati)
Ph.D., M.C.A, M.Sc. (CS), M.B.A
Professional Certification: Wiley Big Data Analyst (USA), R Programming by Johns Hopkins University (USA), CCNA (Cisco), Data Science, R Language, Python, SQL, Big Data, MCP (Microsoft), DBMS (I.I.T, Mumbai-India), Brainbench Certified on (MS Access, MS Project, MySQL 5.7 Administration, Computer Fundamentals, Advanced Ms. Excel & Windows OS)
Website: http://mybigdataanalytics.in | Twitter: DrKamalGulati