Data Science
How To Efficiently Download Any Dataset from Kaggle
Using the Python library “opendatasets”
Table Of Contents
· Overview
· Step 1: Install the required libraries
· Step 2: Import the library
· Step 3: Get the data set URL from Kaggle.
· Step 3: Get Kaggle API token
· Step 4: Download the data set files
· Step 5: List the files
· Step 6: Read data set content
· Conclusion
Overview
Kaggle is a buzzword for any Data Scientist as a platform that provides the free environment to compete, collaborate, learn and share your work.
Apart from this, it gives you the opportunity to practice your skills on real-world datasets in various domains ranging from Education to Finance. Whether you are a novice Data analyst or an experienced Scientist, you can choose from a wide range of datasets available in Kaggle
Most of the time, we may need to download the data sets to work in our local environment or any other platform other than Kaggle like Google Colaboratory, Binder, or Amazon SageMaker. The most commonly used method is to manually download the data sets from Kaggle and upload them into the platform which we are working on; this may be tedious and time-consuming, especially when the data set is huge.
In this post, we will go through simple steps wherein, by using the existing Python library, we can download the data set from Kaggle into any of the working platforms of our choice.
Step 1: Install the required libraries
Here we use a Python library called opendatasets
Let’s install opendatasets. If you wish, you can install any other libraries which you might need like pandas
and others
!pip install opendatasets --upgrade --quiet
Step 2: Import the library
Here we import the required libraries; we just need a few to download and view the data sets along with opendatasets
import pandas as pd
import os
import opendatasets as od
Step 3: Get the data set URL from Kaggle.
Next, we get the Kaggle URL for the specific data set we need to download. We chose the DL Course data set for this post, but you can choose any one of your choices. The total size of the data set that we are downloading is ~586 MB
Step 3: Get Kaggle API token
Before we start downloading the data set, we need the Kaggle API token. To get that
- Login into your Kaggle account
- Get into your account settings page
- Click on Create a new API token
- This will prompt you to download the .json file into your system. Save the file, and we will use it in the next step.
Step 4: Download the data set files
Now that we have all the required information let’s start downloading the data sets from Kaggle.
# Assign the Kaggle data set URL into variable
dataset = 'https://www.kaggle.com/ryanholbrook/dl-course-data'# Using opendatasets let's download the data sets
od.download(dataset)
After running the above statements, you will be prompted for the Kaggle username and then the key. This you can get from the .json file which you downloaded earlier, and the file content looks something like this
{“username”:”<userID>",”key”:”<userKey>"}
After providing the above credentials, the data set files will be downloaded into your working environment (either local or any other platform). If there is no issue in downloading, then the message looks something like this
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds Your Kaggle username: <userID> Your Kaggle Key: ········ Downloading dl-course-data.zip to ./dl-course-data
100%|██████████| 231M/231M [00:03<00:00, 64.4MB/s]
For the data set which we are working on, related files will be downloaded into the dir named /dl-course-data
(You can get the dir name from the above message). Now that we have all the files in our working environment, let’s list them.
Step 5: List the files
We will use the os library to list the files inside dir in our case dl-course-data
Great!! Now we have successfully downloaded the data set files from Kaggle. You can further use quite a few data sets (CSV files) to start with analysis. Let’s read one of the file content using pandas
Step 6: Read data set content
The contents of the CSV files look good. You can further load any other file required for Data Analysis and get valuable insights from it.
Conclusion
Although most of the time we use Kaggle as the platform to do our Data Analysis and collaborate with other users, whenever you need to use data from Kaggle to platforms like Google Collab and Jupyter Notebooks;
the above steps will be handy.
I hope you learned something new from this. If so, do provide your comments, and do not forget to clap.
Thanks for reading, and Happy coding until next time !!