Data Science

Exploratory Data Analysis — Loan Investment Case Study — Part 1

Data preprocessing and cleaning the dataset using Pandas — Getting ready to analyze data!

Ravi Chandra

--

Index Of Contents
· What is Exploratory Data Analysis
· Project Outline
· Download and read the dataset
· Data Preprocessing & Cleaning
Data Cleansing: DataType changes and fixing column values
· Summary statistics of relevant columns and filling the missing values
Column: revol_util
Column: pub_rec_bankruptcies
Column: home_ownership
Column: emp_length
Column: issue_d
Calculate the Return of Investment (ROI)
· Conclusion

This project is about a consumer finance company that specialized in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.

What is Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of exploring, investigating, and gathering meaningful insights and nuggets using different kind of statistical measures and visualizations. The objective of EDA is to develop an understanding of data by uncovering trends, relationships, and patterns.

When it comes to the requirement of statistical knowledge, visualization technique, and data analysis tools like Numpy, Pandas, Matplotlib, etc. we categorize it as an art. When there is a requirement of asking interesting questions to guide the investigation for generating meaningful insight we call it a science. So it is a mixture of both art and science.

Project Outline

  • Download and read the dataset.
  • Data Processing & Cleaning with Pandas
  • Exploratory Analysis and Visualization
  • Asking and Answering Questions

As part of this post, we will concentrate on the initial two steps i.e Downloading along with preprocessing the data set so that we can use it further to analyze and draw insights on Loan defaulters.

Download and read the dataset

Let’s start by installing the required libraries and importing the useful modules.

Load the CSV file using Pandas library and store the data into a dataframe named as loan.

To look at all the columns in the dataframe we can use columns property.

Data Preprocessing & Cleaning

As we can see there are a number of columns in the dataframe but in this project, we will limit our analysis to only a few columns. So let’s check the missing values and remove unnecessary columns

It seems we have so many columns with 100% missing values and these columns don’t provide us any information. Let’s look at the graphical representation of columns with missing values.

From the above graph we can see that there are many (around 30+) columns which are having more than 30% of their values as Null. So these columns are not useful and hence we will remove them from our dataframe

Let’s drop the columns with more than 30% missing values

Although we have removed few columns based on NULL values, let’s also drop unnecessary columns which we are not using for analysis

Now that we have removed the non relevant columns and many columns which had missing values from our dataframe, Let’s update the data type formats are columns which we will be use for our analysis further in next section.

Data Cleansing: DataType changes and fixing column values

Let’s convert issue_id and last_pymnt_d columns into datetime data type

Similarly let’s remove % and change the datatype to float of columns int_rate and revol_util

There are extra spaces in term column let’s change it and do some cleaning

Now that we have made the changes to the column. Let’s check a few records in the data set

Looks good, in the next section we will do some statistics of numerical columns and fill the missing values

Summary statistics of relevant columns and filling the missing values

Column: revol_util

Since there is no much spread of data and the difference between mean and median is very small, let’s impute the missing values with mean

Column: pub_rec_bankruptcies

Let’s convert to Category type

Display frequency of each column

The most common value of pub_rec_bankruptcies is 0, so let’s impute the missing values

Column: home_ownership

Unique values of home_ownership

Check the number of loans by home_ownership

Replace the ‘NONE’ with ‘RENT’

Column: emp_length

Unique values of emp_length

Check the number of loans by emp_length

Let’s replace mode of em_length column and change the data type

Column: issue_d

Extract the month, year from issue_d column

Calculate the Return of Investment (ROI)

Return on investment (ROI) is a widely used financial metric for measuring the probability of gaining a return from an investment. It is a ratio that compares the gain or loss from an investment relative to its cost. It is as useful in evaluating the potential return from a stand-alone investment as it is in comparing returns from several investments.

Let’s calculate ROI based on data available for us which we will use for analysis later on.

ROI = (Final Value Of Investment — Initial Value Of Investment / Cost of Investment) — 1

For further analysis, let’s filter the loan data by ignoring loan_status of ‘Current’, since these are currently running loans.

Finally, we have preprocessed and cleaned our data so that we can get some insights and nuggets out of the data in the next post.

Conclusion

As part of this post we have

  • Downloaded and read the dataset.
  • Data Processing & Cleaning with Pandas

In the next post, we will use the cleaned data to do Exploratory Analysis and answer the questions related to the data

If you’ve made it this far, thank you for reading and if you enjoyed reading this post, consider dropping a clap and a follow. I post articles on interesting Python topics and write beginner-friendly Python tutorials.

Until next time…Happy coding !!

--

--

Ravi Chandra

Full Stack Developer | Java | Python | Machine Learning | Data Science | Self-Development