Why are tree models preferred in credit risk modeling?
Credit risk modeling is an area where machine learning can be used to offer analytical solutions, as it has the ability to find answers from the large amount of heterogeneous data. In credit risk modeling, it is also necessary to infer characteristics as they are very important in data-driven decision making. Unlike credit risk, we will look at what credit risk is and how it can be represented using various machine learning algorithms in this article. We will implement credit risk modeling with different machine learning models and see how tree models outperform other models in this task. Here are the main points to discuss.
- What is credit risk
- What is Credit Risk Modeling
- How is machine learning used in credit risk modeling?
- Implementation of credit risk modeling
- The outperformance of tree-based models
Let’s start the discussion by understanding what credit risk is.
What is credit risk
Credit risk refers to the likelihood that a borrower will not be able to make regular payments and default on their obligations. It refers to the possibility that a lender will not get paid for interest or money given on time. The lender’s cash flow is disrupted, and the cost of collection increases. In the worst case scenario, the lender may be forced to cancel all or part of the loan, resulting in a loss.
It is incredibly difficult and complex to predict the likelihood of someone defaulting on a debt. At the same time, an appropriate credit risk assessment can help limit the risk of losses due to defaults and late payments. In return for taking credit risk, the lender receives interest from the borrower.
The lender or investor will charge a higher interest rate or refuse to grant the loan if the credit risk is higher. For the same loan, a loan seeker with a strong credit history and regular income will be charged a lower interest rate than an applicant with a bad credit history.
What is Credit Risk Modeling
A person’s credit risk is influenced by a variety of factors. Therefore, determining a borrower’s credit risk is a difficult endeavor. Credit risk modeling has come into the picture because there is so much money that relies on our ability to appropriately predict a borrower’s credit risk. Credit risk modeling involves applying data models to determine two key factors. The first is the probability that the borrower will default on the loan. The second factor is the financial impact of the lender in the event of default.
Credit risk models are used by financial organizations to assess the credit risk of potential borrowers. Based on the validation of the credit risk model, they decide whether or not to approve a loan as well as the loan interest rate.
New ways to estimate credit risk have emerged as technology has advanced, such as modeling credit risk using R and Python. Using the latest big data and analytics techniques to model credit risk is one of them. Other variables, such as the growth of economies and the creation of various categories of credit risk, have had an impact on the modeling of credit risk.
How is machine learning used in credit risk modeling?
Machine learning allows the use of more advanced modeling approaches such as decision trees and neural networks. This introduces non-linearities into the model, allowing the discovery of more complex connections between variables. We chose to use an XGBoost model that was fed with selected features using the permutation significance technique.
ML models, on the other hand, are often so complex that they are difficult to understand. We chose to combine XGBoost and logistic regression because interpretability is essential in a highly regulated industry like credit risk assessment.
Implementation of credit risk modeling
Modeling credit risk in Python can help banks and other financial institutions reduce risk and prevent financial disasters in society. The purpose of this article is to create a model that can predict the likelihood that someone will default on a loan. Let’s start by loading the dataset.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from sklearn.model_selection import train_test_split, cross_val_score, KFold from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier # load the data loan_data = pd.read_csv('/content/drive/MyDrive/data/loan_data_2007_2014.csv')
When you examine the Colab notebook for this implementation, you will find that many columns are identifiers and do not include any meaningful information for building our machine learning model. Id, member id, and so on are some examples. Remember, we want to build a model that predicts the likelihood of a borrower defaulting on a loan, so we won’t need qualities related to events that occur after a person defaults. This is because this information is not available at the time of loan approval. Collections, collection costs, etc. are examples of these features. The code below displays the columns that have been eliminated.
#dropping irrelevant columns columns_to_ = ['id', 'member_id', 'sub_grade', 'emp_title', 'url', 'desc', 'title', 'zip_code', 'next_pymnt_d', 'recoveries', 'collection_recovery_fee', 'total_rec_prncp', 'total_rec_late_fee', 'desc', 'mths_since_last_record', 'mths_since_last_major_derog', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'open_acc_6m', 'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'inq_fi', 'total_cu_tl', 'inq_last_12m','policy_code',] loan_data.drop(columns=columns_to_, inplace=True, axis=1) # drop na values loan_data.dropna(inplace=True)
Now you might know that while preparing the data multicollinearity should fail because the highly correlated variable provides the same information and it is redundant if we don’t, the models will fail to estimate the relationship. between dependent and independent variables.
To verify multicollinearity we will draw the heat map of the correlation matrix obtained using the panda correlation matrix. The heat map is shown below.
As can be seen, several variables are strongly correlated and must be eliminated. ‘loan amnt’, ‘funded amnt’, ‘funded amnt inv’, ‘payment’, ‘total pymnt inv’ and ‘out prncp inv’ are multi-collinear variables.
If you browse through Notepad, you will notice that several variables are in the wrong data types and need to be preprocessed to put them in the correct format. We are going to define some features to help automate this process. The functions that were used to transform variables into data are coded as below.
def Term_Numeric(data, col): data[col] = pd.to_numeric(data[col].str.replace(' months', '')) term_numeric(loan_data, 'term') def Emp_Length_Convert(data, col): data[col] = data[col].str.replace('+ years', '') data[col] = data[col].str.replace('< 1 year', str(0)) data[col] = data[col].str.replace(' years', '') data[col] = data[col].str.replace(' year', '') data[col] = pd.to_numeric(data[col]) data[col].fillna(value = 0, inplace = True) def Date_Columns(data, col): today_date = pd.to_datetime('2020-08-01') data[col] = pd.to_datetime(data[col], format = "%b-%y") data['mths_since_' + col] = round(pd.to_numeric((today_date - data[col]) / np.timedelta64(1, 'M'))) data['mths_since_' + col] = data['mths_since_' + col].apply(lambda x: data['mths_since_' + col].max() if x < 0 else x) data.drop(columns = [col], inplace = True)
In our dataset, the goal column is the loan status, which has different unique values. These values ââmust be converted to binary. That’s a score of 0 for a bad borrower and a score of 1 for a good borrower. In our situation, a bad borrower is someone who falls into one of the categories listed in our target column. Excluded, Default, Late (31-120 days), Not in accordance with credit policy Excluded Status Remaining debtors are considered good borrowers.
# creating a new column based on the loan_status loan_data['good_bad'] = np.where(loan_data.loc[:, 'loan_status'].isin(['Charged Off', 'Default', 'Late (31-120 days)', 'Does not meet the credit policy. Status:Charged Off']), 0, 1) # Drop the original 'loan_status' column loan_data.drop(columns = ['loan_status'], inplace = True)
We now have other categorical type variables that need to be converted to numbers for further modeling. For this, we will use the Label Encoder class from the sklearn library as below.
categorical_column = loan_data.select_dtypes('object').columns for i in range(len(categorical_column)): le = LabelEncoder() loan_data[categorical_column[i]] = le.fit_transform(loan_data[categorical_column[i]])
Now we are all ready to train the different algorithms and check which one will work the best. Here, we evaluate a linear model, a neighborhood model, two tree models and a Naive-Bayes model. We will cross-validate using KFold for 10 folds and check the average accuracy of those folds.
# compare models models =  models.append(('LR', LogisticRegression())) models.append(('KNN', KNeighborsClassifier())) models.append((DT, DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('RF', RandomForestClassifier())) results =  names =  for name, model in models: kfold = KFold(n_splits=10) cv_results = cross_val_score(model, x_train, y_train, cv=kfold) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg)
The outperformance of tree models
As we can see from the average precisions above, the tree models performed much better than the others. Indeed, tree-based algorithms offer great precision, stability and interpretability to prediction models. They map nonlinear interactions quite well, unlike linear models. They can adapt to any situation and solve any challenge (classification or regression).
Because building trees does not require domain knowledge or configuration of parameters, it is ideal for exploratory knowledge discovery. Multidimensional data can be processed through decision trees.
Attribute selection metrics are used during tree construction to choose the attribute that best divides tuples into distinct classes. Many branches of a tree can reflect noise or outliers in the training data. Tree pruning aims to locate and remove these branches in order to improve the accuracy of classifying data that is not visible.
In addition to all this, applications such as credit risk modeling where the importance of the features plays a very important role as it will decide the predictions. Using the decision tree and other algorithms, we can obtain feature significance maps and fit models accordingly. Below you can see the feature importance map given by the decision tree algorithm.
In many types of data science challenges, methods such as decision trees, random forests, and gradient amplification are often used.
In this article, we have discussed credit risk and credit risk modeling in detail. We have seen the types of credit risk, the factors affecting credit risk and seen how ML can be used to model credit risk rather than the conventional method. Later we saw the practical implementation of modeling where we tested various models and concluded how tree-based algorithms performed better and hence these are preferred in such tasks.
Subscribe to our newsletter
Receive the latest updates and relevant offers by sharing your email.