Data Science Case Study — Credit Default Prediction: Part 1

Feature Engineering, Model Training and Evaluation, and Classification Threshold Selection

Published in

Towards AI

8 min readMay 2, 2024

In financial institutions, credit default occurs when a borrower fails to fulfill their debt obligations, leading to a breach of the loan agreement. It represents the risk that a borrower will default on their debt, impacting lenders and investors. Machine learning models are increasingly being used for predictive modelling of credit default. We have to design a binary classifier to predict whether or not a customer will default given he/she has been provided a loan by a bank.

This article is Part 1 of the Data Science Case Study — Credit Default Prediction. The objective of the article is to majorly discuss feature engineering, and how to choose the right metric for model evaluation and classification threshold. Part 2 will be focused on Explainable AI. It will discuss model explainability and how concepts borrowed from game theory like Shapley values will help us better understand the predictions of our model.

Let’s jump into the feature engineering part!

Feature Engineering

The features used to predict credit default comprise a wide array of financial and transactional metrics, which provide a comprehensive view of an individual’s financial behavior and history. Suppose you are working for a fintech, banking, or financial service company. You’ll have access to various aspects of banking activity, such as

Average transaction amounts.
Frequency of transactions.
Balances.
Loan disbursements and liabilities.
Credit and debit card usage, credit score.
Missed payments.
Loan application and approval history.
Instances of returned checks, declined transactions, and defaults.

Feature Creation and Feature Aggregation

Let’s suppose we have credit card usage data. We can create features by aggregating the total amount transacted using a credit card, the number of credit card transactions, the average transaction amount using a credit card, and the number of times credit card repayment was missed. Depending on the frequency (weekly/monthly/quarterly), we can aggregate the features over different time periods like the last 7 days, 14 days, 28 days, 30 days, last 90 days, last 180 days, last 360 days, and lifetime. We can aggregate the usage data for different types of data, such as debit card data and loan application data. These features capture trends in the customer’s financial habits and can help improve the model’s predictable power.

By collecting and analyzing these features, predictive models can identify patterns and correlations indicative of creditworthiness or risk of default. For instance, higher average credit transaction amounts, consistent closing balances, and a history of timely payments may suggest lower default risk, while frequent missed payments, high loan liabilities, and a pattern of declined transactions could signal high risk. In our dataset we have the following set of features —

Transaction Patterns:
  Average ATM withdrawal amount per transaction 
  Average credit transaction amount  
  Average debit card transaction amount
  Average debit transaction amount
  Average daily closing balance 
  Monthly average balance
  No. of transactions declined due to insufficint fund
  No. of cheques returned due to insufficient fund
Loan and Debt Related:
  Loan disbursement amount
  Current loan liability 
  No. of loans disbursed and closed
  No. of loan defaults, payments missed, and applications rejected
Credit Card and Banking Activity:
  Credit Bureau scores (FICO, Experian, CIBIL) 
  Count of credit and debit transactions
  Number of ATM transactions
  Total credit and debit transaction amounts
  No. of credit card defaults and payments missed
  No. of credit card applications rejected
Income:
  Salary
Flag indicators for various types of loans, missed payments, and other financial events:
  gold_loan
  auto_loan
  business_loan
  education_loan
  home_loan
  salaried

Derived Features

We can enhance our dataset by preparing derived features. Let’s suppose we have two features avg_credit_per_transaction_90_days and avg_debit_per_transaction_90_days. We can take the credit/debit ratio to create new features. Similarly, we can take ratio of the same feature type but with different time period like— avg_missed_payment_amount_last_90_days/avg_missed_payment_amount_last_180_days. The idea is models can capture and understand the more nuanced behavior of a customer if we combine two features into one.

Data Pre-processing

I have a dataset containing around 4K rows and around 280 features containing both numerical and categorical features. In an industry setting, we will train a model on in-time data and test it on out-of-time (OOT) data. For the purpose of this article, we will divide the data randomly into train and test set.

The data may have missing values and may contain outliers. It would an interesting idea to explore how to tackle these issues. You may want to fill missing values with nearest neighbors’ imputation techniques or replace them with average value. Based on quantiles, you can remove outliers from the data.
Depending upon the type of ML algorithm you want to use, you may need to standardize your numerical features and convert your categorical variables into one-hot/label encodings.
Lastly, your data may be imbalanced. Cases of credit getting defaulted in this dataset is around 22% of the entire data. You may need to balance the dataset by performing majority class under-sampling, by using SMOTE to synthetically generate minority class samples, or by using class weights during training. You may skip this step altogether and let the model learn the original distribution.

Model Training

Tree-based models such as XGBoost and Gradient Boost are our best friends when it comes to tabular data. Tree-based models are quite popular in industry applications that deal with tabular data. I’ve used LightGBM implementation of Gradient Boost. Selecting which algorithm to use is not the scope of this article.

We will train a LightGBM model to predict the probability of default using the features engineered. Tree-based models are insensitive to outliers. It can deal with missing values, and there is no need to standardize the numerical features. We will use the Optuna AutoML library for tuning the hyper-parameters of the LightGBM model. Following is a code snippet for the same —

import lightgbm as lgb
import optuna
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import pandas as pd


X_train_val, X_test, y_train_val, y_test = train_test_split(df.drop('default', axis=1), df['default'], test_size=0.2, stratify=df['default'], random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.2, stratify=y_train_val, random_state=42)
print(X_train.shape, X_val.shape, X_test.shape)

def objective(trial):
    params = {
        'boosting_type': 'gbdt',
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.001, 0.1),
        'n_estimators': trial.suggest_int('n_estimators', 50, 1000),
        'subsample_for_bin': trial.suggest_int('subsample_for_bin', 20000, 300000),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-9, 10.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-9, 10.0),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.5, 1.0),
        'subsample': trial.suggest_uniform('subsample', 0.5, 1.0),
        'max_depth': trial.suggest_int('max_depth', -1, 20),
        'min_child_weight': trial.suggest_loguniform('min_child_weight', 1e-5, 1e2),
        'random_state': 42,
        'n_jobs': -1,
    }

    model = lgb.LGBMClassifier(**params)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], eval_metric='auc', categorical_feature=cat_cols)

    y_pred = model.predict_proba(X_val)[:, 1]

    roc_auc = roc_auc_score(y_val, y_pred)

    return roc_auc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

best_params = study.best_params

model = lgb.LGBMClassifier(**best_params)
model.fit(X_train_val, y_train_val, eval_set=[(X_test, y_test)], eval_metric='auc', categorical_feature=cat_cols)

Best Hyper-parameters obtained using Optuna —

{
  'num_leaves': 112, 
  'learning_rate': 0.025903226539801448, 
  'n_estimators': 294, 
  'subsample_for_bin': 203049, 
  'min_child_samples': 93, 
  'reg_alpha': 7.284466282053885e-07, 
  'reg_lambda': 0.009140074090267514, 
  'colsample_bytree': 0.5767956241204047, 
  'subsample': 0.9312222959859626, 
  'max_depth': 0, 
  'min_child_weight': 0.010452731370112569
}

The following is a plot of Split Feature Importance for the top 50 features. The numbers corresponding to each feature quantifies the cumulative gain of each feature when it is used in the trees. According to the below plot, current_loan_liability_in_the_last_3_months is the most important feature.

“Split” Feature Importance Plot — Image by Author

Model Evaluation

For binary classification, accuracy, precision, recall, F1-Score, and ROC AUC are some of the popular metrics for evaluation. In the case of credit default prediction, precision answers the question — Of all the instances predicted as defaults, how many were actually defaults? Recall answers the question — Of all the actual defaults, how many did the model correctly identify?

Precision or Recall?

We have two scenarios —

We want to minimize the number of false positives (i.e., minimize the number of non-defaulters incorrectly classified as defaulters). This is particularly important if the cost of mistakenly identifying non-defaulters as defaulters is high, such as in the case of denying credit to low-risk customers. Precision is the metric we would want to maximize here.
We want to capture as many actual defaulters as possible (i.e., minimize the number of false negatives). This is important if the cost of missing a potential defaulter is high, such as in the case of financial losses due to defaulted loans. Recall is the metric we would want to maximize here.

In most cases, financial institutions would prefer to trade off false positives for lesser false negatives. In other words, they would prefer losing a few potential non-defaulters over offering loans to potential defaulters. In such cases, we will pick Recall as an evaluation metric when selecting the best model. If we don’t want to miss out on potential non-defaulters, we can F1-Score (harmonic mean of Precision and Recall) and ROC-AUC as our evaluation metric.

Classification Threshold

We have to decide the classification threshold to calculate precision, recall, and F1-Score. A naïve idea would be to consider 0.5 as a threshold. However, our dataset is imbalanced, and the model may skew our predictions towards 0. We can obtain an optimal threshold using the Precision-Recall (PR) Curve. The PR curve shows the tradeoff between precision and recall for different thresholds. We can select a threshold that satisfies the precision and recall business requirements. We can also choose a threshold that maximizes the F1-Score. The threshold 0.2499 maximizes F1-Score. Any customer with a model score above the threshold of 0.2499 will be denied a loan.

Precision-Recall Curve, Optimal Threshold computed by maximizing F1-Score— Image by Author

Alternatively, G-Mean (geometric mean of the Recall and Specificity), Youden’s J Statistics (Difference between True Positive Rate and False Positive Rate) are some of the other metrics that can help us come up with an optimal threshold.

from sklearn.metrics import roc_curve, precision_recall_curve

y_pred = model.predict_proba(X_test)[:, 1]

precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

f1_score =  (2 * precision * recall) / (precision + recall)
g_mean = np.sqrt(tpr * (1 - fpr))
youden_j = tpr - fpr

ROC Curve, Optimal Threshold computed by maximizing G-Mean — Image by Author

ROC Curve, Optimal Threshold computed by maximizing Youden J Statistic— Image by Author

Our model can make predictions by feeding into it the features. Let’s suppose the model predicts that a customer will not default on the credit/loan. The financial institution wants to understand why the model is predicting what it is predicting. Model explainability and interpretability become essential. Part 2 of this discussion will be focused on the same.

Thank you for reading!