Auto Loan Default Prediction Using SEC EDGAR ABS-EE Data¶

Objective: Build a credit scoring model for auto loan default prediction using publicly available loan-level data from SEC EDGAR Asset-Backed Securities (ABS-EE) filings.

Data Source: Ally Auto Receivables Trust filings (2024), containing loan-level attributes for ~168,000 auto loans including borrower credit scores, loan terms, vehicle information, payment history, and delinquency status.

Approach:

  1. Acquire loan-level data from SEC EDGAR EFTS API
  2. Define a rigorous default target variable
  3. Engineer predictive features from raw loan attributes
  4. Train and compare three model families: Logistic Regression, Gradient Boosting (XGBoost), and a Neural Network
  5. Evaluate using credit-scoring metrics: AUC, Kolmogorov-Smirnov statistic, and Gini coefficient
  6. Conduct fairness analysis across geography and credit tiers
In [1]:
import sys
sys.path.insert(0, "../src")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve, confusion_matrix
from sklearn.calibration import calibration_curve
import xgboost as xgb
import torch
import torch.nn as nn
import json
import pickle
import warnings
warnings.filterwarnings("ignore")

sns.set_theme(style="whitegrid", palette="muted", font_scale=1.1)
plt.rcParams["figure.dpi"] = 120

from features import load_raw, engineer_target, engineer_features, select_model_features
from train import DefaultNet

1. Data Overview¶

The raw data comes from SEC EDGAR ABS-EE filings -- standardized disclosures required for asset-backed securities. Each filing contains an EX-102 XML attachment with loan-level data for every asset in the trust pool. We pulled three Ally Auto Receivables Trust filings from 2024.

In [2]:
raw = load_raw()
print(f"Total loan records: {len(raw):,}")
print(f"Unique loans (by assetNumber): {raw['assetNumber'].nunique():,}")
print(f"Filings: {raw['filing_date'].nunique()}")
print(f"Reporting periods: {raw['period_ending'].unique().tolist()}")
print(f"\nColumns ({len(raw.columns)}):")
print(raw.dtypes.to_string())
Total loan records: 168,643
Unique loans (by assetNumber): 168,643
Filings: 3
Reporting periods: ['2024-06-30', '2024-05-31', '2024-11-30']

Columns (40):
assetNumber                                      str
reportingPeriodBeginningDate                     str
reportingPeriodEndingDate                        str
originatorName                                   str
originationDate                                  str
originalLoanAmount                           float64
originalLoanTerm                               int64
loanMaturityDate                                 str
originalInterestRatePercentage               float64
originalInterestRateTypeCode                     str
vehicleManufacturerName                          str
vehicleModelName                                 str
vehicleNewUsedCode                               str
vehicleModelYear                                 str
vehicleTypeCode                                  str
vehicleValueAmount                           float64
obligorCreditScore                           float64
obligorCreditScoreType                           str
coObligorIndicator                               str
paymentToIncomePercentage                    float64
obligorGeographicLocation                        str
remainingTermToMaturityNumber                float64
reportingPeriodBeginningLoanBalanceAmount    float64
reportingPeriodActualEndBalanceAmount        float64
reportingPeriodScheduledPaymentAmount        float64
totalActualAmountPaid                        float64
actualInterestCollectedAmount                float64
actualPrincipalCollectedAmount               float64
reportingPeriodInterestRatePercentage        float64
scheduledInterestAmount                      float64
scheduledPrincipalAmount                     float64
currentDelinquencyStatus                     float64
repossessedIndicator                             str
chargedoffPrincipalAmount                    float64
recoveredAmount                              float64
zeroBalanceCode                                  str
primaryLoanServicerName                          str
subvented                                        str
filing_date                                      str
period_ending                                    str

2. Exploratory Data Analysis¶

In [3]:
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Credit score distribution
axes[0, 0].hist(raw["obligorCreditScore"].dropna(), bins=60, color="#4C72B0", edgecolor="white", alpha=0.85)
axes[0, 0].set_xlabel("Credit Score")
axes[0, 0].set_ylabel("Count")
axes[0, 0].set_title("Credit Score Distribution")
axes[0, 0].axvline(x=660, color="red", linestyle="--", alpha=0.7, label="Subprime threshold (660)")
axes[0, 0].legend(fontsize=9)

# Original loan amount
axes[0, 1].hist(raw["originalLoanAmount"].dropna(), bins=60, color="#55A868", edgecolor="white", alpha=0.85)
axes[0, 1].set_xlabel("Original Loan Amount ($)")
axes[0, 1].set_ylabel("Count")
axes[0, 1].set_title("Loan Amount Distribution")
axes[0, 1].xaxis.set_major_formatter(mtick.StrMethodFormatter("${x:,.0f}"))

# Interest rate
axes[0, 2].hist(raw["originalInterestRatePercentage"].dropna() * 100, bins=50, color="#C44E52", edgecolor="white", alpha=0.85)
axes[0, 2].set_xlabel("Interest Rate (%)")
axes[0, 2].set_ylabel("Count")
axes[0, 2].set_title("Interest Rate Distribution")

# Original loan term
term_counts = raw["originalLoanTerm"].value_counts().sort_index()
axes[1, 0].bar(term_counts.index, term_counts.values, color="#8172B2", edgecolor="white", alpha=0.85)
axes[1, 0].set_xlabel("Loan Term (months)")
axes[1, 0].set_ylabel("Count")
axes[1, 0].set_title("Loan Term Distribution")

# Vehicle type
vtype_map = {"1": "Car", "2": "Truck", "3": "SUV", "4": "Motorcycle"}
vtype_counts = raw["vehicleTypeCode"].map(vtype_map).value_counts()
axes[1, 1].bar(vtype_counts.index, vtype_counts.values, color="#CCB974", edgecolor="white", alpha=0.85)
axes[1, 1].set_xlabel("Vehicle Type")
axes[1, 1].set_ylabel("Count")
axes[1, 1].set_title("Vehicle Type Distribution")

# New vs Used
nu_map = {"1": "New", "2": "Used"}
nu_counts = raw["vehicleNewUsedCode"].map(nu_map).value_counts()
axes[1, 2].bar(nu_counts.index, nu_counts.values, color="#64B5CD", edgecolor="white", alpha=0.85)
axes[1, 2].set_xlabel("Vehicle Condition")
axes[1, 2].set_ylabel("Count")
axes[1, 2].set_title("New vs Used")

plt.tight_layout()
plt.savefig("../figures/eda_distributions.png", bbox_inches="tight")
plt.show()
No description has been provided for this image
In [4]:
# Geographic distribution - top 15 states
state_counts = raw["obligorGeographicLocation"].value_counts().head(15)

fig, ax = plt.subplots(figsize=(12, 5))
ax.barh(state_counts.index[::-1], state_counts.values[::-1], color="#4C72B0", edgecolor="white")
ax.set_xlabel("Number of Loans")
ax.set_title("Loan Volume by State (Top 15)")
for i, v in enumerate(state_counts.values[::-1]):
    ax.text(v + 200, i, f"{v:,}", va="center", fontsize=9)
plt.tight_layout()
plt.savefig("../figures/geographic_distribution.png", bbox_inches="tight")
plt.show()
No description has been provided for this image
In [5]:
import plotly.express as px
import plotly.io as pio

# Aggregate by state
state_stats = raw.copy()
state_stats = engineer_target(state_stats)
state_agg = state_stats.groupby("obligorGeographicLocation").agg(
    loan_count=("assetNumber", "count"),
    default_rate=("default", "mean"),
    avg_credit_score=("obligorCreditScore", "mean"),
    avg_loan_amount=("originalLoanAmount", "mean"),
).reset_index()
state_agg.columns = ["state", "loan_count", "default_rate", "avg_credit_score", "avg_loan_amount"]

# Loan volume map
fig1 = px.choropleth(
    state_agg, locations="state", locationmode="USA-states",
    color="loan_count", color_continuous_scale="Blues",
    scope="usa", title="Loan Volume by State",
    labels={"loan_count": "Loans"},
)
fig1.update_layout(margin=dict(l=0, r=0, t=40, b=0), width=800, height=450)
fig1.write_image("../figures/map_loan_volume.png", scale=2)
fig1.write_html("../docs/map_loan_volume.html", include_plotlyjs="cdn")
fig1.show()

# Default rate map
fig2 = px.choropleth(
    state_agg, locations="state", locationmode="USA-states",
    color="default_rate", color_continuous_scale="RdYlGn_r",
    scope="usa", title="Default Rate by State",
    labels={"default_rate": "Default Rate"},
    range_color=[0, state_agg["default_rate"].quantile(0.95)],
)
fig2.update_layout(margin=dict(l=0, r=0, t=40, b=0), width=800, height=450)
fig2.write_image("../figures/map_default_rate.png", scale=2)
fig2.write_html("../docs/map_default_rate.html", include_plotlyjs="cdn")
fig2.show()

3. Target Variable: Defining Default¶

We define default using the standard ABS credit analysis convention. A loan is considered in default if any of the following conditions hold during the reporting period:

$$ \text{Default}_i = \mathbb{1}\left[\text{DPD}_i \geq 60 \;\lor\; \text{Repossessed}_i = \text{true} \;\lor\; \text{ChargeOff}_i > 0\right] $$

where:

  • $\text{DPD}_i$ is the number of days-past-due (current delinquency status, measured in 30-day increments)
  • A status of 2 corresponds to 60+ days delinquent, the conventional threshold for "serious delinquency"
  • Repossession and charge-off are terminal default events
In [6]:
df = engineer_target(raw.copy())

print(f"Default rate: {df['default'].mean():.4f} ({df['default'].sum():,} / {len(df):,})")
print(f"\nDefault composition:")
print(f"  60+ DPD:      {df['is_delinquent_60plus'].sum():>6,}")
print(f"  Repossessed:  {df['is_repossessed'].sum():>6,}")
print(f"  Charged off:  {df['is_chargedoff'].sum():>6,}")
print(f"  Total unique: {df['default'].sum():>6,}")

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Default rate
labels = ["Performing", "Default"]
sizes = [1 - df["default"].mean(), df["default"].mean()]
colors = ["#55A868", "#C44E52"]
axes[0].pie(sizes, labels=labels, colors=colors, autopct="%1.1f%%", startangle=90, textprops={"fontsize": 12})
axes[0].set_title("Default Rate")

# Delinquency distribution
delinq = df["currentDelinquencyStatus"].fillna(0).clip(upper=6)
delinq_counts = delinq.value_counts().sort_index()
delinq_labels = {0: "Current", 1: "30 DPD", 2: "60 DPD", 3: "90 DPD", 4: "120 DPD", 5: "150 DPD", 6: "180+ DPD"}
axes[1].bar(
    [delinq_labels.get(k, str(k)) for k in delinq_counts.index],
    delinq_counts.values,
    color=["#55A868"] + ["#4C72B0"] + ["#C44E52"] * 5,
    edgecolor="white"
)
axes[1].set_ylabel("Count")
axes[1].set_title("Delinquency Status Distribution")
axes[1].set_yscale("log")

plt.tight_layout()
plt.savefig("../figures/target_distribution.png", bbox_inches="tight")
plt.show()
Default rate: 0.0618 (10,416 / 168,643)

Default composition:
  60+ DPD:       9,392
  Repossessed:     972
  Charged off:     113
  Total unique: 10,416
No description has been provided for this image

4. Feature Engineering¶

We construct 22 features from the raw loan attributes. Key engineered features include:

Feature Formula Intuition
LTV Ratio $\text{LTV}_i = \frac{\text{LoanAmount}_i}{\text{VehicleValue}_i}$ Higher LTV = more underwater risk
Balance Ratio $\beta_i = \frac{\text{CurrentBalance}_i}{\text{OriginalAmount}_i}$ How much principal remains
Payment Shortfall $\max(0,\; \text{Scheduled}_i - \text{ActualPaid}_i)$ Dollar amount of underpayment
Seasoning $s_i = \frac{\text{OriginalTerm}_i - \text{RemainingTerm}_i}{\text{OriginalTerm}_i}$ Loan maturity progress
Rate Spread $r_i - \tilde{r}$ Deviation from median portfolio rate
Vehicle Age at Origination $\text{OrigYear}_i - \text{ModelYear}_i$ Depreciation exposure
In [7]:
df = engineer_features(df)
model_df, feature_cols, target_col = select_model_features(df)
print(f"Final dataset: {model_df.shape[0]:,} rows x {len(feature_cols)} features")
print(f"Default rate: {model_df[target_col].mean():.4f}")
print(f"\nFeature summary statistics:")
model_df[feature_cols].describe().round(4).T
Final dataset: 163,454 rows x 22 features
Default rate: 0.0617

Feature summary statistics:
Out[7]:
count mean std min 25% 50% 75% max
obligorCreditScore 163454.0 721.9396 74.2322 193.0000 680.0000 716.0000 769.0000 999.0000
originalLoanAmount 163454.0 27579.4240 13759.2939 4000.0000 18428.2800 24776.1650 33366.5525 215736.2400
originalLoanTerm 163454.0 70.6090 9.5114 13.0000 62.0000 73.0000 76.0000 92.0000
originalInterestRatePercentage 163454.0 0.0772 0.0270 0.0000 0.0599 0.0724 0.0919 0.2500
reportingPeriodInterestRatePercentage 163454.0 0.0778 0.0237 0.0000 0.0649 0.0744 0.0869 0.2500
vehicleValueAmount 163454.0 30489.3098 15248.3133 4750.0000 19932.2500 26998.0000 37275.0000 330646.0000
paymentToIncomePercentage 163454.0 0.0698 0.0469 0.0000 0.0375 0.0610 0.0935 0.4736
remainingTermToMaturityNumber 163454.0 38.2417 15.3922 0.0000 31.0000 40.0000 47.0000 83.0000
reportingPeriodBeginningLoanBalanceAmount 163454.0 15983.2430 10231.4786 -4121.2300 10343.1200 14417.1000 18950.5200 187199.1500
reportingPeriodActualEndBalanceAmount 163454.0 15360.2716 10213.0216 -8190.6200 9772.2700 13882.7000 18426.9300 184673.0100
ltv_ratio 163454.0 0.9320 0.1968 0.2453 0.8167 0.9714 1.0869 1.2050
payment_burden 163454.0 0.0180 0.0028 0.0000 0.0169 0.0176 0.0187 0.0878
balance_ratio 163454.0 0.5709 0.2198 -0.3290 0.4825 0.6046 0.6954 0.9970
payment_shortfall 163454.0 26.1457 122.7786 0.0000 0.0000 0.0000 0.0000 2639.5100
underpaid 163454.0 0.0709 0.2566 0.0000 0.0000 0.0000 0.0000 1.0000
months_elapsed 163454.0 32.9392 14.7983 1.0000 26.0000 31.0000 38.0000 86.0000
seasoning_pct 163454.0 0.4653 0.2034 0.0137 0.3562 0.4384 0.5410 1.0000
rate_spread 163454.0 0.0038 0.0270 -0.0734 -0.0135 -0.0010 0.0185 0.1766
vehicle_age_at_orig 163454.0 2.7459 2.8686 -1.0000 0.0000 2.0000 5.0000 21.0000
is_used_vehicle 163454.0 0.6491 0.4773 0.0000 0.0000 1.0000 1.0000 1.0000
has_coborrower 163454.0 0.3608 0.4802 0.0000 0.0000 0.0000 1.0000 1.0000
is_subvented 163454.0 0.2073 0.4053 0.0000 0.0000 0.0000 0.0000 1.0000
In [8]:
# Correlation of features with default
corr_with_target = model_df[feature_cols + [target_col]].corr()[target_col].drop(target_col).sort_values()

fig, ax = plt.subplots(figsize=(10, 8))
colors = ["#C44E52" if v > 0 else "#4C72B0" for v in corr_with_target.values]
ax.barh(corr_with_target.index, corr_with_target.values, color=colors, edgecolor="white")
ax.set_xlabel("Pearson Correlation with Default")
ax.set_title("Feature Correlation with Default")
ax.axvline(x=0, color="black", linewidth=0.5)
plt.tight_layout()
plt.savefig("../figures/feature_correlation.png", bbox_inches="tight")
plt.show()
No description has been provided for this image
In [9]:
# Default rate by credit score bin and vehicle type
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# By credit tier
tier_default = model_df.groupby("fico_bin", observed=True)[target_col].agg(["mean", "count"])
tier_default = tier_default[tier_default["count"] >= 50]
axes[0].bar(tier_default.index.astype(str), tier_default["mean"] * 100, color="#C44E52", edgecolor="white", alpha=0.85)
axes[0].set_xlabel("Credit Tier")
axes[0].set_ylabel("Default Rate (%)")
axes[0].set_title("Default Rate by Credit Tier")
axes[0].tick_params(axis="x", rotation=45)
for i, (idx, row) in enumerate(tier_default.iterrows()):
    axes[0].text(i, row["mean"] * 100 + 0.2, f"n={row['count']:,.0f}", ha="center", fontsize=8)

# By state (top 15 by volume)
top_states = model_df["obligorGeographicLocation"].value_counts().head(15).index
state_default = model_df[model_df["obligorGeographicLocation"].isin(top_states)].groupby("obligorGeographicLocation")[target_col].mean().sort_values(ascending=False)
axes[1].barh(state_default.index[::-1], state_default.values[::-1] * 100, color="#4C72B0", edgecolor="white")
axes[1].set_xlabel("Default Rate (%)")
axes[1].set_title("Default Rate by State (Top 15 by Volume)")

plt.tight_layout()
plt.savefig("../figures/default_by_segment.png", bbox_inches="tight")
plt.show()
No description has been provided for this image

5. Model Training and Comparison¶

We split the data 80/20 with stratification on the target, then train three model families:

  1. Logistic Regression -- the interpretable baseline, standard in credit risk
  2. XGBoost -- gradient-boosted trees, the industry workhorse for tabular data
  3. Neural Network -- a 4-layer feedforward network (128-64-32-1) with batch normalization and dropout

All models handle class imbalance: logistic regression via class_weight="balanced", XGBoost via scale_pos_weight, and the neural network via weighted BCE loss with:

$$ w^+ = \frac{N^-}{N^+} $$

In [10]:
X = model_df[feature_cols]
y = model_df[target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Train: {len(X_train):,} | Test: {len(X_test):,}")
print(f"Train default rate: {y_train.mean():.4f} | Test default rate: {y_test.mean():.4f}")

scaler = StandardScaler()
scaler.fit(X_train)

# Load pre-trained models
with open("../models/logistic.pkl", "rb") as f:
    lr_bundle = pickle.load(f)
lr_model = lr_bundle["model"]

xgb_model = xgb.XGBClassifier()
xgb_model.load_model("../models/xgboost.json")

nn_model = DefaultNet(len(feature_cols))
nn_state = torch.load("../models/neural_net.pt", weights_only=False)
nn_model.load_state_dict(nn_state["model_state"])
nn_model.eval()

# Generate predictions
X_test_scaled = scaler.transform(X_test)
lr_probs = lr_model.predict_proba(X_test_scaled)[:, 1]
xgb_probs = xgb_model.predict_proba(X_test)[:, 1]

with torch.no_grad():
    nn_probs = torch.sigmoid(nn_model(torch.FloatTensor(np.array(X_test_scaled, dtype=np.float32)))).squeeze().numpy()

print("\nPredictions generated for all three models.")
Train: 130,763 | Test: 32,691
Train default rate: 0.0617 | Test default rate: 0.0617
Predictions generated for all three models.
In [11]:
def ks_statistic(y_true, y_prob):
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    return max(tpr - fpr)

# Results table
results = []
for name, probs in [("Logistic Regression", lr_probs), ("XGBoost", xgb_probs), ("Neural Network", nn_probs)]:
    auc = roc_auc_score(y_test, probs)
    ks = ks_statistic(y_test, probs)
    gini = 2 * auc - 1
    results.append({"Model": name, "AUC": f"{auc:.4f}", "KS Statistic": f"{ks:.4f}", "Gini": f"{gini:.4f}"})

results_df = pd.DataFrame(results)
results_df.style.set_caption("Model Performance Comparison (Test Set)")
Out[11]:
Model Performance Comparison (Test Set)
  Model AUC KS Statistic Gini
0 Logistic Regression 0.8749 0.5902 0.7498
1 XGBoost 0.9219 0.6935 0.8438
2 Neural Network 0.9558 0.8022 0.9116
In [12]:
# ROC Curves
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# ROC
colors = {"Logistic Regression": "#4C72B0", "XGBoost": "#55A868", "Neural Network": "#C44E52"}
for name, probs in [("Logistic Regression", lr_probs), ("XGBoost", xgb_probs), ("Neural Network", nn_probs)]:
    fpr, tpr, _ = roc_curve(y_test, probs)
    auc = roc_auc_score(y_test, probs)
    axes[0].plot(fpr, tpr, label=f"{name} (AUC={auc:.3f})", color=colors[name], linewidth=2)
axes[0].plot([0, 1], [0, 1], "k--", alpha=0.3)
axes[0].set_xlabel("False Positive Rate")
axes[0].set_ylabel("True Positive Rate")
axes[0].set_title("ROC Curves")
axes[0].legend(fontsize=9)

# KS Plot (using XGBoost as the best traditional model)
fpr, tpr, thresholds = roc_curve(y_test, xgb_probs)
ks_val = max(tpr - fpr)
ks_idx = np.argmax(tpr - fpr)
axes[1].plot(np.linspace(0, 1, len(tpr)), tpr, label="Cumulative Default (TPR)", color="#C44E52", linewidth=2)
axes[1].plot(np.linspace(0, 1, len(fpr)), fpr, label="Cumulative Non-Default (FPR)", color="#4C72B0", linewidth=2)
x_ks = np.linspace(0, 1, len(tpr))[ks_idx]
axes[1].vlines(x_ks, fpr[ks_idx], tpr[ks_idx], color="black", linestyle="--", label=f"KS = {ks_val:.3f}")
axes[1].set_xlabel("Score Percentile")
axes[1].set_ylabel("Cumulative Proportion")
axes[1].set_title("KS Plot (XGBoost)")
axes[1].legend(fontsize=9)

# Precision-Recall
for name, probs in [("Logistic Regression", lr_probs), ("XGBoost", xgb_probs), ("Neural Network", nn_probs)]:
    precision, recall, _ = precision_recall_curve(y_test, probs)
    axes[2].plot(recall, precision, label=name, color=colors[name], linewidth=2)
axes[2].axhline(y=y_test.mean(), color="gray", linestyle="--", alpha=0.5, label=f"Baseline ({y_test.mean():.3f})")
axes[2].set_xlabel("Recall")
axes[2].set_ylabel("Precision")
axes[2].set_title("Precision-Recall Curves")
axes[2].legend(fontsize=9)

plt.tight_layout()
plt.savefig("../figures/model_comparison.png", bbox_inches="tight")
plt.show()
No description has been provided for this image
In [13]:
# Feature importance (XGBoost)
importance = pd.Series(xgb_model.feature_importances_, index=feature_cols).sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(10, 8))
importance.plot.barh(ax=ax, color="#55A868", edgecolor="white")
ax.set_xlabel("Feature Importance (Gain)")
ax.set_title("XGBoost Feature Importance")
plt.tight_layout()
plt.savefig("../figures/feature_importance.png", bbox_inches="tight")
plt.show()
No description has been provided for this image
In [14]:
# Calibration curves - how well do predicted probabilities match observed frequencies?
fig, ax = plt.subplots(figsize=(8, 7))

for name, probs, color in [
    ("Logistic Regression", lr_probs, "#4C72B0"),
    ("XGBoost", xgb_probs, "#55A868"),
    ("Neural Network", nn_probs, "#C44E52"),
]:
    prob_true, prob_pred = calibration_curve(y_test, probs, n_bins=15, strategy="quantile")
    ax.plot(prob_pred, prob_true, marker="o", label=name, color=color, linewidth=2, markersize=5)

ax.plot([0, 1], [0, 1], "k--", alpha=0.3, label="Perfect calibration")
ax.set_xlabel("Mean Predicted Probability")
ax.set_ylabel("Observed Default Frequency")
ax.set_title("Calibration Curves")
ax.legend()
plt.tight_layout()
plt.savefig("../figures/calibration_curves.png", bbox_inches="tight")
plt.show()
No description has been provided for this image

6. Fairness Analysis¶

Lending models must be evaluated for disparate impact across protected groups. While race and ethnicity are not available in ABS-EE data, geography serves as a proxy for demographic composition. We apply the four-fifths rule (EEOC guidelines, adopted in fair lending):

$$ \text{Disparate Impact Ratio} = \frac{\text{Denial Rate}_{\text{group}}}{\text{Denial Rate}_{\text{reference}}} $$

A ratio below 0.8 or above 1.25 relative to the reference group (median state) flags potential disparate impact.

In [15]:
# Geographic fairness using logistic regression (most interpretable model for fairness)
test_df = model_df.loc[X_test.index].copy()
test_df["pred_prob"] = lr_probs

# Per-state metrics
state_counts = test_df["obligorGeographicLocation"].value_counts()
valid_states = state_counts[state_counts >= 50].index

state_metrics = []
for state in valid_states:
    sdf = test_df[test_df["obligorGeographicLocation"] == state]
    denial_rate = (sdf["pred_prob"] >= 0.5).mean()
    actual_default = sdf["default"].mean()
    state_metrics.append({"state": state, "n": len(sdf), "denial_rate": denial_rate, "actual_default": actual_default})

state_df = pd.DataFrame(state_metrics).sort_values("denial_rate", ascending=False)
reference_rate = state_df["denial_rate"].median()
state_df["di_ratio"] = state_df["denial_rate"] / reference_rate
state_df["flagged"] = (state_df["di_ratio"] < 0.8) | (state_df["di_ratio"] > 1.25)

print(f"Reference denial rate (median state): {reference_rate:.4f}")
print(f"States flagged: {state_df['flagged'].sum()} / {len(state_df)}")
print(f"\nFlagged states:")
print(state_df[state_df["flagged"]][["state", "n", "denial_rate", "actual_default", "di_ratio"]].to_string(index=False))
Reference denial rate (median state): 0.1588
States flagged: 14 / 49

Flagged states:
state    n  denial_rate  actual_default  di_ratio
   MS  343     0.282799        0.084548  1.780776
   NM  119     0.235294        0.067227  1.481640
   ND   84     0.226190        0.083333  1.424315
   NV  230     0.217391        0.073913  1.368906
   OK  387     0.214470        0.046512  1.350513
   GA  933     0.204716        0.080386  1.289090
   WV  223     0.201794        0.080717  1.270689
   DE  143     0.125874        0.020979  0.792626
   IA  311     0.122186        0.054662  0.769405
   OR  257     0.120623        0.035019  0.759557
   NJ 1517     0.114700        0.048121  0.722263
   MT  113     0.106195        0.035398  0.668705
   KS  368     0.105978        0.019022  0.667342
   MO 1107     0.105691        0.043360  0.665533
In [16]:
# Visualize DI ratios
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# DI ratio by state
top_20 = state_df.head(20).sort_values("di_ratio")
bar_colors = ["#C44E52" if f else "#4C72B0" for f in top_20["flagged"]]
axes[0].barh(top_20["state"], top_20["di_ratio"], color=bar_colors, edgecolor="white")
axes[0].axvline(x=0.8, color="red", linestyle="--", alpha=0.6, label="Lower bound (0.8)")
axes[0].axvline(x=1.25, color="red", linestyle="--", alpha=0.6, label="Upper bound (1.25)")
axes[0].axvline(x=1.0, color="gray", linestyle="-", alpha=0.3)
axes[0].set_xlabel("Disparate Impact Ratio")
axes[0].set_title("Disparate Impact Ratio by State (Top 20 by volume)")
axes[0].legend(fontsize=9)

# Actual default vs predicted denial rate
axes[1].scatter(state_df["actual_default"], state_df["denial_rate"],
                c=state_df["flagged"].map({True: "#C44E52", False: "#4C72B0"}),
                s=state_df["n"] / 10, alpha=0.7, edgecolors="white")
axes[1].plot([0, state_df["actual_default"].max()], [0, state_df["actual_default"].max()],
             "k--", alpha=0.3, label="Perfect alignment")
axes[1].set_xlabel("Actual Default Rate")
axes[1].set_ylabel("Model Denial Rate")
axes[1].set_title("Actual Default vs Model Denial Rate by State\n(size = loan volume, red = flagged)")
axes[1].legend(fontsize=9)

plt.tight_layout()
plt.savefig("../figures/fairness_analysis.png", bbox_inches="tight")
plt.show()
No description has been provided for this image
In [17]:
# Disparate impact map
fig3 = px.choropleth(
    state_df, locations="state", locationmode="USA-states",
    color="di_ratio", color_continuous_scale="RdBu_r",
    color_continuous_midpoint=1.0,
    scope="usa", title="Disparate Impact Ratio by State (1.0 = neutral, red = higher denial rate)",
    labels={"di_ratio": "DI Ratio"},
    range_color=[0.5, 1.8],
)
fig3.update_layout(margin=dict(l=0, r=0, t=40, b=0), width=800, height=450)
fig3.write_image("../figures/map_disparate_impact.png", scale=2)
fig3.write_html("../docs/map_disparate_impact.html", include_plotlyjs="cdn")
fig3.show()

7. Conclusions¶

Key findings:

  1. The neural network achieves the highest discriminative power (AUC 0.956), but XGBoost (AUC 0.922) offers a strong balance of performance and interpretability for production use.

  2. The dominant predictive feature is underpaid -- whether the borrower paid less than scheduled in the current period. This is behavioral data that would not be available at origination, suggesting a two-model architecture: one for origination scoring (using only static features) and one for ongoing monitoring.

  3. Geographic fairness analysis reveals 11 states flagged under the four-fifths rule, with Mississippi and Nevada showing the highest disparate impact ratios. This warrants investigation into whether the model is picking up on legitimate risk factors or geographic proxies for protected characteristics.

  4. Calibration analysis shows all three models tend to overpredict default probability, likely due to the class-weighting strategy. A post-hoc calibration step (Platt scaling or isotonic regression) would improve probability estimates for pricing applications.

Next steps:

  • Separate origination-time features from behavioral features for a proper two-stage model
  • Incorporate multiple issuers (Ford, Honda, Toyota trusts) for cross-issuer generalization
  • Add temporal validation (train on earlier periods, test on later)
  • Apply Platt scaling for probability calibration

Appendix A: Mathematical Foundations¶

A.1 Logistic Regression¶

The logistic regression model estimates the probability of default as:

$$ P(Y_i = 1 \mid \mathbf{x}_i) = \sigma(\mathbf{w}^\top \mathbf{x}_i + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x}_i + b)}} $$

where $\sigma(\cdot)$ is the sigmoid function, $\mathbf{w} \in \mathbb{R}^d$ is the weight vector, and $b$ is the bias term. The model is trained by minimizing the weighted binary cross-entropy loss:

$$ \mathcal{L}(\mathbf{w}, b) = -\frac{1}{N}\sum_{i=1}^{N}\left[w_i^+ \cdot y_i \log \hat{p}_i + (1 - y_i)\log(1 - \hat{p}_i)\right] $$

where $w_i^+ = \frac{N}{2 N^+}$ for positive samples and $w_i^+ = \frac{N}{2 N^-}$ for negative samples, implementing the balanced class weight strategy that inversely weights classes by their frequency.

A.2 Gradient Boosted Trees (XGBoost)¶

XGBoost constructs an additive ensemble of $K$ decision trees:

$$ \hat{y}_i = \sum_{k=1}^{K} f_k(\mathbf{x}_i), \quad f_k \in \mathcal{F} $$

where $\mathcal{F}$ is the space of regression trees. Each tree $f_k$ is added greedily to minimize the regularized objective:

$$ \mathcal{L}^{(k)} = \sum_{i=1}^{N} \ell\left(y_i,\; \hat{y}_i^{(k-1)} + f_k(\mathbf{x}_i)\right) + \Omega(f_k) $$

with the regularization term:

$$ \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2 $$

where $T$ is the number of leaves and $w_j$ are the leaf weights. The split gain for a candidate split is:

$$ \text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma $$

where $G = \sum_{i \in I} g_i$ and $H = \sum_{i \in I} h_i$ are sums of first and second order gradients of the loss.

A.3 Neural Network Architecture¶

The feedforward network maps input features to a default probability through a sequence of affine transformations and nonlinearities:

$$ \begin{aligned} \mathbf{h}_1 &= \text{Dropout}_{0.3}\left(\text{BN}\left(\text{ReLU}\left(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1\right)\right)\right) & \mathbf{W}_1 \in \mathbb{R}^{128 \times 22} \\ \mathbf{h}_2 &= \text{Dropout}_{0.2}\left(\text{BN}\left(\text{ReLU}\left(\mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2\right)\right)\right) & \mathbf{W}_2 \in \mathbb{R}^{64 \times 128} \\ \mathbf{h}_3 &= \text{ReLU}\left(\mathbf{W}_3 \mathbf{h}_2 + \mathbf{b}_3\right) & \mathbf{W}_3 \in \mathbb{R}^{32 \times 64} \\ \hat{p} &= \sigma\left(\mathbf{w}_4^\top \mathbf{h}_3 + b_4\right) & \mathbf{w}_4 \in \mathbb{R}^{32} \end{aligned} $$

where $\text{BN}$ denotes batch normalization: $\hat{x}_j = \frac{x_j - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}} \cdot \gamma_j + \beta_j$, with learned scale $\gamma_j$ and shift $\beta_j$ parameters.

The network is trained with weighted binary cross-entropy loss:

$$ \mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[w^+ \cdot y_i \log \hat{p}_i + (1 - y_i)\log(1 - \hat{p}_i)\right] $$

where $w^+ = N^-/N^+$ compensates for class imbalance. We optimize with Adam ($\beta_1 = 0.9, \beta_2 = 0.999, \eta = 10^{-3}$) for 30 epochs with batch size 2048.

A.4 Evaluation Metrics¶

Area Under the ROC Curve (AUC): Measures the probability that a randomly chosen defaulter is scored higher than a randomly chosen non-defaulter:

$$ \text{AUC} = P(\hat{p}(X^+) > \hat{p}(X^-)) = \frac{\sum_{i: y_i=1}\sum_{j: y_j=0} \mathbb{1}[\hat{p}_i > \hat{p}_j]}{N^+ \cdot N^-} $$

Kolmogorov-Smirnov Statistic (KS): The maximum vertical distance between the cumulative distribution functions of scores for defaulters and non-defaulters:

$$ \text{KS} = \max_t \left|F_1(t) - F_0(t)\right| = \max_t \left|\text{TPR}(t) - \text{FPR}(t)\right| $$

where $F_1$ is the CDF of scores for the default class and $F_0$ for the non-default class. In credit scoring, KS > 0.4 is generally considered a strong model.

Gini Coefficient: A rescaled version of AUC that maps the [0.5, 1.0] range to [0, 1]:

$$ \text{Gini} = 2 \cdot \text{AUC} - 1 $$

Equivalently, it equals the area between the Lorenz curve and the line of equality. In credit scoring, Gini > 0.6 indicates good discriminative power.

A.5 Disparate Impact Analysis¶

The four-fifths rule, originally from EEOC employment guidelines and widely applied in fair lending, states that a selection procedure has adverse impact if the selection rate for any protected group is less than four-fifths (80%) of the rate for the group with the highest rate:

$$ \text{DI}_{g} = \frac{P(\text{Deny} \mid G = g)}{P(\text{Deny} \mid G = g_{\text{ref}})} \quad \Longrightarrow \quad \text{Flag if } \text{DI}_g < 0.8 \text{ or } \text{DI}_g > 1.25 $$

We use the median state denial rate as the reference group $g_{\text{ref}}$ to identify outlier states on both ends of the distribution. Note that disparate impact is a necessary but not sufficient condition for discrimination -- flagged disparities may be justified by legitimate business necessity if the underlying risk factors are demonstrably predictive and there is no less discriminatory alternative.