People who are good at calculating probability and risk are few and far between. That is why understanding statistics is so vital to the market. Like finance, the insurance industry is a collective of individuals who understand how risks change over time. The majority of Americans, on the other hand, find the sector overwhelming and […]

# Risk Management

## Predictive Modelling: Comparing Model Results

AUC for each model and their performance when we set probability cutoff at 50% is summarised below: Kappa statistics from all models exceed 20% by just small amount, which indicated that they perform moderately better than chance. XGB takes advantage of receiving all downsampling data and provides highest AUC. Comparing performance across models may not […]

## Extreme Gradient Boosting in R

Extreme Gradient Boosting has a very efficient implementation. Unlike SVM and RandomForest, we can tune parameter using the whole downsampling set. We focus on varying Ridge & Lasso regularization and learning rate. We use 10% of data for validating tuning parameter. The best tuning parameter is eta = 0.1, alpha = 0.5, and lambda = 1.0. We retrain […]

## Random Forest Model in R

Now, we will tune RandomForest model. Like SVM, we tune parameter based on 5% downsampling data. The procedure is exactly the same as for SVM model. Below we have reproduced the code for Random Forest model. The best parameter is mtry(number of predictors) = 2. Like SVM, we fit 10% of downsampling data with this […]

## Support Vector Machine (SVM) Model in R

A support vector machine (SVM) is a supervised learning technique that analyzes data and isolates patterns applicable to both classification and regression. The classifier is useful for choosing between two or more possible outcomes that depend on continuous or categorical predictor variables. Based on training and sample classification data, the SVM algorithm assigns the target […]

## Credit Risk – Logistic Regression Model in R

To build our first model, we will tune Logistic Regression to our training dataset. First we set the seed (to any number. we have chosen 100) so that we can reproduce our results. Then we create a downsampled dataset called samp which contains an equal number of Default and Fully Paid loans. We can use the table() function to check that the downsampling […]

## Building Credit Risk Model

The loan data typically have a higher proportion of good loans. We can achieve high accuracy just by labeling all loans as Fully Paid. For our test data, we gain 70.3% accuracy by just following the above strategy. Recall that we are yet to include the outcome of ‘Current’ loans. In a real situation, the ratio […]

## Create a Function and Prepare Test Data in R

When we build the model, we will need the same set of columns in the test data also and will also need to apply all the same transformations that we have done to the test_data also. Kept Columns Create Function Prepare Test Data We will now take our test data and apply our data transformations to it. […]

## Remove Dimensions By Fitting Logistic Regression

We will use the preProcess function from the caret package to center and scale (Normalize) the data. The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation. The center transform calculates the mean for an attribute and subtracts it from each value. We can try to remove the number of dimensions […]

## Data Cleaning in R – Part 5

Numeric Features Let’s look at all numeric features we have left. We will transform annual_inc, revol_bal, avg_cur_bal, bc_open_to_buy by dividing them by funded_amnt (amount of loan). We can now remove the funded amount attribute. Character Features Let’s look at all character features we have left. We will remove verification_status_joint. Let us look at home_ownership data. There are only three options, MORTGAGE, OWN and RENT. Even […]