Facebook

Home .py P2P FinBee loan portfolio analysis using Maschine Learning

P2P FinBee loan portfolio analysis using Maschine Learning

P2P FinBee loan portfolio analysis analysis is made with Python, Pandas and Sklearn.

Data is provided by FinBee. Full details of the company:

UAB "Finansų bitė" | Adresas: Šv. Ignoto g. 5, Vilnius, Lietuva | Pašto kodas: LT-01144

https://www.finbee.lt/

 

Total number of loans is  increasin day by day. At the start of project total number of loans was 8592. 

Descriptive statistics

df.mean()

loan_amount

2112.64

loan_period

33.00

effective_rate

 0.20

preferred_rate 

0.20

monthly_installment

484.83

age

38.22

dependants

0.72

monthly_income

758.90

monthly_expenses

106.28

available_income

652.62

 

 

 

Overall quality of data is good. There are several issues with bad coding like '---' or emty cells.

Variables types

Numeric 15
Categorical 16
Boolean 2
Date 1

Number and percentage of missing values:
missing values

  Total Percent
post_town 9 0.104215
gender 8 0.092635
occupation 3 0.034738
marital_status 3 0.034738
years_working_in_total 1 0.011579

 

 

Strangely but most popular loan amount is 340 EUR (255 loans), second 1110 EUR (215 loans):

# Bin values into discrete intervals
bins = pd.cut(df['Loan_amount'], [0,500, 1000,2000,3000, 5000, 10000])
df.groupby(bins)['Loan_amount'].agg(['count', 'sum'])
 

loan intervals

Loan_amount count sum
(0, 500] 1029 384640
(500, 1000] 1599 1164045
(1000, 2000] 2386 3425405
(2000, 3000] 1508 3722470
(3000, 5000] 1450 5556385
(5000, 10000] 611 3830980

  

 Loan intervals histogram

 

# Plot histogram using seaborn
import seaborn as sns
plt.figure(figsize=(15,8))
sns.distplot(df['loan_amount'],label='fff')
plt.xlabel("Loan intervals (EUR)", labelpad=14)

loan intervals histo

Some other distribution which are not very uncommon and might not give valuable insights for the model.   

agedependantsefective rateincome

Feature Selection with sklearn

 

plt.figure(figsize=(12,10))
cor = df.loc[:,features_list].corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

correlation heatmap

 

 

 

Pearson correlation heatmap shows that some values have strong correlation between each other. These values will be eliminated from model. The priority is given for the feature whish has higher correlation with 'loan_status' .

For example: monthly income and monthly expenses correlation is 0.8. In this case I use monthly expenses as it has higher correlation with Loan Status.

 

Loan_amount 0.077005
Loan_period 0.102977
Loan_rate 0.223578
Gender 0.089568
City 0.021701
Age  0.109887
Loan_type 0.034739
Education  0.000327
Employment_status 0.027008
Occupation 0.008057
Work_duration 0.118933
Total_work_duration 0.055459
Marital_status 0.094221
Dependants 0.008923
Monthly_expenses 0.059593
Last_debt 0.013316
Monthly_payment  0.000211
Credit_rating 0.193291
Monthly_income 0.042510
LOAN_STATUS 1.000000

 

 Model building

Data is splitted into two DataFrames.  One is for model training (confirmed loans), second one- for predictions (unconfirmed loans).

df['confirm_status'] == 0]

df['confirm_status'] != 0]

 

Encoding

Encoding of categorical data (such as 'city', Loan_type', 'education' and ect.) is perfomed with sklearn. Missing values are filled with 'Nan'.

 

Creating vectors

X vector features:

['Loan_amount', 'Loan_period', 'Loan_rate', 'Gender', 'City', 'Age', 'Loan_type', 'Education', 'Employment_status', 'Occupation', 'Work_duration','Total_work_duration','Marital_status', 'Dependants','Monthly_expenses', 'Last_debt'] 

y vector:

['LOAN_STATUS'] 

 

Partition into train and test

from sklearn.model_selection import

train_test_splitX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.2)   

 

Confusion matrix

 [   3    4    2   16]
 [   3   17    5   92]
 [   1    3    7   48]
 [  13  109   63 1367]

Results  of confusion matrix shows that model is not very good in predicting 'default' loans, which is most important.

 

Decision tree

To optimize Decision tree's performance, GridSearchCv() was used with following paratemers:

param_grid={'criterion':['gini','entropy'],'max_depth':[3,5,7,20]}

Scores of Decision Tree GridSearch: 

[(0.8949346405228759, {'criterion': 'gini', 'max_depth': 3}),
 (0.8928104575163399, {'criterion': 'gini', 'max_depth': 5}),
 (0.8857843137254902, {'criterion': 'gini', 'max_depth': 7}),
 (0.8101307189542484, {'criterion': 'gini', 'max_depth': 20}),
 (0.8962418300653595, {'criterion': 'entropy', 'max_depth': 3}),
 (0.894281045751634, {'criterion': 'entropy', 'max_depth': 5}),
 (0.8834967320261438, {'criterion': 'entropy', 'max_depth': 7}),
 (0.8217320261437908, {'criterion': 'entropy', 'max_depth': 20})]

 

Best estimator:

 

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')

 

Decision tree visual graph

decisiontree entropy 

 

Decision tree model accuracy ~89%. Prediction is made on unconfirmed loans to help better evaluate investment choice.

Drawbacks of the model. 'loan_status' is encoded by FinBee in four categories:['ok', 'late', 'arrears', 'default']. Full meaning is provided in FinBee website: Paskolos būsena (ok – nėra vėluojančių įmokų, late – praleista viena paskolos įmoka, arrears – praleistos dvi paskolos įmokos, default – praleistos trys ir daugiau paskolos įmokos). Loan status is varying over time that means model can have some logical deficiency (loan status can go from 'default' to 'ok' and backwards). Therefore might be better to separate and evaluate only finished loans or calculate and predict how many times and days loan is late (unfortunately this data is not provided).