Decision Tree and Random Forest

Assignment 5

Posted by Xiaolu on May 3, 2019

Introduction

In this article, we mainly use Decision Tree nad Random Forest to solve classification problem of heart disease and concern practical issue. Therefore, we use the same data as Classification. Please click here for data and code.

Preprocessing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
df=pd.read_csv("C:\Documents\Data Analysis\Assignments3\heart-disease-uci/heart.csv")
df.head()
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1
a = pd.get_dummies(df['cp'], prefix = "cp")
b = pd.get_dummies(df['thal'], prefix = "thal")
c = pd.get_dummies(df['slope'], prefix = "slope")
frames = [df, a, b, c]
df = pd.concat(frames, axis = 1)
df = df.drop(columns = ['cp', 'thal', 'slope'])
df.head()
age sex trestbps chol fbs restecg thalach exang oldpeak ca ... cp_1 cp_2 cp_3 thal_0 thal_1 thal_2 thal_3 slope_0 slope_1 slope_2
0 63 1 145 233 1 0 150 0 2.3 0 ... 0 0 1 0 1 0 0 1 0 0
1 37 1 130 250 0 1 187 0 3.5 0 ... 0 1 0 0 0 1 0 1 0 0
2 41 0 130 204 0 0 172 0 1.4 0 ... 1 0 0 0 0 1 0 0 0 1
3 56 1 120 236 0 1 178 0 0.8 0 ... 1 0 0 0 0 1 0 0 0 1
4 57 0 120 354 0 1 163 1 0.6 0 ... 0 0 0 0 0 1 0 0 0 1

5 rows × 22 columns

y=df.target.values
x_data=df.drop(['target'],axis=1)

#Normalize
x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data)).values

#Splitting training and testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

Decision Tree

clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
print('Train Accuracy:',format(clf.score(x_train, y_train)))
print('Test Accuracy:',format(clf.score(x_test, y_test)))
Train Accuracy: 1.0
Test Accuracy: 0.8032786885245902

It is apparently that secision tree model overfits the training data. Thus, we control the depth of tree and see how it will improve.

scoreList = []
for i in range(1,20):
    clf2 = DecisionTreeClassifier(max_depth=i)  
    clf2.fit(x_train, y_train)
    scoreList.append(clf2.score(x_test, y_test))
    
plt.plot(range(1,20), scoreList)
plt.xticks(np.arange(1,20,1))
plt.xlabel("Depth")
plt.ylabel("Score")
plt.show()

print('The best depth:',scoreList.index(max(scoreList))+1)
print('Corresponding accuracy on test set:',max(scoreList))

png

The best depth: 3
Corresponding accuracy on test set: 0.8524590163934426

After controlling the depth, decision tree performs much better than before.

As we known, one of the most important advantages of decision tree is that it gives us a direct and easily comprehensible tree-like result. How does it look like in this situation?

import graphviz
from sklearn import tree
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
clf3=DecisionTreeClassifier(max_depth=3)
clf_out=clf3.fit(x_train,y_train)

dot_data=tree.export_graphviz(clf_out, out_file=None,
                              class_names =['Disease','No Disease'],
                              feature_names = x.columns,
                             filled=True, rounded=True,
                             special_characters=True)
graph=graphviz.Source(dot_data) 
graph

svg

Clarification:

  • thal_2 represents fixed defect in heart
  • cp_0 represents having typical angina
  • ca represents the number of major vessels (0-3) colored by flourosopy
  • oldpeak represents ST depression induced by exercise relative to rest, which is a sign of heart diseases

All above are the key features to diagnose whether a person has heart disease and it looks that they totally make sense. However, it is hard to believe that only 4 features are able to tell a person has heart disease or not. We need to use an advanced method, Random Forest, to further test it.

Random Forest

scoreList2 = []
for i in range(10,500):
    rf2=RandomForestClassifier(n_estimators=i, max_depth=3,random_state=0)
    rf2.fit(x_train, y_train)
    scoreList2.append(rf2.score(x_test, y_test))
    
plt.plot(range(10,500), scoreList2)
plt.xticks(np.arange(10,550,50))
plt.xlabel("The number of trees in the forest")
plt.ylabel("Score")
plt.show()

png

rf=RandomForestClassifier(n_estimators=100, max_depth=3,random_state=0)
rf.fit(x_train, y_train)
print('Train Accuracy:',format(rf.score(x_train, y_train)))
print('Test Accuracy:',format(rf.score(x_test, y_test)))
Train Accuracy: 0.8636363636363636
Test Accuracy: 0.9016393442622951

It seems that Random Forest works better than decision tree by giving higher accuracy level.

importances = list(zip(rf.feature_importances_, x.columns))
importances.sort(reverse=True)
pd.DataFrame(importances, index=[i for (_,i) in importances]).plot(kind = 'bar')
plt.title('Importance of Features')
plt.show()

png

The important features are consistent with what we found in Decision Tree. Then, we intend to check the confusion matrix and see what kind of mistake we make.

from sklearn.metrics import confusion_matrix


y_head_rf = rf.predict(x_test)

cm_rf = confusion_matrix(y_test,y_head_rf)

#plt.figure(figsize=(15,6))
plt.title("Confusion Matrixes",fontsize=24)
sns.heatmap(cm_rf,annot=True,cmap="Blues",fmt="d",cbar=False)
plt.show()

png

In terms of statistics, the model performs pretty well since only a few people are mispredicted. But concerning of practical issue, False Positive is more serious than False Negative in diagnosing. Those two who have heart disease but being diagonsed not could miss the treatment and die on heart disease. Thus, we want False Positive to be 0 on this model by adjusting class weight.

rf2=RandomForestClassifier(n_estimators=100, max_depth=5,random_state=0,class_weight={0:1,1:25})
rf2.fit(x_train,y_train)
y_head_rf2 = rf2.predict(x_test)
cm_rf2 = confusion_matrix(y_test,y_head_rf2)
plt.title("Confusion Matrixes",fontsize=24)
sns.heatmap(cm_rf2,annot=True,cmap="Blues",fmt="d",cbar=False)
plt.show()

print('Train Accuracy:',format(rf2.score(x_train, y_train)))
print('Test Accuracy:',format(rf2.score(x_test, y_test)))

png

Train Accuracy: 0.78099173553719
Test Accuracy: 0.7540983606557377

Although False Positive is diminished to 0, False Positive seems to be too high. However, this trade-off is inevitable. If we want to reduce False Negative without changing False Positive, more features needs to be added or we can use another model to further test those who have been misdiagnosed.

Reference

Heart Disease - Classifications (Machine Learning) in Kaggle

Decision Tree - scikit-learn

Ensemble methods - scikit-learn