Incipient Analyst: Decision Tree in R

Decision Trees are non-parametric supervised learning method that are often used for classification and regression. The basis of decision trees is to create simple rules to decide the final outcome based on the available data. It uses a visualization or graphical method to explain the rules and take the final decision.

Advantages of decision trees:

Simple to understand and to interpret. Trees can be visualized.
Requires little data preparation.
Able to handle both numerical and categorical data
Useful in data exploration
Non-parametric method, thus it does not need any assumptions on the sample space

Disadvantages:

Over fitting and unstable at times
Loses a lot of information while trying to categorize continuous variable
Not sensitive to skewed distribution

A few terminologies related to decision tree are as follows:

Root Node: The starting point of a tree which captures the entire data.
Splitting: Process of diving a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf or Terminal Node: Nodes that are not split any further are called as terminal node or leaf.
Pruning: Stopping a node to split into further sub-nodes.
Branch or Sub-Tree: A sub section of entire tree is called branch or sub-tree
Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.

Next we will run a decision tree in R, and look into the lot. We will use the RPART package in R for this. RPART comes with a number of in-built datasets, and we will use the dataset named “kyphosis” for our project.

# RPART is the package for running decision tree

install.packages("rpart")

library(rpart)

# This is an in-built dataset in the RPART package

kyphosis

# The basic code for implementing a Decision Tree

fit1 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)

# These will plot the tree

plot(fit1)

text(fit1, use.n = TRUE)

# A more fancy plot can be obtained using the following packages. This is optional.

install.packages("rattle")

install.packages("rpart.plot")

install.packages("RColorBrewer")

library(rattle)

library(rpart.plot)

library(RColorBrewer)

fancyRpartPlot(fit1)

# Code to predict and score dataset using the decision tree created above.

# In this case, we are trying to do an in-sample validation

pred1 <- as.data.frame(predict(fit1,kyphosis))

# Creating a final table and assigning 1 where predicted probability is greater than 1

pred_kyphosis1 <- data.frame(act_kyphosis = kyphosis$Kyphosis, prob = pred1$present)

pred_kyphosis1$prediction[pred_kyphosis1$prob >= 0.5] = "Pred_Present"

pred_kyphosis1$prediction[pred_kyphosis1$prob < 0.5] = "Pred_Absent"

# Creating the classification (or confusion matrix)

prop.table(table(pred_kyphosis1$act_kyphosis,pred_kyphosis1$prediction))

	Pred_Absent	Pred_Present
Absent	0.65	0.14
Present	0.02	0.19

There are multiple options available while running a decision tree. Here, we will explore one such example of using a loss function.

Suppose, in our case the precision rate should be high, while we can allow the recall rate to go down a bit. Therefore, we want our predictions to be more accurate, i.e. we want our FALSE POSITIVE to be low, even though it may lead to cases where actual 1's are not getting captured, therefore the FALSE NEGATIVE going up.

We can include the cost function in the code where we build the decision tree.

# Creating a Decision Tree with loss function

fit2 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,

parms=list(split="information", loss=matrix(c(0,2,1,0), byrow=TRUE, nrow=2)))

pred2 <- as.data.frame(predict(fit2,kyphosis))

pred_kyphosis2 <- data.frame(act_kyphosis = kyphosis$Kyphosis, prob = pred2$present)

pred_kyphosis2$prediction[pred_kyphosis2$prob >= 0.5] = "Pred_Present"

pred_kyphosis2$prediction[pred_kyphosis2$prob < 0.5] = "Pred_Absent"

# Classification matrix for the new tree

prop.table(table(pred_kyphosis2$act_kyphosis,pred_kyphosis2$prediction))

	Pred_Absent	Pred_Present
Absent	0.74	0.05
Present	0.10	0.11

We can clearly see how the FALSE POSITIVE has gone down, and FALSE NEGATIVE has increased.

Incipient Analyst

Total views

Tuesday, April 26, 2016

Decision Tree in R

No comments:

Post a Comment