Total views

Tuesday, November 15, 2016

K Nearest Neighbor (knn) algorithm in python

Today's post is on K Nearest neighbor and it's implementation in python.

Let's take a hypothetical problem. There are two sections in a class. One section was provided a special coaching program in Mathematics, Physics and Chemistry (they were exposed to a particular treatment), and the next objective is to find the efficiency of the program, or how better the particular section performed.

This is a common problem, and one of the highly used method to calculate is the test and control analysis. However, the main problem remains is how to find the idea control group for a particular test group. One solution to this problem can be given by KNN, or the k-nearest neighbor algorithm. Under this algorithm, for every test student, we can find k different control students based on some pre-determined criteria.

There are a number of articles in the web on knn algorithm, and I would not waste your time here digressing on that. Rather, I would like to share the python code that may be used to implement the knn algorithm on your data.

So, we decide to find the control students based on the marks obtained in last examination in Physics, Chemistry and Mathematics. Our aim is to quantify the increase in total marks obtained in the latest examination (Post_Total_Marks) compared to last examination (Pre_Total_Marks).

The data of the section that was provided the coaching program is saved in Test_Data.csv and the data of the other section is saved in Control_Data.csv.

import csv
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors

#reading in test and control files
control = pd.read_csv('/Users/Desktop/Control_Data.csv', sep = ',')
test = pd.read_csv('/Users/Desktop/Test_Data.csv', sep = ',')

test_rcnt = len(test.index)

#subsetting data for only variables to match on
knn_test = test[['Physics','Chemistry','Mathematics']]
knn_control = (control[['Physics','Chemistry','Mathematics']])

#number of nearest neighbors
k=30

#running KNN algorithm
nbrs = NearestNeighbors(n_neighbors = 30, algorithm = 'kd_tree').fit(knn_control)
neigh = nbrs.kneighbors(knn_test, return_distance = False)

#creating a blank dataframe to store output
output1=pd.DataFrame(columns=['roll_number','Physics','Chemistry','Mathematics','Pre_Total_Marks','Post_Total_Marks','nbr_avg_Pre_Total_Marks','nbr_avg_Pre_Total_Marks'])

#looping over the KNN result to summarise the results for all control students
for i in range (0, test_rcnt):
    roll_number = test.ix[i,'roll_number']
    Physics = test.ix[i,'Physics']
    Chemistry = test.ix[i,'Chemistry']
    Mathematics = test.ix[i,'Mathematics']
    Pre_Total_Marks = test.ix[i,'Pre_Total_Marks']
    Post_Total_Marks = test.ix[i,'Post_Total_Marks']

    nbr_Physics = 0
    nbr_Chemistry = 0
    nbr_Mathematics = 0
    nbr_Pre_Total_Marks = 0
nbr_Post_Total_Marks = 0

    for j in range (0, k):
        nbr_Physics += float(control.ix[neigh[i,j],'Physics'])
        nbr_Chemistry += float(control.ix[neigh[i,j],'Chemistry'])
        nbr_Mathematics += float(control.ix[neigh[i,j],'Mathematics'])
        nbr_Pre_Total_Marks += float(control.ix[neigh[i,j],'Pre_Total_Marks'])
nbr_Post_Total_Marks += float(control.ix[neigh[i,j],'Post_Total_Marks'])

    nbr_avg_Physics = float(nbr_Physics/k)
    nbr_avg_Chemistry = float(nbr_Chemistry/k)
    nbr_avg_Mathematics = float(nbr_Mathematics/k)
    nbr_avg_Pre_Total_Marks = float(nbr_Pre_Total_Marks/k)
    nbr_avg_Post_Total_Marks = float(nbr_Post_Total_Marks/k)

    output1.loc[len(output)]=[roll_number, Physics, Chemistry, Mathematics, Pre_Total_Marks, Post_Total_Marks, nbr_avg_Pre_Total_Marks, nbr_avg_Pre_Total_Marks]

#calculating the incremental value and sending the output to csv
output1['est_post_marks']=output1['Pre_Total_Marks'] * (output1['nbr_avg_Post_Total_Marks'] / output1['nbr_avg_Pre_Total_Marks'])
output1['marks_inc']=output1['Post_Total_Marks'] - output1['est_post_marks']
output1.to_csv('Output_File_1.csv')

#final summary table
output2=pd.DataFrame(columns=['student_count','tot_marks_inc','avg_marks_inc'])

#summary of the incremental value and sending the output to csv
student_count=output1.roll_number.nunique()
tot_marks_inc=sum(output1['marks_inc'])
avg_marks_inc=tot_marks_inc/student_count
output2.loc[0]=[student_count,tot_marks_inc,avg_marks_inc]

output2.to_csv('Output_File_2.csv')

#Note: The indentation may be broken when you paste this code. Please take additional care to look and correct the indentation before using this code.

Output_File_1.csv contains the data at a student level for the test section. If you notice, the marks in Physics would be close to the nbr_avg_Physics. Same holds for Chemistry and Mathematics as well. This is because the knn algorithm finds control students who have very similar marks in the three subjects compared to the test student.

The incremental value is calculated as follows:
Estimated Post Period Marks = Test Pre Marks * (Control Post Marks / Control Pre Marks)
Marks Increment = Test Post Marks - Estimated Post Period Marks

Output_File_2.csv contains the summarized data of the above calculation.

In my next post, I will share the knn algorithm implemented in spark, when the base tables are in big data platform, in the form of hive tables.

Saturday, June 11, 2016

Python: Basic Codes

SAS has been the leader in statistical softwares in large corporates and industries for a long time, while SPSS and STATA have been on top in academics. However, with the huge influx of start-ups, R and Python are coming up quickly to the top, and have challenged SAS for the leading position.

In this blog, I will share a basic set of Python commands and codes that will be helpful to start working with Python. These are easily available online, and this blog is just a small step to consolidate the important codes in one place. If someone really wants to learn Python, it is recommended to browse through coursera and edx.

I had shared similar codes about data exploration in R in a previous blog. You can read it here.

In this blog, we will go through the following commands:

  1. Reading Data in Python
  2. List of columns in a dataset
  3. Appending two dataset
  4. Different joins in Python
  5. Where conditions and multiple where conditions
  6. Group by command
  7. Finding distinct values in a vector
There are multiple modules available in python, and I have used Pandas for the basic operations.

# Importing the pandas module in the memory
import pandas as pd

# 1. Reading a .csv file in Python
Tran_Table1 = pd.read_csv('C:\Users\ Desktop\Python_Dir\Tran_Table1.csv')
Tran_Table2 = pd.read_csv('C:\Users\ Desktop\Python_Dir\Tran_Table2.csv')
Cust_Details = pd.read_csv('C:\Users           \Desktop\Python_Dir\Cust_Details.csv')

# 2. The top 5 columns and column names for a data frame
Cust_Details.head()

# The list of column names for a data frame
Colname1 = list(Tran_Table1.columns.values)
Colname2 = list(Cust_Details.columns.values)

# 3. Merging (or appending) two data frames
frame1 = [Tran_Table1,Tran_Table2]
Total_Tran_Table = pd.concat(frame1)

# 4. Joins - Inner, Left, Right, Outer
Inner_Join = pd.merge(Tran_Table1, Cust_Details, how = 'inner', on = 'Cust_No')
Left_Join = pd.merge(Tran_Table1, Cust_Details, how = 'left', left_on = 'Cust_No', right_on = 'Cust_No')
Right_Join = pd.merge(Tran_Table1, Cust_Details, how = 'right', left_on = 'Cust_No', right_on = 'Cust_No')
Outer_Join = pd.merge(Tran_Table1, Cust_Details, how = 'outer', left_on = 'Cust_No', right_on = 'Cust_No')

# 5. 'Where' condition
Asian_Customer = Cust_Details[Cust_Details['Cust_Continent']=='Asia']
Asian_Customer

# Multiple 'Where' conditions
Asian_Male_Customer = Cust_Details[(Cust_Details['Cust_Continent']=='Asia') & (Cust_Details['Cust_Gender']=='Male')]
Asian_Male_Customer

# 6. Group By
Tran_Table1_Details = pd.merge(Tran_Table1, Cust_Details, how = 'inner', on = 'Cust_No')
Tran_Table1_Details.groupby('Cust_Continent')['X1'].sum()
Tran_Table1_Details.groupby(['Cust_Gender','Cust_Continent'])['Cust_No'].count()

# 7. Unique (distinct) values in a column
Unique_Continents = list(set(Cust_Details.Cust_Continent))


Tuesday, April 26, 2016

Decision Tree in R

Decision Trees are non-parametric supervised learning method that are often used for classification and regression. The basis of decision trees is to create simple rules to decide the final outcome based on the available data. It uses a visualization or graphical method to explain the rules and take the final decision.

Advantages of decision trees:
  • Simple to understand and to interpret. Trees can be visualized.
  • Requires little data preparation.
  • Able to handle both numerical and categorical data
  • Useful in data exploration
  • Non-parametric method, thus it does not need any assumptions on the sample space

Disadvantages:
  • Over fitting and unstable at times
  • Loses a lot of information while trying to categorize continuous variable
  • Not sensitive to skewed distribution
A few terminologies related to decision tree are as follows:

  1. Root Node: The starting point of a tree which captures the entire data.
  2. Splitting: Process of diving a node into two or more sub-nodes.
  3. Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
  4. Leaf or Terminal Node: Nodes that are not split any further are called as terminal node or leaf.
  5. Pruning: Stopping a node to split into further sub-nodes.
  6. Branch or Sub-Tree: A sub section of entire tree is called branch or sub-tree
  7. Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.

Next we will run a decision tree in R, and look into the lot. We will use the RPART package in R for this. RPART comes with a number of in-built datasets, and we will use the dataset named “kyphosis” for our project.

# RPART is the package for running decision tree
install.packages("rpart")
library(rpart)

# This is an in-built dataset in the RPART package
kyphosis

# The basic code for implementing a Decision Tree
fit1 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)

# These will plot the tree
plot(fit1)
text(fit1, use.n = TRUE)




# A more fancy plot can be obtained using the following packages. This is optional.
install.packages("rattle")
install.packages("rpart.plot")
install.packages("RColorBrewer")
library(rattle)
library(rpart.plot)
library(RColorBrewer)

fancyRpartPlot(fit1)




# Code to predict and score dataset using the decision tree created above.
# In this case, we are trying to do an in-sample validation
pred1 <- as.data.frame(predict(fit1,kyphosis))

# Creating a final table and assigning 1 where predicted probability is greater than 1
pred_kyphosis1 <- data.frame(act_kyphosis = kyphosis$Kyphosis, prob = pred1$present)

pred_kyphosis1$prediction[pred_kyphosis1$prob >= 0.5] = "Pred_Present"
pred_kyphosis1$prediction[pred_kyphosis1$prob < 0.5] = "Pred_Absent"

# Creating the classification (or confusion matrix)
prop.table(table(pred_kyphosis1$act_kyphosis,pred_kyphosis1$prediction))

Pred_Absent
Pred_Present
Absent
0.65
0.14
Present
0.02
0.19


There are multiple options available while running a decision tree. Here, we will explore one such example of using a loss function.

Suppose, in our case the precision rate should be high, while we can allow the recall rate to go down a bit. Therefore, we want our predictions to be more accurate, i.e. we want our FALSE POSITIVE to be low, even though it may lead to cases where actual 1's are not getting captured, therefore the FALSE NEGATIVE going up.

We can include the cost function in the code where we build the decision tree.

# Creating a Decision Tree with loss function
fit2 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,
              parms=list(split="information", loss=matrix(c(0,2,1,0), byrow=TRUE, nrow=2)))

pred2 <- as.data.frame(predict(fit2,kyphosis))
pred_kyphosis2 <- data.frame(act_kyphosis = kyphosis$Kyphosis, prob = pred2$present)

pred_kyphosis2$prediction[pred_kyphosis2$prob >= 0.5] = "Pred_Present"
pred_kyphosis2$prediction[pred_kyphosis2$prob < 0.5] = "Pred_Absent"

# Classification matrix for the new tree
prop.table(table(pred_kyphosis2$act_kyphosis,pred_kyphosis2$prediction))

Pred_Absent
Pred_Present
Absent
0.74
0.05
Present
0.10
0.11

We can clearly see how the FALSE POSITIVE has gone down, and FALSE NEGATIVE has increased.

Saturday, March 12, 2016

R: Basic Codes

SAS has been the leader in statistical softwares in corporates and industries for a long time, while SPSS and STATA have been on top in academics. However, with the huge influx of start-ups, R and Python are coming up quickly to the top, and have challenged SAS for the leading position.

In this blog, I will share a basic set of R commands and codes that will be helpful to start working with R. These are easily available online, and this blog is just a small step to consolidate the important codes in one place. If someone really wants to learn R, it is recommended to browse through coursera and edx.

In this blog, we will go through the following commands:

  1. Reading Data in R
  2. List of columns in a dataset
  3. Different joins in R
  4. If conditions and multiple if conditions
  5. Appending two dataset
  6. Group by command
  7. Finding distinct values in a vector


# 1. Reading Data in R
SampleData <- read.csv('C:/Users/Desktop/Sample_R_Data.csv',header = TRUE,dec=".")
SampleData1 <- read.table('C:/Users/Desktop/Sample_R_Data.csv',sep=",",header = TRUE,dec=".")

# 2. List of columns in the dataset
colnames(SampleData)
colist <- colnames(SampleData)

data1 <- subset(SampleData, select=c("cust_i", "y_var"))
data2 <- SampleData[,c(2,3,4)]
data2_1 <- SampleData[c("cust_i", "y_var")]
data3 <- SampleData[1:10,c(2,3,4)]

# 3. Joins in R
 data_LeftJoin <- merge(x = data1, y = data2, by = "cust_i", all.x = TRUE)
data_RightJOin <- merge(x = data1, y = data2, by = "cust_i", all.y = TRUE)
data_OuterJoin <- merge(x = data1, y = data2, by = "cust_i", all = TRUE)
data_CrossJoin <- merge(x = data1, y = data2, by = NULL)
data_InnerJoin <- merge(x = data1, y = data3, by = "cust_i")

# 4a. If conditions
data4 <- subset(SampleData, y_var==1)
data5 <- SampleData[SampleData$y_var==0,]
# install.packages("sqldf")
library(sqldf)
data6 <- sqldf("select * from SampleData where y_var==1")

# 4b. Multiple If condition (SUBSET is inefficient)
data7 <- subset(SampleData, y_var==1 & SampleData <= 300)
data8 <- SampleData[SampleData$y_var==0 & SampleData$str_trnx_gap <= 300,]
data9 <- SampleData[SampleData$y_var==0 | SampleData$str_trnx_gap <= 300,1:4]

# 5. Append
data10 <- rbind(data4,data5)

# 6. Group By Summary Statistics in R
# Using TABLE or AGGREGATE or TAPPLY
# Note the use of USER_DEFINED function

table(SampleData$y_var, responseName= "sum")
data.frame(table(SampleData$y_var))

aggregate(SampleData$cust_i, by=list(Category=SampleData$y_var), FUN=function(x){NROW(x)})

tapply(SampleData$y_var, SampleData$y_var, FUN=function(x){NROW(x)})
tapply(SampleData$y_var, SampleData$y_var, FUN=function(x){NROW(x)/NROW(SampleData)})

# 7. Get distinct values of a vector
unique(SampleData$y_var)

Tuesday, March 8, 2016

Outlier Treatment in SAS

A very important step for any type of analysis is the outlier treatment. An outlier is an observation which lies at a distant from the majority of the observations, may be because of some exceptional cases, or due to issues in data storage and management. For a regression model to be robust, it is essential to have a sanity check of the data, to remove the presence of any such anomaly.

A very basic way to remove the ouliers are to delete such extreme observations completely from the data. Another way to handle it is called capping and flooring, where the upper extreme values and the lower extreme values are replaced by a pre-determined threshold. Another common way to do this is to implement both of these together.

When the number of variables are huge, it might not be possible to check the distribution of every variable one by one, instead a rule can be created to treat the outliers. Let us create the following rule:

For all variables, we will delete the values which are above the 99 percentile or below the 1 percentile. After that, we would cap all the values that lie above 95 percentile to the 95th percentile value, and floor all the values that are less than 5 percentile to the 5th percentile value.

The following SAS code has been automated to implement the above rule and create a outlier-treated dataset. The rule can be modified as per scenario, or as per the analysts' discretion.


libname dataloc "C:/Desktop/Data";  /* MODEL DATASET LOCATION  */
%let inset=Base_Data;               /* MODEL DATASET NAME      */
%let libout=C:/Desktop/Output;
libname outlib "&libout.";          /* OUTPUT LOCATION         */
%let outset=Clean_Data;             /* OUTPUT DATASET          */
%let upper_del_threshold = 99;      /* UPPER LIMIT TO DELETE   */
%let lower_del_threshold = 1;       /* LOWER LIMIT TO DELETE   */
%let upper_cap_threshold = 95;      /* UPPER LIMIT TO CAP      */
%let lower_floor_threshold = 5;     /* LOWER LIMIT TO FLOOR    */
%let varlist = X1 X2 X3;            /* LIST OF VARIABLES       */

data inset1;
set dataloc.&inset.;
run;

proc means data=inset1 StackODSOutput P&upper_del_threshold. P&lower_del_threshold. P&upper_cap_threshold. P&lower_floor_threshold.;
var &varlist;
ods output summary=LongPctls;
run;

data LongPctls;
set LongPctls;
run;

data _null_;
set LongPctls;
call symput(compress("var"||_n_),compress(variable));
call symput(compress("up_del"||_n_),compress(P&upper_del_threshold.));
call symput(compress("low_del"||_n_),compress(P&lower_del_threshold.));
call symput(compress("up_ext"||_n_),compress(P&upper_cap_threshold.));
call symput(compress("low_ext"||_n_),compress(P&lower_floor_threshold.));
call symput(compress("n"),compress(_n_));
run;

%macro dataset;
data outset1;
set inset1;
%do i = 1 %to &n.;

if &&var&i..  gt &&up_del&i.. then delete;
else if &&var&i..  lt &&low_del&i.. then delete;

else if &&var&i..  gt &&up_ext&i.. then &&var&i.. = &&up_ext&i..;
else if &&var&i..  lt &&low_ext&i.. then &&var&i.. = &&low_ext&i..;

%end;
run;
%mend;
%dataset;

data outlib.&outset.;
set outset1;
run;

/*A quick check in the univariate analysis to validate the outlier treatment*/

%macro chck;
%do i = 1 %to &n.;
proc univariate data=inset1;
var &&var&i..;
run;

proc univariate data=outset1;
var &&var&i..;
run;

%end;
%mend;
%chck;

Tuesday, February 2, 2016

Logistic Regression in SAS

Logistic Regression is one of the most used technique in the analytics world, and for every propensity modelling, risk modelling etc., this is one of the most important as well as well-accepted steps. The main difference between the logistic regression and the linear regression is that the Dependent variable (or the Y variable) is a continuous variable in linear regression, but is a dichotomous or categorical variable in a logistic regression. You can read more about logistic regression here or the wiki page.

While validating a logistic model, we try to see some of the statistics like Concordance and Discordance, Sensitivity and Specificity, Precision and Recall, Area under the ROC curve. One of the most important aspect is the Precision and Recall. The problem faced by the analysts is how to balance between the two. If we try to increase one of them, the other reduces. A measure that tries to balance both of them is called as F-ratio or F-measure, which is calculated as the Harmonic Mean of precision and recall. However, based on the project requirement, one can calculate an adjusted F-ratio, which calculates the harmonic mean of precision and recall after giving a higher weight on one of them.

The following SAS code is an attempt to simplify the SAS code, and it has been automated for future use. A detailed documentation about the Logistic regression output is given here. The various outputs like parameter estimate, concordance-discordance, classification table etc. will be stored as tables. The html output contains the regular stuffs, along with the ROC curve for the training data as well as the ROC curve of the validation data. The score statement gives the classification table for the validation data, and also scores the validation data which can be used for calculating other validation statistics like Kolmogorov-Smirnov etc.



%let train = TRAINING_DATA;           /* TRAINING DATA */
%let validate = VALIDATION_DATA       /* VALIDATION DATA    */
%let targetvar = Y;                   /* DEPENDENT BINARY VARIABLE */
%let varlist = X1 X2 X3 X4;           /* INDEPENDENT VARIABLES   */
%let binvarlist = X2 X3;              /* LIST OF BINARY INDEPENDENT VARIABLES */


ods graphics on;
ods html;
ods output           
parameterestimates = TBL_ParamEst     /* PARAMETER ESTIMATES     */
OddsRatios =TBL_OddRatio              /* ODD RATIOS              */
LackFitPartition = TBL_HLpartition    /* HOSMER-LEMESHOW PARTITIONS    */
LackFitChiSq = TBL_HLstatistic        /* HOSMER-LEMESHOW STATISTIC     */
Association = TBL_Association         /* CONCORDANCE DISCORDANCE ETC   */
FitStatistics = TBL_FitStatistic      /* AIC, -2LOG, SC, ETC           */
GlobalTests = TBL_GlobalTests         /* WALD, LOGLIKELIHOOD, ETC      */
Classification = TBL_Classification   /* CLASSIFICATION TABLE    */
;

proc logistic data= &train. descending outest=LogisticTest outmodel=LogisticModel plots(only)=(roc) PLOTS(MAXPOINTS=NONE);
class      &binvarlist.    /param=reference ref=first;
model &targetvar.(event='1')=
&varlist.
/lackfit ctable selection=stepwise slentry= .05 sls=.05 ridging=none;
score data=&train. out= Scored_Training_Data outroc= ROC_TABLE_TRAINING_Data;
score data=&validate. out= Scored_Validation_Data outroc= ROC_TABLE_VALIDATION_DATA;
run;

ods html close;
ods graphics off;
ods output close;

proc sort data= TBL_ParamEst;
by descending waldchisq;
run;

/* The above code sorts the significant estimates based on the Wald Chi Square */

data TBL_Classification (drop = B);
set TBL_classification;
precision = correctevents/(correctevents + Incorrectevents);
recall = correctevents/(correctevents + Incorrectnonevents);
F_stat1 = harmean(precision,recall);
B = 2;
F_stat2 = (((1 + B*B) * (precision * recall))/((B*B*precision) + recall));
run;

proc sort data= LOG_Classification;
by descending F_Stat;
run;

/* The above code uses the Classification Table and calculates the precision, recall and sorts the table based on the F ratio*/
/* It also calculates the adjusted F ratio where higher weightage is given on Recall */
/* The probability value where the F ratio (or adjusted F ratio) is maximum should be treated as the threshold probability during validation */

data ROC_TABLE_TRAINING_Data(drop = B);
set ROC_TABLE_TRAINING_Data;
precision = _POS_/(_POS_ + _FALPOS_);
recall = _POS_/(_POS_ + _FALNEG_);
F_stat = harmean(precision,recall);
B = 2;
F_stat2 = (((1 + B*B) * (precision * recall))/((B*B*precision) + recall));
run;

proc sort data= ROC_TABLE_TRAINING_Data;
by descending F_Stat;
run;

/* This code should give the same precision and recall value as the above table. Instead of using the classification table, here the ROC table made on the training data is used to calculate precision and recall     */

data prob_threshold;
set ROC_TABLE_TRAINING_Data;
if _n_ = 1 then output;
run;

data _null_;
set prob_threshold;
call symput(“prob_thresh”,compress(_prob_);
run;

/* Saving the threshold value of probability for demarcating between 1 and 0 */

data ROC_TABLE_VALIDATION_DATA (drop = B);
set ROC_TABLE_VALIDATION_DATA;
precision = _POS_/(_POS_ + _FALPOS_);
recall = _POS_/(_POS_ + _FALNEG_);
F_stat = harmean(precision,recall);
B = 2;
F_stat2 = (((1 + B*B) * (precision * recall))/((B*B*precision) + recall));
run;

data Validation_Check;
set ROC_TABLE_VALIDATION_DATA;
if _prob_ ge &prob_thresh. then output;     /* Use the Probability_Threshold from the step above */
run;

proc sort data= Validation_Check;
by _prob_;
run;
/* This part of the code calculates the precision and recall in the validation data at the pre-fixed level of probability threshold value. */