CollapseLevels

Krishanu Mukherjee

2017-12-04

This package provides utility functions for binary classification problems.

This package provides functions to collapse levels of an attribute based on response rates.

It also provides functions to compute and display Information Value, and Weight of Evidence for attributes,and to convert numeric variables to categorical by binning.

These functions only work for binary classification problems.

This package provides utility functions for the data exploration part of binary classification.

The binary outcome variable may be a factor with two levels or an integer (or numeric ) with two unique values.

Data Set

This package includes a data set named “German_Credit”. This data set classifies customers as “Good” or “Bad” as per their credit risks. This data set was contributed by Professor Dr. Hans Hofmann,and can be downloaded from the UCI Machine Learning Repository. The outcome variable of the downloaded data set is an integer with two unique values 1, and 2.

library(CollapseLevels)
## Loading required package: magrittr
data("German_Credit")

str(German_Credit)
## 'data.frame':    1000 obs. of  21 variables:
##  $ Account_Balance            : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ Duration                   : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Credit_History             : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ Purpose                    : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ Credit_Amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ Saving_Accounts_Bonds      : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ Current_Employment_Length  : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ Installment_Rate           : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ MaritalStatusnGender       : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ Guarantors                 : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ Duration in Current Address: int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Valuable_Asset             : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ Age                        : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ Other_Credit               : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Housing                    : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ Existing_Credits           : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ Job                        : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ Dependents                 : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                  : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ ForeignWorker              : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Good_Bad                   : int  1 2 1 1 2 1 1 1 1 2 ...

Functions

The functions in the package are as follows

levelsCollapser

This function displays the response rates by the levels of an attribute Levels with similar response rates may be combined

We will explore the levels of the attribute “Credit_History” in the German_Gredit data set

data("German_Credit")

# Create an empty list to hold the data structures returned by numericToCategorical

l<-list()

l<-levelsCollapser(German_Credit,resp="Good_Bad",bins=10)

# dset holds the data set
# German_Credit is the data set
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins for categorizing/binning numeric variables
# Default value for the parameter bin is 10
# If you are supplying default values for bin  , the parameter need not be specified in the function 
# The function returns a list.
# For every attribute in the data set , the list contains a table thats shows the response rates
# by the levels of the attribute
# Collapse levels with similar response percentages.

l$Credit_History
##   Credit_History tot response non_response response_pct non_response_pct
## 1            A30  40       25           15     8.333333         2.142857
## 2            A31  49       28           21     9.333333         3.000000
## 4            A33  88       28           60     9.333333         8.571429
## 5            A34 293       50          243    16.666667        34.714286
## 3            A32 530      169          361    56.333333        51.571429
##   response_pct_change
## 1             0.00000
## 2            10.71429
## 4             0.00000
## 5            44.00000
## 3            70.41420

The column response gives the total number of responses (binary outcome variable is 1) for the level.

The column response_pct gives the response percentage for the level.

The table is sorted by response_pct.

The column response_pct_change gives the percentage change in response_pct from one column to the next.

As seen from response_pct_change column the change in response_pct from level A31 to A33 is 0 . So these levels may be combined.

numericToCategorical

This functions categorizes a numeric attribute.

We will categorize the numeric attribute “Duration” in the German_Gredit data set

# Create an empty list to hold the data structures returned by numericToCategorical
l<-list()

# Call the function numericToCategorical to categorize the numeric attribute Duration
# dset holds the data set
# German_Credit is the data set
# col specifies the name of the numeric variable we want to categorize
# resp specifies the name of the binary response variable 
# bins denotes the number of bins
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level


l<-numericToCategorical(dset=German_Credit,col="Duration",resp="Good_Bad",bins=10,adjFactor=0.5)

# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call
#  l$categoricalVariable gives the binned categorized variable. 
#  A bin [a,b) denotes >=a and <b
#  A bin [a,b] denotes >=a and <=b

head(l$categoricalVariable)
## [1] [0,9) [0,9) [0,9) [0,9) [0,9) [0,9)
## 8 Levels: [0,9) [9,12) [12,15) [15,18) [18,24) [24,30) ... [36,72]
#  l$IVTable gives  the Information values of the levels of the binned categorized variable

l$IVTable
##   categoricalDuration tot response non_response response_pct
## 1               [0,9)  94       10           84   0.03333333
## 2              [9,12)  86       17           69   0.05666667
## 3             [12,15) 187       50          137   0.16666667
## 4             [15,18)  66       13           53   0.04333333
## 5             [18,24) 153       52          101   0.17333333
## 6             [24,30) 201       62          139   0.20666667
## 7             [30,36)  43       14           29   0.04666667
## 8             [36,72] 170       82           88   0.27333333
##   non_response_pct         woe           iv
## 1       0.12000000  1.28093385 0.1110142666
## 2       0.09857143  0.55359530 0.0231982792
## 3       0.19571429  0.16066006 0.0046667922
## 4       0.07571429  0.55804470 0.0180700187
## 5       0.14428571 -0.18342106 0.0053279451
## 6       0.19857143 -0.03995831 0.0003234721
## 7       0.04142857 -0.11905936 0.0006236443
## 8       0.12571429 -0.77668029 0.1146528052
#  l$IV gives the Information Value for the binned categorized variable

l$IV
## [1] 0.2778772
#  l$collapseLevels  gives a table of the response rates by the levels of the categorized variable
#  Levels with similar response rates may be collapsed

l$collapseLevels
##   categoricalDuration tot response non_response response_pct
## 1               [0,9)  94       10           84     3.333333
## 4             [15,18)  66       13           53     4.333333
## 7             [30,36)  43       14           29     4.666667
## 2              [9,12)  86       17           69     5.666667
## 3             [12,15) 187       50          137    16.666667
## 5             [18,24) 153       52          101    17.333333
## 6             [24,30) 201       62          139    20.666667
## 8             [36,72] 170       82           88    27.333333
##   non_response_pct response_pct_change
## 1        12.000000            0.000000
## 4         7.571429           23.076923
## 7         4.142857            7.142857
## 2         9.857143           17.647059
## 3        19.571429           66.000000
## 5        14.428571            3.846154
## 6        19.857143           16.129032
## 8        12.571429           24.390244

The change in response_pct from level [15,18) to [30,36) is only 7 percent .

So these levels may be combined.

Similarly levels [12,15), and [18,24) may be combined.

IVCalc2

This function displays the Information Values ( not level wise ) for all the attributes

# Create an empty data frame 
l<-list()

# dset holds the data set
# German_Credit is the data set
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins
# Default value for the parameter bin is 10
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call
# The function returns a data frame.
# For every attribute, the function displays the information values for the attribute


d<-IVCalc2(dset=German_Credit,resp="Good_Bad")


d
##                       Variable          IV
## 1              Account_Balance 0.666011503
## 2                     Duration 0.277877223
## 3               Credit_History 0.293233547
## 4                      Purpose 0.169195066
## 5                Credit_Amount 0.113980630
## 6        Saving_Accounts_Bonds 0.196009557
## 7    Current_Employment_Length 0.086433631
## 8             Installment_Rate 0.020522345
## 9         MaritalStatusnGender 0.044670678
## 10                  Guarantors 0.032019322
## 11 Duration in Current Address 0.003247037
## 12              Valuable_Asset 0.112638262
## 13                         Age 0.121227707
## 14                Other_Credit 0.057614542
## 15                     Housing 0.083293434
## 16            Existing_Credits 0.010083557
## 17                         Job 0.008762766
## 18                  Dependents 0.000000000
## 19                   Telephone 0.006377605
## 20               ForeignWorker 0.043877412

IVCalc

This function displays the Information Values by the levels of an attribute This information is displayed for all attributes in the data set

# Create an empty list to hold the data structures returned by IVCalc function
l<-list()

# dset holds the data set
# German_Credit is the data set
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins
# Default value for the parameter bin is 10
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call
# The function returns a list.
# For every attribute, the function displays the information values by levels of the
# attribute . It also displays the Information Value for the entire attribute

l<-IVCalc(dset=German_Credit,resp="Good_Bad")

#Explore Information Values for the attribute Credit_History

l$Credit_History
## $IVTable
##   Credit_History tot response non_response response_pct non_response_pct
## 1            A30  40       25           15   0.08333333       0.02142857
## 2            A31  49       28           21   0.09333333       0.03000000
## 3            A32 530      169          361   0.56333333       0.51571429
## 4            A33  88       28           60   0.09333333       0.08571429
## 5            A34 293       50          243   0.16666667       0.34714286
##           woe           iv
## 1 -1.35812348 0.0840743109
## 2 -1.13497993 0.0718820624
## 3 -0.08831862 0.0042056484
## 4 -0.08515781 0.0006488214
## 5  0.73374058 0.1324227042
## 
## $IV
## [1] 0.2932335

displayWOE

This function displays the Weight of Evidence of the levels of an attribute.

# dset holds the data set
# German_Credit is the data set
# col specifies the name of the variable for which we want to display the Weight of Evidence values
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins
# Default value for the parameter bin is 10
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call

# Display the Weight of Evidence for the levels of the Job attribute

displayWOE(German_Credit,col="Job",resp="Good_Bad")

displayResponseRatebyLevels

This function displays the response percentages of the levels of an attribute.

# dset holds the data set
# German_Credit is the data set
# col specifies the name of the variable for which we want to display the response percents
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins
# Default value for the parameter bin is 10
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call

# Display the response percentages for the levels of the Account_Balance attribute

displayResponseRatebyLevels(German_Credit,col="Account_Balance",resp="Good_Bad")

displayIV

This function displays the Information Values of the levels of an attribute.

# dset holds the data set
# German_Credit is the data set
# col specifies the name of the variable for which we want to display the IV values
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins
# Default value for the parameter bin is 10
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call

# Display the IV values for the levels of the Account_Balance attribute

displayIV(German_Credit,col="Account_Balance",resp="Good_Bad")