Random Forest and Decision Tree Analysis

Decision Tree and Random Forest

This analysis will utilize the data set from https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv and the purpose is to find the differences between the decision tree method and the random forest method for predicting wine quality from the 11 variables that indicate each wine’s chemical readings. The 12th variable is the quality rating given to these 1599 wines and is used as the predictor in the models. The quality has ratings from 3 to 8. To make the ratings work in the models there will be three groups “Fair” = (3-4), “Satisfactory” = (5-6), and “Excellent” = (7-8).

# Load All Libraries
library(C50)
library(gmodels)
library(randomForest)
library(rpart)
library(rpart.plot)
library(caret)

#Load Data set
winequality.red <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", row.names=NULL, sep=";")
vino1 <- winequality.red

Glancing over the summary table we can see the quality is 3 to 8. The density is interesting because it is approximately 1. Maybe this is an important measure or maybe not, we will leave it in for now. The total.sulfur.dioxide is also interesting because it the max=289 while the mean=46.47 indicates some outlier values, but again lets keep the data to gather for now.

summary(vino1)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Here the quality variable is grouped into categories because we can see in the barplot that there is a three way split in the quality ratings. Fair, Satisfactory, and Excellent will be used as the dependent variable. From the graph we can see the Satisfactory quality is about 82% of the data, Excellent is approximately 14% of the data and Fair is 0.4% of the data. In this example we have a good amount of data to indicate a normal quality wine in the middle. The Excellent wines are a metric that may be something we aim for if producing high quality wines. While the fair wines don’t have much data and possibly a good measure of what not to do when making a wine.

barplot(table(vino1$quality))

#group quality variable into 3
vino1$quality[vino1$quality >= 7] <- "Excellent"
vino1$quality[vino1$quality >= 5 & vino1$quality <=6] <- "Satisfactory"
vino1$quality[vino1$quality <= 4] <- "Fair"
#change quality to factor
vino1$quality <- as.factor(vino1$quality)
#view data set
str(vino1$quality)

##  Factor w/ 3 levels "Excellent","Fair",..: 3 3 3 3 3 3 3 1 1 3 ...

table(vino1$quality)

## 
##    Excellent         Fair Satisfactory 
##          217           63         1319

barplot(table(vino1$quality))

Part 1 - Decision Tree with C5.0 package

The first section is going to use the C5.0 package to model a decision tree. The data is split 70%/30% train/test for this project. The data is randomly sampled into training and test objects. In the training set we can see 83% Satisfactory, 13% Excellent, and 0.4% fair which are close to representing the data split mentioned earlier. The C5.0 model shows 1119 samples and the 11 predictors with a tree size of 54. The output is a bit lengthy and toward the end we can see the error at 5.8% which doesn’t seem too bad, but we will try to improve on later. The accuracy would be (106+21+927)/1119= 94% over the training data. The attributes usage is interesting and shows the top two attributes alcohol (100%) and volatile.acidity(96.25%). Then a sharp drop to sulfates at 34.5%. As mentioned earlier the density was about 1 and here we see its last for attributes at 2.32%. The attribute usage chart could possibly be used to limit variables in future decision trees by dropping the bottom four under 10%. For now we will proceed through the exercise with all the attributes to see what happens.

###############Decision Tree in C5.0
set.seed(1234)
train_sample <- sample(1599, 1119) #70/30 data split
str(train_sample) #check sample

##  int [1:1119] 1308 1018 1125 1004 623 905 645 934 400 900 ...

vino_train <- vino1[train_sample, ] #training set
vino_test <- vino1[-train_sample, ] #testing set
prop.table((table(vino_train$quality))) #Check split

## 
##    Excellent         Fair Satisfactory 
##   0.13404826   0.04468275   0.82126899

#library(C50)
set.seed(765)
vino_model <- C5.0(vino_train[-12], vino_train$quality) #training set less quality variable

vino_model #check model

## 
## Call:
## C5.0.default(x = vino_train[-12], y = vino_train$quality)
## 
## Classification Tree
## Number of samples: 1119 
## Number of predictors: 11 
## 
## Tree size: 69 
## 
## Non-standard options: attempt to group attributes

This shows the predictive model and the cross table produces a summary. The model accuracy is 83% (40+1+359)/(480). The false positive of 34 and the false negative of 19 show there could be a fuzzy area between Excellent and Satisfactory. There is also a discrepancy between Fair and Satisfactory. With satisfactory being a large grouping there is room for error but it would be nice to have more accuracy on the higher end wines in the Excellent category.

vino_pred <- predict(vino_model, vino_test) #run prediction
#library(gmodels)
#Table outcome
CrossTable(vino_test$quality, vino_pred, prop.chisq = FALSE, prop.c = FALSE, 
           prop.r = FALSE, dnn = c('actual quality', 'predicted quality'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  480 
## 
##  
##                | predicted quality 
## actual quality |    Excellent |         Fair | Satisfactory |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|
##      Excellent |           28 |            1 |           38 |           67 | 
##                |        0.058 |        0.002 |        0.079 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##           Fair |            1 |            1 |           11 |           13 | 
##                |        0.002 |        0.002 |        0.023 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##   Satisfactory |           28 |           15 |          357 |          400 | 
##                |        0.058 |        0.031 |        0.744 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##   Column Total |           57 |           17 |          406 |          480 | 
## ---------------|--------------|--------------|--------------|--------------|
## 
##

confusionMatrix(vino_test$quality, vino_pred, dnn = c('actual quality', 'predicted quality'))

## Confusion Matrix and Statistics
## 
##               predicted quality
## actual quality Excellent Fair Satisfactory
##   Excellent           28    1           38
##   Fair                 1    1           11
##   Satisfactory        28   15          357
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8042          
##                  95% CI : (0.7658, 0.8387)
##     No Information Rate : 0.8458          
##     P-Value [Acc > NIR] : 0.9941          
##                                           
##                   Kappa : 0.2946          
##                                           
##  Mcnemar's Test P-Value : 0.5458          
## 
## Statistics by Class:
## 
##                      Class: Excellent Class: Fair Class: Satisfactory
## Sensitivity                   0.49123    0.058824              0.8793
## Specificity                   0.90780    0.974082              0.4189
## Pos Pred Value                0.41791    0.076923              0.8925
## Neg Pred Value                0.92978    0.965739              0.3875
## Prevalence                    0.11875    0.035417              0.8458
## Detection Rate                0.05833    0.002083              0.7438
## Detection Prevalence          0.13958    0.027083              0.8333
## Balanced Accuracy             0.69951    0.516453              0.6491

The next part of the C5.0 model is to try to improve the accuracy by increasing the number of trials to 10, which is the base point for most boosted models. The output is rather long. The Attribute usage is now showing 100% usage of alcohol, sulphates (from the first model) and includes density, total.sulfur.dioxide, and volatile.acidity. The remaining are in the 90% range with pH down to 86.6%. Thus with 10 trials more chances for the attributes to be utilized.The model classified 99% or a 0.71% error rate on the training data which is a big improvement from the 5.8% error we had earlier with one pass over the data. The accuracy only climbed up to 86% with 10 trials. A small gain but worth noting and looking further into possibly increasing the trials with more time for analysis. Not sure what to do to increase accuracy, the error rate over the training data is really low with the boost.

vino_boost <- C5.0(vino_train[-12], vino_train$quality, trials = 10) #try to improve
vino_boost

## 
## Call:
## C5.0.default(x = vino_train[-12], y = vino_train$quality, trials = 10)
## 
## Classification Tree
## Number of samples: 1119 
## Number of predictors: 11 
## 
## Number of boosting iterations: 10 
## Average tree size: 60.8 
## 
## Non-standard options: attempt to group attributes

vino_boost$boostResults # long summary

##    Trial Size Errors Percent         Data
## 1      1   69     52     4.6 Training Set
## 2      2   40    140    12.5 Training Set
## 3      3   59    145    13.0 Training Set
## 4      4   62    134    12.0 Training Set
## 5      5   70    110     9.8 Training Set
## 6      6   60    122    10.9 Training Set
## 7      7   68    105     9.4 Training Set
## 8      8   48    192    17.2 Training Set
## 9      9   77    133    11.9 Training Set
## 10    10   55    121    10.8 Training Set

set.seed(354)
vino_boost_pred <- predict(vino_boost, vino_test) #run prediction on improved model

confusionMatrix(vino_test$quality, vino_boost_pred, dnn = c('actual', 'predicted'))

## Confusion Matrix and Statistics
## 
##               predicted
## actual         Excellent Fair Satisfactory
##   Excellent           28    0           39
##   Fair                 0    1           12
##   Satisfactory        16    1          383
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8583          
##                  95% CI : (0.8239, 0.8883)
##     No Information Rate : 0.9042          
##     P-Value [Acc > NIR] : 0.9995          
##                                           
##                   Kappa : 0.3936          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Excellent Class: Fair Class: Satisfactory
## Sensitivity                   0.63636    0.500000              0.8825
## Specificity                   0.91055    0.974895              0.6304
## Pos Pred Value                0.41791    0.076923              0.9575
## Neg Pred Value                0.96126    0.997859              0.3625
## Prevalence                    0.09167    0.004167              0.9042
## Detection Rate                0.05833    0.002083              0.7979
## Detection Prevalence          0.13958    0.027083              0.8333
## Balanced Accuracy             0.77346    0.737448              0.7565

This section attempts to add penalties to the model errors in the false positive and false negative values. Its didn’t work out that well and reduced the accuracy down to 82.9%. The errors where slightly increased in the false positives and decreased in the false negatives.Maybe the penalty values need to be changed a bit for this to work. Further analysis is needed in this error correction section.

matrix_dimensions <- list(c("Excellent", "Fair", "Satisfactory"), c("Excellent", "Fair", "Satisfactory"))
names(matrix_dimensions) <- c("predicted", "actual")
matrix_dimensions

## $predicted
## [1] "Excellent"    "Fair"         "Satisfactory"
## 
## $actual
## [1] "Excellent"    "Fair"         "Satisfactory"

##### doesnt seem to make a difference
error_cost <- matrix(c(0,1,4,1,0,4,4,1,0), nrow = 3, dimnames = matrix_dimensions)
error_cost

##               actual
## predicted      Excellent Fair Satisfactory
##   Excellent            0    1            4
##   Fair                 1    0            1
##   Satisfactory         4    4            0

vino_cost <- C5.0(vino_train[-12], vino_train$quality, costs = error_cost)
vino_cost_pred <- predict(vino_cost, vino_test)
CrossTable(vino_test$quality, vino_cost_pred, prop.chisq = FALSE, prop.c = FALSE, 
           prop.r = FALSE, dnn = c('actual quality', 'predicted quality'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  480 
## 
##  
##                | predicted quality 
## actual quality |    Excellent |         Fair | Satisfactory |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|
##      Excellent |           28 |            1 |           38 |           67 | 
##                |        0.058 |        0.002 |        0.079 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##           Fair |            1 |            1 |           11 |           13 | 
##                |        0.002 |        0.002 |        0.023 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##   Satisfactory |           28 |           15 |          357 |          400 | 
##                |        0.058 |        0.031 |        0.744 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##   Column Total |           57 |           17 |          406 |          480 | 
## ---------------|--------------|--------------|--------------|--------------|
## 
##

confusionMatrix(vino_test$quality, vino_cost_pred, dnn = c('actual quality', 'predicted quality'))

## Confusion Matrix and Statistics
## 
##               predicted quality
## actual quality Excellent Fair Satisfactory
##   Excellent           28    1           38
##   Fair                 1    1           11
##   Satisfactory        28   15          357
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8042          
##                  95% CI : (0.7658, 0.8387)
##     No Information Rate : 0.8458          
##     P-Value [Acc > NIR] : 0.9941          
##                                           
##                   Kappa : 0.2946          
##                                           
##  Mcnemar's Test P-Value : 0.5458          
## 
## Statistics by Class:
## 
##                      Class: Excellent Class: Fair Class: Satisfactory
## Sensitivity                   0.49123    0.058824              0.8793
## Specificity                   0.90780    0.974082              0.4189
## Pos Pred Value                0.41791    0.076923              0.8925
## Neg Pred Value                0.92978    0.965739              0.3875
## Prevalence                    0.11875    0.035417              0.8458
## Detection Rate                0.05833    0.002083              0.7438
## Detection Prevalence          0.13958    0.027083              0.8333
## Balanced Accuracy             0.69951    0.516453              0.6491

Part 2 - Decision Tree

#Decision Tree in rpart
#library(rpart)
set.seed(8765)
#split data into 70/30
ind <- sample(1599, 1119) #70/30 data split
vinoDT_train <- vino1[ind, ] #training set
vinoDT_test <- vino1[-ind, ] #testing set
#run rpart
vinoDT <- rpart(quality ~ ., method = "class", data = vino1, control = rpart.control(minsplit=10))
plotcp(vinoDT)

The graph above shows the plot of the cost complexity parameters and indicates where pruning off of the splits that are not worthwhile. Any split that does not improve the fit by the cp will be pruned off. The graph shows where the size of the tree is higher at 14 the relative classification error rate is lower on the training set. This graph doesn’t show the optimal level before the error grows again. The table in the summary also shows these figures and describes the best place to cut off the tree. The graph shows a convenient line at approx 8 for optimal cut off.

((vinoDT$cptable))

##           CP nsplit rel error    xerror       xstd
## 1 0.06428571      0 1.0000000 1.0000000 0.05427741
## 2 0.02500000      2 0.8714286 0.9035714 0.05211953
## 3 0.02142857      3 0.8464286 0.8892857 0.05178265
## 4 0.01785714      5 0.8035714 0.8964286 0.05195167
## 5 0.01666667      6 0.7857143 0.9000000 0.05203575
## 6 0.01428571     10 0.7178571 0.9035714 0.05211953
## 7 0.01071429     11 0.7035714 0.8607143 0.05109473
## 8 0.01000000     13 0.6821429 0.8357143 0.05047682

attributes(vinoDT)

## $names
##  [1] "frame"               "where"               "call"               
##  [4] "terms"               "cptable"             "method"             
##  [7] "parms"               "control"             "functions"          
## [10] "numresp"             "splits"              "variable.importance"
## [13] "y"                   "ordered"            
## 
## $xlevels
## named list()
## 
## $ylevels
## [1] "Excellent"    "Fair"         "Satisfactory"
## 
## $class
## [1] "rpart"

summary(vinoDT$frame ) #too long

##                   var           n                wt              dev       
##  <leaf>             :14   Min.   :   3.0   Min.   :   3.0   Min.   :  0.0  
##  free.sulfur.dioxide: 3   1st Qu.:  16.5   1st Qu.:  16.5   1st Qu.:  4.0  
##  volatile.acidity   : 3   Median :  50.0   Median :  50.0   Median : 16.0  
##  alcohol            : 2   Mean   : 209.4   Mean   : 209.4   Mean   : 37.7  
##  sulphates          : 2   3rd Qu.: 119.0   3rd Qu.: 119.0   3rd Qu.: 38.5  
##  pH                 : 1   Max.   :1599.0   Max.   :1599.0   Max.   :280.0  
##  (Other)            : 2                                                    
##       yval         complexity          ncompete       nsurrogate   
##  Min.   :1.000   Min.   :0.000000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:0.008036   1st Qu.:0.000   1st Qu.:0.000  
##  Median :3.000   Median :0.010000   Median :0.000   Median :0.000  
##  Mean   :2.259   Mean   :0.015101   Mean   :1.926   Mean   :1.963  
##  3rd Qu.:3.000   3rd Qu.:0.016667   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :3.000   Max.   :0.064286   Max.   :4.000   Max.   :5.000  
##                                                                    
##        yval2.V1              yval2.V2              yval2.V3              yval2.V4              yval2.V5              yval2.V6              yval2.V7           yval2.nodeprob   
##  Min.   :1.0000000     Min.   :  0.00000     Min.   : 0.00000      Min.   :   0.0000     Min.   :0.0000000     Min.   :0.00000000    Min.   :0.0000000     Min.   :0.0018762   
##  1st Qu.:1.0000000     1st Qu.:  8.00000     1st Qu.: 0.00000      1st Qu.:   5.5000     1st Qu.:0.1173055     1st Qu.:0.00000000    1st Qu.:0.3475232     1st Qu.:0.0103189   
##  Median :3.0000000     Median : 16.00000     Median : 0.00000      Median :  18.0000     Median :0.2631579     Median :0.00000000    Median :0.6842105     Median :0.0312695   
##  Mean   :2.2592593     Mean   : 35.96296     Mean   : 7.37037      Mean   : 166.1111     Mean   :0.3877042     Mean   :0.01276578    Mean   :0.5995300     Mean   :0.1309846   
##  3rd Qu.:3.0000000     3rd Qu.: 48.00000     3rd Qu.: 2.00000      3rd Qu.:  90.5000     3rd Qu.:0.6524768     3rd Qu.:0.02089000    3rd Qu.:0.8505476     3rd Qu.:0.0744215   
##  Max.   :3.0000000     Max.   :217.00000     Max.   :63.00000      Max.   :1319.0000     Max.   :1.0000000     Max.   :0.05263158    Max.   :1.0000000     Max.   :1.0000000   
##

#library(rpart.plot)
rpart.plot(vinoDT, main="Wine Decision Tree") #plot tree

optimal <- which.min(vinoDT$cptable[,"xerror"])
cp <- vinoDT$quality[optimal, "CP"]
#Tree Pruning 
vinoDT_prune <- prune(vinoDT, cp=vinoDT$cptable[which.min(vinoDT$cptable[,"xerror"]),"CP"])

plotcp(vinoDT_prune)

rpart.plot(vinoDT_prune, main="Wine Decision Tree") #plot tree

Here the accuracy of the model prediction is up to 88%. This method brought down the error rate in the false positive and false negative areas 34 down to 24 and 19 down to 11. A good result even above the C5.0 boosted output. This model optimized the error rate by cutting off the tree in the 8 area.

vinoDT_pred <- predict(vinoDT_prune, newdata = vinoDT_test, type = "class")
confusionMatrix(vinoDT_test$quality, vinoDT_pred, dnn = c('actual quality', 'predicted quality'))

## Confusion Matrix and Statistics
## 
##               predicted quality
## actual quality Excellent Fair Satisfactory
##   Excellent           41    0           31
##   Fair                 0    0           21
##   Satisfactory         8    0          379
## 
## Overall Statistics
##                                          
##                Accuracy : 0.875          
##                  95% CI : (0.842, 0.9032)
##     No Information Rate : 0.8979         
##     P-Value [Acc > NIR] : 0.9553         
##                                          
##                   Kappa : 0.5206         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Excellent Class: Fair Class: Satisfactory
## Sensitivity                   0.83673          NA              0.8794
## Specificity                   0.92807     0.95625              0.8367
## Pos Pred Value                0.56944          NA              0.9793
## Neg Pred Value                0.98039          NA              0.4409
## Prevalence                    0.10208     0.00000              0.8979
## Detection Rate                0.08542     0.00000              0.7896
## Detection Prevalence          0.15000     0.04375              0.8063
## Balanced Accuracy             0.88240          NA              0.8580

plot(vinoDT_pred ~ quality, data=vinoDT_test, xlab="Actual", ylab = "Predicted")

Random Forest

#Random Forest 
#library(randomForest)
#Split data 70/30
set.seed(9876)
ind <- sample(1599, 1119) #70/30 data split
trainData <- vino1[ind, ] #training set
testData <- vino1[-ind, ] #testing set
rf <- randomForest(quality ~., data = trainData, ntree=100, proximity=TRUE)
table(predict(rf), trainData$quality)

##               
##                Excellent Fair Satisfactory
##   Excellent           87    1           24
##   Fair                 0    0            0
##   Satisfactory        72   43          892

((rf)) #print

## 
## Call:
##  randomForest(formula = quality ~ ., data = trainData, ntree = 100,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 12.51%
## Confusion matrix:
##              Excellent Fair Satisfactory class.error
## Excellent           87    0           72  0.45283019
## Fair                 1    0           43  1.00000000
## Satisfactory        24    0          892  0.02620087

attributes(rf)

## $names
##  [1] "call"            "type"            "predicted"       "err.rate"       
##  [5] "confusion"       "votes"           "oob.times"       "classes"        
##  [9] "importance"      "importanceSD"    "localImportance" "proximity"      
## [13] "ntree"           "mtry"            "forest"          "y"              
## [17] "test"            "inbag"           "terms"          
## 
## $class
## [1] "randomForest.formula" "randomForest"

plot(rf)

The random forest shows alcohol at 50.67 and volatile.acidity at 39.99 under the MeanDecrease which is similar to the decision tree is highest ranking. Some of the other variables are also similar we see pH not at the lowest at 24.68 but fee.sulfur.dioxide is lowest at 21.55 in this model. The preceding graph demonstrates the same numbers from the table.

importance(rf)

##                      MeanDecreaseGini
## fixed.acidity                25.42739
## volatile.acidity             44.73501
## citric.acid                  27.87692
## residual.sugar               25.44751
## chlorides                    27.65742
## free.sulfur.dioxide          20.05222
## total.sulfur.dioxide         26.89442
## density                      30.71208
## pH                           22.67565
## sulphates                    42.21175
## alcohol                      50.04752

varImpPlot(rf)

vinoPred <- predict(rf, newdata = testData, type = "class")
table(vinoPred, testData$quality)

##               
## vinoPred       Excellent Fair Satisfactory
##   Excellent           22    0           12
##   Fair                 0    0            0
##   Satisfactory        36   19          391

This model shows an accuracy of 87% which is an increase from other models used.

confusionMatrix(testData$quality, vinoPred,dnn = c('actual', 'predicted'))

## Confusion Matrix and Statistics
## 
##               predicted
## actual         Excellent Fair Satisfactory
##   Excellent           22    0           36
##   Fair                 0    0           19
##   Satisfactory        12    0          391
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8604          
##                  95% CI : (0.8262, 0.8902)
##     No Information Rate : 0.9292          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3395          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Excellent Class: Fair Class: Satisfactory
## Sensitivity                   0.64706          NA              0.8767
## Specificity                   0.91928     0.96042              0.6471
## Pos Pred Value                0.37931          NA              0.9702
## Neg Pred Value                0.97156          NA              0.2857
## Prevalence                    0.07083     0.00000              0.9292
## Detection Rate                0.04583     0.00000              0.8146
## Detection Prevalence          0.12083     0.03958              0.8396
## Balanced Accuracy             0.78317          NA              0.7619

Bryan

2018/06/01

Decision Tree and Random Forest

Part 1 - Decision Tree with C5.0 package

Part 2 - Decision Tree

Random Forest