Data Preprocessing
The following analysis is of the diamonds dataset downloaded from the tidyverse/ggplot2 Github repository. The data is in a .csv file. The purpose of the analysis is to explore the data and perform data exploration, cleaning, and preprocessing needed for modeling. Data cleaning and preprocessing involves checking for missing records, removing missing data, imputing missing data, converting categorical variables with one-hot encoding or dummy variables, scaling data, normalizing data. The majority of code is not the focus of this analysis but rest assured, and there are close to 700 lines written for the graphing presented in this article.
The tidyverse package is the primary tool used in R for this analysis in addition to a few other R packages.
library(tidyverse)
Import the Data
The data imports into the diamonds
tibble, also specify the data type for each data column as factors (categorical), doubles (digits), and an integer.
diamonds <- read_csv("https://github.com/tidyverse/ggplot2/raw/master/data-raw/diamonds.csv", col_types = "dfffddiddd")
Data Summary
R’s summary methods are summary
from base R and glimpse
from the tidyverse. The data has 53,940 records and ten columns of data. There are three categorical data types: cut
, color
, and clarity
, while the remaining variables carat
, depth
, table
, x, y, & z
are digits and price
(integers). Within the categorical variables of cut, color
, and clarity
, there is an imbalance and noted if using machine learning algorithms in modeling the data. The min and max values of x, y, z
standout in the numeric data. As shown later, x, y, z
are measurements, and a zero measurement would not be possible. The price
variable has an $18,497 range. The median is $2401, indicating the possibility of outlier values, and the carat
range from 0.2 to 5.0, with a median of 0.7 carats. And overall, there are no missing values or NA’s in this data.
tibble::glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <fct> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <fct> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <fct> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Ideal :21551 E: 9797 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Premium :13791 I: 5422 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Good : 4906 J: 2808 SI2 : 9194 Median :61.80
## Mean :0.7979 Very Good:12082 H: 8304 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Fair : 1610 F: 9542 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 G:11292 VVS1 : 3655 Max. :79.00
## D: 6775 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Identify the Zero Values in x, y, z
The summary table above lists the x, y, and z
variables having a minimum value of 0. The minimum value of zero would not be possible. Table 1 lists out all the instances of x, y, z
== 0, and there are twenty observations total. These rows are removed from the data because the sample is low and won’t impact the analysis or modeling later. Table 2 below shows that the values are removed and correctly display the lowest x, y , z
values in the data.
xyzZero = filter(diamonds, x == 0 | y == 0 | z == 0)
carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|
1 | Premium | G | SI2 | 59 | 59 | 3142 | 7 | 6 | 0 |
1 | Premium | H | I1 | 58 | 59 | 3167 | 7 | 7 | 0 |
1 | Premium | G | SI2 | 63 | 59 | 3696 | 6 | 6 | 0 |
1 | Premium | F | SI2 | 59 | 58 | 3837 | 6 | 6 | 0 |
2 | Good | G | I1 | 64 | 61 | 4731 | 7 | 7 | 0 |
1 | Ideal | F | SI2 | 62 | 56 | 4954 | 0 | 7 | 0 |
1 | Very Good | H | VS2 | 63 | 53 | 5139 | 0 | 0 | 0 |
1 | Ideal | G | VS2 | 59 | 56 | 5564 | 7 | 7 | 0 |
1 | Fair | G | VS1 | 58 | 67 | 6381 | 0 | 0 | 0 |
2 | Premium | H | SI2 | 59 | 61 | 12631 | 8 | 8 | 0 |
2 | Ideal | G | VS2 | 62 | 54 | 12800 | 0 | 0 | 0 |
2 | Premium | I | SI1 | 61 | 58 | 15397 | 9 | 8 | 0 |
1 | Premium | D | VVS1 | 62 | 59 | 15686 | 0 | 0 | 0 |
2 | Premium | H | SI1 | 61 | 59 | 17265 | 8 | 8 | 0 |
2 | Premium | H | SI2 | 63 | 59 | 18034 | 0 | 0 | 0 |
2 | Premium | H | VS2 | 63 | 53 | 18207 | 8 | 8 | 0 |
3 | Good | G | SI2 | 64 | 58 | 18788 | 9 | 9 | 0 |
1 | Good | F | SI2 | 64 | 60 | 2130 | 0 | 0 | 0 |
1 | Good | F | SI2 | 64 | 60 | 2130 | 0 | 0 | 0 |
1 | Premium | G | I1 | 60 | 59 | 2383 | 7 | 7 | 0 |
Remove all x, y, z
equal to zero rows from data. A total of 20 rows will be removed.
carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|
0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
Categorical Variables - (Cut, Color, Clarity)
Figure 1 depicts the four primary classifications of diamond quality and is proportional to value according to the Gemological Appraisal Industry (2018a), an independent company that does gemstone appraisals and services. Clarity
describes the type and amount of blemishing in the diamond stone. Color
is the lightness of the diamond. Cut
is the dimensions of the stone. Carat
is the weight and proportional to the size of the stone. In the data, cut
is a grading while depth
, table
, x
, y
, z
are the measurements leading to the cut proportions.
Diamonds Cut Variable
The graph below is the categorical variable cut
and the distribution of records within the dataset. Ideal
, Premium
, Very Good
, Good
, and Fair
make up the cut
category. 21,548 or 40% of the data is an Ideal
cut. 1,609 or 3% of the dataset is Fair
cut. Thus an imbalance within the cut
variable.
Diamonds Color Variable
The data contains the color
categorical variable and has seven classes: D, E, F, G, H, I, J
in order from colorless to light yellow. The G
color class makes up 11,284 or 21% of the total data, and the J
color class makes up 2,808 or 5% of the total data. There is an imbalance of the classes in color
.
Diamonds Clarity Variable
The data has a third categorical variable clarity
and contains eight classes: IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1
grades in descending order of clarity in the stone. The lower grade SI1
is 13,063 or 24% of the total data, while the highest grade I1
is 738 or 1.2% of the total data. There is an imbalance in the ‘clarity’ classes.
Alluvial Plot - Color
The following section is the categorical variables plotted using the alluvial plot method suitable for categorical variables. The alluvial plot will help see the relationship between these three complex categorical variables with multiple classes per variable.
library(alluvial)
The above diagram highlights where the color
classes on the right spread out through the cut
classes and then into the clarity
classes. Yes, this is a complicated-looking pattern of lines; however, there are patterns in the E
class moving into the Ideal
class and from the Ideal
class spanning out into SI2, SI1, VS1, VS2
with fewer occurrences into VVS1, VVS2, I1, & IF
.
Alluvial Plot - Cut
The Alluvial plot above gives a focus on the cut
variable classes. The Ideal
class has green lines moving left into a clear direction into the majority of SI2, SI1, VS1, & VS2
. On the right side, Idea
moves into all the classes of color
; however, a strong move into class E
of color
variable. Patterns can be observed in orange from the Premium
class and magenta Very Good
class. Further down is the itemization of these cut classes out into the clarity
and color
classes.
Alluvial Plot - Clarity
The alluvial plot above describes the movement of clarity
through cut
and into color
. The relationships are discernable, and the relationship that all classes of clarity
and cut
span out into the color
classes.
Alluvial Plot - Fair
The next five alluvial plots make the variable relationship clearer showing the relationship of cut
and its classes to clarity
and color
. The patterns become much clearer in this format.
Alluvial Plot - Very Good
Alluvial Plot - Good
Alluvial Plot - Ideal
Correlation Matrix Analysis
Before going further with the individual numeric variables, this section will present the correlation between the numeric variables. Table 3 is the correlation matrix, and Spearman’s Rank Correlation Heat Map is below. One of the better robust measures of correlation is Spearman’s. The data shows skewness with long tails and many outliers; thus, using a robust correlation method. I did look at Pearson’s, and there were minor differences in this analysis, which may not be the case with other datasets.
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
carat | 1 | 0.03 | 0.19 | 0.96 | 1.00 | 1.00 | 0.99 |
depth | NA | 1.00 | -0.25 | 0.01 | -0.02 | -0.03 | 0.10 |
table | NA | NA | 1.00 | 0.17 | 0.20 | 0.20 | 0.16 |
price | NA | NA | NA | 1.00 | 0.96 | 0.96 | 0.96 |
x | NA | NA | NA | NA | 1.00 | 1.00 | 0.99 |
y | NA | NA | NA | NA | NA | 1.00 | 0.99 |
z | NA | NA | NA | NA | NA | NA | 1.00 |
Spearman’s Correlation Matrix
Purple shows the highest positive correlation between variables, and the lighter gold color shows no linear correlation between variables. Close to a coefficient of 0 indicates no linear relationship between variables. We consider any value below 0.3 or -0.3 as a weak correlation between variables, while a strong positive/negative linear relationship is any value 0.7 to -0.7.
The carat
variable displays a strong positive uphill relationship of 1 - 0.99 to the x
, y
, & z
variables. See Figure 5 below that demonstrates where the x, y, z
measurements are taken on the diamond stone. Larger measurements certainly mean a larger stone, equating to a higher carat weight. Note the relationship in carat
wt to the x, y, z
in Table 4 below. And as we will see, the price
and carat
are strongly correlated. Oddly the depth
shows practically no linear relationship to z
, and table
shows a weak linear relationship to x
. However, Figure 5 indicates depth
and table
derived from z
& x
.
Note the relationship in carat wt
to the x, y, z
in Table 4. The smaller sizes are smaller carats, while the larger sizes are higher carats. Dimensions are proportional to the carat weight.
carat | x | y | z | |
---|---|---|---|---|
Min. :0.2000 | Min. : 3.730 | Min. : 3.680 | Min. : 1.07 | |
1st Qu.:0.4000 | 1st Qu.: 4.710 | 1st Qu.: 4.720 | 1st Qu.: 2.91 | |
Median :0.7000 | Median : 5.700 | Median : 5.710 | Median : 3.53 | |
Mean :0.7977 | Mean : 5.732 | Mean : 5.735 | Mean : 3.54 | |
3rd Qu.:1.0400 | 3rd Qu.: 6.540 | 3rd Qu.: 6.540 | 3rd Qu.: 4.04 | |
Max. :5.0100 | Max. :10.740 | Max. :58.900 | Max. :31.80 |
Remove x, y, z Variables
The decision is to remove the x
, y
, and z
variable from the dataset.
diamonds = select(diamonds, carat, cut, color, clarity, depth, table, price)
Analysis of Numeric Variables - (Caret, Price, Depth, Table)
This next section explores the remaining numeric variables in the diamonds dataset. The variables: carat
, depth
, table
, price
are the remaining numeric variables in the dataset.
I will explore the population distribution, standard deviation, medians and quartiles, skew, outliers, and normal probability variables. There is a summary table for each variable, histogram with normal distribution curve imposed on it, showing how well the data fits a normal distribution. The normal probability plot for continuous variables determines if it from a normally distributed sample with a reference line to look for closest to a straight line. And finally, a boxplot to also describe the distribution of data. They show the whole range of the variable where the whiskers end—Left and right of the box in the 25th and 75th percentiles. The upper/lower whisker projects from the hinge 1.5 x the interquartile range (IQR). A solid black line through the box is the median, and the red circles are the outliers in the variable. The summary table summarizes the variable: min, 25%, mean, median, 75%, max, sd, and the lower and upper IQR and the number of outliers beyond the IQR* 1.5.
Carat
The range of carat
size is 0.2ct. up to 5.01ct., and the median is 0.7ct, which is a robust estimate not influenced by the extreme outliers. The histogram shows the heavily grouped size between 0.25ct. and 0.5ct. However, a median of 0.7 and a standard deviation of 0.47 puts the bulk of data between 0.23ct and 1.17ct. The standard deviation (squared deviation) is not robust to the outliers that skewed the data. However, the majority of the data points are between 0.2ct. and 2ct. The interquartile range is 0.64. The data has a positive skew to the right indicated by the data points curving from above to below the line then back above it on the Q-Q plot. From the boxplot, the 1,883 outliers skew the variable to the right.
Price
The price
histogram shows the distribution of data points with a majority in the lower price range. There is an $18,497 range in price across the data points. The median is $2,401, which is not influenced by the upper value of $18,823. We can see a skew from the median with a 75% percentile of $5,323. The $3,532 outliers affect the standard deviation above the IQR*1.5. The IQR range is $4,374. The data has a positive skew to the right, shown by the Q-Q Plot. The points curve from above the line to below the line and then back above the Q-Q Plotline. However, the upper points return closer to the line at the end. Price
is not a normal distribution. However, price
is the predictor variable at a future point.
Depth
The depth
variable displays a normal curve; however, there is a peak that increases kurtosis. There are long tails on either side of the 25% and 75% quartiles due to the 2,543 outliers. The standard deviation is 1.43 and is not too affected by outliers. There seems like an even distribution of outliers on either side of the IQR*1.5 whiskers. The Q-Q Plot has data points below the line and then at the red bar and finishing above it. The data indicates a normal distribution with a sharp peak and fatter tails relative to a normal distribution.
Table
Table
produces and normal distribution with a sharp peak and more sweeping tails due to outliers. The median is 57 and tight between the 25th and 75th percentiles. The Q-Q plot shows some kurtosis but little skew. The standard deviation is 2.23 and seems high, given the normal-looking curve. Standard deviation is affected by the outliers again and may not be the best measure. There are 604 outliers in the table
data points.
CARAT AND PRICE
We noticed a strong uphill correlation of 0.96 between the price
and carat
in the Spearman Matrix above. The plot below displays that correlated relationship to a point on the carat
scale. However, we can see a large dispersion of prices for the same carat weight, which should happen based on the other factors that grade the diamond. The anomalous drop in price occurs at 2.25ct, and the loess smoothing curve dips in price to approximately 3.25ct and then resumes its climb, but doesn’t exceed in price until after 4ct. Thus 2.25ct to 4ct sees a drop in the linear price. Also, the data becomes very disperse after 2.5ct. According to the carat
boxplot from earlier, the outliers start around 2ct. and above. On the graph, the blue line is climbing sharply and begins to stop climbing 1.75ct. as it approaches the 2.0ct mark.
This graph takes a randomized sample and plots a scatterplot of the price
and carat
to help visualize where the loess curve begins to decline or deviate from its upward growth in price/carat relationship. Around 2ct, the data starts to ease up and at 2.25ct starts to fall in price. Not sure why this is happening; however, let’s look at some more data relationships.
The price-carat
boxplot is more telling but very similar to the scatterplot in a curve over carat
weight and price. The boxplot was made by binning the carat sizes down from 5.25ct to 0.2ct in 0.5ct intervals. Grouping the carat
helped organize the plots and better visualize 1ct and 1.25ct standout, showing a skew with many outliers and spanning the price range $2000/$2500 up to the $18,000 limit. Thus, other variables determine the price
as not only the diamond’s carat weight. Note the sparseness of the data points after 3.5ct. The table is a summary of the grouping of carat
into 0.5ct intervals. The table reiterates what is in the histogram plot from earlier. The majority of diamond data points are in the 0.25ct to 1.25ct range, with the most extensive grouping 0.25ct to 0.75ct. The sparse data points after 2.25ct jump out and will consider what to do with the data pre-modeling.
The grouping and distribution plot below shows the dispersion of carat weights binned into 0.5ct groups. 0.25ct to 0.75ct hold the most observations at 29,496, which is 54.7% of the total data, followed by the next size group 0.75ct to 1.25ct, which is 29.6% of the entire data. The question I have is there enough data after 2.75ct to 5ct to determine or predict diamond prices? More than likely, not, and maybe we remove this data.
COLOR AND PRICE
The scatter plot demonstrates the carat
to price
and introduces a diamond color
grade. However, there is a trend; it’s the D, E, F
considered higher grades that are colorless, which seem to be in the smaller carat weights and span across the entire price range. The outlier carat weights above 3ct are H, I, J
, which have more color than the D, E, F
, and assumes that there are few large carat colorless diamonds, certainly in this dataset.
The color
and price
boxplot shows low to high median value left to right and the outliers in red. Again H, I, & J
all have higher prices, while D
, E
, & F
have lower prices, but their outliers reach up into the higher cost diamonds. The number in the boxes gives the distribution of color numbers in the total dataset. G
has the most observations, followed by E
and F
color grades.
The density curve demonstrates the skewed color data to the right and the distribution of data points in the data. It’s another way to see what is happening with color
and price
.
The box plot above summarized the distribution of color
, and here is each color grade that shows the skewness of each grading. Heavy skew to the right. Notice the increase of price in H, I, & J
around the $4,500 to $5,000 mark.
CLARITY AND PRICE
The plot below is for the price
, carat
, and & clarity
. This graphic is a bit better because we can see the clear bands of color representing each of the clarity
grades in proportion to price
and carat
weight. I1
is the “Included” rating and in the lowest clarity rating because it has more blemishes and irregularities compared to the IF
(internally flawless). We find the I1
clarity in the higher carat wt diamonds, and there are outliers in the higher price range. With this scatterplot, it is clear that the clarity
grades are following a curve and span a defined carat
range but extent through all prices from lowest to highest price.
The price/clarity boxplot gives more insight into what is happening. All the clarity
grades span the price range, but all of them are outliers at higher prices. SI1
, VS1
& SI2
make up the majority of the data for clarity
. The boxes are in ascending order of each median, and the lower quality clarity
grades are higher in price than the top clarity grades IF, VVS1, & VVS2
. Figure 8 does elude to higher the clarity the rarer the diamond, but in this data, IF
is the second-lowest distribution above I1
, but I1
isn’t considered rare it’s the least clear diamond, and the outliers are few however do reach the higher prices. The five clarity grades to the plots right all have a 75th percentile hovering around $5,000 in price.
The density plot of price to clarity gets confusing because of the overlap in color
and the low price. The IF
clarity is in more of the lower-priced diamonds. And so is I1
.
These histograms show the distribution of each clarity grade to price. We find a heavy skew right, and the graphs show the majority of clarity grades are in the lower cost diamonds. SI2
& I1
have a broader price range, with many costing up to $5,000. The two from the left edge are under $2,500 and are the highest clarity grade.
CUT AND PRICE
The fair cut appears to follow the larger carat sizes similar to the I1
clarity grade above, and the fair cut is found on the biggest diamonds and at the highest prices. The Ideal cut in red spans the carat range densely from 0.2ct to approximately 1.5ct then becomes more dispersed in the higher carat ranges, yet the price is higher when the carat
and Ideal
grade are together.
The boxplot explains the cut
distribution better. The Ideal
cut is the largest proportion of the dataset, followed by Premium
and Very Good
, Premium
is the grade with a 75th percentile over $5,000 in the price above Very Good
and then Fair
. All cut grades have outliers that reach the highest prices and possibly Ideal
with the most outliers extending the price range. But statistically, the Premium
grade extends the IQR*1.5 the highest in price.
The density curve is more clear to read with fewer categories. We can see this in the boxplots, and here Ideal
takes the majority of data points followed by Premium
. The Fair
grade has a narrow price range, as seen in the boxplot, with the lowest number of observations in the data.
No big surprise at this point that the data is skewed right because of outliers. The histogram plots give some more view into how the cut dispersed in the data. The properties combined with the boxplot and density curve show that most of the data reside in the under $5,000 price range and under the 2.5ct weight range.
Conclusion
It is best to look at the outliers and decide if they should be taken out of the data or stay. For regression analysis, it recommends removing outliers and normalizing the data to represent a normal distribution. Also, scaling the data would help with the accuracy of the algorithm. If we wanted to use machine learning algorithms on this data, I would remove extreme outliers, normalize/scale the data and one-hot-encode all of the categorical variables. I am leaning toward using a decision tree classifier and may post up future analysis. I would like to continue this process in a future post when time is available.
References
Gemological Appraisal Industry. (2018). Diamond Education. Retrieved from: https://gailab.org/content/diamond-education
Gemological Institute of America Inc. (2018a). Diamond Quality Factors. Retrieved from: https://www.gia.edu/diamond-quality-factor
Gemological Institute of America Inc. (2018b). Diamonds - Overview. Retrieved from: https://www.gia.edu/diamond
Gemological Institute of America Inc. (2018c). Diamond Cut: The wow factor. Retrieved from: https://www.gia.edu/diamond-cut/diamond-cut-basic-overview
STHDA, Statistical tools for high-throughput data analysis. (2018). ggplot2: Quick correlation matrix heatmap - R software and data visualization. Retrieved from: (http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization
Tidyverse/ggplot2. (2018). diamonds.R. [Data file]. Retrieved from: https://github.com/tidyverse/ggplot2/tree/master/data-raw
Tremonti Fine Gems & Jewellery. (2012). Buying diamonds safely - Why we won’t let you get caught out. Retrieved from: http://tremontijewellery.blogspot.com/2012/07/buying-diamonds-safely-why-we-wont-let.html
Wickham, H.(2016).ggplot2 Elegant Graphics for Data Analysis. 2nd ed, pg. 65 Springer. doi:10.1007/978-3-319-24277-4