Using the dplyr Package in R for Simple Data Analysis

This article will briefly describe the dplyr package in R and the simple functions to analyze data with R for data analytics. The dplyr package in R is used for data manipulation and transformation because, typically, the data is not in a usable form. The package is useful for exploratory data analysis and a method of statistical analysis using graphical techniques. Some of the benefits are:

get insights from the data intuitively
find hidden trends
find outliers and anomalous data
test assumptions
help develop models that the data will fit

The dplyr package is used by many data scientists and analysts to manipulate data. All the functions in dplyr have alternate methods in the base version of R and the other myriad of packages that do similar functions. However, R’s code base may be slower in processing time for the modern data problems. The dplyr documentation says dplyr is the grammar of data manipulation. The plyr package was the old version, and the “d” represented data frames and added functionality. I further learned that the “Tibble” package is also used in conjunction because it is a simplified data frame structure for subsetting and printing. The dplyr package provides fast performance with large data sets, easy to use function names, with different data frames, tables, database formats.

Below R code shows rename(), select(), and filter(), group_by(), and summarize() functions. I brought this data into a data frame and noticed there are no descriptive column headings. I want to give a descriptive name to 5 variables (rename) and then reorder them in a new data frame (select), then I would like to see all the grade variables that are the number four (filter). I would like to see the break down of the size variable and use (group_by) and then (summarize) to get a full count of the size occurrences.

Load the data into a data frame using readr which is part of the tidyverse collection including dplyr. The default output shows the column variable names and the data type.

read_csv()

df = readr::read_csv("../../static/data/df.csv")

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   V1 = col_double(),
##   V2 = col_double(),
##   V3 = col_double(),
##   V4 = col_double(),
##   V5 = col_double(),
##   V6 = col_double(),
##   V7 = col_double(),
##   V8 = col_double(),
##   V9 = col_double(),
##   V10 = col_double(),
##   V11 = col_double(),
##   V12 = col_character(),
##   V13 = col_character(),
##   V14 = col_double()
## )

This is produced using kable() from the knitr package in R. Displayed are the first 10 rows of data.

Table 1: The raw data
X1	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14
1	63	1	1	145	233	1	2	150	0	2.3	3	0.0	6.0	0
2	67	1	4	160	286	0	2	108	1	1.5	2	3.0	3.0	2
3	67	1	4	120	229	0	2	129	1	2.6	2	2.0	7.0	1
4	37	1	3	130	250	0	0	187	0	3.5	3	0.0	3.0	0
5	41	0	2	130	204	0	2	172	0	1.4	1	0.0	3.0	0
6	56	1	2	120	236	0	0	178	0	0.8	1	0.0	3.0	0
7	62	0	4	140	268	0	2	160	0	3.6	3	2.0	3.0	3
8	57	0	4	120	354	0	0	163	1	0.6	1	0.0	3.0	0
9	63	1	4	130	254	0	2	147	0	1.4	2	1.0	7.0	2
10	53	1	4	140	203	1	2	155	1	3.1	3	0.0	7.0	1

The names() function will display all the variable names in the data.

names(df)

##  [1] "X1"  "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11"
## [13] "V12" "V13" "V14"

rename()

For this exercise I will only rename() five variables for demonstration.

df_names = rename(df, color=V1, round=V2, length=V3, size=V12, grade=V14)

Table 2: The rename five variables
X1	color	round	length	V4	V5	V6	V7	V8	V9	V10	V11	size	V13	grade
1	63	1	1	145	233	1	2	150	0	2.3	3	0.0	6.0	0
2	67	1	4	160	286	0	2	108	1	1.5	2	3.0	3.0	2
3	67	1	4	120	229	0	2	129	1	2.6	2	2.0	7.0	1
4	37	1	3	130	250	0	0	187	0	3.5	3	0.0	3.0	0
5	41	0	2	130	204	0	2	172	0	1.4	1	0.0	3.0	0
6	56	1	2	120	236	0	0	178	0	0.8	1	0.0	3.0	0
7	62	0	4	140	268	0	2	160	0	3.6	3	2.0	3.0	3
8	57	0	4	120	354	0	0	163	1	0.6	1	0.0	3.0	0
9	63	1	4	130	254	0	2	147	0	1.4	2	1.0	7.0	2
10	53	1	4	140	203	1	2	155	1	3.1	3	0.0	7.0	1

select()

Narrow the data to the five variables that I am interested in analyzing.

df_select = select(df_names, grade, size, color, length, round)

Table 3: The selected variables
grade	size	color	length	round
0	0.0	63	1	1
2	3.0	67	4	1
1	2.0	67	4	1
0	0.0	37	3	1
0	0.0	41	2	0
0	0.0	56	2	1
3	2.0	62	4	0
0	0.0	57	4	0
2	1.0	63	4	1
1	0.0	53	4	1

filter()

Refine the variables to only display the grade variable when it is equal to four.

df_filter = filter(df_select, grade == 4)

Table 4: Filter out all grade = 4 records
grade	size	color	length	round
4	2.0	60	4	1
4	3.0	65	4	0
4	1.0	58	3	1
4	0.0	50	4	1
4	3.0	63	4	0
4	0.0	70	4	1
4	3.0	57	4	1
4	1.0	55	4	1
4	3.0	77	4	1
4	2.0	64	4	1

group_by() and summarize()

Further refine and summarize the data. This returns a count of each size.

by_size = group_by(df_filter, size) %>% summarize(count = n())

Table 5: The size variable
size	count
0.0	3
1.0	3
2.0	2
3.0	5

Conclusion

This article covered the dplyr package from the tidyverse in R. To set up the data first, the readr() package loaded the dataset into a working data frame or “tibble” Each time a function runs, the output is in a labeled table. The table comes from knitr and the kable() function. Table 1 described the dataset and showed the first ten records. The name() function lists the variable or column names as a starting point. For the exercise, I was interested in five variables: V1, V2, V3, V12, & V14. I put them in a new data frame df_names and renamed them descriptively as: “color”, “round”, “length”, “size”, and “grade” with the rename function. The result is in table 2. I used the select() function to only show the five variables of interest in table 3. The further refine the data, the filter() function helped to narrow down all the “grade” variables that are number four and shown in table 4. To go a bit further in the data the group_by function insolated the “size” variable and summarize() by count = n() to output on the four categories of “size” and the number of instances in the data, shown in table 5. The functions used from the dplyr package very useful when exploring the data. See the documentation for the other functions arrange() and merge() not covered for this example.

References

Peng, R. D. (2015). Exploratory Data Analysis with R. Retrieved from: https://leanpub.com/exdata

RDocumentation. (2018). Introduction to dplyr. Retrieved from: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. p. 43-75, O’Reilly.