Using the dplyr Package in R for Simple Data Analysis

Bryan

2018/07/04

This article will briefly describe the dplyr package in R and the simple functions to analyze data with R for data analytics. The dplyr package in R is used for data manipulation and transformation because, typically, the data is not in a usable form. The package is useful for exploratory data analysis and a method of statistical analysis using graphical techniques. Some of the benefits are:

The dplyr package is used by many data scientists and analysts to manipulate data. All the functions in dplyr have alternate methods in the base version of R and the other myriad of packages that do similar functions. However, R’s code base may be slower in processing time for the modern data problems. The dplyr documentation says dplyr is the grammar of data manipulation. The plyr package was the old version, and the “d” represented data frames and added functionality. I further learned that the “Tibble” package is also used in conjunction because it is a simplified data frame structure for subsetting and printing. The dplyr package provides fast performance with large data sets, easy to use function names, with different data frames, tables, database formats.

Below R code shows rename(), select(), and filter(), group_by(), and summarize() functions. I brought this data into a data frame and noticed there are no descriptive column headings. I want to give a descriptive name to 5 variables (rename) and then reorder them in a new data frame (select), then I would like to see all the grade variables that are the number four (filter). I would like to see the break down of the size variable and use (group_by) and then (summarize) to get a full count of the size occurrences.

Load the data into a data frame using readr which is part of the tidyverse collection including dplyr. The default output shows the column variable names and the data type.

read_csv()

df = readr::read_csv("../../static/data/df.csv")
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   V1 = col_double(),
##   V2 = col_double(),
##   V3 = col_double(),
##   V4 = col_double(),
##   V5 = col_double(),
##   V6 = col_double(),
##   V7 = col_double(),
##   V8 = col_double(),
##   V9 = col_double(),
##   V10 = col_double(),
##   V11 = col_double(),
##   V12 = col_character(),
##   V13 = col_character(),
##   V14 = col_double()
## )


This is produced using kable() from the knitr package in R. Displayed are the first 10 rows of data.

Table 1: The raw data
X1 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
5 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
6 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
7 62 0 4 140 268 0 2 160 0 3.6 3 2.0 3.0 3
8 57 0 4 120 354 0 0 163 1 0.6 1 0.0 3.0 0
9 63 1 4 130 254 0 2 147 0 1.4 2 1.0 7.0 2
10 53 1 4 140 203 1 2 155 1 3.1 3 0.0 7.0 1

The names() function will display all the variable names in the data.

names(df)
##  [1] "X1"  "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11"
## [13] "V12" "V13" "V14"

rename()

For this exercise I will only rename() five variables for demonstration.

df_names = rename(df, color=V1, round=V2, length=V3, size=V12, grade=V14)
Table 2: The rename five variables
X1 color round length V4 V5 V6 V7 V8 V9 V10 V11 size V13 grade
1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
5 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
6 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
7 62 0 4 140 268 0 2 160 0 3.6 3 2.0 3.0 3
8 57 0 4 120 354 0 0 163 1 0.6 1 0.0 3.0 0
9 63 1 4 130 254 0 2 147 0 1.4 2 1.0 7.0 2
10 53 1 4 140 203 1 2 155 1 3.1 3 0.0 7.0 1

select()

Narrow the data to the five variables that I am interested in analyzing.

df_select = select(df_names, grade, size, color, length, round)
Table 3: The selected variables
grade size color length round
0 0.0 63 1 1
2 3.0 67 4 1
1 2.0 67 4 1
0 0.0 37 3 1
0 0.0 41 2 0
0 0.0 56 2 1
3 2.0 62 4 0
0 0.0 57 4 0
2 1.0 63 4 1
1 0.0 53 4 1

filter()

Refine the variables to only display the grade variable when it is equal to four.

df_filter = filter(df_select, grade == 4)
Table 4: Filter out all grade = 4 records
grade size color length round
4 2.0 60 4 1
4 3.0 65 4 0
4 1.0 58 3 1
4 0.0 50 4 1
4 3.0 63 4 0
4 0.0 70 4 1
4 3.0 57 4 1
4 1.0 55 4 1
4 3.0 77 4 1
4 2.0 64 4 1

group_by() and summarize()

Further refine and summarize the data. This returns a count of each size.

by_size = group_by(df_filter, size) %>% summarize(count = n())
Table 5: The size variable
size count
0.0 3
1.0 3
2.0 2
3.0 5


Conclusion

This article covered the dplyr package from the tidyverse in R. To set up the data first, the readr() package loaded the dataset into a working data frame or “tibble” Each time a function runs, the output is in a labeled table. The table comes from knitr and the kable() function. Table 1 described the dataset and showed the first ten records. The name() function lists the variable or column names as a starting point. For the exercise, I was interested in five variables: V1, V2, V3, V12, & V14. I put them in a new data frame df_names and renamed them descriptively as: “color”, “round”, “length”, “size”, and “grade” with the rename function. The result is in table 2. I used the select() function to only show the five variables of interest in table 3. The further refine the data, the filter() function helped to narrow down all the “grade” variables that are number four and shown in table 4. To go a bit further in the data the group_by function insolated the “size” variable and summarize() by count = n() to output on the four categories of “size” and the number of instances in the data, shown in table 5. The functions used from the dplyr package very useful when exploring the data. See the documentation for the other functions arrange() and merge() not covered for this example.

References

Peng, R. D. (2015). Exploratory Data Analysis with R. Retrieved from: https://leanpub.com/exdata

RDocumentation. (2018). Introduction to dplyr. Retrieved from: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. p. 43-75, O’Reilly.