This article will briefly describe the dplyr package in R and the simple functions to analyze data with R for data analytics. The dplyr package in R is used for data manipulation and transformation because, typically, the data is not in a usable form. The package is useful for exploratory data analysis and a method of statistical analysis using graphical techniques. Some of the benefits are:
- get insights from the data intuitively
- find hidden trends
- find outliers and anomalous data
- test assumptions
- help develop models that the data will fit
The dplyr package is used by many data scientists and analysts to manipulate data. All the functions in dplyr have alternate methods in the base version of R and the other myriad of packages that do similar functions. However, R’s code base may be slower in processing time for the modern data problems. The dplyr documentation says dplyr is the grammar of data manipulation. The plyr package was the old version, and the “d” represented data frames and added functionality. I further learned that the “Tibble” package is also used in conjunction because it is a simplified data frame structure for subsetting and printing. The dplyr package provides fast performance with large data sets, easy to use function names, with different data frames, tables, database formats.
Below R code shows rename()
, select()
, and filter()
, group_by()
, and summarize()
functions. I brought this data into a data frame and noticed there are no descriptive column headings. I want to give a descriptive name to 5 variables (rename) and then reorder them in a new data frame (select), then I would like to see all the grade variables that are the number four (filter). I would like to see the break down of the size variable and use (group_by) and then (summarize) to get a full count of the size occurrences.
Load the data into a data frame using readr which is part of the tidyverse collection including dplyr. The default output shows the column variable names and the data type.
read_csv()
df = readr::read_csv("../../static/data/df.csv")
## Parsed with column specification:
## cols(
## X1 = col_double(),
## V1 = col_double(),
## V2 = col_double(),
## V3 = col_double(),
## V4 = col_double(),
## V5 = col_double(),
## V6 = col_double(),
## V7 = col_double(),
## V8 = col_double(),
## V9 = col_double(),
## V10 = col_double(),
## V11 = col_double(),
## V12 = col_character(),
## V13 = col_character(),
## V14 = col_double()
## )
This is produced using kable()
from the knitr package in R. Displayed are the first 10 rows of data.
X1 | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 63 | 1 | 1 | 145 | 233 | 1 | 2 | 150 | 0 | 2.3 | 3 | 0.0 | 6.0 | 0 |
2 | 67 | 1 | 4 | 160 | 286 | 0 | 2 | 108 | 1 | 1.5 | 2 | 3.0 | 3.0 | 2 |
3 | 67 | 1 | 4 | 120 | 229 | 0 | 2 | 129 | 1 | 2.6 | 2 | 2.0 | 7.0 | 1 |
4 | 37 | 1 | 3 | 130 | 250 | 0 | 0 | 187 | 0 | 3.5 | 3 | 0.0 | 3.0 | 0 |
5 | 41 | 0 | 2 | 130 | 204 | 0 | 2 | 172 | 0 | 1.4 | 1 | 0.0 | 3.0 | 0 |
6 | 56 | 1 | 2 | 120 | 236 | 0 | 0 | 178 | 0 | 0.8 | 1 | 0.0 | 3.0 | 0 |
7 | 62 | 0 | 4 | 140 | 268 | 0 | 2 | 160 | 0 | 3.6 | 3 | 2.0 | 3.0 | 3 |
8 | 57 | 0 | 4 | 120 | 354 | 0 | 0 | 163 | 1 | 0.6 | 1 | 0.0 | 3.0 | 0 |
9 | 63 | 1 | 4 | 130 | 254 | 0 | 2 | 147 | 0 | 1.4 | 2 | 1.0 | 7.0 | 2 |
10 | 53 | 1 | 4 | 140 | 203 | 1 | 2 | 155 | 1 | 3.1 | 3 | 0.0 | 7.0 | 1 |
The names()
function will display all the variable names in the data.
names(df)
## [1] "X1" "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11"
## [13] "V12" "V13" "V14"
rename()
For this exercise I will only rename()
five variables for demonstration.
df_names = rename(df, color=V1, round=V2, length=V3, size=V12, grade=V14)
X1 | color | round | length | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | size | V13 | grade |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 63 | 1 | 1 | 145 | 233 | 1 | 2 | 150 | 0 | 2.3 | 3 | 0.0 | 6.0 | 0 |
2 | 67 | 1 | 4 | 160 | 286 | 0 | 2 | 108 | 1 | 1.5 | 2 | 3.0 | 3.0 | 2 |
3 | 67 | 1 | 4 | 120 | 229 | 0 | 2 | 129 | 1 | 2.6 | 2 | 2.0 | 7.0 | 1 |
4 | 37 | 1 | 3 | 130 | 250 | 0 | 0 | 187 | 0 | 3.5 | 3 | 0.0 | 3.0 | 0 |
5 | 41 | 0 | 2 | 130 | 204 | 0 | 2 | 172 | 0 | 1.4 | 1 | 0.0 | 3.0 | 0 |
6 | 56 | 1 | 2 | 120 | 236 | 0 | 0 | 178 | 0 | 0.8 | 1 | 0.0 | 3.0 | 0 |
7 | 62 | 0 | 4 | 140 | 268 | 0 | 2 | 160 | 0 | 3.6 | 3 | 2.0 | 3.0 | 3 |
8 | 57 | 0 | 4 | 120 | 354 | 0 | 0 | 163 | 1 | 0.6 | 1 | 0.0 | 3.0 | 0 |
9 | 63 | 1 | 4 | 130 | 254 | 0 | 2 | 147 | 0 | 1.4 | 2 | 1.0 | 7.0 | 2 |
10 | 53 | 1 | 4 | 140 | 203 | 1 | 2 | 155 | 1 | 3.1 | 3 | 0.0 | 7.0 | 1 |
select()
Narrow the data to the five variables that I am interested in analyzing.
df_select = select(df_names, grade, size, color, length, round)
grade | size | color | length | round |
---|---|---|---|---|
0 | 0.0 | 63 | 1 | 1 |
2 | 3.0 | 67 | 4 | 1 |
1 | 2.0 | 67 | 4 | 1 |
0 | 0.0 | 37 | 3 | 1 |
0 | 0.0 | 41 | 2 | 0 |
0 | 0.0 | 56 | 2 | 1 |
3 | 2.0 | 62 | 4 | 0 |
0 | 0.0 | 57 | 4 | 0 |
2 | 1.0 | 63 | 4 | 1 |
1 | 0.0 | 53 | 4 | 1 |
filter()
Refine the variables to only display the grade variable when it is equal to four.
df_filter = filter(df_select, grade == 4)
grade | size | color | length | round |
---|---|---|---|---|
4 | 2.0 | 60 | 4 | 1 |
4 | 3.0 | 65 | 4 | 0 |
4 | 1.0 | 58 | 3 | 1 |
4 | 0.0 | 50 | 4 | 1 |
4 | 3.0 | 63 | 4 | 0 |
4 | 0.0 | 70 | 4 | 1 |
4 | 3.0 | 57 | 4 | 1 |
4 | 1.0 | 55 | 4 | 1 |
4 | 3.0 | 77 | 4 | 1 |
4 | 2.0 | 64 | 4 | 1 |
group_by() and summarize()
Further refine and summarize the data. This returns a count of each size.
by_size = group_by(df_filter, size) %>% summarize(count = n())
size | count |
---|---|
0.0 | 3 |
1.0 | 3 |
2.0 | 2 |
3.0 | 5 |
Conclusion
This article covered the dplyr package from the tidyverse in R. To set up the data first, the readr() package loaded the dataset into a working data frame or “tibble” Each time a function runs, the output is in a labeled table. The table comes from knitr and the kable()
function. Table 1 described the dataset and showed the first ten records. The name()
function lists the variable or column names as a starting point. For the exercise, I was interested in five variables: V1, V2, V3, V12, & V14. I put them in a new data frame df_names and renamed them descriptively as: “color”, “round”, “length”, “size”, and “grade” with the rename
function. The result is in table 2. I used the select()
function to only show the five variables of interest in table 3. The further refine the data, the filter()
function helped to narrow down all the “grade” variables that are number four and shown in table 4. To go a bit further in the data the group_by
function insolated the “size” variable and summarize()
by count = n()
to output on the four categories of “size” and the number of instances in the data, shown in table 5. The functions used from the dplyr package very useful when exploring the data. See the documentation for the other functions arrange()
and merge()
not covered for this example.
References
Peng, R. D. (2015). Exploratory Data Analysis with R. Retrieved from: https://leanpub.com/exdata
RDocumentation. (2018). Introduction to dplyr. Retrieved from: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. p. 43-75, O’Reilly.