Introduction
This exploratory data analysis will use the “hourly_44201_2014” data set obtained from the web site: https://aqs.epa.gov/aqsweb/airdata/download_files.html#Raw The data is “Criteria Gases” and labeled “Ozone (44201)” for this particular data set. The data is loaded into RStudio and shows 9,060,694 rows and 24 variables.
There are 24 variables in this data set. Looking through the variables, we can see the first four are coded presumably by the EPA. The Latitude, Longitude, and Datum are geo-location data. Datum NAD83 is the coordinate system for North American datum, while the WGS84 is a world coordinate system and tells that there are measurements taken presumably in another region or territory of the U.S. The “Date_Local” variable is clearly showing 365 unique dates, which cover the number of days in a year. The “Time_local” variable indicates 24 individual observations and covers each hour in a day. We see the variable measurements labeled as “Sample_Measurement” and then some more coded variables. The last variables are the “State_Name”, “County_Name”. The 53 individual names in the “State_Name” variable stand out, and this possibly coincides with the WGS84 coordinate system for U.S. States outside of the 48 States and are territories. In summary, we have essential geographic information, time/date information, and an Ozone measurement available in the data set. For this analysis, we can disregard the internal coding from the EPA.
Load the Tidyverse Library
library(tidyverse)
Importing the Data into R Studio
A note about loading the data: this data analysis is using R and the Tidyverse package. Tidyverse at this writing, can not download and open .zip files from a URL. Thus a temp file is created, and the .zip data file is loaded into it then unzipped from this temp location, which then read into a tibble or data frame. In this coding step, I have identified the variables’ data type per the EPA specifications describing this data. Knowing the data type may not be the case in other data sets pulled from the web, or identifying the data type after performing the summary statistics. The last line of code is to keep with acceptable coding standards and format the variable names to work with R by removing white spaces and placing an underscore between descriptors.
# place data into frame
url = "https://aqs.epa.gov/aqsweb/airdata/hourly_44201_2014.zip"
# create temp file for the zip file
zip_file = tempfile(fileext = ".zip")
# unzip data into temp location
download.file(url, zip_file, method = 'libcurl', mode = "wb")
# read data into a tibble and specify data type for each variable, see EPA data source
ozn = readr::read_csv(zip_file, col_types = "iiiiinnccDtDtncnlccicccc")
# remove spaces in variables and replace with underscore
names(ozn) = gsub("\\s", "_", names(ozn))
unlink(zip_file)
Viewing the Dataset
Showing the variable names using the names()
function.
names(ozn)
## [1] "State_Code" "County_Code" "Site_Num"
## [4] "Parameter_Code" "POC" "Latitude"
## [7] "Longitude" "Datum" "Parameter_Name"
## [10] "Date_Local" "Time_Local" "Date_GMT"
## [13] "Time_GMT" "Sample_Measurement" "Units_of_Measure"
## [16] "MDL" "Uncertainty" "Qualifier"
## [19] "Method_Type" "Method_Code" "Method_Name"
## [22] "State_Name" "County_Name" "Date_of_Last_Change"
Using glimpse()
can quickly display the data when in a tibble format.
- chr
are characters
- dbl
are real numbers
- int
are integers
- date
- time
- lgl
is logical
- fctr
is factors or categorical
tibble::glimpse(ozn)
## Rows: 9,060,694
## Columns: 24
## $ State_Code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ County_Code <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ Site_Num <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,...
## $ Parameter_Code <int> 44201, 44201, 44201, 44201, 44201, 44201, 44201...
## $ POC <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Latitude <dbl> 30.49748, 30.49748, 30.49748, 30.49748, 30.4974...
## $ Longitude <dbl> -87.88026, -87.88026, -87.88026, -87.88026, -87...
## $ Datum <chr> "NAD83", "NAD83", "NAD83", "NAD83", "NAD83", "N...
## $ Parameter_Name <chr> "Ozone", "Ozone", "Ozone", "Ozone", "Ozone", "O...
## $ Date_Local <date> 2014-03-01, 2014-03-01, 2014-03-01, 2014-03-01...
## $ Time_Local <time> 01:00:00, 02:00:00, 03:00:00, 04:00:00, 05:00:...
## $ Date_GMT <date> 2014-03-01, 2014-03-01, 2014-03-01, 2014-03-01...
## $ Time_GMT <time> 07:00:00, 08:00:00, 09:00:00, 10:00:00, 11:00:...
## $ Sample_Measurement <dbl> 0.047, 0.047, 0.043, 0.038, 0.035, 0.035, 0.034...
## $ Units_of_Measure <chr> "Parts per million", "Parts per million", "Part...
## $ MDL <dbl> 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005...
## $ Uncertainty <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Qualifier <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Method_Type <chr> "FEM", "FEM", "FEM", "FEM", "FEM", "FEM", "FEM"...
## $ Method_Code <int> 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47,...
## $ Method_Name <chr> "INSTRUMENTAL - ULTRA VIOLET", "INSTRUMENTAL - ...
## $ State_Name <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Al...
## $ County_Name <chr> "Baldwin", "Baldwin", "Baldwin", "Baldwin", "Ba...
## $ Date_of_Last_Change <chr> "2014-06-30", "2014-06-30", "2014-06-30", "2014...
State_Code | County_Code | Site_Num | Parameter_Code | POC | Latitude | Longitude | Datum | Parameter_Name | Date_Local | Time_Local | Date_GMT | Time_GMT | Sample_Measurement | Units_of_Measure | MDL | Uncertainty | Qualifier | Method_Type | Method_Code | Method_Name | State_Name | County_Name | Date_of_Last_Change |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 10 | 44201 | 1 | 30.49748 | -87.88026 | NAD83 | Ozone | 2014-03-01 | 01:00:00 | 2014-03-01 | 07:00:00 | 0.047 | Parts per million | 0.005 | NA | NA | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | Baldwin | 2014-06-30 |
1 | 3 | 10 | 44201 | 1 | 30.49748 | -87.88026 | NAD83 | Ozone | 2014-03-01 | 02:00:00 | 2014-03-01 | 08:00:00 | 0.047 | Parts per million | 0.005 | NA | NA | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | Baldwin | 2014-06-30 |
1 | 3 | 10 | 44201 | 1 | 30.49748 | -87.88026 | NAD83 | Ozone | 2014-03-01 | 03:00:00 | 2014-03-01 | 09:00:00 | 0.043 | Parts per million | 0.005 | NA | NA | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | Baldwin | 2014-06-30 |
1 | 3 | 10 | 44201 | 1 | 30.49748 | -87.88026 | NAD83 | Ozone | 2014-03-01 | 04:00:00 | 2014-03-01 | 10:00:00 | 0.038 | Parts per million | 0.005 | NA | NA | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | Baldwin | 2014-06-30 |
1 | 3 | 10 | 44201 | 1 | 30.49748 | -87.88026 | NAD83 | Ozone | 2014-03-01 | 05:00:00 | 2014-03-01 | 11:00:00 | 0.035 | Parts per million | 0.005 | NA | NA | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | Baldwin | 2014-06-30 |
1 | 3 | 10 | 44201 | 1 | 30.49748 | -87.88026 | NAD83 | Ozone | 2014-03-01 | 06:00:00 | 2014-03-01 | 12:00:00 | 0.035 | Parts per million | 0.005 | NA | NA | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | Baldwin | 2014-06-30 |
1 | 3 | 10 | 44201 | 1 | 30.49748 | -87.88026 | NAD83 | Ozone | 2014-03-01 | 07:00:00 | 2014-03-01 | 13:00:00 | 0.034 | Parts per million | 0.005 | NA | NA | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | Baldwin | 2014-06-30 |
1 | 3 | 10 | 44201 | 1 | 30.49748 | -87.88026 | NAD83 | Ozone | 2014-03-01 | 08:00:00 | 2014-03-01 | 14:00:00 | 0.037 | Parts per million | 0.005 | NA | NA | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | Baldwin | 2014-06-30 |
1 | 3 | 10 | 44201 | 1 | 30.49748 | -87.88026 | NAD83 | Ozone | 2014-03-01 | 09:00:00 | 2014-03-01 | 15:00:00 | 0.044 | Parts per million | 0.005 | NA | NA | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | Baldwin | 2014-06-30 |
1 | 3 | 10 | 44201 | 1 | 30.49748 | -87.88026 | NAD83 | Ozone | 2014-03-01 | 10:00:00 | 2014-03-01 | 16:00:00 | 0.046 | Parts per million | 0.005 | NA | NA | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | Baldwin | 2014-06-30 |
Data Summary Function
The summary function displays the data format and shows a nice breakdown of the data. The “Site_Num” variable indicates 9,997 total monitoring sites. I specified the data type and the data/time came in precisely when imported into the tibble. The “Sample_Measurement” shows no missing data indicated by NA’s and a min of -0.005 to 0.213 max measurement with the mean around 0.03027. The “Method_Name” is describing the ultraviolet method of measure with a chemiluminescence API model. The “State_Name” variable lists familiar names as well as the “County_Name” variable. The data is reasonably clean and looking ahead, and we could exclude much of the variables for a few vital variables (ozone, county, state, time, date, latitude, longitude).
summary(ozn)
## State_Code County_Code Site_Num Parameter_Code POC
## Min. : 1.00 Min. : 1.00 Min. : 1 Min. :44201 Min. :1.00
## 1st Qu.: 8.00 1st Qu.: 23.00 1st Qu.: 6 1st Qu.:44201 1st Qu.:1.00
## Median :25.00 Median : 61.00 Median : 27 Median :44201 Median :1.00
## Mean :26.52 Mean : 76.71 Mean :1362 Mean :44201 Mean :1.07
## 3rd Qu.:42.00 3rd Qu.:103.00 3rd Qu.:1016 3rd Qu.:44201 3rd Qu.:1.00
## Max. :80.00 Max. :800.00 Max. :9997 Max. :44201 Max. :6.00
## Latitude Longitude Datum Parameter_Name
## Min. :18.18 Min. :-158.09 Length:9060694 Length:9060694
## 1st Qu.:33.92 1st Qu.:-109.91 Class :character Class :character
## Median :38.01 Median : -92.01 Mode :character Mode :character
## Mean :37.44 Mean : -95.35
## 3rd Qu.:40.92 3rd Qu.: -81.97
## Max. :64.85 Max. : -65.92
## Date_Local Time_Local Date_GMT Time_GMT
## Min. :2014-01-01 Length:9060694 Min. :2014-01-01 Length:9060694
## 1st Qu.:2014-04-20 Class1:hms 1st Qu.:2014-04-20 Class1:hms
## Median :2014-07-05 Class2:difftime Median :2014-07-05 Class2:difftime
## Mean :2014-07-04 Mode :numeric Mean :2014-07-04 Mode :numeric
## 3rd Qu.:2014-09-19 3rd Qu.:2014-09-20
## Max. :2014-12-31 Max. :2015-01-01
## Sample_Measurement Units_of_Measure MDL Uncertainty
## Min. :-0.00500 Length:9060694 Min. :0.0006 Mode:logical
## 1st Qu.: 0.01900 Class :character 1st Qu.:0.0050 NA's:9060694
## Median : 0.03000 Mode :character Median :0.0050
## Mean : 0.03027 Mean :0.1361
## 3rd Qu.: 0.04100 3rd Qu.:0.0050
## Max. : 0.21300 Max. :8.0000
## Qualifier Method_Type Method_Code Method_Name
## Length:9060694 Length:9060694 Min. : 19.00 Length:9060694
## Class :character Class :character 1st Qu.: 47.00 Class :character
## Mode :character Mode :character Median : 87.00 Mode :character
## Mean : 71.08
## 3rd Qu.: 87.00
## Max. :901.00
## State_Name County_Name Date_of_Last_Change
## Length:9060694 Length:9060694 Length:9060694
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
Below is the skimr
library function and provides an excellent summary of the data, particularly the histogram at the bottom, giving the numeric data types distribution.
skimr::skim(ozn)
Name | ozn |
Number of rows | 9060694 |
Number of columns | 24 |
_______________________ | |
Column type frequency: | |
character | 9 |
Date | 2 |
difftime | 2 |
logical | 1 |
numeric | 10 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Datum | 0 | 1.00 | 5 | 5 | 0 | 2 | 0 |
Parameter_Name | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 |
Units_of_Measure | 0 | 1.00 | 17 | 17 | 0 | 1 | 0 |
Qualifier | 8937542 | 0.01 | 1 | 2 | 0 | 19 | 0 |
Method_Type | 0 | 1.00 | 3 | 7 | 0 | 2 | 0 |
Method_Name | 0 | 1.00 | 27 | 63 | 0 | 8 | 0 |
State_Name | 0 | 1.00 | 4 | 20 | 0 | 53 | 0 |
County_Name | 0 | 1.00 | 3 | 20 | 0 | 631 | 0 |
Date_of_Last_Change | 0 | 1.00 | 10 | 10 | 0 | 406 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Date_Local | 0 | 1 | 2014-01-01 | 2014-12-31 | 2014-07-05 | 365 |
Date_GMT | 0 | 1 | 2014-01-01 | 2015-01-01 | 2014-07-05 | 366 |
Variable type: difftime
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Time_Local | 0 | 1 | 0 secs | 82800 secs | 43200 secs | 24 |
Time_GMT | 0 | 1 | 0 secs | 82800 secs | 43200 secs | 24 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
Uncertainty | 9060694 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
State_Code | 0 | 1 | 26.52 | 16.99 | 1.00 | 8.00 | 25.00 | 42.00 | 80.00 | ▇▆▇▂▁ |
County_Code | 0 | 1 | 76.71 | 79.70 | 1.00 | 23.00 | 61.00 | 103.00 | 800.00 | ▇▁▁▁▁ |
Site_Num | 0 | 1 | 1362.11 | 2672.34 | 1.00 | 6.00 | 27.00 | 1016.00 | 9997.00 | ▇▁▁▁▁ |
Parameter_Code | 0 | 1 | 44201.00 | 0.00 | 44201.00 | 44201.00 | 44201.00 | 44201.00 | 44201.00 | ▁▁▇▁▁ |
POC | 0 | 1 | 1.07 | 0.33 | 1.00 | 1.00 | 1.00 | 1.00 | 6.00 | ▇▁▁▁▁ |
Latitude | 0 | 1 | 37.44 | 5.05 | 18.18 | 33.92 | 38.01 | 40.92 | 64.85 | ▁▆▇▁▁ |
Longitude | 0 | 1 | -95.35 | 16.32 | -158.09 | -109.91 | -92.01 | -81.97 | -65.92 | ▁▂▇▇▇ |
Sample_Measurement | 0 | 1 | 0.03 | 0.02 | 0.00 | 0.02 | 0.03 | 0.04 | 0.21 | ▇▃▁▁▁ |
MDL | 0 | 1 | 0.14 | 0.81 | 0.00 | 0.00 | 0.00 | 0.00 | 8.00 | ▇▁▁▁▁ |
Method_Code | 0 | 1 | 71.08 | 43.25 | 19.00 | 47.00 | 87.00 | 87.00 | 901.00 | ▇▁▁▁▁ |
These next lines are ways to narrow in on the data metrics. First, we can see how many different Latitude and Longitude values there are in each variable. Observed 1,305 unique Latitudes and 1,306 unique Longitudes.
ozn %>%
summarize(Total_Lat = n_distinct(Latitude))
## # A tibble: 1 x 1
## Total_Lat
## <int>
## 1 1305
ozn %>%
summarize(Total_Long = n_distinct(Longitude))
## # A tibble: 1 x 1
## Total_Long
## <int>
## 1 1306
I break down all the counties by latitude and longitude based on the variable’s 1,306 unique longitude values. This assumes where the ozone monitors are positioned.
ozn %>%
group_by(County_Name, Latitude, Longitude) %>%
summarize(Lat = n_distinct(Longitude))
## # A tibble: 1,306 x 4
## # Groups: County_Name, Latitude [1,305]
## County_Name Latitude Longitude Lat
## <chr> <dbl> <dbl> <int>
## 1 Abbeville 34.3 -82.4 1
## 2 Ada 43.6 -116. 1
## 3 Ada 43.6 -116. 1
## 4 Adair 35.8 -94.7 1
## 5 Adams 39.8 -105. 1
## 6 Adams 39.9 -91.3 1
## 7 Adams 39.9 -77.3 1
## 8 Adams 39.9 -77.3 1
## 9 Aiken 33.3 -81.8 1
## 10 Alachua 29.5 -82.3 1
## # ... with 1,296 more rows
If I want to view the first ten rows of Latitude and Longitude and put it into a table, I can specify that in R.
Latitude | Longitude |
---|---|
30.49748 | -87.88026 |
30.49748 | -87.88026 |
30.49748 | -87.88026 |
30.49748 | -87.88026 |
30.49748 | -87.88026 |
30.49748 | -87.88026 |
30.49748 | -87.88026 |
30.49748 | -87.88026 |
30.49748 | -87.88026 |
30.49748 | -87.88026 |
Taking the concept further, I can create a table showing the State, County, and Sample Measurement, only displaying the first ten rows of data.
State_Name | County_Name | Sample_Measurement |
---|---|---|
Alabama | Baldwin | 0.047 |
Alabama | Baldwin | 0.047 |
Alabama | Baldwin | 0.043 |
Alabama | Baldwin | 0.038 |
Alabama | Baldwin | 0.035 |
Alabama | Baldwin | 0.035 |
Alabama | Baldwin | 0.034 |
Alabama | Baldwin | 0.037 |
Alabama | Baldwin | 0.044 |
Alabama | Baldwin | 0.046 |
A more detailed look at the State, County, Count of measurements, and Mean Ozone reading shown in this table.
summarize(group_by(ozn,
State_Name,
County_Name),
count = n(),
meanOzone = mean(Sample_Measurement))
## # A tibble: 797 x 4
## # Groups: State_Name [53]
## State_Name County_Name count meanOzone
## <chr> <chr> <int> <dbl>
## 1 Alabama Baldwin 5492 0.0293
## 2 Alabama Colbert 5613 0.0249
## 3 Alabama DeKalb 8232 0.0320
## 4 Alabama Elmore 5585 0.0292
## 5 Alabama Etowah 5578 0.0261
## 6 Alabama Houston 5562 0.0272
## 7 Alabama Jefferson 41443 0.0262
## 8 Alabama Madison 11678 0.0300
## 9 Alabama Mobile 10874 0.0259
## 10 Alabama Montgomery 5560 0.0282
## # ... with 787 more rows
Here we can summarize the State and Counties with the lowest Mean Ozone readings. Puerto Rico and Alaska, along with Oklahoma, Oregon, Washington, etc. have the lowest readings of Mean Ozone in this data set. The arrange()
function does this in ascending order by default.
ozn %>%
group_by(State_Name, County_Name) %>%
summarize(count = n(), MeanO = mean(Sample_Measurement)) %>%
arrange(MeanO)
## # A tibble: 797 x 4
## # Groups: State_Name [53]
## State_Name County_Name count MeanO
## <chr> <chr> <int> <dbl>
## 1 Puerto Rico Catano 8038 0.00449
## 2 Puerto Rico Bayamon 2794 0.00852
## 3 Alaska Fairbanks North Star 8575 0.0133
## 4 Puerto Rico Juncos 8180 0.0135
## 5 Oklahoma Caddo 7507 0.0175
## 6 Oregon Washington 9215 0.0196
## 7 Washington Whatcom 3098 0.0201
## 8 Louisiana West Baton Rouge 8635 0.0201
## 9 New Jersey Warren 8620 0.0201
## 10 Georgia DeKalb 8594 0.0205
## # ... with 787 more rows
This table displays the highest Mean Ozone readings per State and County in this data set. Colorado, California, and Wyoming are at the top, with Utah and Nevada.
ozn %>%
group_by(State_Name, County_Name) %>%
summarize(count = n(), MeanO = mean(Sample_Measurement)) %>%
arrange(desc(MeanO))
## # A tibble: 797 x 4
## # Groups: State_Name [53]
## State_Name County_Name count MeanO
## <chr> <chr> <int> <dbl>
## 1 Colorado Clear Creek 6447 0.0502
## 2 California Mariposa 12130 0.0485
## 3 California Nevada 12748 0.0482
## 4 Colorado Park 8127 0.0480
## 5 Wyoming Albany 8056 0.0474
## 6 Colorado Chaffee 3748 0.0473
## 7 California Inyo 8667 0.0447
## 8 Utah San Juan 10794 0.0444
## 9 California El Dorado 16521 0.0436
## 10 Nevada Elko 3431 0.0435
## # ... with 787 more rows
This summary table shows the number of readings taken per day for the entire year across all states. The summary indicates that not every monitoring station records consistently.
ozn %>%
group_by(Date_Local) %>%
summarize(readings_day = n())
## # A tibble: 365 x 2
## Date_Local readings_day
## <date> <int>
## 1 2014-01-01 18115
## 2 2014-01-02 17901
## 3 2014-01-03 17992
## 4 2014-01-04 17929
## 5 2014-01-05 17769
## 6 2014-01-06 17561
## 7 2014-01-07 17433
## 8 2014-01-08 17706
## 9 2014-01-09 17820
## 10 2014-01-10 18156
## # ... with 355 more rows
As eluded to earlier, the data records once an hour of a day totaling 24 measurements a day for each monitoring station. However, by the previous table, we see that the number of readings per day is not the same every day of the year and should be noted. We can further display the number of days monitoring occurs at 365 days. And finally, across 53 states and territories. We saw earlier Puerto Rico counts as a State. The next table will itemize each State.
ozn %>%
summarize(Times_Recorded = n_distinct(Time_Local))
## # A tibble: 1 x 1
## Times_Recorded
## <int>
## 1 24
ozn %>%
summarize(Total_Days_Recorded = n_distinct(Date_Local))
## # A tibble: 1 x 1
## Total_Days_Recorded
## <int>
## 1 365
ozn %>%
summarize(Total_n_States = n_distinct(State_Name))
## # A tibble: 1 x 1
## Total_n_States
## <int>
## 1 53
Below is the table with each State name. Looking through the table are the 48 contiguous States, including Alaska and Hawaii. Then the District of Columbia, Puerto Rico, and Country of Mexico totaling 53. This information can lead us to further analysis using the latitude and longitude to find data that we may desire for continued research, particularly the “Country of Mexico”; is that measurement station desirable in our analysis? Or maybe we are only interested in the continental U.S. for analysis.
ozn %>%
group_by(State_Name) %>%
distinct(State_Name)
## # A tibble: 53 x 1
## # Groups: State_Name [53]
## State_Name
## <chr>
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
## 7 Connecticut
## 8 Delaware
## 9 District Of Columbia
## 10 Florida
## # ... with 43 more rows
The “Sample_Measurement” can be further analyzed by itemizing each State by breaking out the measurements into quantiles of 10%, 25%, 50%, 75%, and 99%.
probs = c(0.01, 0.25, 0.5, 0.75, 0.99)
ozn %>%
group_by(State_Name) %>%
summarize(p =list(probs), q = list(quantile(Sample_Measurement, probs))) %>%
unnest(cols = c(p, q))
## # A tibble: 265 x 3
## State_Name p q
## <chr> <dbl> <dbl>
## 1 Alabama 0.01 0
## 2 Alabama 0.25 0.014
## 3 Alabama 0.5 0.027
## 4 Alabama 0.75 0.039
## 5 Alabama 0.99 0.062
## 6 Alaska 0.01 0
## 7 Alaska 0.25 0.011
## 8 Alaska 0.5 0.022
## 9 Alaska 0.75 0.032
## 10 Alaska 0.99 0.051
## # ... with 255 more rows
In an earlier table, Clear Creek County, Colorado, was showing the highest Sample_Measurement value. Here is the measurement of ozone for each month. The ozone value climbs in April/May, peaking in June then dropping in July.
ozn %>%
filter(State_Name == "Colorado" & County_Name == "Clear Creek") %>%
mutate(Month = lubridate::month(Date_Local, label = TRUE)) %>%
group_by(Month) %>%
summarize(Ozone = mean(Sample_Measurement))
## # A tibble: 11 x 2
## Month Ozone
## <ord> <dbl>
## 1 Jan 0.0471
## 2 Feb 0.0474
## 3 Apr 0.0585
## 4 May 0.0592
## 5 Jun 0.0621
## 6 Jul 0.0529
## 7 Aug 0.0494
## 8 Sep 0.0473
## 9 Oct 0.0471
## 10 Nov 0.0448
## 11 Dec 0.0443
Washington County, Oregon, was listed in the top six lowest measurements of ozone in a previous table. Below is the summary of Washington County, Oregon, by each month. The data summary is missing readings from January through April. The data would further our exploratory analysis and noting this as missing data. A point to look for a more thorough analysis.
ozn %>%
filter(State_Name == "Oregon" & County_Name == "Washington") %>%
mutate(Month = lubridate::month(Date_Local, label=TRUE)) %>%
group_by(Month) %>%
summarize(Ozone = mean(Sample_Measurement))
## # A tibble: 8 x 2
## Month Ozone
## <ord> <dbl>
## 1 May 0.0261
## 2 Jun 0.0199
## 3 Jul 0.0219
## 4 Aug 0.0209
## 5 Sep 0.0201
## 6 Oct 0.0108
## 7 Nov 0.0155
## 8 Dec 0.0123
After this analysis one would start to develop a hypothesis or business question that we can answer with the data. Also graphical representations of the data should be employed to better visualize the relationships in the data. We use graphics in other posts.
References
EPA - United States Environmental Protection Agency (2017). Pre-Generated Data Files. Retrieved from: https://aqs.epa.gov/aqsweb/airdata/download_files.html
EPA - United States Environmmental Protection Agency (2017). AirData Download Files Documentation. Retrieved from: https://aqs.epa.gov/aqsweb/airdata/FileFormats.html
Grolemund, G., Wickham, H. (2017) R for Data Science. Retrieved from: https://r4ds.had.co.nz/
Peng, R. D. (2015). Exploratory Data Analysis with R. Retrieved from: https://leanpub.com/exdata