Exploratory Data Analysis - EPA Ozone Data

Introduction

This exploratory data analysis will use the “hourly_44201_2014” data set obtained from the web site: https://aqs.epa.gov/aqsweb/airdata/download_files.html#Raw The data is “Criteria Gases” and labeled “Ozone (44201)” for this particular data set. The data is loaded into RStudio and shows 9,060,694 rows and 24 variables.

There are 24 variables in this data set. Looking through the variables, we can see the first four are coded presumably by the EPA. The Latitude, Longitude, and Datum are geo-location data. Datum NAD83 is the coordinate system for North American datum, while the WGS84 is a world coordinate system and tells that there are measurements taken presumably in another region or territory of the U.S. The “Date_Local” variable is clearly showing 365 unique dates, which cover the number of days in a year. The “Time_local” variable indicates 24 individual observations and covers each hour in a day. We see the variable measurements labeled as “Sample_Measurement” and then some more coded variables. The last variables are the “State_Name”, “County_Name”. The 53 individual names in the “State_Name” variable stand out, and this possibly coincides with the WGS84 coordinate system for U.S. States outside of the 48 States and are territories. In summary, we have essential geographic information, time/date information, and an Ozone measurement available in the data set. For this analysis, we can disregard the internal coding from the EPA.

Load the Tidyverse Library

library(tidyverse)

Importing the Data into R Studio

A note about loading the data: this data analysis is using R and the Tidyverse package. Tidyverse at this writing, can not download and open .zip files from a URL. Thus a temp file is created, and the .zip data file is loaded into it then unzipped from this temp location, which then read into a tibble or data frame. In this coding step, I have identified the variables’ data type per the EPA specifications describing this data. Knowing the data type may not be the case in other data sets pulled from the web, or identifying the data type after performing the summary statistics. The last line of code is to keep with acceptable coding standards and format the variable names to work with R by removing white spaces and placing an underscore between descriptors.

# place data into frame
url = "https://aqs.epa.gov/aqsweb/airdata/hourly_44201_2014.zip" 
# create temp file for the zip file
zip_file = tempfile(fileext = ".zip") 
# unzip data into temp location
download.file(url, zip_file, method = 'libcurl', mode = "wb") 
# read data into a tibble and specify data type for each variable, see EPA data source
ozn = readr::read_csv(zip_file, col_types = "iiiiinnccDtDtncnlccicccc") 
# remove spaces in variables and replace with underscore
names(ozn) = gsub("\\s", "_", names(ozn)) 
unlink(zip_file)

Viewing the Dataset

Showing the variable names using the names() function.

names(ozn)

##  [1] "State_Code"          "County_Code"         "Site_Num"           
##  [4] "Parameter_Code"      "POC"                 "Latitude"           
##  [7] "Longitude"           "Datum"               "Parameter_Name"     
## [10] "Date_Local"          "Time_Local"          "Date_GMT"           
## [13] "Time_GMT"            "Sample_Measurement"  "Units_of_Measure"   
## [16] "MDL"                 "Uncertainty"         "Qualifier"          
## [19] "Method_Type"         "Method_Code"         "Method_Name"        
## [22] "State_Name"          "County_Name"         "Date_of_Last_Change"

Using glimpse() can quickly display the data when in a tibble format. - chr are characters - dbl are real numbers - int are integers - date - time - lgl is logical - fctr is factors or categorical

tibble::glimpse(ozn)

## Rows: 9,060,694
## Columns: 24
## $ State_Code          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ County_Code         <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ Site_Num            <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,...
## $ Parameter_Code      <int> 44201, 44201, 44201, 44201, 44201, 44201, 44201...
## $ POC                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Latitude            <dbl> 30.49748, 30.49748, 30.49748, 30.49748, 30.4974...
## $ Longitude           <dbl> -87.88026, -87.88026, -87.88026, -87.88026, -87...
## $ Datum               <chr> "NAD83", "NAD83", "NAD83", "NAD83", "NAD83", "N...
## $ Parameter_Name      <chr> "Ozone", "Ozone", "Ozone", "Ozone", "Ozone", "O...
## $ Date_Local          <date> 2014-03-01, 2014-03-01, 2014-03-01, 2014-03-01...
## $ Time_Local          <time> 01:00:00, 02:00:00, 03:00:00, 04:00:00, 05:00:...
## $ Date_GMT            <date> 2014-03-01, 2014-03-01, 2014-03-01, 2014-03-01...
## $ Time_GMT            <time> 07:00:00, 08:00:00, 09:00:00, 10:00:00, 11:00:...
## $ Sample_Measurement  <dbl> 0.047, 0.047, 0.043, 0.038, 0.035, 0.035, 0.034...
## $ Units_of_Measure    <chr> "Parts per million", "Parts per million", "Part...
## $ MDL                 <dbl> 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005...
## $ Uncertainty         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Qualifier           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Method_Type         <chr> "FEM", "FEM", "FEM", "FEM", "FEM", "FEM", "FEM"...
## $ Method_Code         <int> 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47,...
## $ Method_Name         <chr> "INSTRUMENTAL - ULTRA VIOLET", "INSTRUMENTAL - ...
## $ State_Name          <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Al...
## $ County_Name         <chr> "Baldwin", "Baldwin", "Baldwin", "Baldwin", "Ba...
## $ Date_of_Last_Change <chr> "2014-06-30", "2014-06-30", "2014-06-30", "2014...

Table 1: The EPA Hourly Ozone Dataset - First 10 Rows
State_Code	County_Code	Site_Num	Parameter_Code	POC	Latitude	Longitude	Datum	Parameter_Name	Date_Local	Time_Local	Date_GMT	Time_GMT	Sample_Measurement	Units_of_Measure	MDL	Uncertainty	Qualifier	Method_Type	Method_Code	Method_Name	State_Name	County_Name	Date_of_Last_Change
1	3	10	44201	1	30.49748	-87.88026	NAD83	Ozone	2014-03-01	01:00:00	2014-03-01	07:00:00	0.047	Parts per million	0.005	NA	NA	FEM	47	INSTRUMENTAL - ULTRA VIOLET	Alabama	Baldwin	2014-06-30
1	3	10	44201	1	30.49748	-87.88026	NAD83	Ozone	2014-03-01	02:00:00	2014-03-01	08:00:00	0.047	Parts per million	0.005	NA	NA	FEM	47	INSTRUMENTAL - ULTRA VIOLET	Alabama	Baldwin	2014-06-30
1	3	10	44201	1	30.49748	-87.88026	NAD83	Ozone	2014-03-01	03:00:00	2014-03-01	09:00:00	0.043	Parts per million	0.005	NA	NA	FEM	47	INSTRUMENTAL - ULTRA VIOLET	Alabama	Baldwin	2014-06-30
1	3	10	44201	1	30.49748	-87.88026	NAD83	Ozone	2014-03-01	04:00:00	2014-03-01	10:00:00	0.038	Parts per million	0.005	NA	NA	FEM	47	INSTRUMENTAL - ULTRA VIOLET	Alabama	Baldwin	2014-06-30
1	3	10	44201	1	30.49748	-87.88026	NAD83	Ozone	2014-03-01	05:00:00	2014-03-01	11:00:00	0.035	Parts per million	0.005	NA	NA	FEM	47	INSTRUMENTAL - ULTRA VIOLET	Alabama	Baldwin	2014-06-30
1	3	10	44201	1	30.49748	-87.88026	NAD83	Ozone	2014-03-01	06:00:00	2014-03-01	12:00:00	0.035	Parts per million	0.005	NA	NA	FEM	47	INSTRUMENTAL - ULTRA VIOLET	Alabama	Baldwin	2014-06-30
1	3	10	44201	1	30.49748	-87.88026	NAD83	Ozone	2014-03-01	07:00:00	2014-03-01	13:00:00	0.034	Parts per million	0.005	NA	NA	FEM	47	INSTRUMENTAL - ULTRA VIOLET	Alabama	Baldwin	2014-06-30
1	3	10	44201	1	30.49748	-87.88026	NAD83	Ozone	2014-03-01	08:00:00	2014-03-01	14:00:00	0.037	Parts per million	0.005	NA	NA	FEM	47	INSTRUMENTAL - ULTRA VIOLET	Alabama	Baldwin	2014-06-30
1	3	10	44201	1	30.49748	-87.88026	NAD83	Ozone	2014-03-01	09:00:00	2014-03-01	15:00:00	0.044	Parts per million	0.005	NA	NA	FEM	47	INSTRUMENTAL - ULTRA VIOLET	Alabama	Baldwin	2014-06-30
1	3	10	44201	1	30.49748	-87.88026	NAD83	Ozone	2014-03-01	10:00:00	2014-03-01	16:00:00	0.046	Parts per million	0.005	NA	NA	FEM	47	INSTRUMENTAL - ULTRA VIOLET	Alabama	Baldwin	2014-06-30

Data Summary Function

The summary function displays the data format and shows a nice breakdown of the data. The “Site_Num” variable indicates 9,997 total monitoring sites. I specified the data type and the data/time came in precisely when imported into the tibble. The “Sample_Measurement” shows no missing data indicated by NA’s and a min of -0.005 to 0.213 max measurement with the mean around 0.03027. The “Method_Name” is describing the ultraviolet method of measure with a chemiluminescence API model. The “State_Name” variable lists familiar names as well as the “County_Name” variable. The data is reasonably clean and looking ahead, and we could exclude much of the variables for a few vital variables (ozone, county, state, time, date, latitude, longitude).

summary(ozn)

##    State_Code     County_Code        Site_Num    Parameter_Code       POC      
##  Min.   : 1.00   Min.   :  1.00   Min.   :   1   Min.   :44201   Min.   :1.00  
##  1st Qu.: 8.00   1st Qu.: 23.00   1st Qu.:   6   1st Qu.:44201   1st Qu.:1.00  
##  Median :25.00   Median : 61.00   Median :  27   Median :44201   Median :1.00  
##  Mean   :26.52   Mean   : 76.71   Mean   :1362   Mean   :44201   Mean   :1.07  
##  3rd Qu.:42.00   3rd Qu.:103.00   3rd Qu.:1016   3rd Qu.:44201   3rd Qu.:1.00  
##  Max.   :80.00   Max.   :800.00   Max.   :9997   Max.   :44201   Max.   :6.00  
##     Latitude       Longitude          Datum           Parameter_Name    
##  Min.   :18.18   Min.   :-158.09   Length:9060694     Length:9060694    
##  1st Qu.:33.92   1st Qu.:-109.91   Class :character   Class :character  
##  Median :38.01   Median : -92.01   Mode  :character   Mode  :character  
##  Mean   :37.44   Mean   : -95.35                                        
##  3rd Qu.:40.92   3rd Qu.: -81.97                                        
##  Max.   :64.85   Max.   : -65.92                                        
##    Date_Local          Time_Local          Date_GMT            Time_GMT       
##  Min.   :2014-01-01   Length:9060694    Min.   :2014-01-01   Length:9060694   
##  1st Qu.:2014-04-20   Class1:hms        1st Qu.:2014-04-20   Class1:hms       
##  Median :2014-07-05   Class2:difftime   Median :2014-07-05   Class2:difftime  
##  Mean   :2014-07-04   Mode  :numeric    Mean   :2014-07-04   Mode  :numeric   
##  3rd Qu.:2014-09-19                     3rd Qu.:2014-09-20                    
##  Max.   :2014-12-31                     Max.   :2015-01-01                    
##  Sample_Measurement Units_of_Measure        MDL         Uncertainty   
##  Min.   :-0.00500   Length:9060694     Min.   :0.0006   Mode:logical  
##  1st Qu.: 0.01900   Class :character   1st Qu.:0.0050   NA's:9060694  
##  Median : 0.03000   Mode  :character   Median :0.0050                 
##  Mean   : 0.03027                      Mean   :0.1361                 
##  3rd Qu.: 0.04100                      3rd Qu.:0.0050                 
##  Max.   : 0.21300                      Max.   :8.0000                 
##   Qualifier         Method_Type         Method_Code     Method_Name       
##  Length:9060694     Length:9060694     Min.   : 19.00   Length:9060694    
##  Class :character   Class :character   1st Qu.: 47.00   Class :character  
##  Mode  :character   Mode  :character   Median : 87.00   Mode  :character  
##                                        Mean   : 71.08                     
##                                        3rd Qu.: 87.00                     
##                                        Max.   :901.00                     
##   State_Name        County_Name        Date_of_Last_Change
##  Length:9060694     Length:9060694     Length:9060694     
##  Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   
##                                                           
##                                                           
##

Below is the skimr library function and provides an excellent summary of the data, particularly the histogram at the bottom, giving the numeric data types distribution.

skimr::skim(ozn)

Table 2: Data summary
Name	ozn
Number of rows	9060694
Number of columns	24
_______________________
Column type frequency:
character	9
Date	2
difftime	2
logical	1
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Datum	0	1.00	5	5	2
Parameter_Name	0	1.00	5	5	1
Units_of_Measure	0	1.00	17	17	1
Qualifier	8937542	0.01	1	2	19
Method_Type	0	1.00	3	7	2
Method_Name	0	1.00	27	63	8
State_Name	0	1.00	4	20	53
County_Name	0	1.00	3	20	631
Date_of_Last_Change	0	1.00	10	10	406

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Date_Local	0	1	2014-01-01	2014-12-31	2014-07-05	365
Date_GMT	0	1	2014-01-01	2015-01-01	2014-07-05	366

Variable type: difftime

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Time_Local	0	1	0 secs	82800 secs	43200 secs	24
Time_GMT	0	1	0 secs	82800 secs	43200 secs	24

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
Uncertainty	9060694	0	NaN	:

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
State_Code	1	26.52	16.99	1.00	8.00	25.00	42.00	80.00	▇▆▇▂▁
County_Code	1	76.71	79.70	1.00	23.00	61.00	103.00	800.00	▇▁▁▁▁
Site_Num	1	1362.11	2672.34	1.00	6.00	27.00	1016.00	9997.00	▇▁▁▁▁
Parameter_Code	1	44201.00	0.00	44201.00	44201.00	44201.00	44201.00	44201.00	▁▁▇▁▁
POC	1	1.07	0.33	1.00	1.00	1.00	1.00	6.00	▇▁▁▁▁
Latitude	1	37.44	5.05	18.18	33.92	38.01	40.92	64.85	▁▆▇▁▁
Longitude	1	-95.35	16.32	-158.09	-109.91	-92.01	-81.97	-65.92	▁▂▇▇▇
Sample_Measurement	1	0.03	0.02	0.00	0.02	0.03	0.04	0.21	▇▃▁▁▁
MDL	1	0.14	0.81	0.00	0.00	0.00	0.00	8.00	▇▁▁▁▁
Method_Code	1	71.08	43.25	19.00	47.00	87.00	87.00	901.00	▇▁▁▁▁

These next lines are ways to narrow in on the data metrics. First, we can see how many different Latitude and Longitude values there are in each variable. Observed 1,305 unique Latitudes and 1,306 unique Longitudes.

ozn %>%
  summarize(Total_Lat = n_distinct(Latitude))

## # A tibble: 1 x 1
##   Total_Lat
##       <int>
## 1      1305

ozn %>%
  summarize(Total_Long = n_distinct(Longitude))

## # A tibble: 1 x 1
##   Total_Long
##        <int>
## 1       1306

I break down all the counties by latitude and longitude based on the variable’s 1,306 unique longitude values. This assumes where the ozone monitors are positioned.

ozn %>%
  group_by(County_Name, Latitude, Longitude) %>%
  summarize(Lat = n_distinct(Longitude))

## # A tibble: 1,306 x 4
## # Groups:   County_Name, Latitude [1,305]
##    County_Name Latitude Longitude   Lat
##    <chr>          <dbl>     <dbl> <int>
##  1 Abbeville       34.3     -82.4     1
##  2 Ada             43.6    -116.      1
##  3 Ada             43.6    -116.      1
##  4 Adair           35.8     -94.7     1
##  5 Adams           39.8    -105.      1
##  6 Adams           39.9     -91.3     1
##  7 Adams           39.9     -77.3     1
##  8 Adams           39.9     -77.3     1
##  9 Aiken           33.3     -81.8     1
## 10 Alachua         29.5     -82.3     1
## # ... with 1,296 more rows

If I want to view the first ten rows of Latitude and Longitude and put it into a table, I can specify that in R.

Table 3: Latitude and Longitude variables.
Latitude	Longitude
30.49748	-87.88026
30.49748	-87.88026
30.49748	-87.88026
30.49748	-87.88026
30.49748	-87.88026
30.49748	-87.88026
30.49748	-87.88026
30.49748	-87.88026
30.49748	-87.88026
30.49748	-87.88026

Taking the concept further, I can create a table showing the State, County, and Sample Measurement, only displaying the first ten rows of data.

Table 4: Select State, County, and Measurement variables.
State_Name	County_Name	Sample_Measurement
Alabama	Baldwin	0.047
Alabama	Baldwin	0.047
Alabama	Baldwin	0.043
Alabama	Baldwin	0.038
Alabama	Baldwin	0.035
Alabama	Baldwin	0.035
Alabama	Baldwin	0.034
Alabama	Baldwin	0.037
Alabama	Baldwin	0.044
Alabama	Baldwin	0.046

A more detailed look at the State, County, Count of measurements, and Mean Ozone reading shown in this table.

  summarize(group_by(ozn, 
                     State_Name, 
                     County_Name), 
            count = n(), 
            meanOzone = mean(Sample_Measurement))

## # A tibble: 797 x 4
## # Groups:   State_Name [53]
##    State_Name County_Name count meanOzone
##    <chr>      <chr>       <int>     <dbl>
##  1 Alabama    Baldwin      5492    0.0293
##  2 Alabama    Colbert      5613    0.0249
##  3 Alabama    DeKalb       8232    0.0320
##  4 Alabama    Elmore       5585    0.0292
##  5 Alabama    Etowah       5578    0.0261
##  6 Alabama    Houston      5562    0.0272
##  7 Alabama    Jefferson   41443    0.0262
##  8 Alabama    Madison     11678    0.0300
##  9 Alabama    Mobile      10874    0.0259
## 10 Alabama    Montgomery   5560    0.0282
## # ... with 787 more rows

Here we can summarize the State and Counties with the lowest Mean Ozone readings. Puerto Rico and Alaska, along with Oklahoma, Oregon, Washington, etc. have the lowest readings of Mean Ozone in this data set. The arrange() function does this in ascending order by default.

ozn %>%
  group_by(State_Name, County_Name) %>%
  summarize(count = n(), MeanO = mean(Sample_Measurement)) %>%
  arrange(MeanO)

## # A tibble: 797 x 4
## # Groups:   State_Name [53]
##    State_Name  County_Name          count   MeanO
##    <chr>       <chr>                <int>   <dbl>
##  1 Puerto Rico Catano                8038 0.00449
##  2 Puerto Rico Bayamon               2794 0.00852
##  3 Alaska      Fairbanks North Star  8575 0.0133 
##  4 Puerto Rico Juncos                8180 0.0135 
##  5 Oklahoma    Caddo                 7507 0.0175 
##  6 Oregon      Washington            9215 0.0196 
##  7 Washington  Whatcom               3098 0.0201 
##  8 Louisiana   West Baton Rouge      8635 0.0201 
##  9 New Jersey  Warren                8620 0.0201 
## 10 Georgia     DeKalb                8594 0.0205 
## # ... with 787 more rows

This table displays the highest Mean Ozone readings per State and County in this data set. Colorado, California, and Wyoming are at the top, with Utah and Nevada.

ozn %>%
  group_by(State_Name, County_Name) %>%
  summarize(count = n(), MeanO = mean(Sample_Measurement)) %>%
  arrange(desc(MeanO))

## # A tibble: 797 x 4
## # Groups:   State_Name [53]
##    State_Name County_Name count  MeanO
##    <chr>      <chr>       <int>  <dbl>
##  1 Colorado   Clear Creek  6447 0.0502
##  2 California Mariposa    12130 0.0485
##  3 California Nevada      12748 0.0482
##  4 Colorado   Park         8127 0.0480
##  5 Wyoming    Albany       8056 0.0474
##  6 Colorado   Chaffee      3748 0.0473
##  7 California Inyo         8667 0.0447
##  8 Utah       San Juan    10794 0.0444
##  9 California El Dorado   16521 0.0436
## 10 Nevada     Elko         3431 0.0435
## # ... with 787 more rows

This summary table shows the number of readings taken per day for the entire year across all states. The summary indicates that not every monitoring station records consistently.

ozn %>%
  group_by(Date_Local) %>%
  summarize(readings_day = n())

## # A tibble: 365 x 2
##    Date_Local readings_day
##    <date>            <int>
##  1 2014-01-01        18115
##  2 2014-01-02        17901
##  3 2014-01-03        17992
##  4 2014-01-04        17929
##  5 2014-01-05        17769
##  6 2014-01-06        17561
##  7 2014-01-07        17433
##  8 2014-01-08        17706
##  9 2014-01-09        17820
## 10 2014-01-10        18156
## # ... with 355 more rows

As eluded to earlier, the data records once an hour of a day totaling 24 measurements a day for each monitoring station. However, by the previous table, we see that the number of readings per day is not the same every day of the year and should be noted. We can further display the number of days monitoring occurs at 365 days. And finally, across 53 states and territories. We saw earlier Puerto Rico counts as a State. The next table will itemize each State.

ozn %>%
  summarize(Times_Recorded = n_distinct(Time_Local))

## # A tibble: 1 x 1
##   Times_Recorded
##            <int>
## 1             24

ozn %>%
  summarize(Total_Days_Recorded = n_distinct(Date_Local))

## # A tibble: 1 x 1
##   Total_Days_Recorded
##                 <int>
## 1                 365

ozn %>%
  summarize(Total_n_States = n_distinct(State_Name))

## # A tibble: 1 x 1
##   Total_n_States
##            <int>
## 1             53

Below is the table with each State name. Looking through the table are the 48 contiguous States, including Alaska and Hawaii. Then the District of Columbia, Puerto Rico, and Country of Mexico totaling 53. This information can lead us to further analysis using the latitude and longitude to find data that we may desire for continued research, particularly the “Country of Mexico”; is that measurement station desirable in our analysis? Or maybe we are only interested in the continental U.S. for analysis.

ozn %>%
  group_by(State_Name) %>%
  distinct(State_Name)

## # A tibble: 53 x 1
## # Groups:   State_Name [53]
##    State_Name          
##    <chr>               
##  1 Alabama             
##  2 Alaska              
##  3 Arizona             
##  4 Arkansas            
##  5 California          
##  6 Colorado            
##  7 Connecticut         
##  8 Delaware            
##  9 District Of Columbia
## 10 Florida             
## # ... with 43 more rows

The “Sample_Measurement” can be further analyzed by itemizing each State by breaking out the measurements into quantiles of 10%, 25%, 50%, 75%, and 99%.

probs = c(0.01, 0.25, 0.5, 0.75, 0.99)
ozn %>%
  group_by(State_Name) %>%
  summarize(p =list(probs), q = list(quantile(Sample_Measurement, probs))) %>%
  unnest(cols = c(p, q))

## # A tibble: 265 x 3
##    State_Name     p     q
##    <chr>      <dbl> <dbl>
##  1 Alabama     0.01 0    
##  2 Alabama     0.25 0.014
##  3 Alabama     0.5  0.027
##  4 Alabama     0.75 0.039
##  5 Alabama     0.99 0.062
##  6 Alaska      0.01 0    
##  7 Alaska      0.25 0.011
##  8 Alaska      0.5  0.022
##  9 Alaska      0.75 0.032
## 10 Alaska      0.99 0.051
## # ... with 255 more rows

In an earlier table, Clear Creek County, Colorado, was showing the highest Sample_Measurement value. Here is the measurement of ozone for each month. The ozone value climbs in April/May, peaking in June then dropping in July.

ozn %>%
  filter(State_Name == "Colorado" & County_Name == "Clear Creek") %>%
  mutate(Month = lubridate::month(Date_Local, label = TRUE)) %>%
  group_by(Month) %>%
  summarize(Ozone = mean(Sample_Measurement))

## # A tibble: 11 x 2
##    Month  Ozone
##    <ord>  <dbl>
##  1 Jan   0.0471
##  2 Feb   0.0474
##  3 Apr   0.0585
##  4 May   0.0592
##  5 Jun   0.0621
##  6 Jul   0.0529
##  7 Aug   0.0494
##  8 Sep   0.0473
##  9 Oct   0.0471
## 10 Nov   0.0448
## 11 Dec   0.0443

Washington County, Oregon, was listed in the top six lowest measurements of ozone in a previous table. Below is the summary of Washington County, Oregon, by each month. The data summary is missing readings from January through April. The data would further our exploratory analysis and noting this as missing data. A point to look for a more thorough analysis.

ozn %>%
filter(State_Name == "Oregon" & County_Name == "Washington") %>%
  mutate(Month = lubridate::month(Date_Local, label=TRUE)) %>%
  group_by(Month) %>%
  summarize(Ozone = mean(Sample_Measurement))

## # A tibble: 8 x 2
##   Month  Ozone
##   <ord>  <dbl>
## 1 May   0.0261
## 2 Jun   0.0199
## 3 Jul   0.0219
## 4 Aug   0.0209
## 5 Sep   0.0201
## 6 Oct   0.0108
## 7 Nov   0.0155
## 8 Dec   0.0123

After this analysis one would start to develop a hypothesis or business question that we can answer with the data. Also graphical representations of the data should be employed to better visualize the relationships in the data. We use graphics in other posts.

References

EPA - United States Environmental Protection Agency (2017). Pre-Generated Data Files. Retrieved from: https://aqs.epa.gov/aqsweb/airdata/download_files.html

EPA - United States Environmmental Protection Agency (2017). AirData Download Files Documentation. Retrieved from: https://aqs.epa.gov/aqsweb/airdata/FileFormats.html

Grolemund, G., Wickham, H. (2017) R for Data Science. Retrieved from: https://r4ds.had.co.nz/

Peng, R. D. (2015). Exploratory Data Analysis with R. Retrieved from: https://leanpub.com/exdata