forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
124 lines (87 loc) · 5.76 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
```{R results="hide"}
library(utils) #load utils library that contains unzip function
unzip("activity.zip") #extract activity.csv from zip container.
df <- read.csv("activity.csv") #load csv into variable, "df"
```
## What is mean total number of steps taken per day?
```{R message=FALSE}
stepsPerDay <- aggregate(steps ~ date, df, FUN=sum) #sum the daily number of steps and store results in a new variable, "stepsPerDay""
colnames(stepsPerDay) <- c("date", "steps") #name columns for readability
library(ggplot2) #load ggplot library for plotting
# plots the histogram of the steps taken per day
hist_with_NA_values <- ggplot(stepsPerDay, aes(x=steps))
hist_with_NA_values <- hist_with_NA_values + geom_histogram(fill="darkgreen", colour="black")
hist_with_NA_values + labs(title="Histogram of the number of steps taken per day", x="Steps per day")
#calculate mean of steps taken per day, ignore days where values is NA.
print (mean_with_NA_values <- mean(stepsPerDay$steps, na.rm=TRUE))
#calculate median of steps taken per day, ignore days where values is NA.
print(median_with_NA_values <- median(stepsPerDay$steps, na.rm = TRUE))
```
## What is the average daily activity pattern?
```{R}
meanStepsAtEachInterval <- aggregate(steps ~ interval, df, FUN=mean) #calculate mean steps per interval and store in new variable, "meanStepsAtEachInterval"
#plot time series for numbers of steps in each 5 minute interval
qplot(interval, steps, data = meanStepsAtEachInterval, geom = "line", main="Time series for mean number of steps in each interval")
#identify which 5-minute interval has the max average number of steps across the dataset
meanStepsAtEachInterval[meanStepsAtEachInterval$steps==max(meanStepsAtEachInterval$steps), 1]
```
## Imputing missing values
```{R messsage=FALSE}
#Total number of missing values in the dataset
length(df[is.na(df$steps),1])
#Replace NA values with the mean value of the interval across the data collection period
df$steps_with_imputted_missing_values = df$steps #Duplicate and work on new column so that we can preserve the original data
for ( i in 1:nrow(df) ) {
if (is.na(df[i,"steps"])) {
df[i,"steps_with_imputted_missing_values"] =
meanStepsAtEachInterval[meanStepsAtEachInterval$interval==df[i,"interval"],"steps"]
}
}
stepsPerDay_with_imputted_missing_values = aggregate(steps_with_imputted_missing_values ~ date, df, FUN=sum) #sum the daily number of steps and store results in a new variable, "stepsPerDay""
colnames(stepsPerDay_with_imputted_missing_values) <- c("date", "steps") #name columns for readability
# plots the histogram of the steps taken per day, with imputted values for NA
hist_with_imputted_values <- ggplot(stepsPerDay_with_imputted_missing_values, aes(x=steps))
hist_with_imputted_values <- hist_with_NA_values + geom_histogram(fill="purple", colour="black")
hist_with_imputted_values + labs(title="Histogram of the number of steps taken per day", x="Steps per day")
#calculate mean of steps taken per day, with imputted values for NA
print( mean_with_imputted_values <- mean(stepsPerDay_with_imputted_missing_values$steps))
#calculate median of steps taken per day, with imputted values for NA
print( median_with_imputted_values <- median(stepsPerDay_with_imputted_missing_values$steps))
```
### Comparison between analysis with and without imputted values
```{R message=FALSE}
#create matrix of values to be compared
comparison <- rbind( c(mean_with_NA_values, median_with_NA_values)
, c(mean_with_imputted_values, median_with_imputted_values))
colnames(comparison) <- c("mean", "median") #name columns
rownames(comparison) <- c("Original (with NA values)", "With imputted values") #name rows
comparison #print comparison
require(gridExtra) #load gridExtra library to help with arranging two ggplots
grid.arrange(hist_with_NA_values, hist_with_imputted_values, ncol=1) #plot
```
* There isn't any major differences when the imputed values are inserted over the NA values. The mean remains the same. The median value has increased slightly with the usage of imputed values.
* The same of the histograms remain largely similar as well. However, if we haved used smaller bins, it might be possible to see differences where values in the x axis have been inserted in place of NA.
## Are there differences in activity patterns between weekdays and weekends?
```{R}
df <- transform(df, date=as.Date(date)) #change class type to "Date" so that weekdays function can be used on column
df$day <- weekdays(df$date) #create new column indicate the day of week for corresponding date
df_weekdays <- df[which(df$day != "Saturday" & df$day != "Sunday"), ] #subset data for weekdays
df_weekends <- df[which(df$day == "Saturday" | df$day == "Sunday"), ] #subset data for weekends
meanStepsAtEachInterval_weekday <- aggregate(steps ~ interval, df_weekdays, FUN=mean) #calculate mean steps per interval for weekdays
meanStepsAtEachInterval_weekend <- aggregate(steps ~ interval, df_weekends, FUN=mean) #calculate mean steps per interval for weekdays
meanStepsAtEachInterval_weekday$daytype <- "weekday" #populate daytype column with the type of day
meanStepsAtEachInterval_weekend$daytype <- "weekend" #populate daytype column with the type of day
meansStepsAtEachInterval_wkendwkday <- rbind(meanStepsAtEachInterval_weekend,meanStepsAtEachInterval_weekday) #form combined dataset
#meansStepsAtEachInterval_wkendwkday <- transform(meansStepsAtEachInterval_wkendwkday, daytype=as.factor(daytype))
#plot panel plot
pp <- ggplot(meansStepsAtEachInterval_wkendwkday, aes(x=interval, y=steps))
pp <- pp + geom_line() + labs(title="Panel plot of mean steps at each interval by weekend/weekday")
pp + facet_grid(daytype ~ .)
```