-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathExponential Simulation script.Rmd
More file actions
242 lines (206 loc) · 10.9 KB
/
Exponential Simulation script.Rmd
File metadata and controls
242 lines (206 loc) · 10.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
title: "Central Limit Theorem Simulation"
author: "Alex Fennell"
output: html_document
---
# Simulating the Exponential distribution
## Synopsis
This project will compare how statistics of a non-normal distribution are
themselves normally distributed. Samples will be drawn from the exponential
distribution to generate the mean and variance. The distribution of these statistics
will then be compared to the theoretical mean and standard deviation of the
distribution. The simulated distributions of statistics will then be compared to a normal
distribution to see if the statistics derived from a skewed distribution are themselves
normally distributed. Code is contained in an appendix following the report.
```{r document options,echo=FALSE}
knitr::opts_chunk$set(echo=FALSE)
```
```{r,message=FALSE,warning=FALSE}
library(ggplot2)
library(gridExtra)
library(cowplot)
library(qqplotr)
```
## Simulations
This simulation will use samples of size 40 with lambda=.2. There will be 1000
simulations used to create a distribution
``` {r initial simulation}
set.seed(11111)
lambda=.2
#Number of samples
S=1000
#size of samples
n=40
#samples stored in matrix where columns are the size of the samples
#and each row is an individual simulation
sim<-matrix(rexp(n*S,lambda),nrow=S,ncol=n)
```
## Sample Statistics vs. Theoretical Statistics
The first thing to be examined is a histogram of the means/variances from the
simulation with the theoretical mean/variance overlayed as a dashed red vertical line. The
theoretical mean is calculated as 1/lambda while the theoretical variance is calculated
as 1/lambda^2
``` {r simulated mean vs theoretical mean histogram}
# Theoretical mean
Tmean<-1/lambda
Tmean<-data.frame(Type="Theoretical Mean",val=Tmean)
# The function rowMeans calulates the mean for each row (one simulation), which
#creates a distribution of means based on the simulation
simmean<-data.frame(means=rowMeans(sim))
# standardize bin width for all plots
binsize=.2
#combine mean and theoretical mean into one dataframe
meanplot<-ggplot(data=simmean,aes(means))+
ggtitle("Distribution of Means \nof Exponential Simulation") +
xlab("Means")+
theme(plot.title = element_text(hjust = 0.5),
legend.title=element_blank(),
legend.position="bottom")
meanplot1<-meanplot+geom_histogram(aes(y=..count..),binwidth = binsize,color="black",fill="lightblue")+
geom_vline(data=Tmean,aes(xintercept=val,color=Type),linetype="dashed",size=1.5)
```
```{r Simulated Variance vs. Theoretical variance histogram,fig.height=3}
#Theoretical variance
Tvar<-(1/lambda^2)
Tvar<-data.frame(Type="Theoretical Variance",val=Tvar)
# Using the apply function to calculate the variance for each row (one simulation),
# which creates a distribution of simulated variances
simvar<-data.frame(VAR=apply(sim,1,var))
varplot<-ggplot(data=simvar,aes(VAR))+
ggtitle("Distribution of Variance \nof Exponential Simulation") +
xlab("Variance")+
theme(plot.title = element_text(hjust = 0.5),
legend.title=element_blank(),
legend.position="bottom")
varplot1<-varplot+
geom_histogram(aes(y=..count..),binwidth=2.5,color="black",fill="darkgreen")+
geom_vline(data=Tvar,aes(xintercept=val,color=Type),linetype="dashed",size=1.5)
# the grid.arrange function is used to organize multiple plots in a figure
grid.arrange(meanplot1,varplot1,ncol=2)
```
The histogram on the left shows that the distibution of means sampled from the exponential
distribution is roughly normal with the highest density of responses centered
around the theoretical mean (vertical red dashed line). The histogram on the right
shows the right skew of the simulated variance. The theoretical variance (vertical
red dashed line) lies to the right of the highest density in the plot.
### Exact Theoretical and Simulated Statistics
The actual values of the theoretical and simulated means and varinaces are presented
below. The simulated values are incredibly close to the theoretical values for
both mean and variance.
```{r simulated vs theoretical actual values}
statcompare<-data.frame(Statistic=c("Theoretical Mean","Simulated Mean",
"Theoretical Variance", "Average Simulated Variance"),
Value=c(1/lambda,mean(simmean$means),1/lambda^2,mean(simvar$VAR)))
statcompare
```
## Normality of Simulated Statistics
I will now examine the distribution of the mean, and variance, of the Monte
Carlo Simulations to see how close they approximate a normal distribution
and conform to the central limit theorem. I will do this in two ways. The first
is to overlay a normal PDF (black curve) on the histograms, and the second is to use a QQ plot.
The QQ plot provides a very obvious comparison of data to the theoretical normal
distribution. Simulated data that does not fall on the diagonal reference line
does not conform to a normal distribution, with larger deviations indicating a
greater departure from normality.The dark gray region surrounding the diagonal
reference line is a 95% confidence interval so as to indicate how much the
simulated data differs or conforms to a normal distribution.
```{r Figure to demonstrate normality}
#Creating histogram of simulated means with normal pdf overlayed
meanplot2<-meanplot+
geom_histogram(aes(y=..density..),binwidth = binsize,color="black",fill="lightblue")+
geom_vline(data=Tmean,aes(xintercept=val),color='red',linetype="dashed",size=1.5)+
stat_function(fun=dnorm,lwd=1.5,args=list(mean=mean(simmean$means),
sd=sd(simmean$means)))
#Creating histogram of simulated variances with normal pdf overlayed
varplot2<-varplot+
geom_histogram(aes(y=..density..),binwidth=2.5,color="black",fill="darkgreen")+
geom_vline(data=Tvar,aes(xintercept=val),color='red',linetype="dashed",size=1.5)+
stat_function(fun=dnorm,lwd=1.5,args=list(mean=mean(simvar$VAR),
sd=sd(simvar$VAR)))
# The stat_qq_* functions are from the qqplotr package and generate a 95% confidence
# interval around a reference line that indicates normality.
qqmeanplot<-ggplot(simmean,aes(sample=means))+
stat_qq_band()+
stat_qq_line(color='red',lwd=1)+
stat_qq_point() +
labs(title="QQ plot of Simulated Means \nfrom Exponential Distribution") +
xlab("Theoretical Normal Values")+ylab("Simulated Values")+
theme(plot.title = element_text(hjust = 0.5))
qqvarplot<-ggplot(simvar,aes(sample=VAR))+
stat_qq_band()+
stat_qq_line(color='red',lwd=1)+
stat_qq_point() +
labs(title="QQ plot of Simulated Variance \nfrom Exponential Distribution") +
xlab("Theoretical Normal Values")+ylab("Simulated Values")+
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(meanplot2,varplot2,qqmeanplot,qqvarplot,nrow=2,ncol=2)
```
The plots in the left column of this figure show that the distribution of
simulated means approximates a normal distribution with the exception of a few values.
On the other hand, the plots in the right column show the distribution of simulated
variances is highly skewed and deviates from normality.The QQ plot shows that the simulated variances deviate both in the leading edge of the distribution and the tail of the distribution. This is because the sample size used in the simulation (n=40) is too small of a sample
for variance to conform to a normal distribution as the central limit theorem
states. Variance is heavily influenced by skewed distribution, and given the
heavily skewed nature of the exponential distribution this result is not surprising.
For the interested reader, in the appendix, I will re-run the simulation with a
larger sample and demonstrate how this produces a normally distributed variance.
### Exponential Simulation with Large Sample Size
```{r normality of simulated data-larger sample size}
#I will still use the same number of simulations for this exercise
#size of samples
n=300
#Simulation
set.seed(11111)
sim2<-matrix(rexp(n*S,lambda),nrow=S,ncol=n)
#Calculating means and variances using the newly simulated data
simmean2<-data.frame(means=rowMeans(sim2))
simvar2<-data.frame(VAR=apply(sim2,1,var))
simsd2<-data.frame(SD=apply(sim2,1,sd))
binsize=.06
#New plot for the distribution of simulated means with n=300
meanplot3<-ggplot(data=simmean2,aes(means))+
ggtitle("Distribution of Means \nof Exponential Simulation n=300") +
xlab("Means")+
theme(plot.title = element_text(hjust = 0.5),
legend.title=element_blank())+
geom_histogram(aes(y=..density..),binwidth = binsize,color="black",fill="lightblue")+
geom_vline(data=Tmean,aes(xintercept=val),color='red',linetype="dashed",size=1.5)+
stat_function(fun=dnorm,lwd=1.5,args=list(mean=mean(simmean2$means),
sd=sd(simmean2$means)))
varplot3<-ggplot(data=simvar2,aes(VAR))+
ggtitle("Distribution of Variance \nof Exponential Simulation n=300") +
xlab("Variance")+
theme(plot.title = element_text(hjust = 0.5),
legend.title=element_blank())+
geom_histogram(aes(y=..density..),binwidth=.7,color="black",fill="darkgreen")+
geom_vline(data=Tvar,aes(xintercept=val),color='red',linetype="dashed",size=1.5)+
stat_function(fun=dnorm,lwd=1.5,args=list(mean=mean(simvar2$VAR),
sd=sd(simvar2$VAR)))
qqmeanplot2<-ggplot(simmean2,aes(sample=means))+
stat_qq_band()+
stat_qq_line(color='red',lwd=1)+
stat_qq_point() +
labs(title="QQ plot of Simulated Means from \nExponential Distribution n=300") +
xlab("Theoretical Normal Values")+ylab("Simulated Values")+
theme(plot.title = element_text(hjust = 0.5))
qqvarplot2<-ggplot(simvar2,aes(sample=VAR))+
stat_qq_band()+
stat_qq_line(color='red',lwd=1)+
stat_qq_point() +
labs(title="QQ plot of Simulated Variance from \nExponential Distribution n=300") +
xlab("Theoretical Normal Values")+ylab("Simulated Values")+
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(meanplot3,varplot3,qqmeanplot2,qqvarplot2,nrow=2,ncol=2)
```
With 300 samples in each simulation, both the distributions of simulated means and
variances appear to be normally distributed. The pdf of the normal distribution
(black curve) overlays the histogram of simulated values and conforms quite well.
In the QQ plot most all values for the simulated means and variances fall within
the 95% confidence interval (dark grey area) indicating the distributions of
simulated means and variances are normally distributed. It can also be observed
that spread of the simulated means and variances is much more concentrated around
the true theoretical values in comparison to the previous simulations with n=40.
This is an excellent demonstration of the power of the central limit theorem.
### Appendix
```{r, ref.label=knitr::all_labels(),echo=TRUE,eval=FALSE}
```