# Motive/Background

This analysis is part 1 of 2 of the Statistical Inference course project of John Hopkins University’s Data Science Specialization course. This first section of the project will be investigating the properties of the Central Limit theorem as it relates to the distribution of means of 1000 simulated averages for 40 different exponentially distributed values.

# Writeup

**Click here** to view the full RPubs writeup/report for this project with all of the included code. The rest of this document will summarize/restate what is already on that page.

**Simulation Exercise: Investigating the Exponential Distribution**

To begin, we are asked to investigate the exponential distribution and compare it to the Central Limit Theorem. In R, we can randomly generate/simulate exponential values using `rexp(n,rate)`

where rate = lambda (𝜆). For every simulation/exercise here on out, we will be standardizing our rate parameter to be equal to 0.2.

**Analyzing a sample of 40 random exponential values**

Lets begin by loading in some packages and setting our seed:

We will now simulate 40 random exponential values with a rate/lambda = 0.2, and assign it to an object named *nums*:

Lets now take a look at a histogram of these 40 values using ggplot:

As you can see, the dotted line represents the mean value of this particular sample.

**Comparing the sample mean to the theoretical mean of the distribution**

To reiterate and show proof of our sample mean:

Now lets compare that to what the mean value should be in theory for an exponential distribution. The mean (as well as the standard deviation) of an exponential distribution is equal to (1/𝜆). As stated previously, we are using a lambda (𝜆) equal to 0.2:

As you can see, our sample mean is fairly close and centered to the actual theoretical mean of an exponential distribution, but not quite exact.

**Comparing the sample variance to the theoretical variance of the distribution**

Now lets do the same, but for the variance instead. To calculate our sample variance we run the following code:

Since the mean and standard deviation of an exponential distribution are the same, we can simply square our prior (1/𝜆) value to get our theoretical variance:

Once again, our sample variance is fairly close and centered around the theoretical variance of an exponential distribution, but not quite the same.

**Showing that the distribution of means is approximately normal using the C.L.T.**

The Central Limit theorem is arguably one of the most important concepts/theorems in all of statistics. From an article on Scribbr.com written by Shaun Turney:

“The central limit theorem states that if you take sufficiently large samples from a population, the samples’ means will be normally distributed, even if the population isn’t normally distributed.”

To show that this is true using simulations, we will be examining the difference between the distribution of a large collection of random exponential values and the distribution of a large collection of averages of 40 exponential values.

Lets begin by simulating another set of random exponential values, this time with 1000 samples instead of 40. These values will be assigned to an object named *nums_1000*:

Lets take a look at how this distribution appears now that we have many more values:

As you can see, the general shape of the distribution looks a lot more exponential now that we have more values in our sample. Once again, our sample mean closely centers itself around our true, theoretical mean, which in this case is 5.

Now, lets investigate the distribution of 1000 **averages** of 40 different exponential values. To do this, we will use a simple `for`

loop that calculates the mean of 40 randomly generated exponential values, 1000 times. Each time a mean is calculated, the values are populated/added into the object named *avgs*:

Now lets plot the distribution of these 1000 averages:

As you can see, this distribution no longer looks exponential, but rather a lot more normally distributed! Furthermore, it once again centers itself around the theoretical mean, 5. Although we would need an infinitely large sample for the mean to truly center itself around (1/𝜆), using a sufficiently large sample size certainly illustrates the greater point of the Central Limit Theorem.