11,014 バイト追加
、 2022年12月11日 (日) 18:03
==Standard deviation and Variance==
===SD/Variance of Quantitative data ~ mean===
{|class="wikitable" style="text-align:center"
|-
!style="width:80px"|
!style="width:30px"|Size
!style="width:110px"|Mean
!style="width:180px"|Standard deviation
!style="width:170px"|Variance
!style="width:250px"|notes
|-
!Population
|<math>N</math>
|<math>\mu = \frac{\textstyle \sum_{i=1}^N X_i}{N}</math>
|<math>\sigma=\sqrt{\frac{\sum_{i=1}^N (X_i-\mu)^2}{N}}</math>
|<math>\sigma^2=\frac{\sum_{i=1}^N (X_i-\mu)^2}{N}</math>
|style="text-align:left"|
*<math>X_i</math> is each value in population
**'''Quantitative'''
***continuous or discrete
|-
!Sample
|<math>n</math>
|<math>\overline{x} = \frac{\textstyle \sum_{i=1}^n x_i}{n}</math>
|<math>s=\sqrt{\frac{\sum_{i=1}^n (x_i-\overline{x})^2}{\color{Red}n-1}}</math>
|<math>s^2=\frac{\sum_{i=1}^n (x_i-\overline{x})^2}{\color{Red}n-1}</math>
|style="text-align:left"|
*<math>x_i</math> is each value in sample
**'''Quantitative'''
***continuous or discrete
*<math>\color{Red}n-1</math> is derived from the degrees of freedom
|}
===SD/Variance of Binomial data ~ proportion===
{|class="wikitable" style="text-align:center"|
|-
!style="width:80px"|
!style="width:30px"|Size
!style="width:110px"|Proportion
!style="width:180px"|Standard deviation ¶
!style="width:170px"|Variance ¶
!style="width:250px"|notes
|-
!population
|<math>N</math>
|<math>\pi = \frac{\sum_{i=1}^N X_i}{N}</math>
<nowiki>*</nowiki> <math>X_i = 0</math> or <math>1</math>
|<math>\sigma = \sqrt{\pi (1 - \pi)}</math>
|<math>\sigma^2 = \pi (1 - \pi)</math>
|style="text-align:left"|
*<math>X_i</math> is each value in population
**'''Binary'''
***'''0''' or '''1'''
|-
!sample
|<math>n</math>
|<math>p = \frac{\sum_{i=1}^n x_i}{n}</math>
<nowiki>*</nowiki> <math>x_i = 0</math> or <math>1</math>
|<math>
\begin{align}
s & = \sqrt{\frac{n}{n-1} \cdot p (1 - p)} \\
& \approx \sqrt{p (1-p)}
\end{align}
</math>
|<math>
\begin{align}
s^2 & = \frac{n}{n-1} \cdot p (1 - p) \\
& \approx p (1 - p)
\end{align}
</math>
|style="text-align:left"|
*<math>x_i</math> is each value in sample
**'''Binary'''
***'''0''' or '''1'''
|}
<div class="toccolours mw-collapsible mw-collapsed" style="width:450px">
¶ How to derive '''variance''' and '''standard deviation''' of '''proportion''' in population:
<div class="mw-collapsible-content">
Definition of variance of values of in population is <math>\frac{\sum_{i=1}^N (X_i - \mu)^2}{N}</math> .
Here, <math>{\color{Green}\mu}</math> is <math>{\color{Green}\frac{\sum_{i=1}^N X_i}{n}}</math> according to its definition.
This is <math>{\color{Green}\pi}</math> itself (refer to the above table).
::<math>{\color{Green}\mu} = {\color{Green}\frac{\sum_{i=1}^N X_i}{N}} = {\color{Green}\pi}</math>
And when we consider <math>{\color{Red}\frac{\sum_{i=1}^N X_i^2}{N}}</math> , provided that <math>X_i = 0</math> or <math>1</math>, it leads:
::<math>{\color{Red}\frac{\sum_{i=1}^N X_i^2}{N}} = {\color{Green}\frac{\sum_{i=1}^N X_i}{N}} = {\color{Green}\pi}</math>
Thus the variance of population proportion can be calculated as follows:
::<math>
\begin{align}
\sigma^2 & = \frac{\sum_{i=1}^N (X_i - {\color{Green}\mu})^2}{N} \\
& = \frac{\sum_{i=1}^N (X_i - {\color{Green}\pi})^2}{N} \\
& = \frac{\sum_{i=1}^N (X_i^2 - 2 {\color{Green}\pi} \cdot X_i + {\color{Green}\pi^2})}{N} \\
& = \frac{\sum_{i=1}^N X_i^2}{N} - 2 {\color{Green}\pi} \cdot \frac{\sum_{i=1}^N X_i}{N} + {\color{Green}\pi^2} \cdot \frac{\sum_{i=1}^N 1}{N}\\
& = {\color{Red}\frac{\sum_{i=1}^N X_i^2}{N}} - 2 {\color{Green}\pi} \cdot {\color{Green}\frac{\sum_{i=1}^N X_i}{N}} + {\color{Green}\pi^2} \cdot \frac{{\color{Orange}\sum_{i=1}^N 1}}{N}\\
& = {\color{Green}\pi} - 2 {\color{Green}\pi} \cdot {\color{Green}\pi} + {\color{Green}\pi^2} \cdot \frac{\color{Orange}N}{N} \\
& = \pi - 2\pi^2 + \pi^2 \\
& = \pi - \pi^2 \\
& = \pi(1-\pi)
\end{align}
</math>
Then standard deviation is also obtained:
::<math>
\begin{align}
\sigma & = \sqrt{\sigma^2} \\
& = \sqrt{\pi(1-\pi)}
\end{align}
</math>
</div>
</div>
<div class="toccolours mw-collapsible mw-collapsed" style="width:450px">
¶ How to derive '''variance''' and '''standard deviation''' of '''proportion''' in sample:
<div class="mw-collapsible-content">
Definition of variance of values in sample is <math>\frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1}</math> .
This can be transformed into <math>\frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n}</math> .
Here, <math>{\color{Green}\bar x}</math> is <math>{\color{Green}\frac{\sum_{i=1}^n x_i}{n}}</math> according to its definition.
This is <math>{\color{Green}p}</math> itself (refer to the above table).
::<math>{\color{Green}\bar x} = {\color{Green}\frac{\sum_{i=1}^n x_i}{n}} = {\color{Green}p}</math>
And when we consider <math>{\color{Red}\frac{\sum_{i=1}^n x_i^2}{n}}</math> , provided that <math>x_i = 0</math> or <math>1</math>, it leads:
::<math>{\color{Red}\frac{\sum_{i=1}^n x_i^2}{n}} = {\color{Green}\frac{\sum_{i=1}^n x_i}{n}} = {\color{Green}p}</math>
Thus the variance of sample proportion can be calculated as follows:
::<math>
\begin{align}
s^2 & = \frac{\sum_{i=1}^n (x_i - {\color{Green}\bar x})^2}{n-1} \\
& = \frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i - {\color{Green}\bar x})^2}{n} \\
& = \frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i - {\color{Green}p})^2}{n} \\
& = \frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i^2 - 2 {\color{Green}p} \cdot x_i + {\color{Green}p^2})}{n} \\
& = \frac{n}{n-1} \left ( \cdot \frac{\sum_{i=1}^n x_i^2}{n} - 2 {\color{Green}p} \cdot \frac{\sum_{i=1}^n x_i}{n} + {\color{Green}p^2} \cdot \frac{\sum_{i=1}^n 1}{n} \right ) \\
& = \frac{n}{n-1} \left ( {\color{Red}\frac{\sum_{i=1}^n x_i^2}{n}} - 2 {\color{Green}p} \cdot {\color{Green}\frac{\sum_{i=1}^n x_i}{n}} + {\color{Green}p^2} \cdot \frac{{\color{Orange}\sum_{i=1}^n 1}}{n} \right ) \\
& = \frac{n}{n-1} \left ( {\color{Green}p} - 2 {\color{Green}p} \cdot {\color{Green}p} + {\color{Green}p^2} \cdot \frac{\color{Orange}n}{n} \right ) \\
& = \frac{n}{n-1} \left ( p - 2p^2 + p^2 \right ) \\
& = \frac{n}{n-1} \left ( p - p^2 \right ) \\
& = \frac{n}{n-1} \cdot p(1-p)
\end{align}
</math>
Here, if <math>n</math> is large enough, we can ignore <math>\frac{n}{n-1}</math> from the calculation.
::<math>
s^2 \approx p(1-p)
</math>
Then standard deviation is also obtained:
::<math>
\begin{align}
s & = \sqrt{s^2} \\
& = \sqrt{\frac{n}{n-1} \cdot p(1-p)} \\
& \approx \sqrt{p(1-p)}
\end{align}
</math>
</div>
</div>
==Standard Error==
If we repeated '''infinite times of sampling''' from the population with sample size of <math>n</math> every time (<math>n</math> is large enough), no matter what the population distribution was, those infinite number of sample mean'''s''' follow normal distribution with mean identical to population mean <math>\mu</math>, and variance derived from population variance <math>\frac{\sigma^2}{n}</math> (not population variance <math>\sigma^2</math>itself). This is '''central limit theorem'''.
{{quote|content=Derivation of <math>\frac{\sigma^2}{n}</math> needs far advanced mathematics like ''Maclaurin expansion'', ''characteristic function'' or ''moment-generating function''.}}
Hence the standard deviation of sample mean'''s''' is derived from the square root of its variance —— <math>\sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}</math>.
[[File:Central limit theorem.png|none|400px]]
This standard deviation of sample mean'''s''' <math>\frac{\sigma}{\sqrt{n}}</math> is defined as '''Standard error'''.
In reality, God knows the population mean <math>\mu</math> and population standard deviation <math>\sigma</math>, thus only way to utilize '''standard error''' is to assume '''sample standard deviation''' would be '''close to population standard deviation''' as follows:
<math>Standard\ error \approx \frac{s}{\sqrt{n}}</math>, where <math>s</math> = sample standard deviation
{|class="wikitable"
|-
!
!Standard error
!notes
|-
!mean
|<math>
\begin{align}
SEM & = \frac{\sigma}{\sqrt{N}} \\
& \approx \frac{s}{\sqrt{n}}
\end{align}
</math>
|rowspan="2"|
*<math>\sigma</math> is population standard deviation
*<math>N</math> is population size
*<math>\pi</math> is population proportion
*<math>s</math> is sample standard deviation
*<math>n</math> is sample size
*<math>p</math> is sample proportion
|-
!proportion
|<math>
\begin{align}
SE_p & = \frac{\sigma}{\sqrt{N}} = \sqrt{\frac{\pi (1-\pi)}{N}} \\
& \approx \frac{s}{\sqrt{n}} = \sqrt{\frac{p (1-p)}{n}}
\end{align}
</math>
|}
==Standard Error and Confidence Interval==
===When sample size is large enough and assumed to follow normal distribution===
According to Central Limit Theorem, distribution of sample mean'''s''' follow '''normal distribution''' with mean of <math>\sigma</math> (population mean) and standard deviation of <math>\frac{\sigma}{\sqrt{n}}</math>.
The mean of one single sample will lie somewhere within the distribution of sample mean'''s''' around their mean = <math>\sigma</math> (population mean!) with standard deviation of <math>\frac{\sigma}{\sqrt{n}}</math>.
As a simple rule, in '''normal distribution''', each range of ±<math>k</math> SD contains the following proportion total values.
{|class="wikitable"
|-
!±<math>k</math> SD
!Proportion
|-
!style="text-align:left"|±1 SD
|68.2%
|-
!style="text-align:left"|±1.96 SD
|95 %
|-
!style="text-align:left"|±2 SD
|95.4%
|-
!style="text-align:left"|±2.58 SD
|99 %
|-
!style="text-align:left"|±3 SD
|99.7%
|}
We cannot estimate '''how far''' a single sample mean <math>\bar{x}</math> is from <u>the true mean of sample means = population mean <math>\sigma</math></u>,
but we can estimate '''the probability''' that '''a certain range of the distribution''' of a single sample mean contains <u>the true mean of sample means = population mean <math>\sigma</math></u> according to above table.
The standard deviation of sample mean'''s''' = Standard Error is <math>\frac{\sigma}{\sqrt{n}}</math>,
and it can be approximate by using the standard deviation of a single sample mean <math>s</math> as <math>\frac{s}{\sqrt{n}}</math>.
Thus, <math>\bar{x}\ \pm\ k \frac{s}{\sqrt{n}}</math> is the range of distribution of a single sample mean <math>\bar{x}</math> and its corresponding proportion is the probability that the range contains <math>\sigma</math>.
<math>\bar{x}\ \pm\ 1.96 \frac{s}{\sqrt{n}}</math> is 95% Confidence Interval, <math>\bar{x}\ \pm\ 2.58 \frac{s}{\sqrt{n}}</math> is 99% Confidence Interval.
===When sample size is small (roughly <30) and assumed to follow t distribution===
We have to refer to '''t-distribution table''' instead of normal distribution table (Z table),
as well as take into account of '''degrees of freedom''', <math>n-1</math>.
Find out the relevant coefficient <math>k</math> of <math>\bar{x}\ \pm\ k \frac{s}{\sqrt{n}}</math> from t-distribution table by using desired CI range and degrees of freedom.