差分

ナビゲーションに移動 検索に移動
11,014 バイト追加 、 2022年12月11日 (日) 18:03
ページの作成:「==Standard deviation and Variance== ===SD/Variance of Quantitative data ~ mean=== {|class="wikitable" style="text-align:center" |- !style="width:80px"| !style="width:30px…」
==Standard deviation and Variance==
===SD/Variance of Quantitative data ~ mean===
{|class="wikitable" style="text-align:center"
|-
!style="width:80px"|
!style="width:30px"|Size
!style="width:110px"|Mean
!style="width:180px"|Standard deviation
!style="width:170px"|Variance
!style="width:250px"|notes
|-
!Population
|<math>N</math>
|<math>\mu = \frac{\textstyle \sum_{i=1}^N X_i}{N}</math>
|<math>\sigma=\sqrt{\frac{\sum_{i=1}^N (X_i-\mu)^2}{N}}</math>
|<math>\sigma^2=\frac{\sum_{i=1}^N (X_i-\mu)^2}{N}</math>
|style="text-align:left"|
*<math>X_i</math> is each value in population
**'''Quantitative'''
***continuous or discrete
|-
!Sample
|<math>n</math>
|<math>\overline{x} = \frac{\textstyle \sum_{i=1}^n x_i}{n}</math>
|<math>s=\sqrt{\frac{\sum_{i=1}^n (x_i-\overline{x})^2}{\color{Red}n-1}}</math>
|<math>s^2=\frac{\sum_{i=1}^n (x_i-\overline{x})^2}{\color{Red}n-1}</math>
|style="text-align:left"|
*<math>x_i</math> is each value in sample
**'''Quantitative'''
***continuous or discrete
*<math>\color{Red}n-1</math> is derived from the degrees of freedom
|}

===SD/Variance of Binomial data ~ proportion===
{|class="wikitable" style="text-align:center"|
|-
!style="width:80px"|
!style="width:30px"|Size
!style="width:110px"|Proportion
!style="width:180px"|Standard deviation &para;
!style="width:170px"|Variance &para;
!style="width:250px"|notes
|-
!population
|<math>N</math>
|<math>\pi = \frac{\sum_{i=1}^N X_i}{N}</math>
<nowiki>*</nowiki> <math>X_i = 0</math> or <math>1</math>
|<math>\sigma = \sqrt{\pi (1 - \pi)}</math>
|<math>\sigma^2 = \pi (1 - \pi)</math>
|style="text-align:left"|
*<math>X_i</math> is each value in population
**'''Binary'''
***'''0''' or '''1'''
|-
!sample
|<math>n</math>
|<math>p = \frac{\sum_{i=1}^n x_i}{n}</math>
<nowiki>*</nowiki> <math>x_i = 0</math> or <math>1</math>
|<math>
\begin{align}
s & = \sqrt{\frac{n}{n-1} \cdot p (1 - p)} \\
& \approx \sqrt{p (1-p)}
\end{align}
</math>
|<math>
\begin{align}
s^2 & = \frac{n}{n-1} \cdot p (1 - p) \\
& \approx p (1 - p)
\end{align}
</math>
|style="text-align:left"|
*<math>x_i</math> is each value in sample
**'''Binary'''
***'''0''' or '''1'''
|}

<div class="toccolours mw-collapsible mw-collapsed" style="width:450px">
&para; How to derive '''variance''' and '''standard deviation''' of '''proportion''' in population:

<div class="mw-collapsible-content">
Definition of variance of values of in population is <math>\frac{\sum_{i=1}^N (X_i - \mu)^2}{N}</math> .

Here, <math>{\color{Green}\mu}</math> is <math>{\color{Green}\frac{\sum_{i=1}^N X_i}{n}}</math> according to its definition.

This is <math>{\color{Green}\pi}</math> itself (refer to the above table).


::<math>{\color{Green}\mu} = {\color{Green}\frac{\sum_{i=1}^N X_i}{N}} = {\color{Green}\pi}</math>


And when we consider <math>{\color{Red}\frac{\sum_{i=1}^N X_i^2}{N}}</math> , provided that <math>X_i = 0</math> or <math>1</math>, it leads:


::<math>{\color{Red}\frac{\sum_{i=1}^N X_i^2}{N}} = {\color{Green}\frac{\sum_{i=1}^N X_i}{N}} = {\color{Green}\pi}</math>


Thus the variance of population proportion can be calculated as follows:


::<math>
\begin{align}
\sigma^2 & = \frac{\sum_{i=1}^N (X_i - {\color{Green}\mu})^2}{N} \\

& = \frac{\sum_{i=1}^N (X_i - {\color{Green}\pi})^2}{N} \\

& = \frac{\sum_{i=1}^N (X_i^2 - 2 {\color{Green}\pi} \cdot X_i + {\color{Green}\pi^2})}{N} \\

& = \frac{\sum_{i=1}^N X_i^2}{N} - 2 {\color{Green}\pi} \cdot \frac{\sum_{i=1}^N X_i}{N} + {\color{Green}\pi^2} \cdot \frac{\sum_{i=1}^N 1}{N}\\

& = {\color{Red}\frac{\sum_{i=1}^N X_i^2}{N}} - 2 {\color{Green}\pi} \cdot {\color{Green}\frac{\sum_{i=1}^N X_i}{N}} + {\color{Green}\pi^2} \cdot \frac{{\color{Orange}\sum_{i=1}^N 1}}{N}\\

& = {\color{Green}\pi} - 2 {\color{Green}\pi} \cdot {\color{Green}\pi} + {\color{Green}\pi^2} \cdot \frac{\color{Orange}N}{N} \\

& = \pi - 2\pi^2 + \pi^2 \\

& = \pi - \pi^2 \\

& = \pi(1-\pi)
\end{align}
</math>


Then standard deviation is also obtained:


::<math>
\begin{align}
\sigma & = \sqrt{\sigma^2} \\
& = \sqrt{\pi(1-\pi)}
\end{align}
</math>

</div>
</div>


<div class="toccolours mw-collapsible mw-collapsed" style="width:450px">
&para; How to derive '''variance''' and '''standard deviation''' of '''proportion''' in sample:

<div class="mw-collapsible-content">
Definition of variance of values in sample is <math>\frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1}</math> .

This can be transformed into <math>\frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n}</math> .

Here, <math>{\color{Green}\bar x}</math> is <math>{\color{Green}\frac{\sum_{i=1}^n x_i}{n}}</math> according to its definition.

This is <math>{\color{Green}p}</math> itself (refer to the above table).


::<math>{\color{Green}\bar x} = {\color{Green}\frac{\sum_{i=1}^n x_i}{n}} = {\color{Green}p}</math>


And when we consider <math>{\color{Red}\frac{\sum_{i=1}^n x_i^2}{n}}</math> , provided that <math>x_i = 0</math> or <math>1</math>, it leads:


::<math>{\color{Red}\frac{\sum_{i=1}^n x_i^2}{n}} = {\color{Green}\frac{\sum_{i=1}^n x_i}{n}} = {\color{Green}p}</math>


Thus the variance of sample proportion can be calculated as follows:


::<math>
\begin{align}
s^2 & = \frac{\sum_{i=1}^n (x_i - {\color{Green}\bar x})^2}{n-1} \\

& = \frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i - {\color{Green}\bar x})^2}{n} \\

& = \frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i - {\color{Green}p})^2}{n} \\

& = \frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i^2 - 2 {\color{Green}p} \cdot x_i + {\color{Green}p^2})}{n} \\

& = \frac{n}{n-1} \left ( \cdot \frac{\sum_{i=1}^n x_i^2}{n} - 2 {\color{Green}p} \cdot \frac{\sum_{i=1}^n x_i}{n} + {\color{Green}p^2} \cdot \frac{\sum_{i=1}^n 1}{n} \right ) \\

& = \frac{n}{n-1} \left ( {\color{Red}\frac{\sum_{i=1}^n x_i^2}{n}} - 2 {\color{Green}p} \cdot {\color{Green}\frac{\sum_{i=1}^n x_i}{n}} + {\color{Green}p^2} \cdot \frac{{\color{Orange}\sum_{i=1}^n 1}}{n} \right ) \\

& = \frac{n}{n-1} \left ( {\color{Green}p} - 2 {\color{Green}p} \cdot {\color{Green}p} + {\color{Green}p^2} \cdot \frac{\color{Orange}n}{n} \right ) \\

& = \frac{n}{n-1} \left ( p - 2p^2 + p^2 \right ) \\

& = \frac{n}{n-1} \left ( p - p^2 \right ) \\

& = \frac{n}{n-1} \cdot p(1-p)
\end{align}
</math>

Here, if <math>n</math> is large enough, we can ignore <math>\frac{n}{n-1}</math> from the calculation.

::<math>
s^2 \approx p(1-p)
</math>

Then standard deviation is also obtained:


::<math>
\begin{align}
s & = \sqrt{s^2} \\
& = \sqrt{\frac{n}{n-1} \cdot p(1-p)} \\
& \approx \sqrt{p(1-p)}
\end{align}
</math>

</div>
</div>

==Standard Error==
If we repeated '''infinite times of sampling''' from the population with sample size of <math>n</math> every time (<math>n</math> is large enough), no matter what the population distribution was, those infinite number of sample mean'''s''' follow normal distribution with mean identical to population mean <math>\mu</math>, and variance derived from population variance <math>\frac{\sigma^2}{n}</math> (not population variance <math>\sigma^2</math>itself). This is '''central limit theorem'''.

{{quote|content=Derivation of <math>\frac{\sigma^2}{n}</math> needs far advanced mathematics like ''Maclaurin expansion'', ''characteristic function'' or ''moment-generating function''.}}

Hence the standard deviation of sample mean'''s''' is derived from the square root of its variance &mdash;&mdash; <math>\sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}</math>.

[[File:Central limit theorem.png|none|400px]]

This standard deviation of sample mean'''s''' <math>\frac{\sigma}{\sqrt{n}}</math> is defined as '''Standard error'''.

In reality, God knows the population mean <math>\mu</math> and population standard deviation <math>\sigma</math>, thus only way to utilize '''standard error''' is to assume '''sample standard deviation''' would be '''close to population standard deviation''' as follows:

<math>Standard\ error \approx \frac{s}{\sqrt{n}}</math>, where <math>s</math> = sample standard deviation

{|class="wikitable"
|-
!
!Standard error
!notes
|-
!mean
|<math>
\begin{align}
SEM & = \frac{\sigma}{\sqrt{N}} \\
& \approx \frac{s}{\sqrt{n}}
\end{align}
</math>
|rowspan="2"|
*<math>\sigma</math> is population standard deviation
*<math>N</math> is population size
*<math>\pi</math> is population proportion
*<math>s</math> is sample standard deviation
*<math>n</math> is sample size
*<math>p</math> is sample proportion
|-
!proportion
|<math>
\begin{align}
SE_p & = \frac{\sigma}{\sqrt{N}} = \sqrt{\frac{\pi (1-\pi)}{N}} \\
& \approx \frac{s}{\sqrt{n}} = \sqrt{\frac{p (1-p)}{n}}
\end{align}
</math>
|}

==Standard Error and Confidence Interval==
===When sample size is large enough and assumed to follow normal distribution===
According to Central Limit Theorem, distribution of sample mean'''s''' follow '''normal distribution''' with mean of <math>\sigma</math> (population mean) and standard deviation of <math>\frac{\sigma}{\sqrt{n}}</math>.

The mean of one single sample will lie somewhere within the distribution of sample mean'''s''' around their mean = <math>\sigma</math> (population mean!) with standard deviation of <math>\frac{\sigma}{\sqrt{n}}</math>.

As a simple rule, in '''normal distribution''', each range of &plusmn;<math>k</math> SD contains the following proportion total values.
{|class="wikitable"
|-
!&plusmn;<math>k</math> SD
!Proportion
|-
!style="text-align:left"|&plusmn;1  SD
|68.2%
|-
!style="text-align:left"|&plusmn;1.96 SD
|95 %
|-
!style="text-align:left"|&plusmn;2  SD
|95.4%
|-
!style="text-align:left"|&plusmn;2.58 SD
|99 %
|-
!style="text-align:left"|&plusmn;3  SD
|99.7%
|}

We cannot estimate '''how far''' a single sample mean <math>\bar{x}</math> is from <u>the true mean of sample means = population mean <math>\sigma</math></u>,

but we can estimate '''the probability''' that '''a certain range of the distribution''' of a single sample mean contains <u>the true mean of sample means = population mean <math>\sigma</math></u> according to above table.

The standard deviation of sample mean'''s''' = Standard Error is <math>\frac{\sigma}{\sqrt{n}}</math>,

and it can be approximate by using the standard deviation of a single sample mean <math>s</math> as <math>\frac{s}{\sqrt{n}}</math>.

Thus, <math>\bar{x}\ \pm\ k \frac{s}{\sqrt{n}}</math> is the range of distribution of a single sample mean <math>\bar{x}</math> and its corresponding proportion is the probability that the range contains <math>\sigma</math>.

<math>\bar{x}\ \pm\ 1.96 \frac{s}{\sqrt{n}}</math> is 95% Confidence Interval, <math>\bar{x}\ \pm\ 2.58 \frac{s}{\sqrt{n}}</math> is 99% Confidence Interval.

===When sample size is small (roughly <30) and assumed to follow t distribution===
We have to refer to '''t-distribution table''' instead of normal distribution table (Z table),

as well as take into account of '''degrees of freedom''', <math>n-1</math>.

Find out the relevant coefficient <math>k</math> of <math>\bar{x}\ \pm\ k \frac{s}{\sqrt{n}}</math> from t-distribution table by using desired CI range and degrees of freedom.

案内メニュー