Data distribution

提供: Vaccipedia | Resources for Vaccines, Tropical medicine and Travel medicine
ナビゲーションに移動 検索に移動
Navigation Menu Vac logo.png
General issues of Vaccine
Cold chain
Correlates of Protection
Vaccines for Asplenia
Vaccines for Pregnant women
Vaccines for Immunocompromised hosts
Vaccine hesitancy
Additional materials of vaccine
General issues of Tropical med.
Definition of Tropical Medicine
Matrices of tropical infection
Neglected Tropical Diseases
Sexually-transmitted infections
Non-Communicable Diseases
Maternal health and contraception
Child health
Malnutrition and Micronutrient
Eosinophilia
Fever in the tropics
Diarrhea in the tropics
Anemia in the tropics
Dermatology in the tropics
Ophthalmology in the tropics
Neurology in the tropics
Mental health in the tropics
Surgery in the tropics
Humanitarian emergency
Epidemiology in outbreak
Antimicrobial resistance
Pathology of infectious diseases
General issues of Travel med.
Epidemiology of Travel health
Last minute traveler
Time zone issue
High altitude medicine
Diving medicine
Pregnancy and travel
Children and travel
Elderly and travel
Immunology
Principle of human immune system
Innate immunity
Cellular immunity
Humoral immunity
Neutralizing antibody and its assay
Antigenic Cartography
Additional materials of immunology
Epi & Stats
Basics & Definition
Epidemiology
Odds in statistics and Odds in a horse race
Collider bias
Data distribution
Statistical test
Regression model
Multivariate analysis
Marginal effects
Prediction and decision
Table-related commands in STATA
Missing data and imputation
Virus
HIV
HIV-TB co-infection
HIV-STI interaction
Viral Hemorrhagic Fever
Ebola
Crimean-Congo hemorrhagic fever
SFTS
Rabies
Polio
Dengue
Yellow fever
Chikungunya
Zika
Japanese encephalitis
Tick-borne encephalitis
Viral hepatitis
Measles
Smallpox and Monkeypox
Respiratory Syncytial virus
COVID-19
Bivalent BA.1/BA.4-5 mRNA vaccines
Monovalent XBB-1.5 mRNA vaccine
Private archives of the initial phase of the pandemic
Private archives of lecture materials of COVID vaccine as of March 2021
厚生労働省が発出する保健行政関連の文書の読み解き方
Bacteria
Bacteriological tests
Tuberculosis
Tuberculosis in Children
HIV-TB co-infection
Leprosy
Dermatological mycobacterium infecions
Syphilis and Yaws
Plague
Pneumococcus
Meningococcus
Typhoid
Salmonellosis
Melioidosis
Leptospirosis
Brucellosis
Bartonellosis
Lyme disease and Relapsing fever
Tularaemia
Tetanus
Diphtheria
Anthrax
Coxiellosis
Rickettsia
Rickettsiosis
Scrub typhus
Spotted fevers
Epidemic typhus
Murine typhus
Protozoa
Overview of protozoa
Overview of medicine for protozoa
Malaria
Chagas disease
African trypanosomiasis
Leishmaniasis
Trichomoniasis
Toxoplasmosis
Amoebiasis
Giardiasis
Cryptosporidiosis
Cyclosporiasis
Isosporiasis
Pentatrichomoniasis
Microsporidiasis
Babesiosis
Fungi
General issues of fungi
Coccidioidomycosis
Paracoccidioidomycosis
Histoplasmosis
Talaromycosis
Blastomycosis
Sporotrichosis
Nematode (roundworm)
General issues of Helminths
Nematode principles
Lympatic filariasis
Onchocerciasis
Loiasis
Microscopic differentiation of microfilariae
Strongyloidiasis
Ascariasis
Ancylostomiasis (hookworm)
Trichuriasis (whipworm)
Enterobiasis (pinworm)
Angiostrongyliasis (rat lungworm)
Dracunculiasis (Guinea worm)
Anisakiasis
Trichinellosis (Trichinosis)
Gnathostomiasis
Spirurinasis
Soil-transmitted helminths
Trematode (fluke, distoma)
General issues of Helminths
Trematode principles
Schistosomiasis
Clonorchiasis
Fascioliasis
Paragonimiasis
Metagonimiasis
Cestode (tapeworm)
General issues of Helminths
Cestode principles
Diphyllobothriasis
Sparganosis
Taeniasis
Echinococcosis
Medical Zoology
Zoonosis
Insectology
Mosquitology
Acarology
Batology
Snake toxicology
Scorpion and spider toxicology
Marine toxicology

Chevron-up-blue.png

Standard deviation and Variance

なぜ分散は二乗の和なのか

SD/Variance of Quantitative data ~ mean

Size Mean Standard deviation Variance notes
Population [math]\displaystyle{ N }[/math] [math]\displaystyle{ \mu = \frac{\textstyle \sum_{i=1}^N X_i}{N} }[/math] [math]\displaystyle{ \sigma=\sqrt{\frac{\sum_{i=1}^N (X_i-\mu)^2}{N}} }[/math] [math]\displaystyle{ \sigma^2=\frac{\sum_{i=1}^N (X_i-\mu)^2}{N} }[/math]
  • [math]\displaystyle{ X_i }[/math] is each value in population
    • Quantitative
      • continuous or discrete
Sample [math]\displaystyle{ n }[/math] [math]\displaystyle{ \overline{x} = \frac{\textstyle \sum_{i=1}^n x_i}{n} }[/math] [math]\displaystyle{ s=\sqrt{\frac{\sum_{i=1}^n (x_i-\overline{x})^2}{\color{Red}n-1}} }[/math] [math]\displaystyle{ s^2=\frac{\sum_{i=1}^n (x_i-\overline{x})^2}{\color{Red}n-1} }[/math]
  • [math]\displaystyle{ x_i }[/math] is each value in sample
    • Quantitative
      • continuous or discrete
  • [math]\displaystyle{ \color{Red}n-1 }[/math] is derived from the degrees of freedom

SD/Variance of Binomial data ~ proportion

Size Proportion Standard deviation ¶ Variance ¶ notes
population [math]\displaystyle{ N }[/math] [math]\displaystyle{ \pi = \frac{\sum_{i=1}^N X_i}{N} }[/math]

* [math]\displaystyle{ X_i = 0 }[/math] or [math]\displaystyle{ 1 }[/math]

[math]\displaystyle{ \sigma = \sqrt{\pi (1 - \pi)} }[/math] [math]\displaystyle{ \sigma^2 = \pi (1 - \pi) }[/math]
  • [math]\displaystyle{ X_i }[/math] is each value in population
    • Binary
      • 0 or 1
sample [math]\displaystyle{ n }[/math] [math]\displaystyle{ p = \frac{\sum_{i=1}^n x_i}{n} }[/math]

* [math]\displaystyle{ x_i = 0 }[/math] or [math]\displaystyle{ 1 }[/math]

[math]\displaystyle{ \begin{align} s & = \sqrt{\frac{n}{n-1} \cdot p (1 - p)} \\ & \approx \sqrt{p (1-p)} \end{align} }[/math] [math]\displaystyle{ \begin{align} s^2 & = \frac{n}{n-1} \cdot p (1 - p) \\ & \approx p (1 - p) \end{align} }[/math]
  • [math]\displaystyle{ x_i }[/math] is each value in sample
    • Binary
      • 0 or 1

¶ How to derive variance and standard deviation of proportion in population:

Definition of variance of values of in population is [math]\displaystyle{ \frac{\sum_{i=1}^N (X_i - \mu)^2}{N} }[/math] .

Here, [math]\displaystyle{ {\color{Green}\mu} }[/math] is [math]\displaystyle{ {\color{Green}\frac{\sum_{i=1}^N X_i}{n}} }[/math] according to its definition.

This is [math]\displaystyle{ {\color{Green}\pi} }[/math] itself (refer to the above table).


[math]\displaystyle{ {\color{Green}\mu} = {\color{Green}\frac{\sum_{i=1}^N X_i}{N}} = {\color{Green}\pi} }[/math]


And when we consider [math]\displaystyle{ {\color{Red}\frac{\sum_{i=1}^N X_i^2}{N}} }[/math] , provided that [math]\displaystyle{ X_i = 0 }[/math] or [math]\displaystyle{ 1 }[/math], it leads:


[math]\displaystyle{ {\color{Red}\frac{\sum_{i=1}^N X_i^2}{N}} = {\color{Green}\frac{\sum_{i=1}^N X_i}{N}} = {\color{Green}\pi} }[/math]


Thus the variance of population proportion can be calculated as follows:


[math]\displaystyle{ \begin{align} \sigma^2 & = \frac{\sum_{i=1}^N (X_i - {\color{Green}\mu})^2}{N} \\ & = \frac{\sum_{i=1}^N (X_i - {\color{Green}\pi})^2}{N} \\ & = \frac{\sum_{i=1}^N (X_i^2 - 2 {\color{Green}\pi} \cdot X_i + {\color{Green}\pi^2})}{N} \\ & = \frac{\sum_{i=1}^N X_i^2}{N} - 2 {\color{Green}\pi} \cdot \frac{\sum_{i=1}^N X_i}{N} + {\color{Green}\pi^2} \cdot \frac{\sum_{i=1}^N 1}{N}\\ & = {\color{Red}\frac{\sum_{i=1}^N X_i^2}{N}} - 2 {\color{Green}\pi} \cdot {\color{Green}\frac{\sum_{i=1}^N X_i}{N}} + {\color{Green}\pi^2} \cdot \frac{{\color{Orange}\sum_{i=1}^N 1}}{N}\\ & = {\color{Green}\pi} - 2 {\color{Green}\pi} \cdot {\color{Green}\pi} + {\color{Green}\pi^2} \cdot \frac{\color{Orange}N}{N} \\ & = \pi - 2\pi^2 + \pi^2 \\ & = \pi - \pi^2 \\ & = \pi(1-\pi) \end{align} }[/math]


Then standard deviation is also obtained:


[math]\displaystyle{ \begin{align} \sigma & = \sqrt{\sigma^2} \\ & = \sqrt{\pi(1-\pi)} \end{align} }[/math]


¶ How to derive variance and standard deviation of proportion in sample:

Definition of variance of values in sample is [math]\displaystyle{ \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1} }[/math] .

This can be transformed into [math]\displaystyle{ \frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n} }[/math] .

Here, [math]\displaystyle{ {\color{Green}\bar x} }[/math] is [math]\displaystyle{ {\color{Green}\frac{\sum_{i=1}^n x_i}{n}} }[/math] according to its definition.

This is [math]\displaystyle{ {\color{Green}p} }[/math] itself (refer to the above table).


[math]\displaystyle{ {\color{Green}\bar x} = {\color{Green}\frac{\sum_{i=1}^n x_i}{n}} = {\color{Green}p} }[/math]


And when we consider [math]\displaystyle{ {\color{Red}\frac{\sum_{i=1}^n x_i^2}{n}} }[/math] , provided that [math]\displaystyle{ x_i = 0 }[/math] or [math]\displaystyle{ 1 }[/math], it leads:


[math]\displaystyle{ {\color{Red}\frac{\sum_{i=1}^n x_i^2}{n}} = {\color{Green}\frac{\sum_{i=1}^n x_i}{n}} = {\color{Green}p} }[/math]


Thus the variance of sample proportion can be calculated as follows:


[math]\displaystyle{ \begin{align} s^2 & = \frac{\sum_{i=1}^n (x_i - {\color{Green}\bar x})^2}{n-1} \\ & = \frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i - {\color{Green}\bar x})^2}{n} \\ & = \frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i - {\color{Green}p})^2}{n} \\ & = \frac{n}{n-1} \cdot \frac{\sum_{i=1}^n (x_i^2 - 2 {\color{Green}p} \cdot x_i + {\color{Green}p^2})}{n} \\ & = \frac{n}{n-1} \left ( \cdot \frac{\sum_{i=1}^n x_i^2}{n} - 2 {\color{Green}p} \cdot \frac{\sum_{i=1}^n x_i}{n} + {\color{Green}p^2} \cdot \frac{\sum_{i=1}^n 1}{n} \right ) \\ & = \frac{n}{n-1} \left ( {\color{Red}\frac{\sum_{i=1}^n x_i^2}{n}} - 2 {\color{Green}p} \cdot {\color{Green}\frac{\sum_{i=1}^n x_i}{n}} + {\color{Green}p^2} \cdot \frac{{\color{Orange}\sum_{i=1}^n 1}}{n} \right ) \\ & = \frac{n}{n-1} \left ( {\color{Green}p} - 2 {\color{Green}p} \cdot {\color{Green}p} + {\color{Green}p^2} \cdot \frac{\color{Orange}n}{n} \right ) \\ & = \frac{n}{n-1} \left ( p - 2p^2 + p^2 \right ) \\ & = \frac{n}{n-1} \left ( p - p^2 \right ) \\ & = \frac{n}{n-1} \cdot p(1-p) \end{align} }[/math]

Here, if [math]\displaystyle{ n }[/math] is large enough, we can ignore [math]\displaystyle{ \frac{n}{n-1} }[/math] from the calculation.

[math]\displaystyle{ s^2 \approx p(1-p) }[/math]

Then standard deviation is also obtained:


[math]\displaystyle{ \begin{align} s & = \sqrt{s^2} \\ & = \sqrt{\frac{n}{n-1} \cdot p(1-p)} \\ & \approx \sqrt{p(1-p)} \end{align} }[/math]

Sum of squares

In ANOVA, sums of squares in total and in groups are compared.

Sum of squares is the numerator of variance.

Variance [math]\displaystyle{ = \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1} }[/math]
Sum of square [math]\displaystyle{ = {\sum_{i=1}^n (x_i - \bar x)^2} }[/math]

Here we think a case to divide the observations into [math]\displaystyle{ k }[/math] groups.

[math]\displaystyle{ \begin{Bmatrix} x_1, x_2, \cdots ,x_n \end{Bmatrix} }[/math][math]\displaystyle{ k }[/math] groups [math]\displaystyle{ \begin{Bmatrix} x_1, x_1, \cdots \cdots, x_a & (sample\ size=a)\\ x_{a+1}, x_{a+2}, \cdots, x_b & (sample\ size=b)\\ \vdots & \vdots \\ \cdots \cdots x_{n-1}, x_n & (sample\ size=z) \end{Bmatrix} }[/math]

The sum of square of the total observations can be transformed as follows:

[math]\displaystyle{ \begin{align} {\sum_{i=1}^n (x_i - \bar x)^2} & = \\ \end{align} }[/math]

Standard Error

If we repeated infinite times of sampling from the population with sample size of [math]\displaystyle{ n }[/math] every time ([math]\displaystyle{ n }[/math] is large enough), no matter what the population distribution was, those infinite number of sample means follow normal distribution with mean identical to population mean [math]\displaystyle{ \mu }[/math], and variance derived from population variance [math]\displaystyle{ \frac{\sigma^2}{n} }[/math] (not population variance [math]\displaystyle{ \sigma^2 }[/math]itself). This is central limit theorem.

Derivation of [math]\displaystyle{ \frac{\sigma^2}{n} }[/math] needs far advanced mathematics like Maclaurin expansion, characteristic function or moment-generating function.

Hence the standard deviation of sample means is derived from the square root of its variance —— [math]\displaystyle{ \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} }[/math].

Central limit theorem.png

This standard deviation of sample means [math]\displaystyle{ \frac{\sigma}{\sqrt{n}} }[/math] is defined as Standard error.

In reality, God knows the population mean [math]\displaystyle{ \mu }[/math] and population standard deviation [math]\displaystyle{ \sigma }[/math], thus only way to utilize standard error is to assume sample standard deviation would be close to population standard deviation as follows:

[math]\displaystyle{ Standard\ error \approx \frac{s}{\sqrt{n}} }[/math], where [math]\displaystyle{ s }[/math] = sample standard deviation

Standard error notes
mean [math]\displaystyle{ \begin{align} SEM & = \frac{\sigma}{\sqrt{N}} \\ & \approx \frac{s}{\sqrt{n}} \end{align} }[/math]
  • [math]\displaystyle{ \sigma }[/math] is population standard deviation
  • [math]\displaystyle{ N }[/math] is population size
  • [math]\displaystyle{ \pi }[/math] is population proportion
  • [math]\displaystyle{ s }[/math] is sample standard deviation
  • [math]\displaystyle{ n }[/math] is sample size
  • [math]\displaystyle{ p }[/math] is sample proportion
proportion [math]\displaystyle{ \begin{align} SE_p & = \frac{\sigma}{\sqrt{N}} = \sqrt{\frac{\pi (1-\pi)}{N}} \\ & \approx \frac{s}{\sqrt{n}} = \sqrt{\frac{p (1-p)}{n}} \end{align} }[/math]

Standard Error and Confidence Interval

When sample size is large enough and assumed to follow normal distribution

According to Central Limit Theorem, distribution of sample means follow normal distribution with mean of [math]\displaystyle{ \sigma }[/math] (population mean) and standard deviation of [math]\displaystyle{ \frac{\sigma}{\sqrt{n}} }[/math].

The mean of one single sample will lie somewhere within the distribution of sample means around their mean = [math]\displaystyle{ \sigma }[/math] (population mean!) with standard deviation of [math]\displaystyle{ \frac{\sigma}{\sqrt{n}} }[/math].

As a simple rule, in normal distribution, each range of ±[math]\displaystyle{ k }[/math] SD contains the following proportion total values.

±[math]\displaystyle{ k }[/math] SD Proportion
±1  SD 68.2%
±1.96 SD 95 %
±2  SD 95.4%
±2.58 SD 99 %
±3  SD 99.7%

We cannot estimate how far a single sample mean [math]\displaystyle{ \bar{x} }[/math] is from the true mean of sample means = population mean [math]\displaystyle{ \sigma }[/math],

but we can estimate the probability that a certain range of the distribution of a single sample mean contains the true mean of sample means = population mean [math]\displaystyle{ \sigma }[/math] according to above table.

The standard deviation of sample means = Standard Error is [math]\displaystyle{ \frac{\sigma}{\sqrt{n}} }[/math],

and it can be approximate by using the standard deviation of a single sample mean [math]\displaystyle{ s }[/math] as [math]\displaystyle{ \frac{s}{\sqrt{n}} }[/math].

Thus, [math]\displaystyle{ \bar{x}\ \pm\ k \frac{s}{\sqrt{n}} }[/math] is the range of distribution of a single sample mean [math]\displaystyle{ \bar{x} }[/math] and its corresponding proportion is the probability that the range contains [math]\displaystyle{ \sigma }[/math].

[math]\displaystyle{ \bar{x}\ \pm\ 1.96 \frac{s}{\sqrt{n}} }[/math] is 95% Confidence Interval, [math]\displaystyle{ \bar{x}\ \pm\ 2.58 \frac{s}{\sqrt{n}} }[/math] is 99% Confidence Interval.

When sample size is small (roughly <30) and assumed to follow t distribution

We have to refer to t-distribution table instead of normal distribution table (Z table),

as well as take into account of degrees of freedom, [math]\displaystyle{ n-1 }[/math].

Find out the relevant coefficient [math]\displaystyle{ k }[/math] of [math]\displaystyle{ \bar{x}\ \pm\ k \frac{s}{\sqrt{n}} }[/math] from t-distribution table by using desired CI range and degrees of freedom.

Probability, Likelihood

Statistics is an attempt to estimate population through sample.

A population always follows some kind of distribution, i.e., follows some kind of probability distribution. But parameters of a population – e.g., mean and standard deviation in normal distribution, success probability and number of trials in binary distribution, location and scale in logistic distribution, etc. – are what God knows.

And each distribution has each relevant equation to describe its probability distribution, and the equation is derived from parameters.

Probability

When a sample is derived from a population, parameters of the sample can be calculated, and they are the estimates of parameters of the population.

But God knows the true parameters of the population, and parameter of the sample always have random error.

An equation relevant to each distribution is derived from the parameters of the sample, and a value the equation makes also has random error.

And that value the equation makes is probability. Probability is the chance that a given observed data is included in the distribution. Or more specifically, it is conditional probability, because a given observed data (= condition) gives a probability of the existing of the data.


Likelihood

When a sample is derived from a population, observations in the sample should follow the distribution God knows with the parameters God knows, which are impossible to know.

And there are multiple possibility of sets of parameters which observed data in the sample can follow, and different sets of parameters have different chances to exist.

Those chances are likelihood. A set of parameters can be followed by observed data in the sample with very low chance, another set of parameters can be followed by the sample with relatively high chance, and yet another set of parameters can be followed by the sample with the highest chance.

As a natural sense, the highest chance of set of parameters, i.e., the most likely set of parameters (the parameters with the maximum likelihood) is taken into account and is used to make the relevant equation. That is the maximum likelihood estimation method.

On the contrary to that the above-mentioned (conditional) probability fixes the parameters (hypothesis) and varies observations (data), likelihood fixes observations (data) and varies parameters (hypothesis).