「Basics & Definition」の版間の差分

2023年2月4日 (土) 21:57時点における版

Basics & Definition
Epidemiology
Odds in statistics and Odds in a horse race
Collider bias
Data distribution
Statistical test
Regression model
Multivariate analysis
Marginal effects
Prediction and decision
Table-related commands in STATA
Missing data and imputation

Self-assessment quizzes

Types of variable

Ratio, Rate, Proportion

Every fraction is ratio.

Refer to this page too

Probability, Likelihood

Probability

Given a dataset of a sample [math]\displaystyle{ X = ( x_1,\ x_2,\ x_3,\ \cdots ) }[/math] derived from a population, the sample mean [math]\displaystyle{ \bar{X} }[/math] and the sample standard deviation [math]\displaystyle{ s }[/math] can be calculated.

Chance that a value in [math]\displaystyle{ X }[/math] is in a range of certain values like [math]\displaystyle{ x_i \gt n }[/math] can also be calculated from the dataset. This chance is [math]\displaystyle{ probability }[/math].

[math]\displaystyle{ probability = P(x_i\gt n|\bar{X},s) }[/math]

For example, if a dice was rolled 600 times and numbers of 1 to 6 appeared 100 times for each, the mean is 3.5 and the standard deviation is 1.71.

Probability that a value is 1 in that sample is,

[math]\displaystyle{ probability = P(x_i=1|3.5,1.71) = 100/600 = 1/6 }[/math]

Likelihood

Origin of terminology

Why is it called "Z"?

Why is it called "Student's t"?

William Gosset was a mathematician around the 19th to 20th century as well as he worked for the famous brewery Guinness. He found a new statistical distribution but Guinness did not allow their employees to publish any papers related to their business confidential affairs. Thus Gosset published his achievement under a nickname of Student.

t itself was later named through correspondences between Gosset and a statistician R.A. Fisher. The first description of t is appeared on the article by Fisher in 1924.

Why is it called "regression"?

In the 19th century, Sir Francis Galton investigated association between parents' heights and their offspring's heights. He found association between them had some characteristics that the higher the parents were the higher the offspring are but the offspring tended to shorter than their parents, and vise versa. He described the association as 'offspring's heights to regress (go back) towards mediocrity (average)'.

Since then regression to the mean has expanded to the regression model which provides the estimates of association between one dependent variable and one or more independent variables by a line.

Why is it called "logistic"?

The true reason remains unclear.

The French mathematician who created this term Pierre-François Verhulst first described this word "logistique" (Fr.) in his literature in 1845, "Recherches mathématiques sur la loi d'accroissement de la population," in NOUVEAUX MÉMOIRES DE L'ACADÉMIE ROYALE DES SCIENCES ET BELLES-LETTRES DE BRUXELLES, vol. 18, p 3.

In a figure Verhulst described an usual exponential curve as "logarithmique", and created a new word "logistique" to describe a distinct curve he created by his formula which is now known as a logistic regression formula, but he didn't note through what derivation he created the word.

Description of Logistic function in Wikipedia is here.

At least, it seems to have nothing to do with a general term "logistics".

Why is it called "bootstrapping"?

Bootstrap is a piece of cloth or leather at the back or the side of a boot that is used to help you pull it on. A broader meaning is also added the word as an approach to creating something with the minimum amount of possible resources.

There is also an idiom or a template expression of pull oneself up by one's bootstraps, which means to improve one's situation on one's own efforts without any other's help.

The method of bootstrapping is to derive new samples from the original observations with replacement, not from other data source, i.e., pulling samples up from themselves, which implies pull oneself up by one's bootstraps.

@@ 34行目: / 34行目: @@
 ==Probability, Likelihood==
+===Probability===
+Given a dataset of a sample <math>X = ( x_1,\ x_2,\ x_3,\ \cdots )</math> derived from a population, the sample mean <math>\bar{X}</math> and the sample standard deviation <math>s</math> can be calculated.
+Chance that a value in <math>X</math> is in a range of certain values like <math>x_i > n</math> can also be calculated from the dataset. This chance is <math>probability</math>.
+:<math>probability = P(x_i>n|\bar{X},s)</math>
+For example, if a dice was rolled 600 times and numbers of 1 to 6 appeared 100 times for each, the mean is 3.5 and the standard deviation is 1.71.
+Probability that a value is 1 in that sample is,
+:<math>probability = P(x_i=1|3.5,1.71) = 100/600 = 1/6 </math>
+===Likelihood===
 ==Origin of terminology==