Advanced Undergraduate Econometrics
by Stacey Chen
4 October 2010
Royal Holloway University of London
hhlink|http://personal.rhul.ac.uk/utte/278/econometrics.html|i
Abstract
This is lecture notes for my students of EC3400. Other students who have taken statisticsand introduction to econometrics may benefit from this notes too. In 18 lectures, I start withreview of statistics and probability theory. I cover topics including large-sample propertiesand hypothesis testing of Ordinary Least Squares (OLS), binary choice models, omittedvariable bias, instrumental variable (IV) methods, 英国伦敦大学留学生计量经济学dissertation定制difference in difference models, matchingmethods, regression discontinuity, and panel regression models. Recommended Texts:The latest editions of Stock and Watson’s (SW) Introduction to Econometrics and Angristand Pischeke’s (AP) Mostly Harmless Econometrics.
1 Review of statistics and probability theory
1.1 Primary Goals of Empirical Studies
1.1.1 Causal effects and idealised experiments
Policy evaluation and inferences on human behaviours (or even climate changes, for example)require data analysts to comment on causal impact of treatment on outcome variables. Ideally,
we would like to need implement a randomised controlled experiment to ensure a valid causalinterpretation for the observed comparison between treated and untreated outcomes. But idealrandomised controlled experiments are often feasible or too costly. We often need to rely onobservational data to make statistical inferences.
1.1.2 Data
While experiment data comes from experiments designed to evaluate a treatment/policy, whileobservational data is derived from observing actual behaviour outside of experimental setting.It’s often collected by surveys or derived from administrative records.Cross-sectional data are data on different entities (e.g., workers, firms, and countries) fora single time period. For example, a national annual chttp://www.ukthesis.org/dissertation_writing/Ecommerce/ensus. Time-series data are data from asingle entity (one person, firm or country) collected at multiple time periods. For example, dailystock price of a given firm during a year. Panel or longitudinal data are data on multiple entitieswhere each entity is observed for multiple times. For example, a campus survey that trace a givenset of students for several years.
1.2 Review of probability
Potential outcomes are a result of random events or experiments, to which we can assigna numeric value. The assignment is one-to-one; only one value of the potential outcomes can bepossibly observed because history only happens once and because we only live one and only life ata time.
For example, raining or not and attending university or not are binary potential outcomes (1or 0). Formally, we can assign value 1 if it rains for a given day, and 0 otherwise. Thus, for eachday the outcome Y has two potential values: Y =1 if it rains; Y =0 if it doesn’t.#p#分页标题#e#
1
Given economic conditions (which could be viewed as a result of experiment in real life),
earnings and stock prices are examples of continuous potential outcomes. Formally, we can assign
value Yx given economic conditions x.
The sample space is the set of all possible potential outcomes. An event is a subset of thesample space. For example, consider the sample space of all possible potential education levels, the
event of “having a university eduction” consist of multiple potential outcomes, including “havingsome university education without degrees,” “having a university degree,” and “having postgraduatetrainings.” A random variable is a numerical summary of a random outcome. We use capital let-ters (e.g., Y ) to denote a random variable, and a small letter (e.g., y) to denote the realized/actualvalue of the random variable. Thus, Y is random but y is a given number, not random.
A function of one or several random variables assigns a random value, so the function is also arandom variable. Sample average is an important example.The probability of a random variable is the proportion of time that a given outcome wouldoccur. The probability distribution of a discrete random variable is the list of all possible values
of the variables (as x-axis) and the probability that each value would occur (as y-axis).A binary random variable (Y = 0 or 1) is a Bernoulli random variable. Its probability
distribution is called the Bernoulli distribution, described by p the probability that Y = 1.Formally denoted, the Bernoulli distribution is characterised by:
Y = 1with probability p
0 with probability 1− p.
The probability distribution described earlier for a discrete random variable doesn’t work with con-
tinuous random variables, because the probability of having a one single point on the continuous
real line to occur is zero. The event that the commuting time is exactly equal to 20 minutes,
for example, is measured with probability zero. We therefore summarise the distribution of a
continuous random variable Y with the cumulative distribution function (CDF), denoted by
F(Y ), which describes the probability of commuting time no more than Y . Graphical expressions
of PDFs and CDFs help us to visualize their nature. The CDF of the commuting time can be
depicted by a graph which has the commuting time Y as the x-axis, and the probability Pr{Y y} as the y-axis. Recall that the probability density function (PDF), denoted by f(Y ), is the
derivative/slope of the CDF of Y . In other words, the area underneath of the PDF between
commuting times a and b is the probability of having the commuting time ranging from a to b:
Pr[a6Y 6b]=Za
b
f(y)dy=F(b)−F(a)
The following subsection will give several important examples of continuous random variables and
their CDFs and PDFs.
1.2.1 Characterising the probability distribution of a random variable#p#分页标题#e#
We characterise the tendency and the dispersion of a random variable by its mean and variance.
The mean (or expected value) of a random variable Y , denoted by E[Y ] or μ, is the average
value of the random variable over repeated trials. The mean of Y is also called the first moment
of Y . The mean of Y r, denoted by E[Y r], is called the r-th moment of the random variable.
Assume that a discrete random variable Y has k possible outcomes, (y1, y2, ., yk). Its expected
value is the weighted average of all of the possible outcomes, weighted by the probability of each
outcome (p1, p2, ., pk).
E[Y ] μ= y1p1+ y2p2+ .+ ykpk.
In case k=2 and Y =0 or 1 (success or failure), the random variable is a Bernoulli. Letting p be
the probability of success, the mean of the Bernoulli random variable is E[Y ]=1 · p+0 · (1−p)=p.
Thus, the expectation of a Bernoulli random variable equals the probability of success.
Similarly, the expected value of a continuous random variable Y is the probability weighted
average of all possible outcomes:
E[Y ] μ=Z yf(y) dy.
2 Section 1
The variance and standard deviation measure the dispersion/spread of a probability distri-
bution. The variance of a random variable, denoted by V (Y ) or Var(Y ) or 2, is the expected
value of the squared deviation of Y from its mean: V (y)= E[(y − μ)2]. We note that the unit of
the variance are the units of squared Y , which is harder to interpret than the unit of Y itself. For
convenience of interpretation, we often use standard deviation, which equals the squared root
of variance, denoted by .
If we have a discrete random variable Y with n possible different values (y1, y2, ., yn) and with
corresponding probabilities (p1,p2,...pn), the variance is
V (Y )=i=1
n [yi − μ]2pi
A special case is the Bernoulli random variable (n = 2). Given the probability of success p, the
variance of the Bernoulli random variable is V (Y ) = [1 − p]2p + [0 − p]2(1 − p) = p(1 − p).
Thus, the standard deviation for Bernoulli is p(1− p p .
For continuous random variable Y , the variance is
V (Y )=Z (y − μ)2f(y)dy.
Remark 1. Both mean and variance are defined by the expectation operator E[.], which has linear
properties. In particular, we can easily calculate both mean and variance for a linear function of
a random variable. Let Y be a linear function of X, written as Y = a + bX, where a and b are
constants. Then, we can easily show that
E[Y ]=a+bE[X]
V [Y ]=b2V [X]
To measure the shape of a distribution, we use skewness and kurtosis. The skewness of the
distribution of a random variable Y is measured by
skewness
E[(Y − μ)3]
3 ,
which measures the lack of symmetry of the distribution curve. For a symmetric distribution,#p#分页标题#e#
the skewness equals zero. If the distribution has a long right (left) tail, the skewness is positive
(negative).
The kurtosis of the distribution of a random variable Y is measured by
kurtosis
E[(Y − μ)4]
4 ,
which measures how thick or heavy are the tails of the distribution. The kurtosis of a normally
distributed random variable is 3. Thus, for a random variable with kurtosis greater than 3 has
more mass in its tails than a normal random variable. Distributions with kurtosis greater than 3
are said to be having heavy tails.
Remark 2. Both skewness and kurtosis are unit free, so any change in the unit of a random
variable Y , such as multiplying by 1000 or changing from meters to centimetres, would not change
either of the measures.
1.2.2 The normal distribution
The normal distribution is a bell-shaped probability density function (PDF). The PDF of a
normally distributed random variable Y with mean μ and 2 is
f(y)=
1
2 p
exp[−
1
2
(
y − μ
)2].
Note that this PDF is symmetric around mean. A normally distributed random variable has 95%
of chances to fall into the range between (μ−1.96) and (μ+1.96). We often use the following
notation to denote a normal distributed random variable:
Y N(μ, 2).
Review of statistics and probability theory 3
We can normalise or standardise any random variable Y (μ, 2) to obtain another random
variable
Z
Y − μ
so that we have
Z (0, 1).
In particular, the distribution of the standardised normal random variable is standard
normal, Z N(0, 1). The standard normal CDF is denoted by (z) Pr[Z z] given con-
stant z. The value of this CDF is tabulated in the standard normal table, which you can find
in any statistics or elementary econometrics textbook.
We can use the standard normal table to calculate the probability of a normal random variable
Y with mean μ and variance 2, by simply normalising/standardising the random variable. In
particular, if we need to know the cumulative probability of the normal random variable at value
y, we can normalise Y and y to get the answer:
Pr[Y y]=Pr[
Y − μ
y − μ
]=(
y − μ
).
Because (y, μ, ) is given, we know the value of y − μ
. We can use the standard normal table to get
the value of ( y − μ
) and thus Pr[Y y]. In addition, we also can easily derive the following values:
Pr[Y > y]=1−Pr[Y y]
Pr[y1Y y2]=Pr[Y y2]−Pr[Y y1].
1.2.3 Characterising the relationship between two random variables
One of the most important goals of econometric analysis is to characterise the relationship between#p#分页标题#e#
two or multiple random variables. To this end, we use the joint probability distribution and the
marginal probability distribution, and the conditional probability distribution. The following we
focus on the relationship between two random for the purpose of illustration.
For two discrete random variables (X, Y ), their joint probability distribution is the
probability that they simultaneously take certain values (x, y), which is written as Pr[X = x,
Y = y], or expressed in a joint probability matrix. Given the matrix, the marginal probability
distribution is simply the summation values on the bottom and the right-end margins; that is,
Pr[X =x] and Pr[Y = y]. The conditional distribution of Y given X =x is defined by
Pr[Y = y|X =x]
Pr[Y = y,X =x]
Pr[X =x]
.
If (X, Y ) are continuous, their joint (cumulative) probability is the probability of a joint
event (X, Y )=(x, y), written as
F(x, y)Pr[X x, Y y]=Zs=−1
s=y Zt=−1
t=x
f(t, s)dtds,
Marginal probabilities f(x) and f(y) are defined only for individual random variables. We
can derive the marginal probability of one random variable using their joint probability, after
integrating out the other random variable:
f(x)=Zy
f(x, y)dy
f(y)=Zx
f(x, y)dx.
4 Section 1
英国伦敦大学留学生计量经济学dissertation定制The conditional probability over y for each value of x is defined by their joint probabilities
and the marginal probability of X:
f(y|x)=
f(x, y)
f(x)
. (1)
Thus, the joint probability can be expressed by the product of marginal and conditional probab-
ilities:
f(x, y)= f(y|x)f(x). (2)
Two random variables (X, Y ) are said to be statistically independent if their joint density
is the product of their marginal densities:
f(x, y)= f(x)f(y),X and Y are indepedent.
By the relationship between joint and marginal distributions described in equation (1) or (2), we
also can write:
f(y|x)= f(y) and f(x|y)= f(x),X and Y are indepedent.
Intuitively, if knowing the realized value of x provides no extra information about Y and if knowing
y provides no extra information about X, their conditional distributions are the same as their
marginal distributions.
The conditional expectation of Y given X =x is the mean of conditional distribution of Y
given X =x. For discrete random variables:
E[Y |X =x]=Xi=1
n
yiPr[Y = yi|X =x].
For continuous random variables:
E[Y |X =x]=Z yf(y|x)dy.
The law of iterated expectations. The mean of Y is the weighted average of the conditional
expectation of Y given X, weighted by the probability distribution of X. Precisely,
E[Y ]=EX[E(Y |X)],
where the notation EX indicates the expectation over the value of X. For example, the mean wage#p#分页标题#e#
is the weighted average of men’s and women’s average wage levels, weighted by the population frac-
tions of mean and women; that is, E[Y ]=E(Y |X =male)Pr[men]+E(Y |X =female)Pr[women].
Remark 3. With the law of iterated expectation, it can be shown that in any bivariate distribution,
we have Cov[X, Y ]=Cov[X,E(Y |X)]. I leave the proof of this remark for Problem Set A1.
The conditional variance is the variance of Y conditional on other variables X. For discrete
random variables:
V [Y |X =x]=Xi=1
n
[yi−E(Y |X =x)]2Pr[Y = yi|X =x].
For continuous random variables:
V [Y |X =x]=Z [y −E(Y |X =x)]2f(y|x)dy.
The covariance between X and Y is defined by
Cov(X, Y )XYE[(X − μX)(Y − μY )].
The coefficient of correlation between X and Y is their covariance divided by the product of
their standard deviation (in order to make the coefficient scaleless):
Corr(X, Y )
XY
XY
.
Review of statistics and probability theory 5
Two random variable are said to be uncorrelated if their covariance or correlation coefficient
are zero. It can be shown that Corr(X, Y ) 2[−1, 1].
Remark 4. If the conditional mean of Y doesn’t change with the value of X, then X and Y are
uncorrelated:
E[Y |X]= μ )Corr(X, Y )=0. (3)
I leave the proof of this remark for Problem Set A1.
Remark 5. The opposite direction of (3) does not necessarily hold. Consider one counter example
where Y depends nonlinear on X although their coefficient of correlation is zero. This is because
the coefficient of correlation only captures the linear dependence but cannot detect nonlinear
dependence.
Remark 6. Consider a linear model, E[Y |X] = + X, where and are parameters. Then,
we can show that
=E[Y ]− E[X]
=
Cov[X, Y ]
V [X]
.
1.2.4 Bivariate Normal, Chi-squared, Student-t, and F Distributions
The bivariate normal distribution is the distribution of two normally distributed random
variables (X, Y ). The PDF is a bit complicated:
f(X =x, Y = y)=
1
p2xyp1− exp[−
1
2(1− 2)
(
x − μx
x
)2−2(
x − μx
x
)(
y − μy
y
)+(
x− μx
x
)2],
where is the coefficient of correlation between X and Y .
If X and Y are bivariate normal with correlation coefficient , then we have:
i. the linear combination of X and Y is also normal,
(aX +bY )N(aμx+bμy, a2 x
2+b2 y
2 +2abxy);
ii. their marginal distributions are normal,
X N(μx, x
2) and Y N(μy, y
2).
iii. they are independent in case X and Y are uncorrelated.
More generally, for n>2 jointly normal random variables, their linear combination is also normal,#p#分页标题#e#
including their sum and their marginal distributions. In addition, if n jointly normal random
variables are uncorrelated, they are independent. By the relationship in (3), for random variables
that are jointly normal, zero correlation is the necessary and sufficient conditions for independence.
The chi-squared distribution with n degrees of freedom is defined by the distribution
of the sum of n squared independent standard normal random variables:
If Z1,Z2, , .,Zn are independent andZiN(0, 1),
thenXi=1
n
Zi
2Xn
2.
Each elementary econometrics textbook should have a table for percentiles of the chi-square dis-
tribution. Note that E[Xn
2/n]!1 as n!1.
6 Section 1
The student-t or t distribution with m degrees of freedom is defined by the distribution
of the ratio of a standard normal random variable to the squared root of the an independently
distributed chi-squared random variable with m degrees of freedom:
If Z N(0, 1) andW Xm
2 are independent,
then t
Z
W/m p tm.
See any elementary econometrics textbook for a table for percentiles of the chi-square distribution.
Although the t distribution is bell shape alike the normal distribution, it has slightly more mass in
the tails especially for fewer degrees of freedom m. When m approaches to infinity, the t distribution
can be approximated by the standard normal distribution:
t1=N(0, 1).
The F distribution with (m,n) degrees of freedom, denoted by Fm,n, is the distribution
of the ratio of Xm
2 /m to Xm
2 /n given Xm
2 and Xn
2 are independent. Precisely,
if W Xm
2 and V Xn
2 are independent,
then F
W/m
V /n Fm,n.
Note that if n approaches infinity, the mean of the denominator V /n is 1. Thus,
Fm,1=W/m.
1.2.5 Sampling Distribution of the Sample Average
A sampling scheme is said to be a random sampling if the objects are selected at random
from a population and if each member of the population is equally likely to be included in the
sample. It would be not a random sample, for example, if we survey students in a pub for a study
on university students’ binging behaviour, since those in the pub are more likely to binge than those
not in a pub. Thus the collected data in a pub are not a good representative, thus not a random
sample, for the population of university students.
The n observations in the sample are denoted by (Y1, Y2, . , Yn), where Yi is the i-th randomly
selected observation. Under random sampling, Yi and Yj are independent if ij. So under random
sampling, (Y1, , Yn) are independently distributed. In addition, under random sampling,
because Y1,...,Yn are drawn from the same population (thus from the same distribution), each
of them shares the same marginal distribution, or we say (Y1, , Yn) are identically distributed.#p#分页标题#e#
Therefore, under random sampling, we say
(Y1, , Yn)iid,
or say (Y1, , Yn) are independently and identically distributed.
Remark 7. Suppose that a random sample with n observations (Y1, ,Yn)iid and that Yi(μ,
2) for each observation i=1,2,...,n. The sampling distribution of sample average Y is:
Y(μ, 2/n).
for any n.
Remark 8. If (Y1, , Yn) is normally distributed, then Y is normally distributed.
1.3 Problem Set A1
1. Prove Remark 3.
2. Prove Remark 4.
3. Prove Remark 6.
Review of statistics and probability theory 7
4. Show that the mean, variance and covariance of the sum of several random variables are:
E[a+bX +cY ]=a+bE[X]+cE[Y ].
V [a+bX+cY]=b2V [X]+c2V [Y ]+2bcCov[X, Y ]
Cov[a+bX +cY , Y ]=bXY+Y
2 ,
英国伦敦大学留学生计量经济学dissertation定制where a, b, and c are parameters.
5. Show Remark 7.
6. Show Remark 8.
2 Large-sample Properties of Sample Average
8 Section 2
相关文章
UKthesis provides an online writing service for all types of academic writing. Check out some of them and don't hesitate to place your order.