The Stata Bible: Everything you need for Econometrics in Stata | More Info | Notesale | Buy and Sell Study Notes Online | Extra Student Income | University Notes

Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

Buy These Notes

You have nothing in your shopping cart yet.

Title: The Stata Bible: Everything you need for Econometrics in Stata
Description: Here's a simple, practical and complete guide about everything you need to know and be able to perform in order to go through Applied Econometrics and get a guaranteed first class (ECN336, QMUL).

Buy These Notes Preview

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above

----------------------------------------------------------------------------------------

~ IMPORTING DATA ~
----------------------------------------------------------------------------------------

Clear, set the directory and import (
...
\Downloads"
use dataset
...
csv):
clear
cd "C:\Users\
...
csv
insheet using dataset
...
csv, delim(;)
(OR)
insheet using dataset
...
& aux_black_pop2010!=
...

What is the average age of the sample?
summ age
Which is the region with the most women interviewed in the sample? (frequency)
tab region
How many women are not working? (CATEGORICAL DATA to NUMERICAL)
tab employment DOES NOT WORK
tab employment, nol
How many of the sampled women living in Scotland work? (CAT
...
& x2!=
...

We can also sort observation for two variables! (sort x1 x2)

~3~

----------------------------------------------------------------------------------------

~ DATA MANIPULATION ~
----------------------------------------------------------------------------------------

If the first line gives the variable names and data only starts from line 2 on,
insheet using gini
...
If values are delimited by “;” and not by “,” you need
to use the option:
delimiter(“;”)

Missing values are not denoted by “
...

We can use the destring command to force STATA to interpret the data as numbers
...
On the other hand, we cannot use this loop for groups
of variables with different names)

Foreach is useful when we want to produce a loop on variables called differently (foreach LOOP)
local groupname 1974 1975 1976
foreach i of local groupname {
rename yr`i’

y`i’

}

Renaming variables
rename countryname country
rename countrycode ccode
drop seriesname seriescode

(if unnecessary)
~4~

We can format the width of columns:
format country %30s
We can now drop observations that have all missing values
...
Could it be the case that some country never reports? (DROP ROWS)
egen report=rowtotal(y1981-y2016), missing
tab country if

report==
...

drop report

We now introduce the command reshape
...
etc, observation that has data for several years (year=columns)
...

reshape long y, i(country) j(year)
To return to the original data:
reshape wide y, i(country) j(year)

Write a foreach loop to replace all n
...
with readable missing values:
local group x1 x2 x3 x4 x5 …
foreach c of local group {
destring `c', replace force
}
Count and drop observations with all missing values:
count if x1==
...
& x3==
...
& x5==
...
…
browse if x1==
...
& x3==
...
& x5==
...
…
drop if x1==
...
& x3==
...
& x5==
...
…
Using a foreach loop, create deviation-from-the-average variables:
local group X1 X2
foreach i of local group {
by code: egen mean_`i'=mean(`i')
gen `i'_dev=`i'-mean_`i' if `i'!=
...

corr Y X1, c [OR cov]
di -7996
...
472
Calculate the coefficient of the constant in the regression of Y on X1 without actually running the regression
...
257-(b1*21
...

reg Y X1
Show in graph that the slope and the intercept are in line with results
...

Calculate the t-statistic of the b1 coefficient
di (b1-b1h0)/SEb1
In the regression, is there scope for adjusting for heteroskedasticity? (RVFPLOT)
reg y x1 x2
(or)
predict res,res
predict yhat,xb
scatter res yhat

reg y x1 x2
rvfplot, yline(0)

Now we adjust for heteroscedasticity
reg y x1 x2, robust
rvfplot, yline(0)
(OR, now let's see if there is need to adjust for heteroscedasticity)

scatter res yhat, mlab(msa) mlabsize(tiny) msize(tiny) yline(0)
rvfplot, mlab(msa) mlabsize(tiny) msize(tiny) yline(0)
//the two graphs are identical! IN THIS CASE, WE USE HETTEST

Hettest (if het, the run robust reg)

~6~

----------------------------------------------------------------------------------------

~ RESIDUALS ~
----------------------------------------------------------------------------------------

Analyze the residuals of your regression
...

reg Y X1, robust
----------------------------------------------------------------------------------------

~ GOODNESS OF FIT ~
---------------------------------------------------------------------------------------Run regression and see goodness of fit:
reg violentcrime prop_black
*You can either exploit the information from the table…
*R-sq
di ESS/0
...

reg Y X1
reg Y X1 X2

(save b1univ)
(save b1multiv and b2multiv)

What is the bias on X1?
di b1univ-b1multiv
How can we retrieve the bias to doublecheck? We miss one element
reg X2 X1
(save b1univ-3)
di b2multiv*b1univ-3
Did we expect this direction of the omitted variable bias? Discuss (INTUITIVELY!)
corr Y X1 X2

~7~

----------------------------------------------------------------------------------------

~ F-STATISTIC ~
----------------------------------------------------------------------------------------

Calculate manually the F-test:
egen av_Y=mean(Y)
gen res_Y=Y-av_Y
gen res_Y_sq=res_Y^2
egen sum_res_Y_sq=sum(res_Y_sq)
//notice that this is the residual sum of the squares of the restricted model

rename sum_res_Y_sq rrss
format rrss %20
...
95)^2
//note the rounding

IF F-TEST IS WRONG:
drop av_Y *res_Y*
egen av_Y=mean(Y) if X1!=
...
& X3!=
...

gen res_Y=Y-av_Y
gen res_Y_sq=resY*resY
egen sum_resY_sq=sum(resY_sq)
disp ((RRSS-URSS)/q)/(URSS/n-k)

//now I get the right number!

----------------------------------------------------------------------------------------

~ DFbeta ~
---------------------------------------------------------------------------------------How many observations are influent in terms of years of education according to the DFbeta method?
reg av_trust av_ethnicgroups av_yearsed av_popij av_temp
dfbeta c_yearsed
sort _dfbeta_1
Calculate threshold:
disp 2/sqrt(67)=0
...
24433 & _dfbeta_1!=
...
Calculate the F-test using both methods
...

Show results are the same as directly calculating the test in STATA after unrestricted model

*the restricted model is therefore: Y=sum_X1_X2_X3 X4

gen sum_X1_X2_X3=X1+X2+X3
reg Y sum_X1_X2_X3 X4
di ((RRSS-URSS)/q)/((URSS)/n-k) //first method
di ((R2U – R2R)/1)/((1 – R2R)/70) //second method
We want to test b1=1
...
Do the results make sense?
Are they identical to those obtained using the other method? Why?

di ((R2U – R2R)/1)/((1 – R2R)/70)

//second method

Frisch-Waugh theorem: Partialling out interpretation (partial slope coefficients)
reg price mpg foreign
predict res1, res
reg weight mpg foreign
predict res2, res
reg res1 res2
//and we obtain the coeff of X1 in the intial multiple regression

~9~

replace the string value of the variable "description"
...
Use the command split (and look at the option
"parse"!)
split area, p(,)
split area2
replace area21="TX" if area21=="TX-Texarkana"
tabulate and summarize the observations: what is the state with most MSAs? (notice you
should select either avwage or numjob not to duplicate the observations!)
by state: msa if description=="avwage"
by state: summ time1969 if description=="avwage"
Tell STATA that the time invariant component is the MSA-linecode pair, and what changes is
the year
reshape long time, i(msa linecode) j(year)
*Now better, but it still doesn't work
...

***********************************************
Rearrange to make it suitable for regression analysis (merge)
The problem is that the two variables are one on top of the other and we can't use it for regression!
The easiest way to solve the problem is make two separate datasets and MERGE them together
In first place, save the full reshaped dataset
save msa_work_reshape, replace
Now that we have the main dataset at hand, generate two separated from it
...
Don't forget to sort the dataset like the other one
Now we can proceed to the merge
...

*Idea: you specify the variables that are used to uniquely identify observations in the two datasets
...

clear

~10~

use jobnum
merge 1:1 fips year using avwage
*the command "merge" generates a variable _m that tells you how many observations are matched and how many are not
drop _m
format msa %20s
save full_msa_work,replace
***********************************************
Now we can proceed to regression analysis
Generate lagged values and growth rates
...
Call it "after"
gen after=1 if year==1992
replace after=0 if after==
...
Call it "njafter"
gen njafter=nj*after

Diff-in-Diff by hand
tab nj after

//this would give you frequencies

tab nj after, summ(fte)

//this would give you all the sum stats

tab nj after, summ(fte) means

//this only gives you the mean!

DID = (Y2,TREAT-Y2,CONTR) - (Y1,TREAT-Y1,CONTR)
di (17
...
2538)-(17
...
3)

[2
...
328]

reg fte after nj njafter

(after time period dummy, nj treatment dummy, njafter interaction term dummy)

~13~

~ 2WAY F
...
– LSDV - F
...
~
Show we get the same coefficient of the DiD estimator with a two-way FE model
(later we discuss why the s
...
are so different)

Generate the time fixed-effects (time dummies, year)
tab year,gen(year)

[generates two year dummies]

Define the panel variables (TIME INVARIANT - TIME VARIANT)
xtset restaurant year

Run the regression
xtreg enmployment njafter year1 year2,fe

[bnjafter=2
...

quietly tab restaurant,gen(id)

[generates a lot of dummies!]

reg employment njafter year1 year2 id*,noc

[bnjafter=2
...
E
...

sort year
by year:egen `c'_m_year=mean(`c')

//generate the time mean
...

gen `c'_tilda=`c'-`c'_m_rest-`c'_m_year+`c'_m_sample

//generate new variable

}

sort restaurant year

Run the regression
reg fte_tilda njafter_tilda, noc

Correct the SE
di sqrt(697/347)

// (N*T-1)/[(N-1)*(T-1)-k]

di 1
...
8314387

~15~

k = n
...
E
...
E
...
of bs

di 1
...
3609877

Show in a graph that the identification only comes from the treatment group
twoway (scatter fte_tilda2 njafter_tilda2, mlab(nj) msize(tiny)
mlabsize(tiny)jitter(2)legend(off) ytitle(demeaned(fte))
xtitle(demeaned(affected)))

~ Different S
...
in DiD and tw F
...
~
Notice that the SE of the DiD are much larger than those in the two-way FE! This
is because there is much between-variation left in the DiD
To see this, confront the variables in the FE estimator (in its LDSV version,
explicitly including all individual and period FE) against the DiD
Do it on the marriage dataset (easier to see!)
...
D
...
D
...
Do it with a loop
...

sort id

local group wage (Y) marr (TIME VARIANT)
foreach c of local group {
by id(TIME INVARIANT):gen `c'_l=`c'[_n-1]
by id(TIME INVARIANT):gen d`c'=`c'-`c'_l
}

reg dwage dmarr, noc

~17~

Remember:


When a variable that should be a number is saved as a string something is wrong



When you import data from Excel save the Excel file in the comma-separated values (
...
”



Variable names in Stata cannot be numbers



When you have a sequence of variables with similar names and only differing in referring to a different year, it is
more efficient to use the forvalues loop

* Asterisk: you can use the asterisk for variable names in commands
...

STRINGS
Nonnumeric data are recorded as strings, typically enclosed in double quotes such as “UK”
...
They are stored as str# where # is an integer between 1 and 80 specifying the number of
characters in the string
...

~18~

Buy These Notes Preview

Notesale: Turn your study into money

Already a Member? >

Search for notes by fellow students, in your own course and all over the country.

My Basket

Document Preview