Statistics Extensive Guide | More Info | Notesale | Buy and Sell Study Notes Online | Extra Student Income | University Notes

Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

Buy These Notes

You have nothing in your shopping cart yet.

Title: Statistics Extensive Guide
Description: This study guide will take you from basic statistical principles such as study designs to when to use specific tests, how to read SPSS output, and how to conduct statistical analyses. Odds ratios, Confidence Intervals, T-Tests, Chi Square Tests, Mann Whitney U Tests, Fisher's Exact test, Wilcoxon Ranked Test, McNemar Test, Linear Regression, Correlations, Multiple Regression, and dummy coding are all covered. These notes came from the Research Methods course at King's College London at the Masters level. These were used for the final exam; as a result I received a distinction mark (equivalent to 4.0 GPA USA).

Buy These Notes Preview

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above

1

Contents
Study Design
...
4
Prospective Studies
...
5
Nested Case Control
...
5
Factorial Design
...
5
Matching- Why?
...
6
Selection Bias
...
6
Performance Bias
...
6
Attrition Bias
...
6
Validity
...
6
External Validity
...
7
Construct Validity
...
7
Ecological Validity
...
7
Effect Size
...
7
Sample Sizes
...
8
Continuous
...
8
Discrete
...
9

2
Bivariate or Binomial
...
9
Distributions
...
9
Normal (Gaussian)
...
9
Sampling Distributions
...
10
Measures of Central Tendency
...
10



Median
...
10

Proportion
...
10
Variance
...
11
Risk Ratio
...
11
Odds
...
12
Relative Risk and Odds Ration Calculation Example
...
15
Standard Deviation
...
16
Confidence Intervals
...
18
Histograms
...
19
QQ Plots
...
21
Hypothesis Testing
...
22
Type 1
...
22

3


The power of a statistical test
...
22
Observed Confounder
...
22
Matched Case Control
...
22
o

Simpson's paradox
...
23
Comparing Two Samples using CI
...
25
Binomial test
...
25
Transformations
...
26
Exact Non-Parametric Tests
...
27
Chi Square Cross Tabs
...
28
Expected Value in a Chi Square
...
30
Tests for Paired Samples
...
30
SPSS Output
...
31
McNemar test
...
33
Y= 𝜷𝟎 + 𝜷𝟏𝒙+∈
...
33
Linear Regression SPSS Output
...
34
Standardisation/ Z Score/ Coefficients
...
36
SPSS Output
...
37

4
Fitting Linear Models- Maximum Likelihood
...
38
R in Regression
...
38
Binary Predictor Values
...
39
Multiple Groups & Predictors
...
40
F Test Versions (numerator and denominator)
...
40
Multiple Regression
...
42
Interpreting a Regression equation Example
...
43


Unadjusted association
...
43

Introducing a New Covariate
...
44
Modeling Interaction Effect
...
45
Maternal Sensitivity Example
...
48
Dummy Variables
...
49

Study Design
Retrospective Studies


Outcome first (fixed), then figure out exposures, (predetermines outcome)

5





o Can’t use relative risk
Subjects sampled by outcome status
Comparing cases (subjects w/disease) vs
...

 Estimate prevalence of disease and exposure (prevalence study)
 Not proper for rare diseases or rare exposures
 Difficult to differentiate cause and effect
Nested Case Control
 Follow a group of people, some will develop the disease, use those that don’t as the
control group
Cohort Study
 Follow up 2 or more groups to see how a disease develops over time (groups identified
before appearance of disease)
Factorial Design
 To answer 2 separate research questions in a single sample of subjects
 Allows to examine interactions
o Ex: Drugs A and B, Drug A and Placebo, Placebo and Drug B, 2 placebos
Cross-Over Design
 To make between group and within group analyses
o One group placebo, one group treatment, allow wash out period and then swap
conditions
o

Matching- Why?


Matching at the design stage is carried out for two main reasons:
o To improve the precision of estimators and power of tests
 We want to reduce the amount of sampling error in the estimate (reduce
standard error or variance of estimator)
...
r
...
to
the variables measured
...

Biases




The systematic over or under estimation of the true measurement
Statistically, bias is the difference between the expected value of the estimate and the
population value
the difference between the mean of the sampling distribution and the population
parameter
o In regression Bias = E( ̂ ) - β
𝐵

Selection Bias



Using selective cases, limits generalizability
Using selective controls- invalid estimation of relative risks

Recall Bias


Difficulties in assessing the direction and extent of the bias

Performance Bias


Unequal provision of care to differ comparison group apart from intervention group

Detection Bias


Are the participants, providers, and assessors aware of the allocation (looks at blinding)

Attrition Bias


Selective reporting of outcomes

Systematic Bias



Related to validity as a systematic bias can reflect the fact that the measure actually measures
something else of than that intended
A systematic distortion in the estimated E-O association

Validity
Internal Validity




Ability of subjects to provide valid and reliable data
Expected compliance with regimen
Low dropout rate

7

External Validity



Generalizability of the results
Enhancing the power of the trial

Face Validity


Does the test appear to measure what it aims to test (subjective)

Construct Validity


“the degree to which a test measures what it claims, or purports, to be measuring
...
2 is small,
...
8 Is large
EX:
...
e
...
There is an inverse square root relationship between
confidence intervals and sample sizes
...
5 (half the distribution)
...
96 (means 1
...
025 (ie 2
...
05
which is the p-value)

10

Sampling Distributions



If you take different sample scores, you get a different distribution each time
...
Ie the distribution of all the possible distributions
...

Splitting data into 4 equal parts
IQR=Q3-Q1
Q2 is the median
...

Average of the squared deviations (differences) of the measurements from the mean
...
e
...
g
...
The relative risk is a simple probability: RR = number of desired outcomes (successes) / number of total
possible outcomes
...
In a case-control
study, the outcomes are predetermined, so calculating the RR is redundant
...
2 times the risk
(as likely) of post-operative wound infection compared to patients who did not undergo incidental
appendectomy
...
% increase = (RR - 1) x 100, e
...
(4
...
Those who had the incidental appendectomy had a 320%
increase in risk of getting a post-operative wound infection
...

The OR represents the odds that an outcome will occur given a particular exposure, compared
to the odds of the outcome occurring in the absence of that exposure
...
g
...
The odds ratio calculation is different than the RR calculation:
OR = number of successes / number of non-successes
...
(Exposed here meaning, for example, exposure to a
treatment/allergen/virus, etc
...
3 with a 95%CI: 0
...
4
...

o The odds of panic attacks in the fluoxetine arm are 0
...

o The odds of a man drinking wine are 90 to 10 (x/(n-x)), or 9:1
o The odds of a woman drinking wine are only 20 to 80, or 1:4 = 0
...

o The odds ratio is thus 9/0
...

OR Calculation Example: A cohort study of students finds that those with higher levels of
anxiety have a pass-rate of 80% while those with low levels have a pass rate of 50%
...
996 is substantially larger than 1, the value it would be for no association or equal rates of
high scoring depression in the two groups
...
838, 5
...

15

When to use RR, When to use OR







Example: To study divorce and depression
o Ruth samples 50 women from the divorce court and 50 married women and follows them for a
year to assess their depression
...
In a case-control
study, the outcomes are predetermined, so calculating the RR is redundant
...

Standard deviation is the square root of the variance
...

o

Standard Error






A measure of how much an estimate will vary over repeated samples
How different the estimate will be each time you re-run the experiment with a different sample
o The more different, the more spread, and the more error
Standard error is the standard deviation of the sample
The standard error of our estimated mean decreases (precision increases) as the sample size increases
To estimate it in a large sample, divide the standard deviation by the square root of n

17

Confidence Intervals






Gives a range of values within which the true value of the population parameter is believed to lie
Measures of precision= how wide are the 95% confidence intervals
Margin of Error
Calculates the amount of variation in your estimate you would get if you repeatedly took new samples of the
same size
...
96 x SE
 Upper limit= estimate (Mean) + 1
...
65 for 90% confidence interval
SE is calculated by SD/√n OR σ/√n
 Example:









N=491

x̄ = 170
...
1

 SE= 7
...
7=
...
7 – 1
...
32 = 170
 CI Upper = 170
...
96 x
...

o CI In Regression: B/standard error
For smaller samples, the CI is better defined by using values from the t distribution (2
...
There is an inverse square root relationship between confidence
intervals and sample sizes
...
From past experience, they know σ= 18
...
Give a 95% confidence interval for
the average blood pressure of all patients
...
96 by 2 for easy calculation;
Hint: remember that standard deviations (sds) and standard errors (ses) have
different uses
...
96√



Upper limit [𝑝 + 1
...
509

...
509)
=
1000

o

S
...
(p) =√

...
509 – (1
...
0158) =
...
509 + (1
...
0158) =
...
5 IQR of Q1 and the Q3
o beyond this we call those values outliers
...
These score have a long upper tail and are bunched up at the bottom
...
The plot suggests that the CSA=no group might
have potential outliers, but none of these are more exceptional than those in the CSA=yes group
...

o If you have a graph like this, the whiskers are the
confidence intervals
...

If the data are normal the dots appear as a straight diagonal line

21

Scatter Plots
Display 1: Scatter graph for baseline bdi scores vs 6 month bdi scores
...

Q1
...
You can see that both the baseline and the
outcome scores have some skew
...

Though these scores may be skewed, the change in score is probably normally distributed (which is what we will need
for the paired t-tests that follow)

Hypothesis Testing









Ho = null hypothesis
H1 = alternative hypothesis
Null is written as: Ho : μ = μo
Possible Alternatives
o H1 : μ > μo or H1 : μ > μo which are both one sided
o H1 : μ ≠ μo (2 sided)
Test statistic, its p-value
Rejection Region
Conclusion/decision

22

Types of Error
Type 1




Rejecting the null when it isn’t true
False positive
Significance level (α) is the probability of type 1 error






Accepting the null when it is false
False Negative
Probability of type 2 error = β
The power of a statistical test = (1-β) which is the probability of making a correct decision by
rejection the null hypothesis

Type 2

Confounds
Observed Confounder


A variable associated with both the exposure and the outcome that is not on the causal pathway
o Example: exposure- abuse >>>>> outcome- depression
o Confound = poor care in the family

E

?

O

C

Matching
 Match exposed group and non-exposed group on the confound
...

o Ex: Same level of care for those abused and not abused
Matched Case Control
 Match case and non-case on the confound
...

23
o



Example: white felons get death penalty more than black felons without stratified
analysis, but when accounting for the ‘extraneous variable’ of the victim’s race, black
felons are more likely to get the death penalty
• A stratified analysis involves the following steps:
o - 1
...
Estimate the association between exposure and outcome within each stratum
separately;
o 3
...
Carry out a test for this combined index
...
Drug A reduced 2 symptoms with a 95% CI [
...
7]
...
E
...

o For example, whether the proportion of females (female) differs significantly from 50%, i
...
, from
...
96 SE(mean 1 – mean 2)
o – for lower limit, + for upper limit
SPSS Output



𝑚𝑒𝑎𝑛 1−𝑚𝑒𝑎𝑛 2
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 (𝑚𝑒𝑎𝑛 1−𝑚𝑒𝑎𝑛 2)

o

with n-2 degrees of freedom

If Levene’s test is significant for equal variances, use the second row (equal variances not assumed)
 Homogeneity of variance

26
o
o

Using this example: “There was strong evidence (p<
...
9 cm taller than girls (95% CI from 8
...
7 cm)
T = mean difference/standard error

Transformations





Consider using a symptom count as a measure
...

If we rank the scores then we are also applying a transformation which stretches the scale at those parts of the
distribution where there a lot of scores and compressing it where there are few
...

If we apply a parametric transformation (e
...
log(1+score)) to make the distribution more normal, we are
changing the scale making the difference between 3 and 2 symptoms smaller than the difference between 2 and
1 symptom
...

Assumes that 2 groups follow identical distributions, differing only by location
o The difference in location is
(between the medians)
 Ho :
=0
v0
 H1:
≠
SPSS Output

o

27

o

Exact Non-Parametric Tests




Can be used for ordinal data
When ordinal outcomes with few possible categories are compared ties might arise
Software typically assigns them the mean rank
o e
...
if the 2nd , 3rd and 4th largest values all take the same value they would all be given the rank “3”



To solve: Construct an exact p-value by numerical simulation – This approach is preferred in the presence of
ties
...
For a 2x2 table (ex: male
female, intervention and control) data this gives (2-1) x (2-1) = 1
...
5%
...
Comparing the observed and expected counts, the
expected count being calculated on the basis of the null hypothesis that the rates in the two groups are the same, the
observed freqency is 76, more than double the expected frequency of 35
...
It is suitable for unpaired data from large samples
...

o For example, let's suppose that we believe that the general population consists of 10% Hispanic, 10%
Asian, 10% African American and 70% White folks
...

Calculate the chi square statistic x2 by completing the following steps:
o For each observed number in the table subtract the corresponding expected number (O — E)
...

o Divide the squares obtained for each cell in the table by the expected number for that cell [ (O - E)2 / E ]
...
This is the chi square statistic
...
05)
...
When P <
...



o
o sd is the standard deviation of the differences
o n now is the number of pairs
95% confidence interval is given by




Null hypothesis: population means the same



The paired t-test compares
of freedom
...
1 years for husbands and 31
...

o Thus the estimated difference is 1
...
5 years)
...
9 – 2 ×
0
...
9 + 2 × 0
...
9, 2
...

o The difference in mean ages is statistically significant at the 5% level (t (99) = 3
...
001)
...

o The test statistic (“ Z”) consists of rank sums converted so that it can be compared
against a standard normal distribution to generate a p-value
...

Is the non-normal test equivalent to the dependent t-test
o Test medians
Used to compare 2 sets of scores that come from the same participants
o Dependent variables should be ordinal or continuous
o Independent variable should have matched pairs (two categorical related groups)

32

McNemar test
o

o

o

o

For paired binary data
o Ex: – wifcat ‘1’ if wife’s age >= 40 years, ‘0’ otherwise
 – huscat ‘1’ if husband’s age >= 40 years, ‘0’ otherwise
The McNemar test formally assesses the H0 that the distribution of a binary outcome (e
...
the proportion of
‘1’s) is the same in each of the groups making up the pairs
o – E
...
proportion of individuals who older than or exactly 40 years at marriage is the same for males
and females
...

33

Linear Regression




Most general way to measure association between exposure and outcome; assumes the relationships between
two variables is linear
o Continuous outcome = use linear regression
o Binary Outcome = use logistic regression
Line of best fit
o Estimates coefficients for slope and intercept of the line
 X predicts y, x is the independent variable, y is the dependent variable

Y= 𝜷 𝟎 + 𝜷 𝟏 𝒙+∈
o




𝛽0 = c = constant = intercept = value of y when x = 0
 If the intercept is zero then y increases in direct proportion to x (i
...
if x doubles then y also
doubles)
o 𝛽1 = m = slope = gradient = line goes up when positive, down when negative = direction of the effect =
magnitude of the effect = change in y when x changes by one unit
o Or 𝛽0 + 𝛽1 𝑥+∈
 ∈= y – E[y] is the error/residual
 Difference between the expected and observed value of the outcome
Standard error measures uncertainty in the estimates and can be used to get a confidence interval
Confidence interval is given by β1 ± 1
...
If the distribution is normal, the points on such a plot should fall close to the diagonal
reference line
Variance Homogeneity- The error terms have the same variance irrespective of the values of x (variance does
not depend on x)
Linearity- There is a linear relationship between x and y
o Can generate a plot of predicted values and those that you observed to see if the relationship is indeed
linear
 The points should be symmetrically distributed around a diagonal line in the former plot or
around horizontal line in the later plot, with a roughly constant variance
Independence- The observations y are independent from each other (this, conditional on x, the error terms are
also independent
o To test for violations of independence, you can look at plots of the residuals versus independent
variables
...
Be alert for evidence of residuals that grow larger as a function of the predicted
value or/and the independent variable)

34

Linear Regression SPSS Output

E(weight) = -46
...
63 * height)
• The intercept (B) is interpreted as the predicted value of the dependent variable (here, weight) when the independent
variable (height) is zero
• So here we estimate that a 16-year-old who is 0cm tall will weigh -46
...
63kg for each one cm increase in height”



T= B/standard error
The effect of height on weigh is
...
562,0
...

The new variable then has a mean of zero
...
But if you centre height, the intercept is the predicted weight of a child of
average height
The slope doesn’t change, but the intercept does which makes the interpretation easier
...

Subtract the mean and then divide the predictor by its standard deviation (this creates a Z score)
Can then run the egression model using the standardized height rather than the original variable
...
02kg
difference in weight
Can standardize both variables:


 B z score / Std error z = t
Z score= (Sample mean – Reference mean)/Standard Error

36
Standardized Coefficients
 Impact of your predictor
 1 standard deviation change in x, yields standardized coefficient beta standard deviation
in y
 1 standard deviation difference in height is associated with a
...

o ε=yobs-yfitted squared i
...
(yobs-yfitted)*(yobs-yfitted)




In this case the line that makes a ²+b²+c²+d²… as small as possible
...

It is also the percentage of the response variable variation that is explained by a linear

37


o
o

model
...

In this SPSS output: Height explains 27% of the variation in weight

Adjusted R Square accounts for variance that gets explained by chance
With just one predictor, the F equals the value of the t in the coefficients table, but squared

ANOVA Output from regression
o Total Sum of Squares: ∑(Weight – mean weight)²
 Sum of squared deviations
o Regression Sum of Squares: sum of squared deviations from the mean that your predictor
explains
o Residual sum of squares: sum of squared deviations from the mean that your predictor does
not explain
o F Statistic: mean square regression/mean square residual
 With only one predictor, F equals the value of t squared

Fitting Linear Models- Maximum Likelihood






chooses the line under which the observed data was most likely to have occurred
o Needs to know the distributions under which the errors arise
o OLS = ML if the errors have normal distribution
If deviations are assumed to be normally distributed, then this line is also the maximum likelihood—the
line that makes observed data seem as likely an occurrence as possible

o
Residual- discrepancy between expected and observed
o Residuals should be normally distributed
o The difference between the observed value of the dependent variable (y) and the predicted
value (ŷ)

38

Pearson’s R Correlation
R in Regression













The Pearson product moment correlation r between two continuous variables measures the strength of
the their linear relationship
...

• Correlation ONLY measures LINEAR association
• We use the Greek letter ρ (pronounced “roe”, usually spelt “rho”) to represent the true correlation
between two variables, and r to represent our estimate from a sample
In a regression model the multiple correlation, R, measures the (Pearson) correlation between the
observed and predicted values of y
In a simple linear regression model (one independent variable) the correlation between the response
and explanatory variable is the multiple correlation
...

The square of the multiple correlation, R2, measures the amount of variance in y that can be explained
by differences in x
...
52
...
”

Spearman Correlation
o
o

is used when one or both of the variables are not assumed to be normally distributed and interval (but
are assumed to be ordinal)
...

defined as the Pearson correlation after replacing data points by their ranks

Binary Predictor Values




Everything stays the same, same equation
However, have to give numeric values to your binary conditions
o Example: male= 0, female=1
Expected value is just beta 0 (since β1 x 0=0 leaving E(Y)=β0)

39





In the example of males =0 females= 1, β1 is going to describe the difference between females and
males
...

Multiple Groups & Predictors







Example Question: Are impairment scores for 3 different diagnoses different?
Between 3 groups, there are 3 different pairs of contrasts we can make, but if you know 2 of them, you can then
calculate the 3rd
o Ex: (g1-g2)+(g2-g3)=(g1-g3)
There are “2 degrees of freedom”
o We can choose to estimate any two differences, the third will then be known
However, it matters which 2 we choose when we test them one at a time
F-tests resolve this problem

40

F Test






Goes with multiple groups and predictors
Recognizes that the 3rd comparison is determined by the other 2
Identifies that there is a difference but not where the difference lies
F= (squared difference of the mean)/(error variance of difference)
F= t² and the p value that you get from the t-test and the F-test will be the same

F Test Versions (numerator and denominator)


Result is written as: F(numerator degrees of freedom, denominator degrees of freedom)
o Numerator df: # of groups -1
o Denominator df: sample size- # of groups
 Example: with 3 groups and a sample size of 3 in each group (n=9), we estimate 2 differences
and a common residual variance giving F(2,6)

Multiple Linear Regression



Same thing as simple linear regression but with multiple predictors
SPSS Output

Coefficients
Model

a

Unstandardized
Coefficients
B

Std
...
815
-1
...
233

soc3

-1
...

Beta

1
...
791

...
084

-1
...
167

1
...
045

-
...
375

-1
...
161

-
...
201

...
723

1
...
125

-2
...
031

soc6

-3
...
661

-
...
959

...
358

1
...
031

-
...
435

(Constant)

1

a
...
707 cm shorter than those in social class 1
T—t test testing differences between each predictor and the reference category

41
ANOVA OUTput for multiple regression- R Square
 Tests for any difference with a single test


If you take all the groups together, these isn’t evidence of a significant difference between social classes in
height

Model Summary
Model

R

R Square

Adjusted R
Square

Std
...
087a

...
002

8
...
Predictors: (Constant), soc7, soc6, soc3, soc5, soc2, soc4




More predictors, explain more variance, but many are related by chance
R= correlation coefficicent
r² indicates the strength of the regression requation which is used to predicat the value of the y variable

ANOVAa
Model

df

Mean Square

F

Sig
...
021

6

81
...
273

...
410

993

64
...
431

999

a
...
Predictors: (Constant), soc7, soc6, soc3, soc5, soc2, soc4



o 0= poor predictor 1= excellent predictor
Adjusted r square takes into account the variance that we explain by chance
o Ex: coincidentally sampling more short people from a poor social class

Multiple Regression


To adjust for other variables, just include them in your regression model (for multiple linear regression, the
fitted model is a hyperplane rather than a line)
o E(Y) = β0 + β1 X1 + β2 X2

42




The interpretation of any particular regression coefficient is that it is the increase in the outcome for a one unit
increase in the particular variable, keeping all other variables fixed
For instance, β1 is the expected increase in Y (the outcome) for every one unit increase in X1 keeping X2 fixed
...
e
...
Most of the difference in weights of men and women come
about because of differences in height
...
1 + (0
...
3 * sex)

We can use this to make predictions: the expected weight of a 170cm tall man is
o -52
...
67 * 170) = 62
...
1 + (0
...
34 = 67
...
672kg increase in weight (previous estimate was 0
...
34kg heavier
The intercept is interpreted as the expected weight of a man (because Male=0) with height 0cm
...
9*Y1990 - 10
...
The coefficient of Y1990 indicates that other things being equal, houses in this
neighbourhood built after 1990 command a $33
...

Similarly, houses on the East side cost $10
...
Thus, NW serves as the baseline or
reference level for E and SE
...
3 K
...
626
(p<0
...
672
1

(p<0
...
Have to look at the
direction of the arrows
...
e
...
E
...
The Question thus asks what is the difference in the expected
anger proneness of a female child with average maternal sensitivity and low MAOA activity genotype compared to the
reference group
...
045*female=
...
Only
coefficients involving maternal sensitivity are involved
...
330 (boys are negatively
responsive) Responsiveness of girls is -
...
347=0
...
330+0
...
222 MAOA-H girls responsiveness is -0
...
552+0
...
285=0
...

 For example, if you have one column for East (where 1= on the east side, 0 otherwise) and one
for SE (where 1= on the south east side, 0 otherwise), you do not need a variable for NW
because setting both E and SE to 0 indicates that it’s the third option
 that this coding only works if the three levels are mutually exclusive (so not overlap) and
exhaustive (no other levels exist for this variable)
...
That is, one dummy
variable can not be a constant multiple or a simple linear relation of another
...
g
...

49

Which Test to Use and When

50

Buy These Notes Preview

Notesale: Turn your study into money

Already a Member? >

Search for notes by fellow students, in your own course and all over the country.

My Basket

Document Preview