Search for notes by fellow students, in your own course and all over the country.
Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.
Document Preview
Extracts from the notes are below, to see the PDF you'll receive please use the links above
Chapter 7
Regression models
7
...
A statistical model takes account of the inherent variability of the data: this
will arise because of measuring or observational errors or chance circumstances
...
7
...
A simple linear
regression model relates a response variable y to a single explanatory variable x
...
2
...
( x n , y n ), then the linear model is of the
form:
where:
0 and 1 are the parameters of the model,
1 is the gradient and 0 is the intercept
...
Example:
Children of specified ages are observed and the average oral vocabulary is recorded
...
The data are as follows:
age
words
97
1
3
1
...
5
446
3
896
3
...
5
1870
5
2072
6
2562
7
...
2 The Scatter Diagram:
A plot of the data provides a scatter diagram showing the relationship between x and y
...
3000
2000
1000
WORDS
0
-1000
0
1
2
3
4
5
6
7
AGE
This plot is approximately linear (although some evidence of a curvilinear pattern at the
beginning and a levelling off at the end)
...
2
...
The estimates are denoted by
...
average oral vocabulary by age in years
Y = -763
...
926X
R-Sq = 0
...
2
...
Predicting equation:
= -764 + 562 x
...
Comments:
i)
ii)
Predicting outside the range of values of x on which the equation was estimated
should be done with caution
...
The
intercept represents the value of y when x is zero
...
There is a strong argument here for fitting line through the origin, i
...
, making the intercept
zero
...
2
...
2
...
4
...
The errors are random errors with mean zero and variance
...
The errors are assumed to have constant variance,
...
Thus the model is where the are independent ,
N(0, ) for all i = 1,2,…n
...
2
...
The residuals, provide an estimate of the errors,
...
2
...
Method
1
AGEa
Enter
a
...
b
...
Error of
Model
R
R Square
R Square
the Estimate
a
1
...
985
...
74
a
...
Dependent Variable: WORDS
2
The R Square ( R ) measures the % of variation explained by the model
...
5 % of the variation in the y-variable (words) was explained by the model
...
Coefficientsa
Model
1
a
...
Error
Beta
-763
...
250
561
...
290
...
656
23
...
95% Confidence Interval for B
Lower Bound Upper Bound
...
362
-560
...
000
505
...
939
i) Inference about 1 the slope
...
error
...
926
with std
...
29
...
=2
...
Hence 95% CI for is given by
561
...
29(2
...
913, 617
...
Hypothesis test about 1 :
T
ˆ
1
ˆ
std
...
H 0 : 1 0 No relationship between age and vocabulary
H 1 : 1 0 There is a linear relationship
Observed T = 23
...
000
Hence very strong evidence to reject the null hypothesis and conclude that there is a linear
relationship between age and vocabulary
...
656 and
p = 0
...
c) Test of the significance of the regression using the F-distribution
...
b
...
8
7403118
Predictors: (Constant), AGE
Dependent Variable: WORDS
df
1
8
9
Mean Square
F
Sig
...
725 535
...
000
13629
...
The hypotheses may be
written as:
H 0 : 1 0 (y does not depend on x) model : y i 0 i
H 1 : 1 0 (y does depend on x)
model : y i 0 1 xi i
Test Statistic
Reject H 0 if
...
Otherwise accept the null hypothesis and conclude that y does not depend on x
...
185 and p = 0
...
F > = 5
...
Very, very strong evidence to reject the null hypothesis and accept the
alternative
...
7
...
8 Residual Analysis to validate the assumptions of the model
...
Plot of residuals v fitted values
Should be randomly scattered about zero with fairly constant ‘spread’ if assumption of
independence and homogeneity are valid
...
Scatterplot
Dependent Variable: WORDS
Regression Standardized Residual
2
...
5
1
...
5
0
...
5
-1
...
5
-2
...
5
-1
...
5
0
...
5
1
...
5
2
...
The observations
may not be independent and also the linear fit in the model may be inadequate
...
Histogram
Dependent Variable: WORDS
3
...
0
2
...
0
Frequency
1
...
0
Std
...
94
...
00
N = 10
...
0
-1
...
00
-
...
00
...
00
1
...
00
Regression Standardized Residual
Comment: Histogram looks relatively symmetric
...
00
Expected Cum Prob
...
50
...
00
0
...
25
...
75
1
...
Hence Normality assumption looks valid
...
2
...
It is important to predict within a sensible range of values - the range of values on
which the model was based should be satisfactory
...
The prediction interval for a single value will always be wider than that for the ’line’ or
‘expected value of y given x’
...
103
Prediction of vocabulary when age is 5 years
...
95% CI is (1918
...
504)
95% PI is (1747
...
751)
The limits for the above two intervals can be obtained from the Save option within the
Regression procedure
...
2
...
The regression line can be influenced strongly by certain observations
...
This observation is said to be strongly influential
...
15
...
02 + 1
...
89
y
10
...
00
2
...
00
6
...
00
10
...
e
...
Note that these points have
low leverage, as they are near the centre of the x’s
...
15
...
00
Linear Regression
y
y = 1
...
91 * x
R-Square = 0
...
00
2
...
00
6
...
00
10
...
2
...
The effect of a point, or set of points, can be determined by finding the change in sum of
squares when the point or points are removed
...
A quick approximate way of finding the likelihood displacement is to use Cook’s distance
...
e
...
The case of several points cancelling each other out, as in (iii), is not picked up
by Cook’s distance
...
Again, Cook’s distance
will not necessarily pick this up
...
We here do this for the first graph (i
...
the data with only one observation at x
= 10)
...
x
1
2
2
3
4
5
5
10
y
2
...
9
3
...
2
5
...
08411
0
...
00739
0
...
2075
0
...
0381
4
...
16071
0
...
07143
0
...
01786
0
...
64286
Unlike most of the statistics you have encountered, Cook’s distance does not have an
associated significance test
...
A convenient way of doing this is to plot Cook’s distances against the observed xvalue or, perhaps preferably, to plot the Cook’s distances against an index denoting the
number of the observation
...
00000
3
...
00000
1
...
00000
2
...
00
6
...
00
index
Having found what appears to be an influential observation (in this case, the observation at
x=10), the observation should be checked for any obvious recording error
...
Cook’s distance can be very useful
...
Here is our example with the two points at x=10
...
91 to 0
...
x
1
2
2
3
4
5
5
10
10
y
2
...
9
3
...
2
5
...
00015
0
...
00011
0
...
03491
0
0
...
09582
1
...
15278
0
...
08081
0
...
00505
0
...
00126
0
...
32323
Note also that the leverage of an observation is sometimes quoted
...
As noted above, a high leverage does not imply that a point is unduly influencing the
regression line
...
It may be noted that SPSS seems to standardise the
leverages that it outputs in an unusual way
...
7
...
A multiple linear regression model relates a
response variable Y to more than one explanatory variable
...
We are usually looking for the ‘best’
subset of the explanatory variables
...
3
...
108
i 1,2,
...
Example
Are the size and weight of your brain indicators of your mental capacity? In this study by
Willerman et al
...
The researchers take into account gender and body size to draw
conclusions about the connection between brain size and intelligence
...
(1991) conducted their study at a large southwestern university
...
These subjects were drawn from a larger pool of introductory psychology students with total
Scholastic Aptitude Test Scores higher than 1350 or lower than 940 who had agreed to satisfy
a course requirement by allowing the administration of four subtests (Vocabulary,
Similarities, Block Design, and Picture Completion) of the Wechsler (1981) Adult
Intelligence Scale-Revised
...
The MRI Scans were
performed at the same facility for all 40 subjects
...
The computer counted all pixels with non-zero gray scale in each of the 18 images
and the total count served as an index for brain size
...
2
...
4
...
6
...
Gender: Male or Female
FSIQ: Full Scale IQ scores based on the four Wechsler (1981) subtests
VIQ: Verbal IQ scores based on the four Wechsler (1981) subtests
PIQ: Performance IQ scores based on the four Wechsler (1981) subtests
Weight: body weight in pounds
Height: height in inches
MRI_Count: total pixel Count from the 18 MRI scans
The Data:
GenderFSIQ
Female133
Male 140
Male 139
Male 133
Female137
Female99
Female138
Female92
Male 89
Male 133
Female132
Male 141
Male 135
Female140
Female96
Female83
Female132
Male 100
109
VIQ
132
150
123
129
132
90
136
90
93
114
129
150
129
120
100
71
132
96
PIQ
124
124
150
128
134
110
131
98
84
147
124
128
124
147
90
96
120
102
WeightHeight MRI_Count
118
64
...
72
...
3 1038437
172
68
...
0 951545
146
69
...
5 991305
175
66
...
3 904858
172
68
...
5 833868
151
70
...
0 924059
155
70
...
0 878897
135
68
...
5 852244
178
73
...
186
122
132
114
171
140
187
106
159
127
191
192
181
143
153
144
139
148
179
66
...
0
...
5
62
...
0
63
...
0
68
...
0
63
...
5
62
...
0
75
...
0
66
...
5
70
...
5
74
...
5
808020
889083
892420
905940
790619
955003
831772
935494
798612
1062462
793549
866662
857782
949589
997925
879987
834344
948066
949395
893983
930016
935863
A new variable SEX was created from gender having set Male=0 and Female=1
...
3 2 The Multiple Scatter Diagrams:
The response variable y and the all the x continuous variables are plotted between each other
...
Also there is some high correlation between some of the
explanatory variables
...
3
...
Regression
b
Variables Entered/Removed
Model
1
a
...
Variables
Entered
Male=0
Female=1,
PIQ,
WEIGHT,
HEIGHT, a
VIQ, FSIQ
Variables
Removed
Method
...
Dependent Variable: MRI_CNT
a) Variation explained by the model
Model Summary
Std
...
808
...
585
46759
...
Predictors: (Constant), Male=0 Female=1, PIQ,
WEIGHT, HEIGHT, VIQ, FSIQ
The variation explained here is 65
...
b) Testing whether the x-variables jointly are significant
...
b
...
27E+11
6
...
95E+11
6
31
37
Mean Square
F
2
...
684
2186409811
Sig
...
000
Predictors: (Constant), Male=0 Female=1, PIQ, WEIGHT, HEIGHT, VIQ, FSIQ
Dependent Variable: MRI_CNT
If the regression is not significant, then y does not depend on the x’s
...
Conclude that y does depend on x
...
The F-value is 9
...
000, a result that is highly
significant indicating that the x-variables jointly are significant
...
Coefficientsa
Model
1
a
...
Error
Beta
206819
...
2
-9389
...
638
-3
...
765
2761
...
704
6287
...
270
1
...
015
485
...
028
6883
...
980
...
7 24529
...
295
t
...
019
1
...
489
...
146
-1
...
...
052
...
018
...
040
...
e
...
507
...
Hypothesis test about i :
H 0 : i 0 No linear relationship between x i and y given the rest of the x - variables
H 1 : i 0 There is a linear relationship between x i and y given the rest of the x - variables
T
ˆ
i
ˆ
std
...
489 and p = 0
...
Also from the table the HEIGHT seems significant given the rest of the variables
...
3
...
There are many ways to construct an ‘best’ regression equation from a large set of xvariables
...
Forward selection: We start with the constant and add only significant variables
...
The following output uses the option STEPWISE
...
2
PIQ
...
...
050,
Probabilit
y-of-F-to-r
emove >=
...
Stepwise
(Criteria:
Probabilit
y-of-F-to-e
nter <=
...
100)
...
050,
Probabilit
y-of-F-to-r
emove >=
...
Dependent Variable: MRI_CNT
Firstly SEX, secondly PIQ and finally Height are found to be significant
...
The output below shows the R-squares, the ANOVA table and the regression coefficients
from all three models
...
Note that the final prediction model is
MRI_CNT=353207
...
4x(if female)+1267
...
095xHEIGHT
b) The R-Squares for the three fitted models
Model Summary
Adjusted
Std
...
649
...
405
55951
...
738
...
518
50352
...
778
...
570
47576
...
Predictors: (Constant), Male=0 Female=1
b
...
Predictors: (Constant), Male=0 Female=1, PIQ, HEIGHT
c) The ANOVA table for the three fitted models
ANOVAd
Model
1
2
3
a
...
21E+10
1
...
95E+11
1
...
87E+10
1
...
18E+11
7
...
95E+11
1
36
37
2
35
37
3
34
37
Mean Square
F
8
...
229
3130555653
a
...
304E+10
2535367942
20
...
000
3
...
355
c
...
Predictors: (Constant), Male=0 Female=1, PIQ
c
...
Dependent Variable: MRI_CNT
114
Sig
...
Standardi
zed
Unstandardized
Coefficien
Coefficients
ts
B
Std
...
7 13187
...
1 18178
...
649
829137
...
670
-90976
...
728
-
...
148
366
...
351
353207
...
8
-54561
...
139
-
...
677
351
...
395
6447
...
449
...
472
-5
...
344
-5
...
074
1
...
454
3
...
281
Sig
...
FSIQ
VIQ
PIQ
WEIGHT
HEIGHT
FSIQ
VIQ
WEIGHT
HEIGHT
FSIQ
VIQ
WEIGHT
Beta In a
...
223
a
...
173
a
...
329
b
-
...
187
b
...
179
c
-
...
051
t
2
...
796
3
...
060
1
...
023
-
...
284
2
...
569
-
...
324
Sig
...
022
...
004
...
157
...
479
...
029
...
864
...
377
...
290
...
461
...
176
...
238
...
173
...
122
...
215
...
364
...
099
...
030
...
056
...
Predictors in the Model: (Constant), Male=0 Female=1, PIQ
c
...
Dependent Variable: MRI_CNT
115
...
000
...
000
...
106
...
001
...
1: Regression analysis
1
...
age
words
1
3
1
...
5
446
3
896
3
...
5
1870
5
2072
6
2562
Use ANALYSE/REGRESSION/LINEAR to declare your y and x variable and fit the model
...
i)
Tick Histogram and Normal probability plot
ii)
Declare ZRESID as Y and ZPRED as X
...
The accompanying data is of 10 makes of car
...
The data are taken from Hogg, R
...
and Ledolter, J
...
New York: Macmillan
...
Weight
3
...
80
4
...
20
2
...
90
2
...
70
1
...
40
Fuel
5
...
90
6
...
30
3
...
60
2
...
60
3
...
90
The data are on the shared drive (K):\Sctms\som\ma2013\data\fuel
...
i)
ii)
Check Cook’s distance for any influential points
...
Show that the
slope is not greatly different
...
iv)
116
Fit a simple linear regression line and show that the constant is not needed
...
Find a 95% confidence interval for the gradient of the regression line which
passes through the origin
...
1
The fuel example
i)
Write down the simple linear regression equation with the constant included
...
iv)
How can you show that the slope is not greatly different when you fit the regression
line through the origin with the most influential point weighted out?
v)
Find a 95% confidence interval for the gradient of the regression line which passes
through the origin when all 10 cars are included
...
2: Multiple Regression analysis
1
...
The data are on the shared drive
(K):\Sctms\som\ma2013\data\brain
...
i)
Use Graph/scatter to produce a multiple plot for the brain data
...
iii)
Play with the different selection options
a
...
Backward
c
...
Are those
plots consistent with the assumption of normality?
2
...
This dataset contains concentrations of various
chemicals in 30 samples of mature cheddar cheese, and a subjective measure of taste for
each sample
...
The variable "Lactic" has
not been transformed
...
sav and try to identify
which explanatory variables are needed in the models to predict taste
...
117
Questions for practical 7
...
How do we interpret the coefficients in a multiple regression equation?
What does the ANOVA table test in the regression analysis context?
How are individual coefficients tested given all other variables?
What are the different selection techniques for selecting the ‘best’ subset of the xvariables?
What assumptions do we require for the regression analysis to be valid
...
What is the % variation explained in the final model?
How can you test that the final model is significant overall?
Test, using a t test, whether the coefficient of sex is significant or not
...
Interpret the coefficient for SEX
...
118