[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
answers to your lab questions from friday
Dear Student:
One bad thing about working in the lab the way we do is that you
individually and collectively say "What's this?" "What's that" and we/I
don't always have calm reflection and thought to arrive at an answer.
You know the expression RTFM ?
Last friday, we were fussing over "how to calculate the model
Chi-Square" for Logistic regression. The answer is quite obvious in the
lrm document, and if we had bothered to read it, we would have known.
Using glm, the model Chi-Square can easily be calculated.
The attached R script steps through many of these points, telling you
different ways you might get these values to persuade you they are
right. I'd urge you to run it if you have a moment before class. I'll
bring copies for everybody.
The other thing I noticed while reviewing the casualty reports from
Friday was that I could do a better job on explaining how to create new
data frames in the predict statement. I've documented my findings in
the Rtip sheet
http://www.ku.edu/~pauljohn/R/Rtips.html#7.5
The big new thing there is an R method called "expand.grid", which can
do some of the horrible detail work for you. Suppose you have some
variable that ranges from 3 to 44, and you want predictions for 10
values between 3 and 44, as well as its mean.
So phony up a newX by taking a sequence from the minimum to the maximum:
seq( min(x), max(x), length.out=10)
and tack the mean onto the end of that and create a new variable:
newx <- c( seq( min(x), max(x), length.out=10), mean(x))
You can write a function to automate that, of course.
You can replace "mean" by "mode" for factor variables.
Suppose you create new variables to represent x, y and z in this way.
The totally awesome news I have to share now is that this creates a new
dataset you really really want:
mynewdf <- expand,grid(x=newx, y=newy, z=newz)
in predict as newdata
predict (mymod, newdata=mynewdf, type="response", se.fit=T)
--
Paul E. Johnson email: pauljohn_AT_ku.edu
Dept. of Political Science http://lark.cc.ku.edu/~pauljohn
1541 Lilac Lane, Rm 504
University of Kansas Office: (785) 864-9086
Lawrence, Kansas 66044-3177 FAX: (785) 864-5700
#Paul Johnson pauljohn_AT_ku.edu 2005-11-14
#set the random generator's starting point so
#everybody will get the same results when they run this.
set.seed(12341234)
x1 <- rnorm (1000)
x2 <- rnorm(1000)
e <- rnorm(1000)
y <- 3 + 4*x1 -14*x2 + 7*e
ybin <> mean(y),1,0)
myglm1<-glm(ybin~x1+x2,family=binomial)
summary(myglm1)
# Observe the residual and null deviance. Those will be important later.
# Want to do the "likelihood ratio" test, dropping one variable at a time?
# In OLS, you'd say "there's no need, the t-test does that." But in
# the logit model, and some other models, there is the problem that even
# huge b's are not significant because the estimated standard error
# grows faster than the estimate of b (see Hauck-Donner, 1977).
# If you have the Hauck-Donner problem with the Wald Chi-Square. then it
# is recommended as a more powerful significance test.
drop1(myglm1,test="Chisq")
# Could I reproduce that the old fashioned way?
# YES! Here's my restricted model:
myglm2 <- glm(ybin~x1,family=binomial)
# Note that the following test shows the same test value as drop1.
anova(myglm2, myglm1,test="Chisq")
# You can do that by hand as well. Compare
summary(myglm2)
# Note Residual Deviance of the full model is 590.11.
# The Residual deviance for the restricted model is 1331.6
# The difference of those 2 deviances is:
myglm2$deviance-myglm1$deviance
# DO THE MATH: write down the formula for deviance, note
# that the saturated likelihood gets canceled out, so the
# difference of the 2 deviances is actually the Chi-Square
# Likelihood Ratio Test.
# Note the numbers match up exactly. drop1 really does work!
# Of course, the beauty of it is that it reports that
# same test for all variables with a single swipe.
# Note you can replicate the drop1 findings by restricting b2=0
myglm3 <- glm(ybin~x2,family=binomial)
anova(myglm3,myglm1,test="Chisq")
### It is customary in our field to report -2LLR as a summary
### statistic, rather than deviance. Very few people will know
### the term deviance. So, how to get the value for -2LLR,
### the so-called "Model Likelihood Ratio Chi-Square Statistic."
### If you are looking at glm output, it can be calculated as
### the difference between the residual and null deviances.
myLRT <- myglm1$null.deviance - myglm1$deviance
myLRT
## What's the p-value?
pchisq(myLRT,df=2, lower.tail=F)
# To make sure the number you get is correct/meaningful, let's repeat
# the same analysis with lrm from Design.
# If you RTFM for lrm, you see the stat output called Model LR is
# indeed the much sought after "Model Likelihood Ratio Chi-Square"
library(Design)
mylrm1 <- lrm(ybin~x1+x2)
mylrm1
# The number reported in lrm as "Model LR" has the EXACT SAME
# value that you get by subtracting the residual and null deviances
# from the GLM.
# The percent correctly predicted is a troublesome number, for
# many reasons. But editors and reviewers seem to want to know.
# If you *at least* calculate a classification table, you
# can think about wheter PPC is meaningful to you.
predp <- predict(myglm1)
predYbin <>0.5, 1, 0)
table(ybin, predYbin)
## Want some proportions with that?
prop.table(table (ybin, predYbin))
prop.table(table (ybin, predYbin), 2)
pprob1 <- predict(myglm1, newdata=data.frame(x1=x1,x2=mean(x2)),type="response", se.fit=T)
plot(x1,pprob1$fit, main="Estimated Probability Y=1 when x2 is fixed at its mean")
## It seems to me the only horrible loose end is this:
## What in the heck is anova() giving us and how does it make sense?
anova(myglm1,test="Chisq")
## Analysis of Deviance Table
## Model: binomial, link: logit
## Response: ybin
## Terms added sequentially (first to last)
## Df Deviance Resid. Df Resid. Dev P(>|Chi|)
## NULL 999 1386.04
## x1 1 44.04 998 1341.99 3.211e-11
## x2 1 751.89 997 590.11 1.559e-165
# reverse that:
myglm4 <- glm(ybin~x2+x1,family=binomial)
anova(myglm4,test="Chisq")
## Analysis of Deviance Table
## Model: binomial, link: logit
## Response: ybin
## Terms added sequentially (first to last)
## Df Deviance Resid. Df Resid. Dev P(>|Chi|)
## NULL 999 1386.04
## x2 1 681.03 998 705.01 3.991e-150
## x1 1 114.90 997 590.11 8.266e-27
## The output from anova() always has the residual deviance of the full
## model on the last line, because it is the "unexplained" variance
## that exists after all models have been fitted.
## These other values indicate, if we insert variables one-by-one, how
## much Does the Residual deviance decline?