![]() |
![]() |
Back |
Hello I have a problem when carrying out the all possible subsets option in logistic regression (performed on SAS 6.11), I do not get the classification table! Is this possible to do when all possible subsets option are chosen? Thank you in advance! Arild BreistølReturn to Top
In article <589cqn$1rdo@yuma.ACNS.ColoState.EDU>, Tim DieraufReturn to Topwrites > > >As I remember, pre-5.0 Statistica has the K-S test. Statistica v4.5 has Schapiro-Wilks, Lilliefors and K-S in the Descriptive Statistics routine. -- Chris Crocker
G Asha (asha@CAS.IISC.ERNET.IN) wrote: : My friend has collected data which has the following info. : Dependent variable: Achievement in Biology : Ind variables : self -conf, home adjustment, health adjustment etc : totally 11 in number. : She needs to do multiple (step-wise) regression analysis. I am familiar : with multiple regression, but do not know how to go about step-wise : regression. Can someone please advice me about this.. Even a ref book or : some simple stat package will do. -- I have seen advice posted on how-to. -- If you have a stat package, it usually has examples. Or, Windows/ graphical approaches let you build your request. -- For most purposes, it is not wise to run step-wise regression. Below, I am including the commentary and references posted before by Frank Harrell, and which I have mentioned before (on some .stat. groups). Rich Ulrich, biostatistician wpilib+@pitt.edu Western Psychiatric Inst. and Clinic Univ. of Pittsburgh *** ==================posting on stepwise analyses====== Frank E Harrell Jr feh@biostat.mc.duke.edu Associate Professor of Biostatistics Division of Biometry Duke University Medical Center ---------------------------------------------------------------------- Newsgroups: sci.stat.consult Subject: Reasons not to do stepwise (or all possible regressions) Date: 19 Feb 1996 19:22:19 GMT Message-ID: <4gailc$cc0@news.duke.edu> Keywords: variable selection I post this every few months. I hope it helps. Here are SOME of the problems with stepwise variable selection. 1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are falsely narrow (See Altman and Anderson Stat in Med) 4. It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem 5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani, 1996). 6. It has severe problems in the presence of collinearity 7. It is based on methods (e.g. F tests for nested models) that were intended to be used to test pre-specified hypotheses. 8. Increasing the sample size doesn't help very much (see Derksen and Keselman) 9. It allows us to not think about the problem 10. It uses a lot of paper Note that 'all possible subsets' regression does not solve any of these problems. References ---------- @article{alt89, author = "Altman, D. G. and Andersen, P. K.", journal = "Statistics in Medicine", pages = "771-783", title = "Bootstrap investigation of the stability of a {C}ox regression model", volume = "8", year = "1989" Shows that stepwise methods yields confidence limits that are far too narrow. } @article{der92bac, author = {Derksen, S. and Keselman, H. J.}, journal = {British Journal of Mathematical and Statistical Psychology}, pages = {265-282}, title = {Backward, forward and stepwise automated subset selection algorithms: {F}requency of obtaining authentic and noise variables}, volume = {45}, year = {1992}, annote = {variable selection} Conclusions: ``The degree of correlation between the predictor variables affected the frequency with which authentic predictor variables found their way into the final model. The number of candidate predictor variables affected the number of noise variables that gained entry to the model. The size of the sample was of little practical importance in determining the number of authentic variables contained in the final model. The population multiple coefficient of determination could be faithfully estimated by adopting a statistic that is adjusted by the total number of candidate predictor variables rather than the number of variables in the final model''. } @article{roe91pre, author = {Roecker, Ellen B.}, journal = {Technometrics}, pages = {459-468}, title = {Prediction error and its estimation for subset--selected models}, volume = {33}, year = {1991} Shows that all-possible regression can yield models that are "too small". } @article{man70why, author = {Mantel, Nathan}, journal = {Technometrics}, pages = {621-625}, title = {Why stepdown procedures in variable selection}, volume = {12}, year = {1970}, annote = {variable selection; collinearity} } @article{hur90, author = "Hurvich, C. M. and Tsai, C. L.", journal = American Statistician, pages = "214-217", title = "The impact of model selection on inference in linear regression", volume = "44", year = "1990" } @article{cop83reg, author = {Copas, J. B.}, journal = "Journal of the Royal Statistical Society B", pages = {311-354}, title = {Regression, prediction and shrinkage (with discussion)}, volume = {45}, year = {1983}, annote = {shrinkage; validation; logistic model} Shows why the number of CANDIDATE variables and not the number in the final model is the number of d.f. to consider. } @article{tib96reg, author = {Tibshirani, Robert}, journal = "Journal of the Royal Statistical Society B", pages = {267-288}, title = {Regression shrinkage and selection via the lasso}, volume = {58}, year = {1996}, annote = {shrinkage; variable selection; penalized MLE; ridge regression} }Return to Top
<< address: Clay Helberg | Internet: helberg@execpc.com >> Writing about testing hypotheses, Clay Helberg (chelberg@spss.com) cited: : Richard F Ulrich (me) wrote: me: > "null is nil." For an Odds ratio, for instance, *ordinarily* : > 'nil' is OR= 1.0; but the statistical test, if you want to write : > out the terms, is : > absolute value of (Group1-Group2) minus 1 equal 0 : > or ' ... minus 5 Zuleks equal 0'. CH: This is tautological--you can always rearrange an equation so that there : is a zero on one side. The point I was objecting to was the automatic : assumption that it refers to "no difference" or "no effect" (not, as in : your example, where a specific difference is given, but the equation is : rearranged so the hypothesis reads "the observed difference minus the : hypothesized difference equals zero"). My point, though, was about NULL being meaningful, and NOT just an exercise in tautology. You need some reason to pick the value you want to call Null, even if it is just "the historical value". It is not a random value. As I went on: me: > There is a 'nil' in there somewhere, or you have a funny idea : > of a null hypothesis. Usually, there is a very rational/logical : > reason for what constitutes the null=nil though I do imagine : > the lax case as being, arbitrarily, 'some value previously : > observed', which is what people look at on process-control charts. CH: Unfortunately, all too often the default null of "no difference" is used : because it is convenient (it is generally what you get from : computer-generated output), or because the theory under investigation is : so vague as to preclude reasonable point predictions. -- My experience say, the null is usually just 0.... me: > I do NOT see a string of hypothesis, of which H-sub-zero is simply : > the lowest number. CH: Well, in Hays (Statistics), he lists the symbol for the null hypothesis : as H-sub-zero, and the symbol for the alternative as H-sub-one. This : usage is also given in Hogg & Craig (Introduction to Mathematical : Statistics) and Vogt (Dictionary of Statistics & Methodology). In fact, : here is a relevant quote from Hays (4th ed., p 249): -- It reads to me like Hays, etc., is on MY side, if *all* the alternatives are grouped together as H-sub-one.... CH:Return to Top: Incidentally, there is an impression in some quarters that the term : "null hypothesis" refers to the fact that in experimental work the : parameter value specified in Ho is very often zero. Thus, in many : experiments the hypothetical situation "no experimental effect" is : represented by a statement that some mean or difference between means is : exactly zero. However, as we have seen, the tested hypothesis can : specify any of the possible values for one or more parameters, and this : use of the word *null* is only incidental. It is far better for the : student to think of the null hypthesis Ho as simply designating that : hypothesis actually being tested, the one which, if true, determines the : sampling distribution referred to in the test. : : I couldn't have said it better myself.... -- The hypothesis, he says, "if true, determines the sampling distribution referred to in the test." I think he could have said it a bit better. Emphasizing the null should help us keep in mind the problem/fact, that testing is not symmetrical. One REJECTS the null, or one FAILS to reject; one does not, in the latter case, thereby REJECT the alternative. There is only one sampling distribution used in the test, which is the NULL. Not -sub-one, -sub-two, etc. Rich Ulrich, biostatistician wpilib+@pitt.edu Western Psychiatric Inst. and Clinic Univ. of Pittsburgh
Assuming that all the effects that you detect are about the same general size, then your conclusions are correct. However, when looking at a sample of 20000 subjects, I did see some F-test for interactions that were significant but totally uninteresting, because the main effects were 5-fold larger, in differences between means (thus, F-tests 25 times as large). (Besides being small, the interactions showed the data to have MULTIPLICATIVE effects rather than additive ones - under the proper model, the interactions would have disappeared completely, so that was another reason they deserved no attention.) Rich Ulrich, biostatistician wpilib+@pitt.edu Western Psychiatric Inst. and Clinic Univ. of Pittsburgh =====================question that was posted about ANOVA Danny Martin (admpt@hrp.health.ufl.edu) wrote: : Against my better judgement I have allowed a student to design a project : requiring a 3 way ANOVA for analysis (Factors A, B and C). This ANOVA : will generate 3 main effects, 3 two way interactions and one 3 way : interaction. I think I understand how to interpret this ANOVA, but I'd : like some expert confirmation- or correction if needed. : Are the following statements correct? : 1. If the 3 way interaction is significant, main effects post hoc tests : can't be used and simple effects tests are needed. : 2. If the 3 way interaction is not significant and the A x B interaction : is significant, I can analyze effect C with a post hoc if it is : significant. The A x B interaction will be explored with simple effects : tests.Return to Top
Against my better judgement I have allowed a student to design a project requiring a 3 way ANOVA for analysis (Factors A, B and C). This ANOVA will generate 3 main effects, 3 two way interactions and one 3 way interaction. I think I understand how to interpret this ANOVA, but I'd like some expert confirmation- or correction if needed. Are the following statements correct? 1. If the 3 way interaction is significant, main effects post hoc tests can't be used and simple effects tests are needed. 2. If the 3 way interaction is not significant and the A x B interaction is significant, I can analyze effect C with a post hoc if it is significant. The A x B interaction will be explored with simple effects tests. Thanks for your help. Danny Martin, Ph D, PT U. Florida Physical Therapy Gainesville, FL admpt@hrp.health.ufl.edu PH (352) 395-0085 FAX (352) 395-0731Return to Top
F. Bellour (bellour@upso.ucl.ac.be) wrote: FB : Does anyone know how to calculate the validity (or reliability) of survey data? *Reliability* is calculated as correlations -- "Internal" is among items on a scale. "Concomitant" is with alternate scales. There is also reliability between raters, across time, etc. *Validity* is an ideal, saying that you measured what you really WANTED to measure. You can show you do NOT have it, but any claim that you DO have it is subject to potential falsification. Types: include predictive; discriminative. FB ":I know I can use factor analysis to tap the construct validity. But how can I use Cronbach's alpha, to what conclusion does the alpha lead. Do I need to standardize data before calculating the alpha? If yes, do I still have to use standaridized scores in further analyses?" Cronbach's alpha implies items are correlated. Therefore, they might be measuring the same thing. If your items are not scored on the same range (for example, 1-4 vs 0-7) then you may want to standardize. Or else, describe two scales, one for each kind of item - it is very convenient to compute scores, and describe scores, as the simple sum (or average) of their component items. Besides not gaining much, "standardization" should be defined in terms of WHAT standard sample?, the first one you have collected? SPSS reports a "standardized item alpha" which implies that standardized scores were used there; and it gives results otherwise. If the non-standardized results were really worse, then you could need to standardize to hope for the better level of reliability. With luck, you might only need to rescale a single score or two. Rich Ulrich, biostatistician wpilib+@pitt.edu Western Psychiatric Inst. and Clinic Univ. of PittsburghReturn to Top
>Michael Axelrod wrote that Eric Bohlman wrote: >> The incorrect prediction in 1948 (Dewey vs. Truman) wasn't due to invalid >> methodology; political and economic events that occurred in between the >> poll and the election caused a lot of voters to change their minds. It takes very little time for people to change their minds. However, there were two major technical flaws with the Crossley, Gallup, and Roper election polls. All three organizations used quota sampling and gave their interviewers general instructions for selecting their own interviewees. Today no credible organization would be caught making either mistake--they would at least cover it up. In 1948 all three quota samples 'fit' the six-variable (et al) shape of the population very well, but Republicans were over represented. The Washington State Public Opinion Laboratory conducted two polls prior to the 1948 election: a probability sample and a quota sample. The probability sample predicted the correct outcome. Beginning in 1952 polls were conducted using probability sampling--a major step forward. This parallels the argument between Fisher and Gossett regarding the necessity of randomization in experiments. See F. Mosteller, The Pre-Election Polls of 1948. R.Return to Top
Staff /AdminReturn to Topwrote: : Warning to parents! : Content of http://www.mrdoobie.com/ too controversial for children! says who??? whose children????? why do you ordain yourself the minister of right and wrong???? oh, did i forget to mention that your post is horribly off-topic. try righteous.controlfreak.censor.censor.censor instead. your courtesy is appreciated.
Brian L. Bingham wrote: >> We have an ongoing debate in our lab about nested factors and whether they should be fixed or random. There is no consensus among authors on the subject. Some say that nested factors are always random while others state that it is possible (though unlikely) that a nested factor will be fixed. << There is a way of determining this. You have to think more about the multivariate normal approach to the analysis of variance, i.e. mixed models. The way I help determine if a factor is fixed or random is if I am trying to model the parameter as a mean (fixed), or as a component of the covariance matrix (random). I use random effects for nested effects if they are to be used to model the covariance matrix. I use fixed effects if they are to be usded to model the mean. Mark Von Tress 71530.1170@compuserve.comReturn to Top
Bruce Bradbury wrote: > > I can see two complications: the dependent variable is censored, and spells > are not independent (because individuals have multiple spells). The latter > point could conceivably be used to control for unobserved health status > characteristics. > > Is this a common data analysis problem? Is there any literature I can point > them to? Yes and yes. For statistical rigor (in a surprisingly readable book) I'd start with Kalbfleisch and Prentice's book entitled something like "Analysis of survival data." For more intuition, including lots of worked examples, I'd try the volume in the SAS user's series, "Survival analysis using the SAS system," by P. D. Allison. There are places in which I might quibble with this book (in particular, his emphasis on the Cox proportional hazards model as a sort of default model with which to start) but on the whole it's quite good.Return to Top
Hi Stat-ler-s: I was reading a paper "Comparisons of Treatments After an Analysis of Variance in Ecology" and came across a paragraph that puzzled me. The paragraph reads: "Non-parametric-- The Kruskal-Wallis test for one-way designs and the Friedman test for two-way designs without replication do not require that the distributions be normal. However, except for a difference in medians, the distributions must be identical in all the treatment populations compared, either for the original or some TRANSFORM OF THE DATA (Hollander and Wolfe 1973, Conover 1980). It follows that while the variances need not be equal in the raw data, there MUST BE A SUITABLE TRANSFORM TO STABILIZE THE VARIANCES. If transforms are appropriate (see above Assumptions), PARAMETRIC ANALYSES OF THE TRANSFORMED DATA WOULD BE MORE POWERFUL. The standard nonparametic tests should not be used as a simple means to avoid the problem of unequal variances, as some authors of the papers surveyed appeared to do. 1.) Nonparametric statistics are not affected by transformation (eg log transformation) so why must there be a suitable tranform to stabilize the variances the treatments? 2.) In my reading, I have read that nonparametric methods were always as, if not more powerful, than parametric methods even in cases of normality. Has anyone else come across similar or different arguments? I would like to hear any opinions on the above paragraph, as well as my questions. StevenReturn to Top
I was recently hired on a contractor basis to tabulate and analyze results of a customer satisfaction/demograghics email survey. The company anticipated a response of about 500, but so far has received over 3,000 surveys. As a cost saving measure (I'm basically being paid per questionnaire), they want me to take a sample of 200-300 responses and project the results over the 3,000-4,000 responses they will actually receive. My experience in survey analysis has been limited to 4 small, simple surveys. So I'm not sure if their sample "suggestion" is mathematically or ethically sound. That is, is it statistically appropriate to, in effect, take a sample of a sample, and would you have any confidence in the accuracy of the results. Also, under these circumstances would it be unethical to state to prospective customers that "based on over 3,000 responses our survey results indicate our customers' median household income (or age, education level, etc.)is $60,000/year?" If anyone has had experience with this type of situation I would appreciate your advice.Return to Top
Actually, by exploring the engineering functions in Excel I was able to find the ones that process complex numbers. Thank you very much for responding to my query. PETER HOMEL PHD HEALTH SCIENCE CENTER BROOKLYN STATE UNIVERSITY OF NEW YORK 450 CLARKSON AVENUE BOX 7 BROOKLYN, NY 11203-2098 EMAIL: HOMEL@SACC.HSCBKLYN.EDU HOMEL@SNYBKSAC.BITNET TEL: (718) 270-7424 FAX: (718) 270-7461 MOTTO: STATISTICS DON'T LIE!(PEOPLE DO!)Return to Top
Alf Tore Mjos wrote: > > Hello > > I have a problem when carrying out the all possible subsets option in > logistic regression (performed on SAS 6.11), I do not get the > classification table! > Is this possible to do when all possible subsets option are chosen? > > Thank you in advance! > > Arild Breistøl Hi, I don't believe you can get classification tables when using SELECTION=SCORE for best subsets regression. When you say you don't get "the" classification table, for which model should the table be printed? Once you have have decided on a model you can rerun the regression to get the classification table for that model. DanReturn to Top
Hi, This is a beginners' SAS question. I am a social scientist trying to finish my final project in a research class. Is it possible to perform tests of simple effects (as defined in APPLIED STATISTICS by HINKEL/WIERSMA/JURS) in SAS ? I am using the following setup PROC ANOVA DATA=PROJECT; CLASSES A B; MODEL S=A B A*B; MEANS A B A*B MEANS A B A*B/TUKEY BON; FORMAT A AA. B BB.; TITLE 'THE TWO-WAY FIXED-MODEL ANOVA'; The second means statement doesn't perform the simple effects as I would have expected. Please help !Return to Top
>Dennis Roberts wrote: >> >> i am wondering, except in certain linear transformation situations, WHY >> would one want to test a null where the null was that the correlation >> between X and Y is 1? maybe the orignal poster could give an example= where >> this would be a reasonable hypothesis to even test? >> > > Let me jump in here as this null does appear in my discipline, >population genetics. In looking a genetic correlations, if two traits >have a correlations of |1| then they cannot evolve independently. Thus, >it is of interest to test this null for some types of trait combinations. > Let me also put forward my testing method at the risk of being taken >to task. Using a tanh=AC-1 transform, one can test the null rho=3Dsome= value >other than 0. (For example see explanation in Zar). The problem is that >tanh=AC-1(1) is undefined. Instead, I test rho=3D0.99. The logic being that >if that is rejected, one can reject rho=3D1. Obviously there is a slight >loss of power, but acceptable under the circumstances. > >Sam This is unnecessarily complicated. If rho=3D1 (population value), then all= =20 sample correlations (r) from the population must equal 1. Thus, reject h=3D:= =20 rho=3D1 when r<1. _______________________________________________________________________ Hans-Peter Piepho Institut f. Nutzpflanzenkunde WWW: http://www.wiz.uni-kassel.de/fts/ Universitaet Kassel Mail: piepho@wiz.uni-kassel.de Steinstrasse 19 Fax: +49 5542 98 1230 37213 Witzenhausen, Germany Phone: +49 5542 98 1248 =20 =20Return to Top
lucz@ix.netcom.com wrote: > > If you have access to SAS it is very easy to do stepwise regression. There is a problem here. Conventional statistical theory assumes the model is known first of all, and all you have to do is estimate the parameters. There are unbiased methods for doing this - least squares and maximum likelihood. When the estimation is conditioned on the fact that the variables are 'significant' according to some stepwise method they are most definitely NOT unbiased estimates. So... you have to split up your data set and build the model on one subset and estimate the parameters on a different subset. Why does G Asha say his friend *needs* to do stepwise regression? Blaise F Egan Data Mining Group BT Labs BlaiseReturn to Top
Glen Barnett (barnett@agsm.unsw.edu.au) wrote: : Can anyone suggest references for the loss of degrees of freedom : in a regression situation under heteroscedasticity? : Alternatively, the equivalent effect in unbalanced : one way ANOVA may be of help. If you have much effect, then a skewed distribution will give you heteroscedastisity. I have read about the more general case of densities with long tails. Or with a mixture of two populations, with variances that are very different. This was Cressie, writing in about 1977, I think. He showed that, basically, if a few cases (out of a bigger N ) determine your total variance, the the d.f. of your error should be regarded as not much more than the N of the few cases with large variance. He was looking at mixture-models. Here is some of the logic. For the t-test, for instance, the error D.F. for the test is approximated by using a formula that matches the VARIANCE-of-the- variance terms. So, if justReturn to Topof your cases dominate the the variance, then it will act like a variance with D.F., That does mean that you are better off if your larger sample does have the larger variance, if you were comparing two samples that were each, separately, HOMOGENEOUS. But if your larger- variance, larger-N sample owes it variance to a "few" outliers, then your D.F. is really closer to "few". Cressie also made some interesting comments, to the effect that skewed distributions, upon randomization and testing, yield t-tests with a short tail. And stumpy, symmetrical distributions yield t-tests with long tails. Hope this helps. Maybe I will run across the exact reference in the next few days, if someone else does not post something useful. Rich Ulrich, biostatistician wpilib+@pitt.edu Western Psychiatric Inst. and Clinic Univ. of Pittsburgh *** ====================the rest of the note============ : The simpler of the situations I'm in essentially has a model of a set of : parallel lines, for which I'm interested in finding the p-values of the : parameters representing the differences in height. The smaller sample : sizes generally have the smaller variances. If the two-sample t is any : guide, this indicates the effect of d.f. should be small, but I'd : like to see what is out there on this problem. : Most of what I've been able to find out there so far either just falls : back on asymptotic normality, or pretends that the degrees of freedom : don't change. : Glen
In article <9612121754.AA12393@mailhost.sfwmd.gov>, Steven HillReturn to Topwrites: |> Hi Stat-ler-s: |> |> I was reading a paper "Comparisons of Treatments After an Analysis of Variance |> in Ecology" and |> came across a paragraph that puzzled me. The paragraph reads: |> |> "Non-parametric-- The Kruskal-Wallis test for one-way designs and the Friedman |> test for two-way |> designs without replication do not require that the distributions be normal. |> However, except for |> a difference in medians, the distributions must be identical in all the |> treatment populations |> compared, either for the original or some TRANSFORM OF THE DATA (Hollander and |> Wolfe 1973, |> Conover 1980). It follows that while the variances need not be equal in the raw |> data, there |> MUST BE A SUITABLE TRANSFORM TO STABILIZE THE VARIANCES. If transforms are |> appropriate (see |> above Assumptions), PARAMETRIC ANALYSES OF THE TRANSFORMED DATA WOULD BE MORE |> POWERFUL. The |> standard nonparametic tests should not be used as a simple means to avoid the |> problem of unequal |> variances, as some authors of the papers surveyed appeared to do. |> |> 1.) Nonparametric statistics are not affected by transformation (eg log |> transformation) so why |> must there be a suitable tranform to stabilize the variances the |> treatments? |> |> 2.) In my reading, I have read that nonparametric methods were always as, if not |> more powerful, |> than parametric methods even in cases of normality. Has anyone else come |> across similar or |> different arguments? |> |> I would like to hear any opinions on the above paragraph, as well as my |> questions. |> |> Steven IIRC from my nonparametrics class, the Wilcoxon test *is* affected by heterogenous variances - the alpha level inflates. Barry p
Hi folks, I often have to analyze data where the DVs are frequencies whose distribution is skewed toward zero. That is, often as many as half of the subjects will score 0 on a given DV, while the remaining scores will be spread across small integer values (e.g., 1, 2, 3, 4). I could dichotomize these variables and use logistic or loglinear models I suppose, but I'm more comfortable and familiar with (and power calculations are easier for) linear regression and would prefer to use it if I can. It seems to me that this should be okay. I can't see any obvious violation of regression assumptions. My intuition says using the frequencies rather than the dichotomies allows me to keep more information. What do you folks think? Is there a type of analysis better suited for this type of DV? Is it helpful or necessary to transform the frequency data before analysis? Thanks. Bruce L. Lambert, Ph.D. Department of Pharmacy Administration University of Illinois at Chicago lambertb@uic.edu http://ludwig.pmad.uic.edu/~bruce/ Phone: +1 (312) 996-2411 Fax: +1 (312) 996-0868Return to Top