Newsgroup sci.stat.consult 21573

Articles

Subject: All possible subsets in logistic regression
From: Alf Tore Mjos
Date: Thu, 12 Dec 1996 20:00:21 +0100

Hello
I have a problem when carrying out the all possible subsets option in
logistic regression (performed on SAS 6.11), I do not get the
classification table! 
Is this possible to do when all possible subsets option are chosen?
Thank you in advance!
Arild Breistøl

Return to Top

Subject: Re: normality test ?
From: Chris Crocker
Date: Thu, 12 Dec 1996 19:43:47 +0000

In article <589cqn$1rdo@yuma.ACNS.ColoState.EDU>, Tim Dierauf
 writes
>
>
>As I remember, pre-5.0 Statistica has the K-S test.
Statistica v4.5 has Schapiro-Wilks, Lilliefors and K-S in the
Descriptive Statistics routine.
-- 
Chris Crocker

Return to Top

Subject: Re: Help needed in data analysis
From: wpilib+@pitt.edu (Richard F Ulrich)
Date: 12 Dec 1996 21:40:16 GMT

G Asha (asha@CAS.IISC.ERNET.IN) wrote:
: My friend has collected data which has the following info.
: Dependent variable: Achievement in Biology
: Ind variables     : self -conf, home adjustment, health adjustment etc
:                      totally 11 in number.
: She needs to do multiple (step-wise) regression analysis. I am familiar
: with multiple regression, but do not know how to go about step-wise
: regression. Can someone please advice me about this.. Even a ref book or
: some simple stat package will do.
  -- I have seen advice posted on how-to.
  -- If you have a stat package, it usually has examples.  Or, Windows/
     graphical approaches let you build your request.
  -- For most purposes, it is not wise to run step-wise regression.
Below, I am including the commentary and references posted before
by Frank Harrell, and which I have mentioned before (on some .stat.
groups).
Rich Ulrich, biostatistician              wpilib+@pitt.edu
Western Psychiatric Inst. and Clinic   Univ. of Pittsburgh
 *** ==================posting on stepwise analyses======
Frank E Harrell Jr                      feh@biostat.mc.duke.edu
Associate Professor of Biostatistics
Division of Biometry                    Duke University Medical Center
----------------------------------------------------------------------
Newsgroups: sci.stat.consult
Subject: Reasons not to do stepwise (or all possible regressions)
Date: 19 Feb 1996 19:22:19 GMT
Message-ID: <4gailc$cc0@news.duke.edu>
Keywords: variable selection
I post this every few months.  I hope it helps.
Here are SOME of the problems with stepwise variable selection.
  1. It yields R-squared values that are badly biased high
  2. The F and chi-squared tests quoted next to each variable on the
     printout do not have the claimed distribution
  3. The method yields confidence intervals for effects and predicted
     values that are falsely narrow (See Altman and Anderson Stat in Med)
  4. It yields P-values that do not have the proper meaning and the
     proper correction for them is a very difficult problem
  5. It gives biased regression coefficients that need shrinkage
     (the coefficients for remaining variables are too large;
     see Tibshirani, 1996).
  6. It has severe problems in the presence of collinearity
  7. It is based on methods (e.g. F tests for nested models) that were
     intended to be used to test pre-specified hypotheses.
  8. Increasing the sample size doesn't help very much (see
     Derksen and Keselman)
  9. It allows us to not think about the problem
 10. It uses a lot of paper
Note that 'all possible subsets' regression does not solve any of these
problems.
References
----------
@article{alt89,
   author = "Altman, D. G.  and Andersen, P. K.",
   journal = "Statistics in Medicine",
   pages = "771-783",
   title = "Bootstrap investigation of the stability of a {C}ox
           regression model",
   volume = "8",
   year = "1989"
Shows that stepwise methods yields confidence limits that are far too narrow.
}
@article{der92bac,
   author = {Derksen, S. and Keselman, H. J.},
   journal = {British Journal of Mathematical and Statistical Psychology},
   pages = {265-282},
   title = {Backward, forward and stepwise automated subset selection 
 algorithms: {F}requency of obtaining authentic and noise variables},
   volume = {45},
   year = {1992},
   annote = {variable selection}
Conclusions:
``The degree of correlation between the predictor variables affected
the frequency with which authentic predictor variables found their way
into the final model.
The number of candidate predictor variables affected the number of
noise variables that gained entry to the model.
The size of the sample was of little practical importance in
determining the number of authentic variables contained in the final
model.
The population multiple coefficient of determination could be
faithfully estimated by adopting a statistic that is adjusted by
the total number of candidate predictor variables rather than the
number of variables in the final model''.
}
@article{roe91pre,
   author = {Roecker, Ellen B.},
   journal = {Technometrics},
   pages = {459-468},
   title = {Prediction error and its estimation for subset--selected models},
   volume = {33},
   year = {1991}
Shows that all-possible regression can yield models that are "too small".
}
@article{man70why,
   author = {Mantel, Nathan},
   journal = {Technometrics},
   pages = {621-625},
   title = {Why stepdown procedures in variable selection},
   volume = {12},
   year = {1970},
   annote = {variable selection; collinearity}
}
@article{hur90,
   author = "Hurvich, C. M. and Tsai, C. L.",
   journal = American Statistician,
   pages = "214-217",
   title = "The impact of model selection on inference in linear regression",
   volume = "44",
   year = "1990"
}
@article{cop83reg,
   author = {Copas, J. B.},
   journal = "Journal of the Royal Statistical Society B",
   pages = {311-354},
   title = {Regression, prediction and shrinkage (with discussion)},
   volume = {45},
   year = {1983},
   annote = {shrinkage; validation; logistic model}
Shows why the number of CANDIDATE variables and not the number in the
final model is the number of d.f. to consider.
}
@article{tib96reg,
   author = {Tibshirani, Robert},
   journal = "Journal of the Royal Statistical Society B",
   pages = {267-288},
   title = {Regression shrinkage and selection via the lasso},
   volume = {58},
   year = {1996},
   annote = {shrinkage; variable selection; penalized MLE; ridge regression}
}

Return to Top

Subject: Re: What do we mean by "The Null Hypothesis"?
From: wpilib+@pitt.edu (Richard F Ulrich)
Date: 12 Dec 1996 22:01:54 GMT

<< address: Clay Helberg         | Internet: helberg@execpc.com >>
Writing about testing hypotheses, Clay Helberg (chelberg@spss.com) cited:
: Richard F Ulrich (me) wrote:
me: >  "null is nil."  For an Odds ratio, for instance, *ordinarily*
: > 'nil'  is  OR= 1.0;  but the statistical test, if you want to write
: > out the terms, is
: >    absolute value of  (Group1-Group2) minus 1 equal 0
: >    or  '  ...                       minus 5 Zuleks equal 0'.
CH: This is tautological--you can always rearrange an equation so that
    there
: is a zero on one side. The point I was objecting to was the automatic
: assumption that it refers to "no difference" or "no effect" (not, as in
: your example, where a specific difference is given, but the equation is
: rearranged so the hypothesis reads "the observed difference minus the
: hypothesized difference equals zero").
My point, though, was about NULL being meaningful, and NOT just an
exercise in tautology.  You need some reason to pick the value you 
want to call Null, even if it is just "the historical value".  It is not
a random value.  As I went on:
me: > There is a  'nil'  in there somewhere, or you have a funny idea
: > of a null hypothesis.  Usually, there is a very rational/logical
: > reason for what constitutes the  null=nil  though I do imagine
: > the lax  case as being, arbitrarily, 'some value previously
: > observed', which is what people look at on process-control charts.
CH: Unfortunately, all too often the default null of "no difference" is
    used
: because it is convenient (it is generally what you get from
: computer-generated output), or because the theory under investigation is
: so vague as to preclude reasonable point predictions.
 -- My experience say, the null is usually just 0....
me: > I do NOT see a string of hypothesis, of which H-sub-zero  is simply
: > the lowest number.
CH: Well, in Hays (Statistics), he lists the symbol for the null
    hypothesis
: as H-sub-zero, and the symbol for the alternative as H-sub-one. This
: usage is also given in Hogg & Craig (Introduction to Mathematical
: Statistics) and Vogt (Dictionary of Statistics & Methodology). In fact,
: here is a relevant quote from Hays (4th ed., p 249):
 -- It reads to me like Hays, etc., is on MY side, if  *all*  the
alternatives  are grouped together as  H-sub-one....
CH: 	
: 	Incidentally, there is an impression in some quarters that the term
: "null hypothesis" refers to the fact that in experimental work the
: parameter value specified in Ho is very often zero. Thus, in many
: experiments the hypothetical situation "no experimental effect" is
: represented by a statement that some mean or difference between means is
: exactly zero. However, as we have seen, the tested hypothesis can
: specify any of the possible values for one or more parameters, and this
: use of the word *null* is only incidental. It is far better for the
: student to think of the null hypthesis Ho as simply designating that
: hypothesis actually being tested, the one which, if true, determines the
: sampling distribution referred to in the test.
: 	
: I couldn't have said it better myself....
  -- The hypothesis, he says, "if true, determines the sampling
distribution referred to in the test."  I think he could have said
it a bit better.  
Emphasizing the null should help us keep in mind the problem/fact,
that testing is not symmetrical.  One REJECTS the null, or one FAILS
to reject;  one does not, in the latter case, thereby REJECT the
alternative.  There is only one sampling distribution used in the
test, which is the NULL.  Not -sub-one,  -sub-two,  etc.
Rich Ulrich, biostatistician              wpilib+@pitt.edu
Western Psychiatric Inst. and Clinic   Univ. of Pittsburgh

Return to Top

Subject: Re: 3 way ANOVA interaction ?
From: wpilib+@pitt.edu (Richard F Ulrich)
Date: 12 Dec 1996 21:32:01 GMT

Assuming that all the effects that you detect are about the same 
general size, then your conclusions are correct.  However, when
looking at a sample of 20000 subjects, I did see some F-test for
interactions that were significant but totally uninteresting,
because the main effects were 5-fold larger, in differences between
means (thus, F-tests 25 times as large).  (Besides being small, 
the interactions showed the data to have  MULTIPLICATIVE effects
rather than additive ones  -  under the proper model, the 
interactions would have disappeared completely, so that was 
another reason they deserved no attention.)
Rich Ulrich, biostatistician              wpilib+@pitt.edu
Western Psychiatric Inst. and Clinic   Univ. of Pittsburgh
=====================question that was posted about ANOVA
Danny Martin (admpt@hrp.health.ufl.edu) wrote:
: Against my better judgement I have allowed a student to design a project
: requiring a 3 way ANOVA for analysis (Factors A, B and C).  This ANOVA
: will generate 3 main effects, 3 two way interactions and one 3 way
: interaction.  I think I understand how to interpret this ANOVA, but I'd
: like some expert confirmation- or correction if needed.
: Are the following statements correct?
: 1.  If the 3 way interaction is significant, main effects post hoc tests
: can't  be used and simple effects tests are needed.
: 2.  If the 3 way interaction is not significant and the A x B interaction
: is significant, I can analyze effect C with a post hoc if it is
: significant.  The A x B interaction will be explored with simple effects
: tests.

Return to Top

Subject: 3 way ANOVA interaction ?
From: admpt@hrp.health.ufl.edu (Danny Martin)
Date: Thu, 12 Dec 1996 13:42:21 -0500

Against my better judgement I have allowed a student to design a project
requiring a 3 way ANOVA for analysis (Factors A, B and C).  This ANOVA
will generate 3 main effects, 3 two way interactions and one 3 way
interaction.  I think I understand how to interpret this ANOVA, but I'd
like some expert confirmation- or correction if needed.
Are the following statements correct?
1.  If the 3 way interaction is significant, main effects post hoc tests
can't  be used and simple effects tests are needed.
2.  If the 3 way interaction is not significant and the A x B interaction
is significant, I can analyze effect C with a post hoc if it is
significant.  The A x B interaction will be explored with simple effects
tests.  
Thanks for your help.
Danny Martin, Ph D, PT
U. Florida Physical Therapy
Gainesville, FL
admpt@hrp.health.ufl.edu
PH (352) 395-0085
FAX (352) 395-0731

Return to Top

Subject: Re: help with reliability/validity
From: wpilib+@pitt.edu (Richard F Ulrich)
Date: 12 Dec 1996 21:22:00 GMT

F. Bellour (bellour@upso.ucl.ac.be) wrote:
FB    : Does anyone know how to calculate the validity 
       (or reliability) of survey data? 
*Reliability*  is calculated as correlations -- "Internal" is among
items on a scale.  "Concomitant"  is with alternate scales.  There
is also reliability between raters, across time, etc.
*Validity*  is an ideal, saying that you measured what you really
WANTED to measure.  You can show you do NOT have it, but any claim
that you DO have it is subject to potential falsification.  Types:
include predictive; discriminative.
FB  ":I know I can use factor analysis to tap the 
      construct validity.  But how can I use Cronbach's 
      alpha, to what conclusion does the alpha lead. Do
      I need to standardize data before calculating the 
      alpha? If yes, do I still have to use standaridized 
      scores in further analyses?"
Cronbach's alpha implies items are correlated.  Therefore, they
might be measuring the same thing.  If your items are not
scored on the same range (for example, 1-4 vs 0-7) then you
may want to standardize.  Or else, describe two scales, one
for each kind of item  -  it is very convenient to compute
scores, and describe scores, as the simple sum (or average)
of their component items.  Besides not gaining much, 
"standardization" should be defined in terms of WHAT standard
sample?, the first one you have collected?
SPSS reports a "standardized item alpha" which implies that
standardized scores were used there;  and it gives results
otherwise.  If the non-standardized results were really worse,
then you could need to standardize to hope for the better 
level of reliability.  With luck, you might only need to 
rescale a single score or two.
Rich Ulrich, biostatistician              wpilib+@pitt.edu
Western Psychiatric Inst. and Clinic   Univ. of Pittsburgh

Return to Top

Subject: Re: population versus sample
From: Randy Bartlett
Date: Wed, 11 Dec 1996 23:16:27 EST

>Michael Axelrod wrote that Eric Bohlman wrote:
>> The incorrect prediction in 1948 (Dewey vs. Truman) wasn't due to invalid
>> methodology; political and economic events that occurred in between the
>> poll and the election caused a lot of voters to change their minds.
It takes very little time for people to change their minds.  However, there
were two major technical flaws with the Crossley, Gallup, and Roper election
polls.  All three organizations used quota sampling and gave their
interviewers general instructions for selecting their own interviewees.
 Today no credible organization would be caught making either mistake--they
would at least cover it up.
In 1948 all three quota samples 'fit' the six-variable (et al) shape of the
population very well, but Republicans were over represented.  The Washington
State Public Opinion Laboratory conducted two polls prior to the 1948
election: a probability sample and a quota sample.  The probability sample
predicted the correct outcome.  Beginning in 1952 polls were conducted using
probability sampling--a major step forward.
This parallels the argument between Fisher and Gossett regarding the
necessity of randomization in experiments.
See F. Mosteller, The Pre-Election Polls of 1948.
                                   R.

Return to Top

Subject: Re: Warning to parents!
From: Steve Doliov
Date: 12 Dec 1996 21:54:23 GMT

Staff /Admin  wrote:
: Warning to parents!
: Content of http://www.mrdoobie.com/ too controversial for children!
says who??? whose children?????  why do you ordain yourself the minister of right
and wrong????
oh, did i forget to mention that your post is horribly off-topic.  try 
righteous.controlfreak.censor.censor.censor instead.
your courtesy is appreciated.

Return to Top

Subject: Re: Fixed or random?
From: Mark Von Tress, 71530.1170@compuserve.com <71530.1170@CompuServe.COM>
Date: 13 Dec 1996 01:33:40 GMT

Brian L. Bingham wrote:
>> We have an ongoing debate in our lab about nested factors and 
whether they should be fixed or random.  There is no consensus 
among authors on the subject.  Some say that nested factors are 
always random while others state that it is possible (though 
unlikely) that a nested factor will be fixed.
<<
There is a way of determining this. You have to think more about 
the multivariate normal approach to the analysis of variance, 
i.e. mixed models. The way I help determine if a factor is fixed 
or random is if I am trying to model the parameter as a mean 
(fixed), or as a component of the covariance matrix (random). I 
use random effects for nested effects if they are to be used to 
model the covariance matrix. I use fixed effects if they are to 
be usded to model the mean.

Mark Von Tress
71530.1170@compuserve.com

Return to Top

Subject: Re: Repeated measure survival data question
From: "Gordon A. Fox"
Date: Thu, 12 Dec 1996 14:24:14 -0800

Bruce Bradbury wrote:
> 
> I can see two complications: the dependent variable is censored, and spells
> are not independent (because individuals have multiple spells). The latter
> point could conceivably be used to control for unobserved health status
> characteristics.
> 
> Is this a common data analysis problem? Is there any literature I can point
> them to?
Yes and yes.
For statistical rigor (in a surprisingly readable book) I'd start with 
Kalbfleisch and Prentice's book entitled something like "Analysis of survival 
data."
For more intuition, including lots of worked examples, I'd try the volume in 
the SAS user's series, "Survival analysis using the SAS system," by P.  D. 
Allison. There are places in which I might quibble with this book (in 
particular, his emphasis on the Cox proportional hazards model as a sort of 
default model with which to start) but on the whole it's quite good.

Return to Top

Subject: Transforms and NonParametric Statistics
From: Steven Hill
Date: Thu, 12 Dec 1996 12:54:09 EST

Hi Stat-ler-s:
I was reading a paper "Comparisons of Treatments After an Analysis of Variance
 in Ecology" and
came across a paragraph that puzzled me.  The paragraph reads:
"Non-parametric-- The Kruskal-Wallis test for one-way designs and the Friedman
 test for two-way
designs without replication do not require that the distributions be normal.
 However, except for
a difference in medians, the distributions must be identical in all the
 treatment populations
compared, either for the original or some TRANSFORM OF THE DATA (Hollander and
 Wolfe 1973,
Conover 1980).  It follows that while the variances need not be equal in the raw
 data, there
MUST BE A SUITABLE TRANSFORM TO STABILIZE THE VARIANCES.  If transforms are
 appropriate (see
above Assumptions), PARAMETRIC ANALYSES OF THE TRANSFORMED DATA WOULD BE MORE
 POWERFUL.  The
standard nonparametic tests should not be used as a simple means to avoid the
 problem of unequal
variances, as some authors of the papers surveyed appeared to do.
1.) Nonparametric statistics are not affected by transformation (eg log
 transformation)  so why
    must there be a suitable tranform  to stabilize the variances the
 treatments?
2.) In my reading, I have read that nonparametric methods were always as, if not
 more powerful,
    than parametric methods even in cases of normality.  Has anyone else come
 across similar or
    different arguments?
I would like to hear any opinions on the above paragraph, as well as my
 questions.
Steven

Return to Top

Subject: survey dilemna
From: harryl
Date: Thu, 12 Dec 1996 22:31:05 -0800

I was recently hired on a contractor basis to tabulate and analyze
results of a customer satisfaction/demograghics email survey.  The
company anticipated a response of about 500, but so far has received
over 3,000 surveys.  As a cost saving measure (I'm basically being paid
per questionnaire), they want me to take a sample of 200-300 responses
and project the results over the 3,000-4,000 responses they will
actually receive.  
My experience in survey analysis has been limited to 4 small, simple
surveys.  So I'm not sure if their sample "suggestion" is mathematically
or ethically sound.  That is, is it statistically appropriate to, in
effect, take a sample of a sample, and would you have any confidence in
the accuracy of the results. Also, under these circumstances would it be
unethical to state to prospective customers that "based on over 3,000
responses our survey results indicate our customers' median household
income (or age, education level, etc.)is $60,000/year?"
If anyone has had experience with this type of situation I would
appreciate your advice.

Return to Top

Subject: Re: fft in excel
From: peter homel
Date: Thu, 12 Dec 1996 10:32:35 EST

Actually, by exploring the engineering functions in Excel I was
able to find the ones that process complex numbers. Thank you
very much for responding to my query.
PETER HOMEL PHD
HEALTH SCIENCE CENTER BROOKLYN
STATE UNIVERSITY OF NEW YORK
450 CLARKSON AVENUE BOX 7
BROOKLYN, NY 11203-2098
EMAIL: HOMEL@SACC.HSCBKLYN.EDU
       HOMEL@SNYBKSAC.BITNET
TEL: (718) 270-7424
FAX: (718) 270-7461
MOTTO: STATISTICS DON'T LIE!(PEOPLE DO!)

Return to Top

Subject: Re: All possible subsets in logistic regression
From: Daniel Nordlund
Date: Thu, 12 Dec 1996 12:34:47 -0800

Alf Tore Mjos wrote:
> 
> Hello
> 
> I have a problem when carrying out the all possible subsets option in
> logistic regression (performed on SAS 6.11), I do not get the
> classification table!
> Is this possible to do when all possible subsets option are chosen?
> 
> Thank you in advance!
> 
> Arild Breistøl
Hi,
I don't believe you can get classification tables when using 
SELECTION=SCORE for best subsets regression.  When you say 
you don't get "the" classification table, for which model 
should the table be printed?  Once you have have decided on a 
model you can rerun the regression to get the classification 
table for that model.
Dan

Return to Top

Subject: SAS help
From: Ya-Fen Lo
Date: Fri, 13 Dec 1996 00:07:22 -0500

Hi,
This is a beginners' SAS question.
I am a social scientist trying to 
finish my final project in a research class.
Is it possible to perform tests of simple effects
(as defined in APPLIED STATISTICS by HINKEL/WIERSMA/JURS)
in SAS ? I am using the following setup
PROC ANOVA DATA=PROJECT;
     CLASSES A B;
     MODEL S=A B A*B;
     MEANS A B A*B
     MEANS A B A*B/TUKEY BON;
     FORMAT A AA. B BB.;
     TITLE 'THE TWO-WAY FIXED-MODEL ANOVA';
The second means statement doesn't perform the simple
effects as I would have expected.
Please help !

Return to Top

Subject: Re: Is there a test for H0:Pearson-Rho=1?
From: Hans-Peter Piepho
Date: Thu, 12 Dec 1996 20:40:25 +0100

>Dennis Roberts wrote:
>>
>> i am wondering, except in certain linear transformation situations, WHY
>> would one want to test a null where the null was that the correlation
>> between X and Y is 1? maybe the orignal poster could give an example=
 where
>> this would be a reasonable hypothesis to even test?
>>
>
>  Let me jump in here as this null does appear in my discipline,
>population genetics. In looking a genetic correlations, if two traits
>have a correlations of |1| then they cannot evolve independently. Thus,
>it is of interest to test this null for some types of trait combinations.
>   Let me also put forward my testing method at the risk of being taken
>to task. Using a tanh=AC-1 transform, one can test the null rho=3Dsome=
 value
>other than 0. (For example see explanation in Zar). The problem is that
>tanh=AC-1(1) is undefined. Instead, I test rho=3D0.99. The logic being that
>if that is rejected, one can reject rho=3D1. Obviously there is a slight
>loss of power, but acceptable under the circumstances.
>
>Sam
This is unnecessarily complicated. If rho=3D1 (population value), then all=
=20
sample correlations (r) from the population must equal 1. Thus, reject h=3D:=
=20
rho=3D1 when r<1.
_______________________________________________________________________
Hans-Peter Piepho
Institut f. Nutzpflanzenkunde  WWW:   http://www.wiz.uni-kassel.de/fts/
Universitaet Kassel            Mail:  piepho@wiz.uni-kassel.de
Steinstrasse 19                Fax:   +49 5542 98 1230
37213 Witzenhausen, Germany    Phone: +49 5542 98 1248
       =20
                                    =20

Return to Top

Subject: Re: Help needed in data analysis
From: Blaise F Egan
Date: Fri, 13 Dec 1996 11:47:58 +0000

lucz@ix.netcom.com wrote:
> 
> If you have access to SAS it is very easy to do stepwise regression.
There is a problem here. Conventional statistical theory assumes
the model is known first of all, and all you have to do is estimate
the parameters. There are unbiased methods for doing this - least 
squares and maximum likelihood. When the estimation is conditioned on 
the fact that the variables are 'significant' according to some stepwise 
method they are most definitely NOT unbiased estimates.
So... you have to split up your data set and build the model on one 
subset and estimate the parameters on a different subset. 
Why does G Asha say his friend *needs* to do stepwise regression?
Blaise F Egan
Data Mining Group
BT Labs
Blaise

Return to Top

Subject: Re: Heteroscedasticity and degrees of freedom
From: wpilib+@pitt.edu (Richard F Ulrich)
Date: 13 Dec 1996 14:53:26 GMT

Glen Barnett (barnett@agsm.unsw.edu.au) wrote:
: Can anyone suggest references for the loss of degrees of freedom
: in a regression situation under heteroscedasticity?
: Alternatively, the equivalent effect in unbalanced 
: one way ANOVA may be of help.
If you have much effect, then a skewed distribution will give 
you heteroscedastisity.  I have read about the more general case
of densities with long tails.  Or with a mixture of two populations,
with variances that are very different.
This was Cressie, writing in about 1977, I think.  He showed that,
basically, if a few cases (out of a bigger  N ) determine your
total variance, the the  d.f.  of your error should be regarded as
not much more than the N of the few cases with large variance.  He
was looking at mixture-models.  Here is some of the logic.
For the t-test, for instance, the error D.F.  for the test is
approximated by using a formula that matches the VARIANCE-of-the-
variance terms.  So,  if just   of your cases dominate the
the variance, then it will act like a variance with   D.F.,
That does mean that you are better off if your larger sample 
does have the larger variance, if you were comparing two samples
that were each, separately, HOMOGENEOUS.  But if your larger-
variance, larger-N  sample owes it variance to a "few"  outliers,
then your D.F.  is really closer to "few".
Cressie also made some interesting comments, to the effect that 
skewed distributions, upon randomization and testing,  yield 
t-tests with a short tail.  And stumpy, symmetrical distributions
yield t-tests with long tails.
Hope this helps.  Maybe I will run across the exact reference
in the next few days, if someone else does not post something
useful.
Rich Ulrich, biostatistician              wpilib+@pitt.edu
Western Psychiatric Inst. and Clinic   Univ. of Pittsburgh
*** ====================the rest of the note============
: The simpler of the situations I'm in essentially has a model of a set of
: parallel lines, for which I'm interested in finding the p-values of the
: parameters representing the differences in height. The smaller sample
: sizes generally have the smaller variances. If the two-sample t is any
: guide, this indicates the effect of d.f. should be small, but I'd
: like to see what is out there on this problem.
: Most of what I've been able to find out there so far either just falls
: back on asymptotic normality, or pretends that the degrees of freedom
: don't change.
: Glen

Return to Top

Subject: Re: Transforms and NonParametric Statistics
From: bdecicco@sunm4048az.sph.umich.edu (Barry DeCicco)
Date: 13 Dec 1996 15:06:44 GMT

In article <9612121754.AA12393@mailhost.sfwmd.gov>, Steven Hill  writes:
|> Hi Stat-ler-s:
|> 
|> I was reading a paper "Comparisons of Treatments After an Analysis of Variance
|>  in Ecology" and
|> came across a paragraph that puzzled me.  The paragraph reads:
|> 
|> "Non-parametric-- The Kruskal-Wallis test for one-way designs and the Friedman
|>  test for two-way
|> designs without replication do not require that the distributions be normal.
|>  However, except for
|> a difference in medians, the distributions must be identical in all the
|>  treatment populations
|> compared, either for the original or some TRANSFORM OF THE DATA (Hollander and
|>  Wolfe 1973,
|> Conover 1980).  It follows that while the variances need not be equal in the raw
|>  data, there
|> MUST BE A SUITABLE TRANSFORM TO STABILIZE THE VARIANCES.  If transforms are
|>  appropriate (see
|> above Assumptions), PARAMETRIC ANALYSES OF THE TRANSFORMED DATA WOULD BE MORE
|>  POWERFUL.  The
|> standard nonparametic tests should not be used as a simple means to avoid the
|>  problem of unequal
|> variances, as some authors of the papers surveyed appeared to do.
|> 
|> 1.) Nonparametric statistics are not affected by transformation (eg log
|>  transformation)  so why
|>     must there be a suitable tranform  to stabilize the variances the
|>  treatments?
|> 
|> 2.) In my reading, I have read that nonparametric methods were always as, if not
|>  more powerful,
|>     than parametric methods even in cases of normality.  Has anyone else come
|>  across similar or
|>     different arguments?
|> 
|> I would like to hear any opinions on the above paragraph, as well as my
|>  questions.
|> 
|> Steven
IIRC from my nonparametrics class, the Wilcoxon test *is*
affected by heterogenous variances - the alpha level inflates.
Barry
p

Return to Top

Subject: Regression to analyze frequencies skewed to zero?
From: "Bruce L. Lambert, Ph.D."
Date: 13 Dec 1996 17:23:02 GMT

Hi folks,
I often have to analyze data where the DVs are frequencies whose
distribution is skewed toward zero. That is, often as many as half of the
subjects will score 0 on a given DV, while the remaining scores will be
spread across small integer values (e.g., 1, 2, 3, 4). I could
dichotomize these variables and use logistic or loglinear models I
suppose, but I'm more comfortable and familiar with (and power
calculations are easier for) linear regression and would prefer to use it
if I can.
It seems to me that this should be okay. I can't see any obvious
violation of regression assumptions. My intuition says using the
frequencies rather than the dichotomies allows me to keep more
information. What do you folks think? Is there a type of analysis better
suited for this type of DV? Is it helpful or necessary to transform the
frequency data before analysis? Thanks.
Bruce L. Lambert, Ph.D.
Department of Pharmacy Administration
University of Illinois at Chicago
lambertb@uic.edu
http://ludwig.pmad.uic.edu/~bruce/
Phone: +1 (312) 996-2411				
Fax:   +1 (312) 996-0868

Return to Top

Downloaded by WWW Programs
Byron Palmer

Newsgroup sci.stat.consult 21573

Directory

Articles