Subject: pronunciation of statisticians' names
From: Desmond Allen
Date: Fri, 08 Nov 1996 19:42:25 +0000
I am presently trying to render the names of various statisticians into
Japanese. Unfortunately I don't know the national or linguistic origins
of many of them which makes it impossible to guess the pronunciaton of
their names particularly in regard to correct vowel sounds and their
length, the syllable stress, silent letters, v/w, ch/k etc.
Could anyone help me with the following please:
Sidak,Hochberg,Einot,Welsch,Tamhame,Waller,Breslow,Meyer,Olkin,Tarone,Levene,Geisser,Cronbach,Mantel,Haenszel,Mauchly,Desu,Hamann,
Jaccard,Dice,Wald
Many thanks
Des Allen
Machida Shi, Japan
e-mail: dw6d-alln@asahi-net.or.jp
Subject: Re: I need help with a t-test.
From: Warren
Date: 8 Nov 1996 14:26:30 GMT
"Andrew " wrote:
>I have a sample of five transgenic seedlings assayed for the presence of an
>enzyme. And then another sample of five transgenic seedlings assayed for
>the enzyme after treatment with an inducer.
>
>Both samples show visual trends to having differents means however, the
>variance for the treated plants is very high due to differing copy numbers
>of the inserted gene.
>
Andrew,
I am sure someone can help, but will probably need a little more
clarification. You indicate you are performing t-tests...did you take
measurements at just one time point? Or did you measure enzyme presence
over time? You mention visible trends which is puzzling and looks to me
like you took measurements over time...is that right? If so, I am not
sure that a simple t-test is appropriate. If "both samples show visual
trends...", you must have plotted out either the individual observations
or the means for 5 in each group.
You have 5 seedlings in each group...did you take replicate measures on
each of these seedlings or just one measurement? By presence of enzymes,
is this quantitative? Or "presence/absence"? On an ordinal scale? Did
you randomize seedlings to treatment? Do you have measurements of
enzymes both before and after treatment or does the assay destroy the
seedling?
With only 5 in each group, it will be difficult to pick up small
differences and the heteroscedasticity is a problem. Did you try a
transformation to see if that helped calm down the variances? The other
members of the group might have better suggestions, but one thing you
might look at is a randomization test. For two groups with 5 in each
group, there are 252 possible assignments of subjects to groups.
Subject: Re: ANOVA Question
From: wpilib+@pitt.edu (Richard F Ulrich)
Date: 8 Nov 1996 15:10:58 GMT
After I asked for clarification,
Henrik Heine (1heine@rzdspc3.informatik.uni-hamburg.de) wrote:
<< ... >>
: Yes, 'phone duration' is a quantitative variable which actually does
: have a distribution somewhat close to a gaussian.
: The problem is, that there are some 'qualitative' variables like
: 'broad phonetic class' and 'phone'. These variables have
: a fixied number of possible values (do you say 'levels'?).
: 'broad phonetic class' has three levels: 'p,t,k', 'b,d,g' and 'others'.
: If you now wanted to calculate the variance of this variable
: given a number of samples, how would you do this?
: In 'Statistical Methods for the Social and Behavioral Science,
: L. A. Marascuilo' I read, that you use an ANOVA to test the
: H0 that several means are equal. But it says that this is very
: close to calculating the amount of variance of the dependent
: variable covered by each of the independent.
: But there seems to arrise a problem when you have qualitative
: variables ... it says somewhere that you would have to do
: all your calculations for each 'level' seperatly since the
: formulation of a 'mean' and thus of 'variance' does not make
: sense in this case.
: Can you make something out of this? :)
Well, what you cited DOES make sense to me. Maybe I can give you
a more explicit model.
Think of ANOVA as an analysis of Sums of Squares (SS). I have
been confused by textbooks before, where a SS is called a Variance,
but also a SS divided by (N) or (N-1) is also called a variance,
where the latter kind is also the square of the Standard Deviation.
Formulas then would make the subtle difference that the first kind
would be symbolized by S-squared and the second by s-squared; where
capitilization is important.
For a group of numbers, the Mean is the best "estimator" of the
center of them, in the sense that there is the minimum deviation
in the SS, which is TOTAL for the group.
If you break the group into two, and compute separate means, and
compute SS separately around the two means, then the SS which is
WITHIN the two groups is less than the amount that was TOTAL, with
the limit of being zero if the two means were exactly equal.
Then, a difference between the two groups can be computed as
BETWEEN = TOTAL - WITHIN .
The BETWEEN is what is tested for a difference between groups. The
WITHIN is sometimes called Residual, or Error.
When you divide these SS terms by appropriate Degrees of freedom,
you get Mean Squares; whose ratio gives you the F-test.
ANOVA in a regression uses a LINE to estimate each number Y, rather
than a separate mean for each of several groups. The variance is
still computed as a SS but this SS uses the square of Y-distance
from the line, rather than the distance from a Y-group mean.
Obviously, a line MIGHT conceivably fit a set of continuous Ys
perfectly, whereas, continuous scores will still have variation
around the means of two or more groups; but the same F-test can be
used. -- Especially if the fit to a line where X predicts Y is
pretty good, continous scores will offer more power to test a
hypothesis, compared to the result if X were dichotomized, so the
fit could not be nearly so good.
Hope this helps.
Rich Ulrich, biostatistician wpilib+@pitt.edu
Western Psychiatric Inst. and Clinic Univ. of Pittsburgh
Subject: Re: pseudo random sequence using (ax+b) mod c
From: hrubin@b.stat.purdue.edu (Herman Rubin)
Date: 8 Nov 1996 11:12:02 -0500
In article <32822FE4.3CA@cod.nosc.mil>, Richard Lay wrote:
>Hi all.
>I have recently heard of a (rather old) method for generating pseudo
>random sequences of numbers using the relation:
>x(new) = (ax+b)mod c
>How does one pick a and b to make sure that the x's go through all c
>numbers before repeating?
This cannot be done if c is odd, although if c is an odd prime, and
a is a primitive root mod c, c-1 values are obtained.
If c is a power of 2, a is congruent to 5 mod 8, and b is odd, all
c values are obtained.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hrubin@stat.purdue.edu Phone: (317)494-6054 FAX: (317)494-0558
Subject: Re: Finite Population Correction
From: wpilib+@pitt.edu (Richard F Ulrich)
Date: 8 Nov 1996 15:48:56 GMT
Richard Reid (rreid@smartt.com) wrote:
: Greetings:
: How is the "finite population correction" derived?
This is a timely question for the U.S., since we just had elections,
and projection Election Returns is the main place that anyone sees
"finite population correction" in action.
If SUM= A+B, then, assuming correlation of zero.
Variance(SUM)= Variance(A) +Variance(B)
Next: If Variance(A)=0, say, then
Variance(SUM)= Variance(B)
For the usual formulas for "finite population correction", the
expected scores for B are assumed to be the same as A, where A
has already been measured. And we are trying to make a statement
about the SUM.
So, we look at A as if it were a "sample" in order to estimate
the mean and standard deviation (S.D.); but the S.D. is then
applied ONLY to the n-sub-B subjects who have not yet been
measured.
In election parlance: Once XX% of the precincts have reported
certain leads are "insurmountable", based on reasonable assumptions
about how the uncounted votes will be cast -- that is, only some
REASONABLE proportion will go for the current underdog.
What is Reasonable? Your text-book formulas start with the easy
case, where what-is-in predicts what-is-to-come. But it is
common in elections that certain regions report earlier, and
different regions may go heavily for different candidates. From
the formula, it is easy to see that you can combine different
Expectations.
Here in Pennsylvania, the TV networks mis-called at least two
state-wide races on Election night. Two "winners", based on huge
leads coming out of (Democratic) Pittsburgh and Philadelphia, woke
up in the morning to be informed that the slower counts coming in
from all the (Republican) rural areas (mid-state, and the northern
tier) had given the race to their opponents. Some rural areas
are still using paper ballots instead of voting machines, which is
the main reason for slower reporting.
However, I think that the observers who MISSED the outcome might
not be so foolish as this looks at first glance. One news item
reported that those two candidates had made the mistake of (the news-
item said) ignoring those counties entirely -- and thus they would
have run BEHIND EXPECTATIONS if those expectations were based on
the last Democratic candidates for those offices. That last is
just my own interpolation, since the news item did not explicitly
excuse the broadcasters; and I do not know if the error was
small enough for that excuse to suffice.
Rich Ulrich, biostatistician wpilib+@pitt.edu
Western Psychiatric Inst. and Clinic Univ. of Pittsburgh
Subject: Re: Confounding variables in regression
From: aacbrown@aol.com
Date: 8 Nov 1996 16:55:23 GMT
Dan Kehler <005769k@ace.acadiau.ca> in <32827192.39D3@ace.acadiau.ca>
writes:
> Given the common situation of wanting to know the effect
> of a variable, X2 on Y, independent of the the effect of X1
> on Y, when X1 and X2 are correlated, which of the following
> two methods is more appropriate?
> method 1:
> model 1 = Y ~ X1
> model 2 = residuals(model 1) ~ X2 , test for significance of X2
> Method 2:
> model 1 = Y ~ X1 + X2, test for significance of X2.
As always, it depends on your application. No statistic will really
address the question of causality, all you can do is combine your data
results with your knowledge of the problem. Method 1 assigns the joint
effect to X1, Method 2 is more generally useful because it apportions the
joint effect. However that apportionment may not be reasonable, that
depends on the application.
If you use method 1 I suggest also regressing X2 ~ X1 and using the
residuals of that regression to predict the residuals of Y ~ X1. Your
regression statistics and plots will be more reliable.
Even more general than Method 2 would be to create three variables, Z, Z1
and Z2 such that X1 = Z + Z1, X2 = aZ + Z2 for some constant a, and Z1 and
Z2 are independent. Then a regression of Y ~ Z, Z1, and Z2 would
explicitly separate the joint effect of X1 and X2 (Z), the addition effect
of X1 (Z1) and the additional effect of X2 (Z2). This might allow you to
draw better conclusions about what causes what.
Aaron C. Brown
New York, NY
Subject: Re: Random number generation
From: wpilib+@pitt.edu (Richard F Ulrich)
Date: 8 Nov 1996 19:40:22 GMT
-----------------I am posting what follows for J P Chandler, whose local
-----------------netserver is having problems. I'm glad to see some
-----------------good information, in this note and some previous ones,
-----------------to replace my own casual assumptions.
------------------------------Rich Ulrich
X-From: IN%"jpc@a.cs.okstate.edu" "J P Chandler" 6-NOV-1996 18:26:11.33
X-Return-path:
X-From: J P Chandler
In article <55qs5h$2oc@usenet.srv.cis.pitt.edu>,
Richard F Ulrich wrote:
>==========================from latest message> (from Barry Hembree),
>
>> | : I'm looking for a pseudo random number algorithm which enables me to
>> | : choose the n:th number in the sequence without needing to generate the
>> | : first n-1 numbers (where n can be quite big). Does such a thing exist?
>I can understand the idea of somehow skipping through a cycle, if you
>are given the rules for generating it. But I have to admit that I
>really have little idea of what Herman Rubin was getting at, when
>he posted.
>
>"For the various pseudo-random algorithms in my ken, typically the
>n-th element can be generated in time O(log(n)), and except for n
>VERY large, it is not likely that better will be possible."
>
> -- How in heaven's name does one achieve an efficiency on the
>order of log(n)? That would imply that the 100th-to-come item
>is not much harder to pre-define than the 1000th. Is this a
>statement of theoretical principle (and can we have a hint,
>or a reference), or has someone actually achieved this for some
>interesting PRNGs?
Professor Rubin can speak for himself,
but for a multiplicative congruential generator
x = a * x mod m
n+1 n
we have that x = a^n * x mod m
n 0
and we can first find the value of (a^n mod m) and finish
up with one more multiply-and-mod operation.
Computing (a^n mod m) is a well-known problem, covered by Knuth.
Compute (a^2 mod m), (a^4 mod m), (a^8 mod m), etc.,
obtaining each value from the previous value
by one multiply-and-mod operation.
Express n in binary notation, and use one of the quantities
just computed above for each "one" bit in the expression for n.
There are about log n bits in the binary expansion of n.
2
"Viola!": log(n).
Can this be done for a mixed congruential generator?
(This is left as a problem for the reader.)
Of course all linear congruential generators, both multiplicative
and mixed, suffer from the nonrandom "parallel planes property"
of Marsaglia, and are unfit for use by any serious researcher.
(Just my opinion, of course.)
Shuffling generators do not suffer from the parallel planes property.
Predicting the n-th variate from a shuffling generator
efficiently is an unsolved problem, and perhaps impossible.
Subject: Re: I need help with a t-test.
From: "Andrew "
Date: 9 Nov 1996 00:35:15 GMT
Thanks for your reply, i've reposted more information in the newsgroup. I
hope this enough. When i say visual trends, i mean that when i plot the
means against each other with the standard error shown, the error bars
don't intercept and the means look different. I've posted some values so
you can see what i mean.
Sorry, but it's been a while since the stats classes, and i'm not sure what
a randomization test is. (and the trusty, first year stats book is no help
on this)
I'm looking up transformations now
Thanks,
Andrew
Subject: Re: Confounding variables in regression
From: Ellen Hertz
Date: Fri, 08 Nov 1996 20:18:56 -0500
Dan Kehler wrote:
>
> We've been struggling with this question for a while:
>
> Given the common situation of wanting to know the effect of a variable,
> X2 on Y, independent of the the effect of X1 on Y, when X1 and X2 are
> correlated, which of the following two methods is more appropriate?
>
> method 1:
>
> model 1 = Y ~ X1
> model 2 = residuals(model 1) ~ X2 , test for significance of X2
>
> Method 2:
>
> model 1 = Y ~ X1 + X2, test for significance of X2.
>
> This is what I think is going on:
>
> These two methods are quite different, and the difference between them
> depends on the degree of correlation betwen X1 and X2. In the first
> method, the two parameters, b1 and b2 are estimated separately, such that
> the entire variability in Y shared between X1 anfd X2 is allocated to the
> parameter estimate for X1. In the second method the two parameter
> estimates compete for the variability in Y shared between X1 and X2.
>
> Which method is preferable and why?
>
> We'd be very grateful for some insight or a useful reference.
>
> Thanks,
>
> Dan Kehler
> Acadia University
> Wolfville, NS, CANADA
> B0P IX0
The second approach treats the variables symmetrically
and allows them to control for each other. Suppose, for example,
ones chances of surviving the year were greater for people who exercise
(all other things being equal) but lower for older people (also
all other things equal) and a suppose a much higher proportion of
older subjects stuck to their exercise programs. Starting
with exercise in the model alone could lead to the conclusion that
exercise has a negative effect on survival.
If X1 and X2 correlated enough, one of them may
turn out to be not significant in the presence of the other.
Then the most "parsimonious" model would be to leave it out.
Subject: More info on " I need help with a t-test"
From: "Andrew "
Date: 9 Nov 1996 00:27:07 GMT
Here is Some More info on the Analysis:
Five transgenic seedlings were grown on medium containing no inducer. These
seelings were removed individually and assayed for presence of the enzyme
after 48 hours (a process which kills the seedlings). This gave five
results (one for each seedling).
Another Five transgenic seedlings (from the same seedstock) were germinated
on medium with a suspected inducer. These seedlings were removed
individually and assayed after 48 hours. This gave another five results
(one for each transgenic seedling)
The enzyme measurement was quantitative, measuring the specific amount of
enzyme produced
Here is a sample of the data (each seedling gave one value):
Treated enzyme activity: 107.0, 83.9, 106.9, 680.7 & 38.25
Untreated enzyme activity: 26.9, 3.8, 20.8, 37.5 & 15.1
The values would appear to show an induction of the gene. but the
heteroscedastic t-test (i assume this test is valid) shows that the
difference is not significant. The treated data have a lot of variation
because the plants are at an early growth stage and the number of gene
inserts within the plants vary.
I hope this is enough info.
Andrew Morgan
University of Queensland
PS. If your interested the transgene is a reporter gene -defence promoter
construct. We've joined the promoter sequence of a defence gene to the
sequence of an enzyme (beta-glucuronidase) which we can assay for
expression levels, and put it back into the plants.
Subject: [Q] Using pseudoinverse in Bayes discriminant function?
From: Dukki Chung
Date: Fri, 08 Nov 1996 00:13:19 -0500
Hi.
Reently, I had to use Bayes classifier for a pattern classification
problem.
The Bayes discriminant function is:
di(x) = - [ ln|Ci| + (x-mu)^t Ci^-1 (x-mu)]
The problem was, the covariance matrix Ci was near singular, so the
inverse
could not be calculated. So, I used pseudoinverse instead of real
inverse.
What I'm wondering is whether this is a valid, justifiable mathematical
or
statistical approach.
I would be appreciated for any comments, suggestions, references, or any
pointers.
Dukki Chung
Department of Electrical Engineering & Applied Physics
Case Western Reserve University
Phone 216-368-8871, FAX 216-368-6039
dkc6@po.cwru.edu
dchung@pgmdi.com