CHAPTER I-2 STATISTICAL INFERENCE AND RANDOM SAMPLING Continuity and sameness is the first fundamental concept in inference in general, as discussed in Chapter I-1. Random sampling is the second of the great concepts in inference, and it distinguishes probabilistic statistical inference from non- statistical inference as well as from non-probabilistic inference based on statistical data. When the data of interest are not the result of random sampling, a sample drawn at random is the ideal to which the actual sample is compared. And the properties of a randomly-drawn sample are utilized on the assumption that the actual sample is sufficiently close to the ideal.<1> The usual goal of a statistical inference is a decision about which of two or more hypotheses one will thereafter choose to believe and act upon. The strategy is to consider the behavior of a given universe in terms of the samples it is likely to produce, and if the observed sample is not a likely outcome we then proceed as if the sample did not in fact come from that universe. (The previous sentence is a restatement in somewhat different form of the core of statistical analysis.) At a more technical level now: Probably the most important task of statistical inference is to determine the existence (or extent) of sameness when intuition alone does not provide a satisfactory answer. Two common cases are a) the extent of overlap between two distributions, and b) the probability that a sample should be said to be the same as a universe in the sense of having been drawn from it. The statistical inference may be thought of as an operational specification that makes more precise a previously-vague notion about sameness. Let's begin the discussion with a simple though unrealistic situation. Your friend Arista a) looks into a cardboard carton, b) reaches in, c) pulls out her hand, and d) shows you a green ball. What might you reasonably infer? You might at least be fairly sure that the green ball came from the carton, though you recognize that Arista might have had it concealed in her hand when she reached into the carton. But there is not much more you might reasonably conclude at this point except that there was at least one green ball in the carton to start with. There could be no more balls; there could be many green balls and no others; there could be a thousand red balls and just one green ball; and there could be one green ball, a hundred balls of different colors, and two pounds of mud - given that she looked in first, it is not improbable that she picked out the only green ball among other material of different sorts. There is not much you could say with confidence about the likelihood of yourself reaching into the same carton with your eyes closed and pulling out a single green ball. To use other language (which some philosophers might say is not appropriate here because the situation is too specific), there is little basis for induction about the contents of the box. Nor is the situation very different if your friend three times in a row reaches in and then hands you a green ball each time. So far we have put our question rather vaguely. Let us frame a more precise inquiry: What do we predict about the next item(s) we might draw from the carton? If we assume - based on who-knows-what information or notions - that another ball will emerge, we could simply use the principle of sameness and (until we see a ball of another color) predict that the next ball will be green, whether one or three or 100 balls is (are) drawn. But now what about if Arista pulls out nine green balls and one red ball? The principle of sameness cannot be applied as simply as before. Based on the last previous ball, the next one will be red. But taking into account all the balls we have seen, the next will "probably" be green. We have no solid basis on which to go further. There cannot be any "solution" to the "problem" of reaching a general conclusion on the basis of these specific pieces of evidence. Now consider what you might conclude if you were told that a single green ball had been drawn with a random sampling procedure from a box containing nothing but balls. Knowledge that the sample was drawn randomly from a given universe is grounds for belief that one knows much more than if a sample were not drawn randomly. First, you would be sure - if you had reasonable basis to believe that the sampling really was random, which is not easy to guarantee - that the ball came from the box. Second, you would guess that the proportion of green balls is not very small, because if there are only a few green balls and many other-colored balls, it would be unusual - that is, the event would have a low probability - to draw a green ball. Not impossible, but unlikely. And we can compute the likelihood of drawing a green ball - or any other combination of colors - for different assumed compositions within the box. So the knowledge that the sampling process is random greatly increases our ability - or our confidence in our ability - to infer the contents of the box. Let us note well the strategy of the previous paragraph: Ask about the probability that one or more various possible contents of the box (the "universe") will produce the observed sample, on the assumption that the sample was drawn randomly. This is the central strategy of all statistical inference, though I do not find it so stated elsewhere. We shall come back to this idea shortly. There are several kinds of questions one might ask about the contents of the box. One general category includes questions about our best guesses of the box's contents - that is, questions of estimation; another category includes questions about our surety of that description, and our surety that the contents are similar or different from the contents of other boxes. The estimation questions can be subtle and unexpected (Savage, 1915/1972, Chapter 15), but do not cause major controversy about the foundations of statistics. Hence I shall merely mention that the method of moments and the method of maximum likelihood serve most of our needs, and often agree in their conclusions; furthermore, we often know when the former may be inappropriate. So we can quickly move on to questions about the extent of surety in our estimations. Consider your reaction if the sampling produces 10 green balls in a row, or 9 out of 10. If you had no other information (a very important assumption that we will leave aside for now), your best guess would be that the box contains all green balls, or a proportion of 9 of 10, in the two cases respectively. This estimation process seems natural enough. You would be surprised if someone told you that instead of the box containing the proportion in the sample, it contained just half green balls. How surprised? Intuitively, the extent of your surprise would depend on the likelihood that a half-green "universe" would produce 10 or 9 green balls out of 10. This surprise is a key element in the logic of the hypothesis-testing branch of statistical inference. We learn more about the likely contents of the box by asking about the probability that various specific populations of balls within the box would produce the particular sample that we received. That is, we can ask how likely a collection of 25 percent green balls is to produce (say) 9 of 10 greens, and how likely collections of 50 percent green, 75 percent green, 90 percent green (and any other collections of interest) are to produce the observed sample. That is, we ask about the consistency between any particular hypothesized collection within the box and the sample we observe. And it is reasonable to believe that those universes which have greater consistency with the observed sample - that is, those universes that are more likely to produce the observed sample - are more likely to be in the box than other universes. What we have just one (to repeat, as I shall repeat many times) is the basic strategy of statistical investigation. If we observe 9 of 10 green balls, we then determine that universes with (say) 9/10 and 10/10 green balls are more consistent with the observed evidence than are universes of 0/10 and 1/10 green balls. So by this process of considering specific universes that the box might contain, we make possible more specific inferences about the box's contents based on the sample evidence than we could without this process. Please notice the role of the concept of probability and the atcual assessment of probabilities here: By one technical means or another (either resampling or formulas), we assess the probabilities that a particular universe will produce the observed sample, and other samples as well. It is of the highest importance to recognize that without additional knowledge (or assumption) one cannot make any statements about the probability of the sample having come from any particular universe, on the basis of the sample evidence. (Better read that last sentence again.) We can only speak about the probability that a particular universe will produce (in contrast to did produce) the observed sample, a very different matter. This issue will arise again very sharply in the context of confidence intervals. Let us generalize the steps in statistical inference: 1. Frame the original question as: What is the chance of getting the observed sample s from population S? That is, what is probability of (If s then S)? 2. Proceed to this question: What kinds of samples does the postulated[<2> universe S produce, with which probability? That is, what is the probability of this particular s coming from S? That is, what is p(s!S)? 3. Actually investigate the behavior of S with respect to s and other samples. One can do this in two ways: a. One can use the calculus of probability, perhaps resorting to Monte Carlo methods if an appropriate formula does not exist. Or, b. Or one can use resampling (in the larger sense); the domain resampling is meant here to equal all Monte Carlo experimentation except for the use of Monte Carlo methods for i) approximations, ii) investigation of complex functions in statistics and other theoretical mathematics, and iii) uses elsewhere in science. Resampling in its more restricted sense includes i) the bootstrap, ii) permutation tests, and iii) other non-parametric simulation methods of statistics. 4. Interpretation of the probabilities that result from step 3 in terms of i) acceptance or rejection of hypotheses, ii) surety of conclusions, or iii) inputs to decision theory. Here is the short definition of statistical inference: The selection of a probabilistic model that might resemble the process you wish to investigate, the investigation of that model's behavior, and the interpretation of the results. We will get even more specific about the procedure when we discuss the canonical procedures for hypothesis testing and for the finding of confidence intervals in the chapters on those subjects. The discussion so far has been in the spirit of what is known as hypothesis testing. The result of a hypothesis test is a decision about whether or not one believes that the sample is likely to have come from the "benchmark [postulated] universe" S. The logic is that if the probability of such a sample coming from that universe is low, we will then choose to believe the alternative - to wit, that the sample came from the universe that resembles the sample. The underlying idea is that if an event would be very surprising if it really happened - as it would be very surprising if the dog had really eaten the homework - we are inclined not to believe in that possibility. (This logic will be explored further in Chapter 00 on hypothesis testing). We have so far assumed that our only relevant knowledge is the sample. And though we almost never lack some additional information, this can be a sensible way to proceed when we wish to suppress any other information or speculation. This suppression is controversial; those known as Bayesians or subjectivists want us to take into account all the information we have. But even they would not dispute suppressing information in certain cases - such as a teacher who does not want to know students' SAT scores because s/he might want avoid the possibility of unconsciously being affected by that score, or by an employer who wants not to know the potential employee's ethnic or racial background even though it might improve the hiring process, or by a sports coach who refuses to pick the starting team each year until the players have competed for the positions. If the Bayesians will admit the reasonability of suppressing information in at least some situations, it will be a major step in accommodation and in bringing all views into greater harmony. (More about this topic in Chapter 00). Now consider a variant on the green-ball situation discussed above. Assume that you are told that there is a (say) equal probability of the sample of nine green and one red balls being drawn from one of two specified universes - for example, two urns of balls, one with 50 percent green balls and the other with 80 percent green balls. On the basis of your sample you can then say how probable it is that the sample came from one or the other. You proceed by computing the probabilities (often called the likelihoods in this situation) that each of those two universes would individually produce the observed samples - probabilities that you could arrive at with resampling, with Pascal's Triangle, or with a table of binomial probabilities, or with the Normal approximation and the Z distribution, or yet other devices. Those probabilities are .01 and .27, and the ratio of the two is between .03 and .04. That is, fair betting odds are about 1 to 27.<3> Actual situations that fit this Neyman-Pearson model are not frequently found. Let us consider a genetics problem on this model. Plant A produces 3/4 black seeds and 1/4 reds; plant B produces all reds. You get a red seed. Which plant would you guess produced it? You surely would guess plant B. Now, how about 9 reds and a black, from Plants A and C, the latter producing 50 percent reds on average? To put the question more precisely: What betting odds would you give that the one red seed came from plant B? Let us reason this way: If you do this again and again, 4 of 5 of the red seeds you see will come from B. Therefore, reasonable (or "fair") odds are 4 to 1, because this is in accord with the ratios with which red seeds are produced by the two plants - 4/4 to 1/4. How about the sample of 9 reds and a black, and plants A and C? It would make sense that the appropriate odds would be derived from the probabilities of the two plants producing that particular sample, probabilities which we computed above. Now let us move to a bit more complex problem: Consider two urns - urn G with 2 red and 1 black balls, and urn H with 100 red and 100 black balls. Someone flips a coin to decide which urn will be drawn from, reaches into that urn, and chooses two balls without replacing the first one before drawing the second. Both are red. What are the odds that the sample came from urn G? Clearly, the answer should derive from the probabilities that the two urns would produce the observed sample.<4> Let's restate the central issue. One can assess the probability that a particular plant which produces on average 1 red and 3 black seeds will produce one red seed, or 5 reds among a sample of 10. But without further assumptions - such as the assumption above that the possibilities are limited to two specific universes - one cannot say how likely a given red seed is to have come from a given plant, even if we know that that plant produces only reds. (For example, it may have come from other plants producing only red seeds.) When we limit the possibilities to two universes (or to a larger set of specified universes) we are able to put a probability on one hypothesis or another. But to repeat, in many or most cases, one cannot reasonably assume it is one or the other. And then we cannot state any odds that the sample came from a particular universe. This is a very difficult point to grasp, experience shows, but a crucial one. (It is the sort of subtle issue that makes statistics so difficult.) The additional assumptions necessary to talk about the probability that the red seed came from a given plant are the stuff of statistical inference. And they must be combined with such "objective" probabilistic assessments as the likelihood that a 1-red-3-black plant will produce one red, or 5 reds of 10. Now let us move one step further. Instead of stating as a fact under our control that there is a .5 chance of the sample being drawn from each of the two urns in the problem above, let us assume that we do not know the probability of each urn being picked, but instead we estimate a probability of .5 for each urn, based on a variety of other information that all is uncertain. But though the facts are now different, the most reasonable estimate of the odds that the observed sample was drawn from one or the other urn will still be the same - because in both cases we were working with a "prior probability" of .5. (The term "prior probability" is Bayesian.) And when we view the situation this way, the Neyman-Pearson model may be seen perfectly well in a Bayesian framework. Now let us go a step further by allowing the universes from which the sample may have come to have different assumed probabilities as well as different compositions. That is, we now consider prior probabilities other than .5. It was the contribution of Thomas Bayes that he showed how to formally incorporate into a computation the "prior" information (which we may choose to call speculation or belief) about the probabilities of drawing from the urns so as to derive a "posterior" probability. But in some or many cases, it is not possible to specify anything further about the "prior distribution" - not even to assume that all possibilities over a given range are of equal probability - and in such a case, you cannot make any reasonable statement about the probability of one or another population based on the sample alone. (People known as "strict Bayesians" say that it is always possible to make meaningful statements about the prior distributions. Whether one can or cannot do so in a particular case seems to me an issue of judgment, however.) How do we decide which universe(s) to investigate for the likelihood of producing the observed sample, as well as producing samples that are even less likely, in the sense of being more surprising? That judgment depends upon the purpose of your analysis, upon your point of view of how statistics ought to be done, and upon some other factors. This decision is discussed in Section 00. It should be noted that the logic described so far applies in exactly the same fashion whether we do our work estimating probabilities with the resampling method or with conventional methods. We can figure the probability of nine or more green chips from a universe of (say) p = .7 with either approach. So far we have discussed the comparison of various hypotheses and possible universes. We must also mention where the consideration of the reliability of estimates comes in. This leads to the concept of confidence limits, which will be discussed in Chapter 00. Samples Whose Observations May Have More Than Two Values So far we have discussed samples and universes that we can characterize as proportions of elements which can have only one of two characteristics - green or other, in this case, which is equivalent to "1" or "0". This expositional choice has been solely for clarity. All the ideas discussed above pertain just as well to samples whose observations may have more than two values, and that may be either discrete or continuous. SUMMARY AND CONCLUSIONS A statistical question asks about the probabilities of possible generating universes in light of the evidence of a sample. In every case, the statistical answer comes from considering the behavior of particular specified universes in relation to the sample evidence and to the behavior of other possible universes. That is, a statistical problem is an exercise in postulating universes of interest and interpreting the probabilistic distributions of results of those universes. The preceding sentence is the key operational idea in statistical inference, though I do not seem a find a statement like this one in the literature. Different sorts of realistic contexts call for different ways of framing the inquiry. For each of the established models there are types of problems that that model fits better than do the other models, and other types of problems for which the model is quite inappropriate. Limiting the domain of application in this fashion, together with using the operational definition of probability discussed in Chapter 00, removes the apparent conflicts between the Fisherian, Neyman-Pearson, and Bayesian models of statistical inference. Fundamental wisdom in statistics, as in all other contexts, is to carry and use a large tool kit rather than just applying only a hammer, screwdriver, or wrench no matter what the problem is at hand. (Philosopher Abraham Kaplan once stated Kaplan's Law of scientific method: Give a small boy a hammer and there is nothing that he will encounter that does not require pounding.) Studying the text of a poem statistically to infer whether Shakespeare or Bacon is the more likely author is quite different than inferring whether bioengineer Smythe can produce an increase in the proportion of calves, and both are different from decisions about whether to remove a basketball player from the game or choose to produce a new product. Some key points: 1) In statistical inference as in all sound thinking, one's purpose is central. All judgments should be made relative to that purpose, and in light of costs and benefits. (This is the spirit of the Neyman-Pearson approach). 2) One cannot avoid making judgments; the process of statistical inference cannot ever be perfectly routinized or objectified. Even in science, fitting a model to experience requires judgment. 3) The best ways to infer are different in different situations - economics, psychology, history, business, medicine, engineering, physics, and so on. 4) Different tools must be used when the situations call for them - sequential vs. fixed sampling, Neyman- Pearson vs. Fisher, and so on. 5) In statistical inference it is wise not to argue about the proper conclusion when the data and procedures are ambiguous. Instead, whenever doing so is possible, one should go back and get more data, hence lessening the importance of the efficiency of statistical tests. In some cases one cannot easily get more data, or even conduct an experiment, as in biostatistics with cancer patients. And with respect to the past one cannot produce more historical data. But one can gather more and different kinds of data, e.g. the history of research on smoking and lung cancer. ENDNOTES **ENDNOTES** <1>: In the course of editing the first two editions of my text on research methods, my friend the late Hanan Selvin never ceased to brace me on writing about a "randomly drawn sample" rather than a random sample, because randomness refers to the process rather than to the outcome. I still slip occasionally into the lazy term, however. When I do so, please note that it is a mistake. <2>: The postulated universe S bears some likeness to the Kantian-Einsteinian model created by the researcher against which to test the observed data. But instead of deriving from theory or insight or hunch or whatever, in statistical inference the model derives from the sample (plus perhaps a Bayesian prior distribution, about which more shortly). Another difference from the original "scientific" model is that the postulated universe S has no causal connection to the sample except through the process of sampling. Statistical inference resembles the scientific model in that it is assumed not to be a perfect picture of nature. But unlike a scientific model, in the case of a finite universe we assume that larger and larger samples can approach the actual universe. <3>: Using RESAMPLING STATS, a program to find the probabilities is as follows. Ask: What is the probability of drawing a sample of nine green and one red ball from a) a 50/50 universe, and b) a universe that is 80% green, 20% red? REPEAT 15000 GENERATE 10 1,2 a Let 1= red, 2 = green COUNT a =1 b SCORE b z-one END COUNT z-one =9 k-one DIVIDE k-one 15000 kk-one REPEAT 15000 GENERATE 10 1,10 a COUNT a <=8 b Let 1-8 = red SCORE b z-two END COUNT z-two =9 k-two DIVIDE k-two 15000 kk-two DIVIDE kk-two kk-two k PRINT kk-two kk-two k kk-one = 0.0092 kk-two = 0.27247 k = 0.033766 [source: program redball.sta] <4>: Just for fun, how about if the first ball drawn is thrown back after examining? What are the appropriate odds now?