statistical analysis in biological research

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings
Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

WormBook: The Online Review of C. elegans Biology [Internet]. Pasadena (CA): WormBook; 2005-2018.

WormBook: The Online Review of C. elegans Biology [Internet].

A biologist's guide to statistical thinking and analysis *.

David S. Fay and Ken Gerow .

Affiliations

The proper understanding and use of statistical tools are essential to the scientific enterprise. This is true both at the level of designing one's own experiments as well as for critically evaluating studies carried out by others. Unfortunately, many researchers who are otherwise rigorous and thoughtful in their scientific approach lack sufficient knowledge of this field. This methods chapter is written with such individuals in mind. Although the majority of examples are drawn from the field of Caenorhabditis elegans biology, the concepts and practical applications are also relevant to those who work in the disciplines of molecular genetics and cell and developmental biology. Our intent has been to limit theoretical considerations to a necessary minimum and to use common examples as illustrations for statistical analysis. Our chapter includes a description of basic terms and central concepts and also contains in-depth discussions on the analysis of means, proportions, ratios, probabilities, and correlations. We also address issues related to sample size, normality, outliers, and non-parametric approaches.

1. The basics

1.1. introduction.

At the first group meeting that I attended as a new worm postdoc (1997, D.S.F.), I heard the following opinion expressed by a senior scientist in the field: “If I need to rely on statistics to prove my point, then I'm not doing the right experiment.” In fact, reading this statement today, many of us might well identify with this point of view. Our field has historically gravitated toward experiments that provide clear-cut “yes” or “no” types of answers. Yes, mutant X has a phenotype. No, mutant Y does not genetically complement mutant Z . We are perhaps even a bit suspicious of other kinds of data, which we perceive as requiring excessive hand waving. However, the realities of biological complexity, the sometimes-necessary intrusion of sophisticated experimental design, and the need for quantifying results may preclude black-and-white conclusions. Oversimplified statements can also be misleading or at least overlook important and interesting subtleties. Finally, more and more of our experimental approaches rely on large multi-faceted datasets. These types of situations may not lend themselves to straightforward interpretations or facile models. Statistics may be required.

The intent of these sections will be to provide C. elegans researchers with a practical guide to the application of statistics using examples that are relevant to our field. Namely, which common situations require statistical approaches and what are some of the appropriate methods (i.e., tests or estimation procedures) to carry out? Our intent is therefore to aid worm researchers in applying statistics to their own work, including considerations that may inform experimental design. In addition, we hope to provide reviewers and critical readers of the worm scientific literature with some criteria by which to interpret and evaluate statistical analyses carried out by others. At various points we suggest some general guidelines, which may lead to somewhat more uniformity in how our field conducts and presents statistical findings. Finally, we provide some suggestions for additional readings for those interested in a more systematic and in-depth coverage of the topics introduced ( Appendix A ).

1.2. Quantifying variation in population or sample data

There are numerous ways to describe and present the variation that is inherent to most data sets. Range (defined as the largest value minus the smallest) is one common measure and has the advantage of being simple and intuitive. Range, however, can be misleading because of the presence of outliers , and it tends to be larger for larger sample sizes even without unusual data values . Standard deviation (SD) is the most common way to present variation in biological data. It has the advantage that nearly everyone is familiar with the term and that its units are identical to the units of the sample measurement. Its disadvantage is that few people can recall what it actually means.

Figure 1 depicts density curves of brood sizes in two different populations of self-fertilizing hermaphrodites. Both have identical average brood sizes of 300. However, the population in Figure 1B displays considerably more inherent variation than the population in Figure 1A . Looking at the density curves, we would predict that 10 randomly selected values from the population depicted in Figure 1B would tend to show a wider range than an equivalent set from the more tightly distributed population in Figure 1A . We might also note from the shape and symmetry of the density curves that both populations are Normally 1 distributed (this is also referred to as a Gaussian distribution ). In reality, most biological data do not conform to a perfect bell-shaped curve, and, in some cases, they may profoundly deviate from this ideal. Nevertheless, in many instances, the distribution of various types of data can be roughly approximated by a normal distribution. Furthermore, the normal distribution is a particularly useful concept in classical statistics (more on this later) and in this example is helpful for illustrative purposes.

Two normal distributions.

The vertical red lines in Figure 1A and 1B indicate one SD to either side of the mean. From this, we can see that the population in Figure 1A has a SD of 20, whereas the population in Figure 1B has a SD of 50. A useful rule of thumb is that roughly 67% of the values within a normally distributed population will reside within one SD to either side of the mean. Correspondingly, 95% of values reside within two 2 SDs, and more than 99% reside within three SDs to either side of the mean. Thus, for the population in Figure 1A , we can predict that about 95% of hermaphrodites produce brood sizes between 260 and 340, whereas for the population in Figure 1B , 95% of hermaphrodites produce brood sizes between 200 and 400.

Often we can never really know the true mean or SD of a population because we cannot usually observe the entire population. Instead, we must use a sample to make an educated guess. In the case of experimental laboratory science, there is often no limit to the number of animals that we could theoretically test or the number of experimental repeats that we could perform. Admittedly, use of the term “populations” in this context can sound rather forced. It's awkward for us to think of a theoretical collection of bands on a western blot or a series of cycle numbers from a qRT-PCR experiment as a population, but from the standpoint of statistics, that's exactly what they are. Thus, our populations tend to be mythical in nature as well as infinite. Moreover, even the most sadistic advisor can only expect a finite number of biological or technical repeats to be carried out. The data that we ultimately analyze are therefore always just a tiny proportion of the population, real or theoretical, from whence they came.

It is important to note that increasing our sample size will not predictably increase or decrease the amount of variation that we are ultimately likely to record. What can be stated is that a larger sample size will tend to give a sample SD that is a more accurate estimate of the population SD. In the same vein, a larger sample size will also provide a more accurate estimation of other parameters , such as the population mean.

In some cases, standard numerical summaries (e.g., mean and SD) may not be sufficient to fully or accurately describe the data. In particular, these measures usually 3 tell you nothing about the shape of the underlying distribution. Figure 2 illustrates this point; Panels A and B show the duration (in seconds) of vulval muscle cell contractions in two populations of C. elegans . The data from both panels have nearly identical means and SDs, but the data from panel A are clearly bimodal, whereas the data from Panel B conform more to a normal distribution 4 . One way to present this observation would be to show the actual histograms (in a figure or supplemental figure). Alternatively, a somewhat more concise depiction, which still gets the basic point across, is shown by the individual data plot in Panel C. In any case, presenting these data simply as a mean and SD without highlighting the difference in distributions would be potentially quite misleading, as the populations would appear to be identical.

Two distributions with similar means and SDs. Panels A and B show histograms of simulated data of vulval muscle cell contraction durations derived from underlying populations with distributions that are either bimodal (A) or normal (B). Note that both (more...)

1.3. Quantifying statistical uncertainty

Before you become distressed about what the title of this section actually means, let's be clear about something. Statistics, in its broadest sense, effectively does two things for us—more or less simultaneously. (1) Statistics provides us with useful quantitative descriptors for summarizing our data. This includes fairly simple stuff such as means and proportions. It also includes more complex statistics such as the correlation between related measurements, the slope of a linear regression, and the odds ratio for mortality under differing conditions. These can all be useful for interpreting our data, making informed conclusions, and constructing hypotheses for future studies. However, statistics gives us something else, too. (2) Statistics also informs us about the accuracy of the very estimates that we've made. What a deal! Not only can we obtain predictions for the population mean and other parameters, we also estimate how accurate those predictions really are. How this comes about is part of the “magic” of statistics, which as stated shouldn't be taken literally, even if it appears to be that way at times.

In the preceding section we discussed the importance of SD as a measure for describing natural variation within an entire population of worms. We also touched upon the idea that we can calculate statistics, such as SD, from a sample that is drawn from a larger population. Intuition also tells us that these two values, one corresponding to the population, the other to the sample, ought to generally be similar in magnitude, if the sample size is large. Finally, we understand that the larger the sample size, the closer our sample statistic will be to the true population statistic. This is true not only for the SD but also for many other statistics as well.

It is now time to discuss SD in another context that is central to the understanding of statistics. We do this with a thought experiment. Imagine that we determine the brood size for six animals that are randomly selected from a larger population. We could then use these data to calculate a sample mean, as well as a sample SD, which would be based on a sample size of n = 6. Not being satisfied with our efforts, we repeat this approach every day for 10 days, each day obtaining a new mean and new SD ( Table 1 ). At the end of 10 days, having obtained ten different means, we can now use each sample mean as though it were a single data point to calculate a new mean, which we can call the mean of the means . In addition, we can calculate the SD of these ten mean values, which we can refer to for now as the SD of the means . We can then pose the following question: will the SD calculated using the ten means generally turn out to be a larger or smaller value (on average) than the SD calculated from each sample of six random individuals? This is not merely an idiosyncratic question posed for intellectual curiosity. The notion of the SD of the mean is critical to statistical inference. Read on.

Table 1. Ten random samples (trials) of brood sizes.

Thinking about this, we may realize that the ten mean values, being averages of six worms, will tend to show less total variation than measurements from individual worms. In other words, the variation between means should be less than the variation between individual data values. Moreover, the average of these means will generally be closer to the true population mean than would a mean obtained from just six random individuals. In fact, this idea is born out in Table 1 , which used random sampling from a theoretical population (with a mean of 300 and SD of 50) to generate the sample values. We can therefore conclude sample means will generally exhibit less variation than that seen among individual samples. Furthermore, we can consider what might happen if we were to take daily samples of 20 worms instead of 6. Namely, the larger sample size would result in an even tighter cluster of mean values. This in turn would produce an even smaller SD of the means than from the experiment where only six worms were analyzed each day. Thus, increasing sample size will consistently lead to a smaller SD of the means. Note however, as discussed above, increasing sample size will not predictably lead to a smaller or larger SD for any given sample.

It turns out that this concept of calculating the SD of multiple means (or other statistical parameters) is a very important one. The good news is that rather than having to actually collect samples for ten or more days, statistical theory gives us a short cut that allows us to estimate this value based on only a single day's effort. What a deal! Rather than calling this value the “SD of the means”, as might make sense, the field has historically chosen to call this value the “ standard error of the mean ” (SEM). In fact, whenever a SD is calculated for a statistic (e.g., the slope from a regression or a proportion), it is called the standard error (SE) of that statistic. SD is a term generally reserved for describing variation within a sample or population only. Although we will largely avoid the use of formulas in this review, it is worth knowing that we can estimate the SEM from a single sample of n animals using the following equation:

From this relatively simple formula 5 , we can see that the greater the SD of the sample, the greater the SEM will be. Conversely, the larger our sample size, the smaller the SEM will be. Looking back at Table 1 , we can also see that the SEM estimate for each daily sample is reasonably 6 close, on average, to what we obtained by calculating the observed SD of the means from 10 days. This is not an accident. Rather, chalk one up for statistical theory.

Obviously, having a smaller SEM value reflects more precise estimates of the population mean. For that reason, scientists are typically motivated to keep SEM values as low as possible. In the case of experimental biology, variation within our samples may be due to inherent biological variation or to technical issues related to the methods we use. The former we probably can't control very much. The latter we may be able to control to some extent, but probably not completely. This leaves increasing sample size as a direct route to decreasing SE estimates and thus to improving the precision of the parameter estimates. However, increasing sample size in some instances may not be a practical or efficient use of our time. Furthermore, because the denominator in SE equations typically involves the square root of sample size, increasing sample size will have diminishing returns. In general, a quadrupling of sample size is required to yield a halving of the SEM. Moreover, as discussed elsewhere in this chapter, supporting very small differences with very high sample sizes might lead us to make convincing-sounding statistical statements regarding biological effects of no real importance, which is not something we should aspire to do.

1.4. Confidence intervals

Although SDs and SEs are all well and good, what we typically want to know is the accuracy of our parameter estimates. It turns out that SEs are the key to calculating a more directly useful measure, a confidence interval (CI). Although, the transformation of SEs into CIs isn't necessarily that complex, we will generally want to let computers or calculators perform this conversion for us. That said, for sample means derived from sample sizes greater than about ten, a 95% CI will usually span about two SEMs above and below the mean 7 . When pressed for a definition, most people might say that with a 95% CI, we are 95% certain that the true value of the mean or slope (or whatever parameter we are estimating) is between the upper and lower limits of the given CI. Proper statistical semantics would more accurately state that a 95% CI procedure is such that 95% of properly calculated intervals from appropriately random samples will contain the true value of the parameter. If you can discern the difference, fine. If not, don't worry about it.

One thing to keep in mind about CIs is that, for a given sample, a higher confidence level (e.g., 99%) will invoke intervals that are wider than those created with a lower confidence level (e.g., 90%). Think about it this way. With a given amount of information (i.e., data), if you wish to be more confident that your interval contains the parameter, then you need to make the interval wider. Thus, less confidence corresponds to a narrower interval, whereas higher confidence requires a wider interval. Generally for CIs to be useful, the range shouldn't be too great. Another thing to realize is that there is really only one way to narrow the range associated with a given confidence level, and that is to increase the sample size. As discussed above, however, diminishing returns, as well as basic questions related to biological importance of the data, should figure foremost in any decision regarding sample size.

1.5. What is the best way to report variation in data?

Of course, the answer to this will depend on what you are trying to show and which measures of variation are most relevant to your experiment. Nevertheless, here is an important news flash: with respect to means, the SEM is often not the most informative parameter to display. This should be pretty obvious by now. SD is a good way to go if we are trying to show how much variation there is within a population or sample. CIs are highly informative if we are trying to make a statement regarding the accuracy of the estimated population mean. In contrast, SEM does neither of these things directly, yet remains very popular and is often used inappropriately ( Nagele, 2003 ). Some statisticians have pointed out that because SEM gives the smallest of the error bars, authors may often chose SEMs for aesthetic reasons. Namely, to make their data appear less variable or to convince readers of a difference between values that might not otherwise appear to be very different. In fairness, SEM is a perfectly legitimate descriptor of variation 8 . In contrast to CIs, the size of the SEM is not an artifact of the chosen confidence level. Furthermore, unlike the CI, the validity of the SEM does not require assumptions that relate to statistical normality 9 . However, because the SEM is often less directly informative to readers, presenting either SDs or CIs can be strongly recommended for most data. Furthermore, if the intent of a figure is to compare means for differences or a lack thereof, CIs are the clear choice.

1.6. A quick guide to interpreting different indicators of variation

Figure 3 shows a bar graph containing identical (artificial) data plotted with the SD, SEM, and CI to indicate variation. Note that the SD is the largest value, followed by the CI and SEM. This will be the case for all but very small sample sizes (in which case the CI could be wider than two SDs). Remember: SD is variation among individuals, SE is the variation for a theoretical collection of sample means (acquired in an identical manner to the real sample), and CI is a rescaling of the SE so as to be able to impute confidence regarding the value of the population mean. With larger sample sizes, the SE and CI will shrink, but there is no such tendency for the SD, which tends to remain the same but can also increase or decrease somewhat in a manner that is not predictable.

Illustration of SD, SE, and CI as measures of variability.

Figure 4 shows two different situations for two artificial means: one in which bars do not overlap ( Figure 4A ), and one in which they do, albeit slightly ( Figure 4B ). The following are some general guidelines for interpreting error bars, which are summarized in Table 2 . (1) With respect to SD, neither overlapping bars nor an absence of overlapping bars can be used to infer whether or not two sample means are significantly different. This again is because SD reflects individual variation and you simply cannot infer anything about significance of differences for the means. End of story. (2) With respect to SEM, overlapping bars ( Figure 4B ) can be used to infer that the sample means are not significantly different . However, the absence of overlapping bars ( Figure 4A ) cannot be used to infer that the sample means are different. (3) With respect to CIs, the absence of overlapping bars ( Figure 4A ) can be used to infer that the sample means are statistically different 6 . If the CI bars do overlap ( Figure 4B ), however, the answer is “maybe”. Here is why. The correct measure for comparing two means is in fact the SE of the difference between the means . In the case of equal SEMs, as illustrated in Figure 4 , the SE of the difference is ∼1.4 times the SEM. To be significantly different, 10 then, two means need to be separated by about twice the SE of the difference (2.8 SEMs). In contrast, visual separation using the CI bars requires a difference of four times the SEM (remember that CI ∼ 2 × SEM above and below the mean), which is larger than necessary to infer a difference between means. Therefore, a slight overlap can be present even when two means differ significantly. If they overlap a lot (e.g., the CI for Mean 1 includes Mean 2), then the two means are for sure not significantly different. If there is any uncertainty (i.e., there is some slight overlap), determination of significance is not possible; the test needs to be formally carried out.

Comparing means using visual measures of precision.

Table 2. General guidelines for interpreting error bars.

1.7. The coefficient of variation

1.8. P -values

Most statistical tests culminate in a statement regarding the P -value, without which reviewers or readers may feel shortchanged. The P -value is commonly defined as the probability of obtaining a result (more formally a test statistic ) that is at least as extreme as the one observed, assuming that the null hypothesis is true. Here, the specific null hypothesis will depend on the nature of the experiment. In general, the null hypothesis is the statistical equivalent of the “innocent until proven guilty” convention of the judicial system. For example, we may be testing a mutant that we suspect changes the ratio of male-to-hermaphrodite cross-progeny following mating. In this case, the null hypothesis is that the mutant does not differ from wild type, where the sex ratio is established to be 1:1. More directly, the null hypothesis is that the sex ratio in mutants is 1:1. Furthermore, the complement of the null hypothesis, known as the experimental or alternative hypothesis , would be that the sex ratio in mutants is different than that in wild type or is something other than 1:1. For this experiment, showing that the ratio in mutants is significantly different than 1:1 would constitute a finding of interest. Here, use of the term “significantly” is short-hand for a particular technical meaning, namely that the result is statistically significant , which in turn implies only that the observed difference appears to be real and is not due only to random chance in the sample(s). Whether or not a result that is statistically significant is also biologically significant is another question . Moreover, the term significant is not an ideal one, but because of long-standing convention, we are stuck with it. Statistically plausible or statistically supported may in fact be better terms.

Getting back to P -values, let's imagine that in an experiment with mutants, 40% of cross-progeny are observed to be males, whereas 60% are hermaphrodites. A statistical significance test then informs us that for this experiment, P = 0.25. We interpret this to mean that even if there was no actual difference between the mutant and wild type with respect to their sex ratios, we would still expect to see deviations as great, or greater than, a 6:4 ratio in 25% of our experiments. Put another way, if we were to replicate this experiment 100 times, random chance would lead to ratios at least as extreme as 6:4 in 25 of those experiments. Of course, you may well wonder how it is possible to extrapolate from one experiment to make conclusions about what (approximately) the next 99 experiments will look like. (Short answer: There is well-established statistical theory behind this extrapolation that is similar in nature to our discussion on the SEM.) In any case, a large P -value, such as 0.25, is a red flag and leaves us unconvinced of a difference. It is, however, possible that a true difference exists but that our experiment failed to detect it (because of a small sample size, for instance). In contrast, suppose we found a sex ratio of 6:4, but with a corresponding P -value of 0.001 (this experiment likely had a much larger sample size than did the first). In this case, the likelihood that pure chance has conspired to produce a deviation from the 1:1 ratio as great or greater than 6:4 is very small, 1 in 1,000 to be exact. Because this is very unlikely, we would conclude that the null hypothesis is not supported and that mutants really do differ in their sex ratio from wild type. Such a finding would therefore be described as statistically significant on the basis of the associated low P -value.

1.9. Why 0.05?

There is a long-standing convention in biology that P -values that are ≤0.05 are considered to be significant, whereas P -values that are >0.05 are not significant 11 . Of course, common sense would dictate that there is no rational reason for anointing any specific number as a universal cutoff, below or above which results must either be celebrated or condemned. Can anyone imagine a convincing argument by someone stating that they will believe a finding if the P -value is 0.04 but not if it is 0.06? Even a P -value of 0.10 suggests a finding for which there is some chance that it is real.

So why impose “cutoffs”, which are often referred to as the chosen α level , of any kind? Well, for one thing, it makes life simpler for reviewers and readers who may not want to agonize over personal judgments regarding every P -value in every experiment. It could also be argued that, much like speed limits, there needs to be an agreed-upon cutoff. Even if driving at 76 mph isn't much more dangerous than driving at 75 mph, one does have to consider public safety. In the case of science, the apparent danger is that too many false-positive findings may enter the literature and become dogma. Noting that the imposition of a reasonable, if arbitrary, cutoff is likely to do little to prevent the publication of dubious findings is probably irrelevant at this point.

The key is not to change the chosen cutoff—we have no better suggestion 12 than 0.05. The key is for readers to understand that there is nothing special about 0.05 and, most importantly, to look beyond P -values to determine whether or not the experiments are well controlled and the results are of biological interest. It is also often more informative to include actual P -values rather than simply stating P ≤ 0.05; a result where P = 0.049 is roughly three times more likely to have occurred by chance than when P = 0.016, yet both are typically reported as P ≤ 0.05. Moreover, reporting the results of statistical tests as P ≤ 0.05 (or any number) is a holdover to the days when computing exact P -values was much more difficult. Finally, if a finding is of interest and the experiment is technically sound, reviewers need not skewer a result or insist on authors discarding the data just because P ≤ 0.07. Judgment and common sense should always take precedent over an arbitrary number.

2. Comparing two means

2.1. introduction.

Many studies in our field boil down to generating means and comparing them to each other. In doing so, we try to answer questions such as, “Is the average brood size of mutant x different from that of wild type?” or “Is the expression of gene y in embryos greater at 25 °C than at 20 °C?” Of course we will never really obtain identical values from any two experiments. This is true even if the data are acquired from a single population; the sample means will always be different from each other, even if only slightly. The pertinent question that statistics can address is whether or not the differences we inevitably observe reflect a real difference in the populations from which the samples were acquired. Put another way, are the differences detected by our experiments, which are necessarily based on a limited sample size, likely or not to result from chance effects of sampling (i.e., chance sampling ). If chance sampling can account for the observed differences, then our results will not be deemed statistically significant 13 . In contrast, if the observed differences are unlikely to have occurred by chance, then our results may be considered significant in so much as statistics are concerned. Whether or not such differences are biologically significant is a separate question reserved for the judgment of biologists.

Most biologists, even those leery of statistics, are generally aware that the venerable t- test (a.k.a., Student's t -test) 14 is the standard method used to address questions related to differences between two sample means. Several factors influence the power of the t- test to detect significant differences. These include the size of the sample and the amount of variation present within the sample. If these sound familiar, they should. They were both factors that influence the size of the SEM, discussed in the preceding section. This is not a coincidence, as the heart of a t- test resides in estimating the standard error of the difference between two means (SEDM). Greater variance in the sample data increases the size of the SEDM, whereas higher sample sizes reduce it. Thus, lower variance and larger samples make it easier to detect differences. If the size of the SEDM is small relative to the absolute difference in means, then the finding will likely hold up as being statistically significant .

In fact, it is not necessary to deal directly with the SEDM to be perfectly proficient at interpreting results from a t -test. We will therefore focus primarily on aspects of the t -test that are most relevant to experimentalists. These include choices of carrying out tests that are either one- or two-tailed and are either paired or unpaired, assumptions of equal variance or not, and issues related to sample sizes and normality. We would also note, in passing, that alternatives to the t -test do exist. These tests, which include the computationally intensive bootstrap (see Section 6.7 ), typically require somewhat fewer assumptions than the t -test and will generally yield similar or superior results. For reasonably large sample sizes, a t -test will provide virtually the same answer and is currently more straightforward to carry out using available software and websites. It is also the method most familiar to reviewers, who may be skeptical of approaches that are less commonly used.

2.2. Understanding the t-test: a brief foray into some statistical theory

To aid in understanding the logic behind the t -test, as well as the basic requirements for the t -test to be valid, we need to introduce a few more statistical concepts. We will do this through an example. Imagine that we are interested in knowing whether or not the expression of gene a is altered in comma-stage embryos when gene b has been inactivated by a mutation. To look for an effect, we take total fluorescence intensity measurements 15 of an integrated a ::GFP reporter in comma-stage embryos in both wild-type (Control, Figure 5A ) and b mutant (Test, Figure 5B ) strains. For each condition, we analyze 55 embryos. Expression of gene a appears to be greater in the control setting; the difference between the two sample means is 11.3 billion fluorescence units (henceforth simply referred to as “11.3 units”).

Summary of GFP-reporter expression data for a control and a test group.

Along with the familiar mean and SD, Figure 5 shows some additional information about the two data sets. Recall that in Section 1.2 , we described what a data set looks like that is normally distributed ( Figure 1 ). What we didn't mention is that distribution of the data 16 can have a strong impact, at least indirectly, on whether or not a given statistical test will be valid. Such is the case for the t -test. Looking at Figure 5 , we can see that the datasets are in fact a bit lopsided, having somewhat longer tails on the right. In technical terms, these distributions would be categorized as skewed right. Furthermore, because our sample size was sufficiently large (n=55), we can conclude that the populations from whence the data came are also skewed right. Although not critical to our present discussion, several parameters are typically used to quantify the shape of the data including the extent to which the data deviate from normality (e.g., skewness 17 , kurtosis 18 and A-squared 19 ). In any case, an obvious question now becomes, how can you know whether your data are distributed normally (or at least normally enough), to run a t -test?

Before addressing this question, we must first grapple with a bit of statistical theory. The Gaussian curve shown in Figure 6A represents a theoretical distribution of differences between sample means for our experiment. Put another way, this is the distribution of differences that we would expect to obtain if we were to repeat our experiment an infinite number of times. Remember that for any given population, when we randomly “choose” a sample, each repetition will generate a slightly different sample mean. Thus, if we carried out such sampling repetitions with our two populations ad infinitum, the bell-shaped distribution of differences between the two means would be generated ( Figure 6A ). Note that this theoretical distribution of differences is based on our actual sample means and SDs, as well as on the assumption that our original data sets were derived from populations that are normal, which is something we already know isn't true. So what to do?

Theoretical and simulated sampling distribution of differences between two means. The distributions are from the gene expression example. The mean and SE (SEDM) of the theoretical (A) and simulated (B) distributions are both approximately 11.3 and 1.4 (more...)

As it happens, this lack of normality in the distribution of the populations from which we derive our samples does not often pose a problem. The reason is that the distribution of sample means, as well as the distribution of differences between two independent sample means (along with many 20 other conventionally used statistics), is often normal enough for the statistics to still be valid. The reason is the The Central Limit Theorem , a “statistical law of gravity”, that states (in its simplest form 21 ) that the distribution of a sample mean will be approximately normal providing the sample size is sufficiently large. How large is large enough? That depends on the distribution of the data values in the population from which the sample came. The more non-normal it is (usually, that means the more skewed), the larger the sample size requirement. Assessing this is a matter of judgment 22 . Figure 7 was derived using a computational sampling approach to illustrate the effect of sample size on the distribution of the sample mean. In this case, the sample was derived from a population that is sharply skewed right, a common feature of many biological systems where negative values are not encountered ( Figure 7A ). As can be seen, with a sample size of only 15 ( Figure 7B ), the distribution of the mean is still skewed right, although much less so than the original population. By the time we have sample sizes of 30 or 60 ( Figure 7C, D ), however, the distribution of the mean is indeed very close to being symmetrical (i.e., normal).

Illustration of the Central Limit Theorem for binomial proportions. Panels A–D show results from a computational sampling experiment where the proportion of successes in the population is 0.25. The x axes indicate the proportions obtained from (more...)

Illustration of Central Limit Theorem for a skewed population of values. Panel A shows the population (highly skewed right and truncated at zero); Panels B, C, and D show distributions of the mean for sample sizes of 15, 30, and 60, respectively, as obtained (more...)

The Central Limit Theorem having come to our rescue, we can now set aside the caveat that the populations shown in Figure 5 are non-normal and proceed with our analysis. From Figure 6 we can see that the center of the theoretical distribution (black line) is 11.29, which is the actual difference we observed in our experiment. Furthermore, we can see that on either side of this center point, there is a decreasing likelihood that substantially higher or lower values will be observed. The vertical blue lines show the positions of one and two SDs from the apex of the curve, which in this case could also be referred to as SEDMs. As with other SDs, roughly 95% of the area under the curve is contained within two SDs. This means that in 95 out of 100 experiments, we would expect to obtain differences of means that were between “8.5” and “14.0” fluorescence units. In fact, this statement amounts to a 95% CI for the difference between the means, which is a useful measure and amenable to straightforward interpretation. Moreover, because the 95% CI of the difference in means does not include zero, this implies that the P -value for the difference must be less than 0.05 (i.e., that the null hypothesis of no difference in means is not true). Conversely, had the 95% CI included zero, then we would already know that the P -value will not support conclusions of a difference based on the conventional cutoff (assuming application of the two-tailed t -test; see below).

The key is to understand that the t -test is based on the theoretical distribution shown in Figure 6A , as are many other statistical parameters including 95% CIs of the mean. Thus, for the t -test to be valid, the shape of the actual differences in sample means must come reasonably close to approximating a normal curve. But how can we know what this distribution would look like without repeating our experiment hundreds or thousands of times? To address this question, we have generated a complementary distribution shown in Figure 6B . In contrast to Figure 6A , Figure 6B was generated using a computational re-sampling method known as bootstrapping (discussed in Section 6.7 ). It shows a histogram of the differences in means obtained by carrying out 1,000 in silico repeats of our experiment. Importantly, because this histogram was generated using our actual sample data, it automatically takes skewing effects into account. Notice that the data from this histogram closely approximate a normal curve and that the values obtained for the mean and SDs are virtually identical to those obtained using the theoretical distribution in Figure 6A . What this tells us is that even though the sample data were indeed somewhat skewed, a t -test will still give a legitimate result. Moreover, from this exercise we can see that with a sufficient sample size, the t -test is quite robust to some degree of non-normality in the underlying population distributions. Issues related to normality are also discussed further below.

2.3. One- versus two-sample tests

Although t -tests always evaluate differences between two means, in some cases only one of the two mean values may be derived from an experimental sample. For example, we may wish to compare the number of vulval cell fates induced in wild-type hermaphrodites versus mutant m . Because it is broadly accepted that wild type induces (on average) three progenitor vulval cells, we could theoretically dispense with re-measuring this established value and instead measure it only in the mutant m background ( Sulston and Horvitz, 1977 ). In such cases, we would be obliged to run a one-sample t -test to determine if the mean value of mutant m is different from that of wild type.

There is, however, a problem in using the one-sample approach, which is not statistical but experimental. Namely, there is always the possibility that something about the growth conditions, experimental execution, or alignment of the planets, could result in a value for wild type that is different from that of the established norm. If so, these effects would likely conspire to produce a value for mutant m that is different from the traditional wild-type value, even if no real difference exists. This could then lead to a false conclusion of a difference between wild type and mutant m . In other words, the statistical test, though valid, would be carried out using flawed data. For this reason, one doesn't often see one-sample t -tests in the worm literature. Rather, researchers tend to carry out parallel experiments on both populations to avoid being misled. Typically, this is only a minor inconvenience and provides much greater assurance that any conclusions will be legitimate. Along these lines, historical controls, including those carried out by the same lab but at different times, should typically be avoided.

2.4. One versus two tails

One aspect of the t -test that tends to agitate users is the obligation to choose either the one or two-tailed versions of the test. That the term “tails” is not particularly informative only exacerbates the matter. The key difference between the one- and two-tailed versions comes down to the formal statistical question being posed. Namely, the difference lies in the wording of the research question. To illustrate this point, we will start by applying a two-tailed t -test to our example of embryonic GFP expression. In this situation, our typical goal as scientists would be to detect a difference between the two means. This aspiration can be more formally stated in the form of a research or alternative hypothesis . Namely, that the average expression levels of a ::GFP in wild type and in mutant b are different. The null hypothesis must convey the opposite sentiment. For the two-tailed t -test, the null hypothesis is simply that the expression of a ::GFP in wild type and mutant b backgrounds is the same. Alternatively, one could state that the difference in expression levels between wild type and mutant b is zero.

It turns out that our example, while real and useful for illustrating the idea that the sampling distribution of the mean can be approximately normal (and indeed should be if a t -test is to be carried out), even if the distribution of the data are not, is not so useful for illustrating P -value concepts. Hence, we will continue this discussion with a contrived variation: suppose the SEDM was 5.0, reflecting a very large amount of variation in the gene expression data. This would lead to the distribution shown in Figure 8A , which is analogous to the one from Figure 6A . You can see how the increase in the SEDM affects the values that are contained in the resulting 95% CI. The mean is still 11.3, but now there is some probability (albeit a small one) of obtaining a difference of zero, our null hypothesis. Figure 8B shows the same curve and SEDMs. This time, however, we have shifted the values of the x axis to consider the condition under which the null hypothesis is true. Thus the postulated difference in this scenario is zero (at the peak of the curve).

Graphical representation of a two-tailed t -test. (A) The same theoretical sampling distribution shown in Figure 6A in which the SEDM has been changed to 5.0 units (instead of 1.4). The numerical values on the x -axis represent differences from the mean (more...)

Now recall that The P -value answers the following question: If the null hypothesis is true, what is the probability that chance sampling could have resulted in a difference in sample means at least as extreme as the one obtained? In our experiment, the difference in sample means was 11.3, where a ::GFP showed lower expression in the mutant b background. However, to derive the P -value for the two-tailed t -test, we would need to include the following two possibilities, represented here in equation forms:

Most notably, with a two-tailed t -test we impose no bias as to the direction of the difference when carrying out the statistical analysis. Looking at Figure 8B , we can begin to see how the P -value is calculated. Depicted is a normal curve , with the observed difference represented by the vertical blue line located at 11.3 units (∼2.3 SEs). In addition, a dashed vertical blue line at −11.3 is also included. Red lines are located at about 2 SEs to either side of the apex. Based on our understanding of the normal curve, we know that about 95% of the total area under the curve resides between the two red lines, leaving the remaining 5% to be split between the two areas outside of the red lines in the tail regions. Furthermore, the proportion of the area under the curve that is to the outside of each individual blue line is 1.3%, for a total of 2.6%. This directly corresponds to the calculated two-tailed P -value of 0.026. Thus, the probability of having observed an effect this large by mere chance is only 2.6%, and we can conclude that the observed difference of 11.3 is statistically significant.

Once you understand the idea behind the two-tailed t -test, the one-tailed type is fairly straightforward. For the one-tailed t -test, however, there will always be two distinct versions, each with a different research hypothesis and corresponding null hypothesis. For example, if there is sufficient a priori 23 reason to believe that GFP wt will be greater than GFP mut b , then the research hypothesis could be written as

This means that the null hypothesis would be written as

Most importantly, the P -value for this test will answer the question: If the null hypothesis is true, what is the probability that the following result could have occurred by chance sampling?

Looking at Figure 8B , we can see that the answer is just the proportion of the area under the curve that lies to the right of positive 11.3 (solid vertical blue line). Because the graph is perfectly symmetrical, the P -value for this right-tailed test will be exactly half the value that we determined for the two-tailed test, or 0.013. Thus in cases where the direction of the difference coincides with a directional research hypothesis, the P -value of the one-tailed test will always be half that of the two-tailed test. This is a useful piece of information. Anytime you see a P -value from a one-tailed t -test and want to know what the two-tailed value would be, simply multiply by two.

Alternatively, had there been sufficient reason to posit a priori that GFP mut b will be greater than GFP wt , and then the research hypothesis could be written as

Of course, our experimental result that GFP wt was greater than GFP mut b clearly fails to support this research hypothesis. In such cases, there would be no reason to proceed further with a t -test, as the P -value in such situations is guaranteed to be >0.5. Nevertheless, for the sake of completeness, we can write out the null hypothesis as

And the P -value will answer the question: If the null hypothesis is true, what is the probability that the following result could have occurred by chance sampling?

Or, written slightly differently to keep things consistent with Figure 8B ,

This one-tailed test yields a P -value of 0.987, meaning that the observed lower mean of a ::GFP in mut b embryos is entirely consistent with a null hypothesis of GFP mut b ≤ GFP wt .

Interestingly, there is considerable debate, even among statisticians, regarding the appropriate use of one- versus two-tailed t -tests. Some argue that because in reality no two population means are ever identical, that all tests should be one tailed, as one mean must in fact be larger (or smaller) than the other ( Jones and Tukey, 2000 ). Put another way, the null hypothesis of a two-tailed test is always a false premise. Others encourage standard use of the two-tailed test largely on the basis of its being more conservative. Namely, the P -value will always be higher, and therefore fewer false-positive results will be reported. In addition, two-tailed tests impose no preconceived bias as to the direction of the change, which in some cases could be arbitrary or based on a misconception. A universally held rule is that one should never make the choice of a one-tailed t -test post hoc 24 after determining which direction is suggested by your data . In other words, if you are hoping to see a difference and your two-tailed P -value is 0.06, don't then decide that you really intended to do a one-tailed test to reduce the P -value to 0.03. Alternatively, if you were hoping for no significant difference, choosing the one-tailed test that happens to give you the highest P -value is an equally unacceptable practice.

Generally speaking, one-tailed tests are often reserved for situations where a clear directional outcome is anticipated or where changes in only one direction are relevant to the goals of the study. Examples of the latter are perhaps more often encountered in industry settings, such as testing a drug for the alleviation of symptoms. In this case, there is no reason to be interested in proving that a drug worsens symptoms, only that it improves them. In such situations, a one-tailed test may be suitable. Another example would be tracing the population of an endangered species over time, where the anticipated direction is clear and where the cost of being too conservative in the interpretation of data could lead to extinction. Notably, for the field of C. elegans experimental biology, these circumstances rarely, if ever, arise. In part for this reason, two-tailed tests are more common and further serve to dispel any suggestion that one has manipulated the test to obtain a desired outcome.

2.5. Equal or non-equal variances

It is common to read in textbooks that one of the underlying assumptions of the t -test is that both samples should be derived from populations of equal variance. Obviously this will often not be the case. Furthermore, when using the t -test, we are typically not asking whether or not the samples were derived from identical populations, as we already know they are not. Rather, we want to know if the two independent populations from which they were derived have different means. In fact, the original version of the t -test, which does not formally take into account unequal sample variances, is nevertheless quite robust for small or even moderate differences in variance. Nevertheless, it is now standard to use a modified version of the t -test that directly adjusts for unequal variances. In most statistical programs, this may simply require checking or unchecking a box. The end result is that for samples that do have similar variances, effectively no differences in P -values will be observed between the two methods. For samples that do differ considerably in their variances, P -values will be higher using the version that takes unequal variances into account. This method therefore provides a slightly more conservative and accurate estimate for P -values and can generally be recommended.

Also, just to reinforce a point raised earlier, greater variance in the sample data will lead to higher P -values because of the effect of sample variance on the SEDM. This will make it more difficult to detect differences between sample means using the t -test. Even without any technical explanation, this makes intuitive sense given that greater scatter in the data will create a level of background noise that could obscure potential differences. This is particularly true if the differences in means are small relative to the amount of scatter. This can be compensated for to some extent by increasing the sample size. This, however, may not be practical in some cases, and there can be downsides associated with accumulating data solely for the purpose of obtaining low P -values (see Section 6.3 ).

2.6. Are the data normal enough?

Technically speaking, we know that for t -tests to be valid, the distribution of the differences in means must be close to normal (discussed above). For this to be the case, the populations from which the samples are derived must also be sufficiently close to normal. Of course we seldom know the true nature of populations and can only infer this from our sample data. Thus in practical terms the question often boils down to whether or not the sample data suggest that the underlying population is normal or normal enough . The good news is that in cases where the sample size is not too small, the distribution of the sample will reasonably reflect the population from which it was derived (as mentioned above). The bad news is that with small sample sizes (say below 20), we may not be able to tell much about the population distribution. This creates a considerable conundrum when dealing with small samples from unknown populations. For example, for certain types of populations, such as a theoretical collection of bands on a western blot, we may have no way of knowing if the underlying population is normal or skewed and probably can't collect sufficient data to make an informed judgment. In these situations, you would admittedly use a t -test at your own risk 25 .

Textbooks will tell you that using highly skewed data for t -tests can lead to unreliable P -values. Furthermore, the reliability of certain other statistics, such as CIs, can also be affected by the distribution of data. In the case of the t -test, we know that the ultimate issue isn't whether the data or populations are skewed but whether the theoretical population of differences between the two means is skewed. In the examples shown earlier ( Figures 6 and 8 ), the shapes of the distributions were normal, and thus the t -tests were valid, even though our original data were skewed ( Figure 5 ). A basic rule of thumb is that if the data are normal or only slightly skewed, then the test statistic will be normal and the t -test results will be valid, even for small sample sizes. Conversely, if one or both samples (or populations) are strongly skewed, this can result in a skewed test statistic and thus invalid statistical conclusions.

Interestingly, although increasing the sample size will not change the underlying distribution of the population, it can often go a long way toward correcting for skewness in the test statistic 26 . Thus, the t -test often becomes valid, even with fairly skewed data, if the sample size is large enough. In fact, using data from Figure 5 , we did a simulation study and determined that the sampling distribution for the difference in means is indeed approximately normal with a sample size of 30 (data not shown). In that case, the histogram of that sampling distribution looked very much like that in Figure 6B , with the exception that the SD 27 of the distribution was ∼1.9 rather than 1.4. In contrast, carrying out a simulation with a sample size of only 15 did not yield a normal distribution of the test statistic and thus the t -test would not have been valid.

Unfortunately, there is no simple rule for choosing a minimum sample size to protect against skewed data, although some textbooks recommend 30. Even a sample size of 30, however, may not be sufficient to correct for skewness or kurtosis in the test statistic if the sample data (i.e., populations) are severely non-normal to begin with 28 . The bottom line is that there are unfortunately no hard and fast rules to rely on. Thus, if you have reason to suspect that your samples or the populations from which there are derived are strongly skewed, consider consulting your nearest statistician for advice on how to proceed. In the end, given a sufficient sample size, you may be cleared for the t -test. Alternatively, several classes of nonparametric tests can potentially be used ( Section 6.5 ). Although these tests tend to be less powerful than the t -test at detecting differences, the statistical conclusions drawn from these approaches will be much more valid. Furthermore, the computationally intensive method bootstrapping retains the power of the t -test but doesn't require a normal distribution of the test statistic to yield valid results.

In some cases, it may also be reasonable to assume that the population distributions are normal enough. Normality, or something quite close to it, is typically found in situations where many small factors serve to contribute to the ultimate distribution of the population. Because such effects are frequently encountered in biological systems, many natural distributions may be normal enough with respect to the t -test. Another way around this conundrum is to simply ask a different question, one that doesn't require the t -test approach. In fact, the western blot example is one where many of us would intuitively look toward analyzing the ratios of band intensities within individual blots (discussed in Section 6.5 ).

2.7. Is there a minimum acceptable sample size?

You may be surprised to learn that nothing can stop you from running a t -test with sample sizes of two. Of course, you may find it difficult to convince anyone of the validity of your conclusion, but run it you may! Another problem is that very low sample sizes will render any test much less powerful . What this means in practical terms is that to detect a statistically significant difference with small sample sizes, the difference between the two means must be quite large. In cases where the inherent difference is not large enough to compensate for a low sample size, the P -value will likely be above the critical threshold. In this event, you might state that there is insufficient evidence to indicate a difference between the populations, although there could be a difference that the experiment failed to detect. Alternatively, it may be tempting to continue gathering samples to push the P -value below the traditionally acceptable threshold of 0.05. As to whether this is a scientifically appropriate course of action is a matter of some debate, although in some circumstances it may be acceptable. However, this general tactic does have some grave pitfalls, which are addressed in later sections (e.g., Section 6.3 ).

One good thing about working with C. elegans , however, is that for many kinds of experiments, we can obtain fairly large sample sizes without much trouble or expense. The same cannot be said for many other biological or experimental systems. This advantage should theoretically allow us to determine if our data are normal enough or to simply not care about normality since our sample sizes are high. In any event, we should always strive to take advantage of this aspect of our system and not short-change our experiments. Of course, no matter what your experimental system might be, issues such as convenience and expense should not be principal driving forces behind the determination of sample size. Rather, these decisions should be based on logical, pragmatic, and statistically informed arguments (see Section 6.2 on power analysis).

Nevertheless, there are certain kinds of common experiments, such as qRT-PCR, where a sample size of three is quite typical. Of course, by three we do not mean three worms. For each sample in a qRT-PCR experiment, many thousands of worms may have been used to generate a single mRNA extract. Here, three refers to the number of biological replicates . In such cases, it is generally understood that worms for the three extracts may have been grown in parallel but were processed for mRNA isolation and cDNA synthesis separately. Better yet, the templates for each biological replicate may have been grown and processed at different times. In addition, qRT-PCR experiments typically require technical replicates . Here, three or more equal-sized aliquots of cDNA from the same biological replicate are used as the template in individual PCR reactions. Of course, the data from technical replicates will nearly always show less variation than data from true biological replicates. Importantly, technical replicates should never be confused with biological replicates. In the case of qRT-PCR, the former are only informative as to the variation introduced by the pipetting or amplification process. As such, technical replicates should be averaged, and this value treated as a single data point.

In this case, suppose for the sake of discussion that each replicate contains extracts from 5,000 worms. If all 15,000 worms can be considered to be from some a single population (at least with respect to the mRNA of interest), then each observed value is akin to a mean from a sample of 5,000. In that case, one could likely argue that the three values do come from a normal population (the Central Limit Theorem having done its magic on each), and so a t -test using the mean of those three values would be acceptable from a statistical standpoint. It might possibly still suffer from a lack of power, but the test itself would be valid. Similarly, western blot quantitation values, which average proteins from many thousands of worms, could also be argued to fall under this umbrella.

2.8. Paired versus unpaired tests

The paired t -test is a powerful way to detect differences in two sample means, provided that your experiment has been designed to take advantage of this approach. In our example of embryonic GFP expression, the two samples were independent in that the expression within any individual embryo was not linked to the expression in any other embryo. For situations involving independent samples, the paired t -test is not applicable; we carried out an unpaired t -test instead. For the paired method to be valid, data points must be linked in a meaningful way. If you remember from our first example, worms that have a mutation in b show lower expression of the a ::GFP reporter. In this example of a paired t -test, consider a strain that carries a construct encoding a hairpin dsRNA corresponding to gene b . Using a specific promoter and the appropriate genetic background, the dsRNA will be expressed only in the rightmost cell of one particular neuronal pair, where it is expected to inhibit the expression of gene b via the RNAi response. In contrast, the neuron on the left should be unaffected. In addition, this strain carries the same a ::GFP reporter described above, and it is known that this reporter is expressed in both the left and right neurons at identical levels in wild type. The experimental hypothesis is therefore that, analogous to what was observed in embryos, fluorescence of the a ::GFP reporter will be weaker in the right neuron, where gene b has been inhibited.

In this scenario, the data are meaningfully paired in that we are measuring GFP levels in two distinct cells, but within a single worm. We then collect fluorescence data from 14 wild-type worms and 14 b(RNAi) worms. A visual display of the data suggests that expression of a ::GFP is perhaps slightly decreased in the right cell where gene b has been inhibited, but the difference between the control and experimental dataset is not very impressive ( Figure 9A, B ). Furthermore, whereas the means of GFP expression in the left neurons in wild-type and b(RNAi) worms are nearly identical, the mean of GFP expression in the right neurons in wild type is a bit higher than that in the right neurons of b(RNAi) worms. For our t -test analysis, one option would be to ignore the natural pairing in the data and treat left and right cells of individual animals as independent. In doing so, however, we would hinder our ability to detect real differences. The reason is as follows. We already know that GFP expression in some worms will happen to be weaker or stronger (resulting in a dimmer or brighter signal) than in other worms. This variability, along with a relatively small mean difference in expression, may preclude our ability to support differences statistically. In fact, a two-tailed t -test using the (hypothetical) data for right cells from wild-type and b(RNAi) strains ( Figure 9B ) turns out to give a P > 0.05.

Representation of paired data.

Figure 9C, D , in contrast, shows a slightly different arrangement of the same GFP data. Here the wild-type and b(RNAi) strains have been separated, and we are specifically comparing expression in the left and right neurons for each genotype. In addition, lines have been drawn between left and right data points from the same animal. Two things are quite striking. One is that worms that are bright in one cell tend to be bright in the other. Second, looking at b(RNAi) worms, we can see that within individuals, there is a strong tendency to have reduced expression in the right neuron as compared with its left counterpart ( Figure 9D ). However, because of the inherent variability between worms, this difference was largely obscured when we failed to make use of the paired nature of the experiment. This wasn't a problem in the embryonic analysis, because the difference between wild type and b mutants was large enough relative to the variability between embryos. In the case of neurons (and the use of RNAi), the difference was, however, much smaller and thus fell below the level necessary for statistical validation. Using a paired two-tailed t -test for this dataset gives a P < 0.01.

The rationale behind using the paired t -test is that it takes meaningfully linked data into account when calculating the P -value. The paired t -test works by first calculating the difference between each individual pair. Then a mean and variance are calculated for all the differences among the pairs. Finally, a one-sample t -test is carried out where the null hypothesis is that the mean of the differences is equal to zero. Furthermore, the paired t -test can be one- or two-tailed, and arguments for either are similar to those for two independent means. Of course, standard programs will do all of this for you, so the inner workings are effectively invisible. Given the enhanced power of the paired t -test to detect differences, it is often worth considering how the statistical analysis will be carried out at the stage when you are developing your experimental design. Then, if it's feasible, you can design the experiment to take advantage of the paired t -test method.

2.9. The critical value approach

Some textbooks, particularly older ones, present a method known as the critical value approach in conjunction with the t -test. This method, which traditionally involves looking up t -values in lengthy appendices, was developed long before computer software was available to calculate precise P -values. Part of the reason this method may persist, at least in some textbooks, is that it provides authors with a vehicle to explain the basis of hypothesis testing along with certain technical aspects of the t -test. As a modern method for analyzing data, however, it has long since gone the way of the dinosaur. Feel no obligation to learn this.

3. Comparisons of more than two means

3.1. introduction.

The two-sample t -test works well in situations where we need to determine if differences exist between two populations for which we have sample means. But what happens when our analyses involve comparisons between three or more separate populations? Here things can get a bit tricky. Such scenarios tend to arise in one of several ways. Most common to our field is that we have obtained a collection of sample means that we want to compare with a single standard. For example, we might measure the body lengths of young adult-stage worms grown on RNAi-feeding plates, each targeting one of 100 different collagen genes. In this case, we would want to compare mean lengths from animals grown on each plate with a control RNAi that is known to have no effect on body length. On the surface, the statistical analysis might seem simple: just carry out 100 two-sample t -tests where the average length from each collagen RNAi plate is compared with the same control. The problem with this approach is the unavoidable presence of false-positive findings (also known as Type I errors ). The more t -tests you run, the greater the chance of obtaining a statistically significant result through chance sampling. Even if all the populations were identical in their lengths, 100 t -tests would result on average in five RNAi clones showing differences supported by P- values of <0.05, including one clone with a P -value of <0.01. This type of multiple comparisons problem is common in many of our studies and is a particularly prevalent issue in high-throughput experiments such as microarrays, which typically involve many thousands of comparisons.

Taking a slightly different angle, we can calculate the probability of incurring at least one false significance in situations of multiple comparisons. For example, with just two t -tests and a significance threshold of 0.05, there would be an ∼10% chance 29 that we would obtain at least one P -value that was <0.05 just by chance [1 – (0.95) 2 = 0.0975]. With just fourteen comparisons, that probability leaps to >50% (1 – (0.95) 14 = 0.512). With 100 comparisons, there is a 99% chance of obtaining at least one statistically significant result by chance. Using probability calculators available on the web (also see Section 4.10 ), we can determine that for 100 tests there is a 56.4% chance of obtaining five or more false positives and a 2.8% chance of obtaining ten or more. Thinking about it this way, we might be justifiably concerned that our studies may be riddled with incorrect conclusions! Furthermore, reducing the chosen significance threshold to 0.01 will only help so much. In this case, with 50 comparisons, there is still an ∼40% probability that at least one comparison will sneak below the cutoff by chance. Moreover, by reducing our threshold, we run the risk of discarding results that are both statistically and biologically significant.

A related but distinct situation occurs when we have a collection of sample means, but rather than comparing each of them to a single standard, we want to compare all of them to each other. As is with the case of multiple comparisons to a single control, the problem lies in the sheer number of tests required. With only five different sample means, we would need to carry out 10 individual t -tests to analyze all possible pair-wise comparisons [5(5 − 1)/2 = 10]. With 100 sample means, that number skyrockets to 4,950. (100(100 − 1)/2 = 4,950). Based on a significance threshold of 0.05, this would lead to about 248 statistically significant results occurring by mere chance! Obviously, both common sense, as well as the use of specialized statistical methods, will come into play when dealing with these kinds of scenarios. In the sections below, we discuss some of the underlying concepts and describe several practical approaches for handling the analysis of multiple means.

3.2. Safety through repetition

Before delving into some of the common approaches used to cope with multiple comparisons, it is worth considering an experimental scenario that would likely not require specialized statistical methods. Specifically, we may consider the case of a large-scale “functional genomics screen”. In C. elegans , these would typically be carried out using RNAi-feeding libraries ( Kamath et al., 2003 ) or chemical genetics ( Carroll et al., 2003 ) and may involve many thousands of comparisons. For example, if 36 RNAi clones are ultimately identified that lead to resistance to a particular bacterial pathogen from a set of 17,000 clones tested, how does one analyze this result? No worries: the methodology is not composed of 17,000 statistical tests (each with some chance of failing). That's because the final reported tally, 36 clones, was presumably not the result of a single round of screening. In the first round (the primary screen), a larger number (say, 200 clones) might initially be identified as possibly resistant (with some “false significances” therein). A second or third round of screening would effectively eliminate virtually all of the false positives, reducing the number of clones that show a consistent biological affect to 36. In other words, secondary and tertiary screening would reduce to near zero the chance that any of the clones on the final list are in error because the chance of getting the same false positives repeatedly would be very slim. This idea of “safety through independent experimental repeats” is also addressed in Section 4.10 in the context of proportion data. Perhaps more than anything else, carrying out independent repeats is often best way to solidify results and avoid the presence of false positives within a dataset.

3.3. The family-wise error rate

To help grapple with the problems inherent to multiple comparisons, statisticians came up with something called the family-wise error rate . This is also sometimes referred to as the family-wide error rate , which may provide a better description of the underlying intent. The basic idea is that rather than just considering each comparison in isolation, a statistical cutoff is applied that takes into account the entire collection of comparisons. Recall that in the case of individual comparisons (sometimes called the per-comparison or comparison-wise error rate ), a P -value of <0.05 tells us that for that particular comparison, there is less than a 5% chance of having obtained a difference at least as large as the one observed by chance. Put another way, in the absence of any real difference between two populations, there is a 95% chance that we will not render a false conclusion of statistical significance. In the family-wise error rate approach, the criterion used to judge the statistical significance of any individual comparison is made more stringent as a way to compensate for the total number of comparisons being made. This is generally done by lowering the P -value cutoff (α level) for individual t -tests. When all is said and done, a P -value of <0.05 will mean that there is less than a 5% chance that the entire collection of declared positive findings contains any false positives.

We can use our example of the collagen experiment to further illustrate the meaning of the family-wise error rate. Suppose we test 100 genes and apply a family-wise error rate cutoff of 0.05. Perhaps this leaves us with a list of 12 genes that lead to changes in body size that are deemed statistically significant . This means that there is only a 5% chance that one or more of the 12 genes identified is a false positive. This also means that if none of the 100 genes really controlled body size, then 95% of the time our experiment would lead to no positive findings. Without this kind of adjustment, and using an α level of 0.05 for individual tests, the presence of one or more false positives in a data set based on 100 comparisons could be expected to happen >99% of the time. Several techniques for applying the family-wise error rate are described below.

3.4. Bonferroni-type corrections

The Bonferroni method , along with several related techniques, is conceptually straightforward and provides conservative family-wise error rates. To use the Bonferroni method, one simply divides the chosen family-wise error rate (e.g., 0.05) by the number of comparisons to obtain a Bonferroni-adjusted P -value cutoff. Going back to our example of the collagen genes, if the desired family-wise error rate is 0.05 and the number of comparisons is 100, the adjusted per-comparison significance threshold would be reduced to 0.05/100 = 0.0005. Thus, individual t -tests yielding P -values as low as 0.0006 would be declared insignificant. This may sound rather severe. In fact, a real problem with the Bonferroni method is that for large numbers of comparisons, the significance threshold may be so low that one may fail to detect a substantial proportion of true positives within a data set. For this reason, the Bonferroni method is widely considered to be too conservative in situations with large numbers of comparisons.

Another variation on the Bonferroni method is to apply significance thresholds for each comparison in a non-uniform manner. For example, with a family-wise error rate of 0.05 and 10 comparisons, a uniform cutoff would require any given t -test to have an associated P -value of <0.005 to be declared significant. Another way to think about this is that the sum of the 10 individual cutoffs must add up to 0.05. Interestingly, the integrity of the family-wise error rate is not compromised if one were to apply a 0.04 significance threshold for one comparison, and a 0.00111 (0.01/9) significance threshold for the remaining nine. This is because 0.04 + 9(0.00111) ≈ 0.05. The rub, however, is that the decision to apply non-uniform significance cutoffs cannot be made post hoc based on how the numbers shake out! For this method to be properly implemented, researchers must first prioritize comparisons based on the perceived importance of specific tests, such as if a negative health or environmental consequence could result from failing to detect a particular difference. For example, it may be more important to detect a correlation between industrial emissions and childhood cancer rates than to effects on the rate of tomato ripening. This may all sound rather arbitrary, but it is nonetheless statistically valid.

3.5. False discovery rates

As discussed above, the Bonferroni method runs into trouble in situations where many comparisons are being made because a substantial proportion of true positives are likely to be discarded for failing to score below the adjusted significance threshold. Stated another way, the power of the experiment to detect real differences may become unacceptably low. Benjamini and Hochberg (1995) are credited with introducing the idea of the false discovery rate (FDR) , which has become an indispensable approach for handling the statistical analysis of experiments composed of large numbers of comparisons. Importantly, the FDR method has greater power than does the Bonferroni method. In the vernacular of the FDR method, a statistically significant finding is termed a “discovery”. Ultimately, the FDR approach allows the investigator to set an acceptable level of false discoveries (usually 5%), which means that any declared significant finding has a 5% chance of being a false positive. This differs fundamentally from the idea behind the family-wise model, where an error rate of 5% means that there is a 5% chance that any of the declared significant findings are false. The latter method starts from the position that no differences exist. The FDR method does not suppose this.

The FDR method is carried out by first making many pairwise comparisons and then ordering them according to their associated P -values, with lowest to highest displayed in a top to bottom manner. In the examples shown in Table 3 , this was done with only 10 comparisons (for three different data sets), but this method is more commonly applied to studies involving hundreds or thousands of comparisons. What makes the FDR method conceptually unique is that each of the test-derived P -values is measured against a different significance threshold. In the example with 10 individual tests, the one giving the lowest P -value is measured against 30 0.005 (0.05/10). Conversely, the highest P -value is measured against 0.05. With ten comparisons, the other significance thresholds simply play out in ordered increments of 0.005 ( Table 3 ). For example, the five significance thresholds starting from the top of the list would be 0.005, 0.010, 0.015, 0.020, and 0.025. The formula is k (α/ C ), where C is the number of comparisons and k is the rank order (by sorted P -values) of the comparison. If 100 comparisons were being made, the highest threshold would still be 0.05, but the lowest five in order would be 0.0005, 0.0010, 0.0015, 0.0020, and 0.0025. Having paired off each experimentally derived P -value with a different significance threshold, one checks to see if the P -value is less than the prescribed threshold. If so, then the difference is declared to be statistically significant (a discovery), at which point one moves on to the next comparison, involving the second-lowest P -value. This process continues until a P -value is found that is higher than the corresponding threshold. At that point, this and all remaining results are deemed not significant.

Table 3. Illustration of FDR method, based on artificial P -values from 10 comparisons.

Examples of how this can play out are shown in Table 3 . Note that even though some of the comparisons below the first failed test may themselves be less than their corresponding significance thresholds (Data Set #3), these tests are nevertheless declared not significant. This may seem vexing, but without this property the test would not work. This is akin to a “one strike and you're out” rule. Put another way, that test, along with all those below it on the list, are declared persona non grata and asked to leave the premises!

Although the FDR approach is not hugely intuitive, and indeed the logic is not easily tractable, it is worth considering several scenarios to see how the FDR method might play out. For example, with 100 independent tests of two populations that are identical, chance sampling would be expected to result on average with a single t -test having an associated P -value of 0.01 31 . However, given that the corresponding significance threshold would be 0.0005, this test would not pass muster and the remaining tests would also be thrown out. Even if by chance a P -value of <0.0005 was obtained, the next likely lowest P -value on the list, 0.02, would be measured against 0.001, underscoring that the FDR method will be effective at weeding out imposters. Next, consider the converse situation: 100 t -tests carried out on two populations that are indeed different. Furthermore, based on the magnitude of the difference and the chosen sample size, we would expect to obtain an average P -value of 0.01 for all the tests. Of course, chance sampling will lead to some experimental differences that result in P -values that are higher or lower than 0.01, including on average one that is 0.0001 (0.01/100). Because this is less than the cutoff of 0.0005, this would be classified as a discovery, as will many, though not all, of the tests on this particular list. Thus, the FDR approach will also render its share of false-negative conclusions (often referred to as Type II errors). But compared with the Bonferroni method, where the significance threshold always corresponds to the lowest FDR cutoff, the proportion of these errors will be much smaller.

3.6. Analysis of variance

Entire books are devoted to the statistical method known as analysis of variance 32 (ANOVA) . This section will contain only three paragraphs. This is in part because of the view of some statisticians that ANOVA techniques are somewhat dated or at least redundant with other methods such as multiple regression (see Section 5.5 ). In addition, a casual perusal of the worm literature will uncover relatively scant use of this method. Traditionally, an ANOVA answers the following question: are any of the mean values within a dataset likely to be derived from populations 33 that are truly different? Correspondingly, the null hypothesis for an ANOVA is that all of the samples are derived from populations, whose means are identical and that any difference in their means are due to chance sampling. Thus, an ANOVA will implicitly compare all possible pairwise combinations of samples to each other in its search for differences. Notably, in the case of a positive finding, an ANOVA will not directly indicate which of the populations are different from each other. An ANOVA tells us only that at least one sample is likely to be derived from a population that is different from at least one other population.

Because such information may be less than totally satisfying, an ANOVA is often used in a two-tiered fashion with other tests; these latter tests are sometimes referred to as post hoc tests . In cases where an ANOVA suggests the presence of different populations, t -tests or other procedures (described below) can be used to identify differences between specific populations. Moreover, so long as the P -value associated with the ANOVA is below the chosen significance threshold, the two means that differ by the greatest amount are assured of being supported by further tests. The correlate, however, is not true. Namely, it is possible to “cherry pick” two means from a data set (e.g., those that differ by the greatest amount) and obtain a P value that is <0.05 based on a t -test even if the P -value of the ANOVA (which simultaneously takes into account all of the means) is >0.05. Thus, ANOVA will provide a more conservative interpretation than t -tests using chosen pairs of means. Of course, focusing on certain comparisons may be perfectly valid in some instances (see discussion of planned comparisons below). In fact, it is generally only in situations where there is insufficient structure among treatment groups to inspire particular comparisons where ANOVA is most applicable. In such cases, an insignificant ANOVA finding might indeed be grounds for proceeding no further.

In cases of a positive ANOVA finding, a commonly used post hoc method is Tukey's test , which goes by a number of different names including Tukey's honest significant difference test and the Tukey-Kramer test . The output of this test is a list of 95% CIs of the differences between means for all possible pairs of populations. Real differences between populations are indicated when the 95% CI for a given comparison does not include zero. Moreover, because of the family-wise nature of this analysis, the entire set of comparisons has only a 5% chance of containing any false positives. As is the case for other methods for multiple comparisons, the chance of obtaining false negatives increases with the number of populations being tested, and, with post hoc ANOVA methods, this increase is typically exponential. For Tukey's test, the effect of increasing the number of populations is manifest as a widening of 95% CIs, such that a higher proportion will encompass zero. Tukey's test does have more power than the Bonferroni method but does not generate precise P -values for specific comparisons. To get some idea of significance levels, however, one can run Tukey's test using several different family-wise significance thresholds (0.05, 0.01, etc.) to see which comparisons are significant at different thresholds. In addition to Tukey's test, many other methods have been developed for post hoc ANOVA including Dunnett's test, Holm's test, and Scheffe's test. Thus if your analyses take you heavily into the realm of the ANOVA, it may be necessary to educate yourself about the differences between these approaches.

3.7. Summary of multiple comparisons methods

Figure 10 provides a visual summary of the multiple comparisons methods discussed above. As can be seen, the likelihood of falsely declaring a result to be statistically significant is highest when conducting multiple t -tests without corrections and lowest using Bonferroni-type methods. Conversely, incorrectly concluding no significant difference even when one exists is most likely to occur using the Bonferroni method. Thus the Bonferroni method is the most conservative of the approaches discussed, with FDR occupying the middle ground. Additionally, there is no rule as to whether the uniform or non-uniform Bonferroni method will be more conservative as this will always be situation dependent. Though discussed above, ANOVA has been omitted from Figure 10 since this method does not apply to individual comparisons. Nevertheless, it can be posited that ANOVA is more conservative than uncorrected multiple t -tests and less conservative than Bonferroni methods. Finally, we can note that the statistical power of an analysis is lowest when using approaches that are more conservative (discussed further in Section 6.2 ).

Strength versus weakness comparison of statistical methods used for analyzing multiple means.

3.8. When are multiple comparison adjustments not required?

There is no law that states that all possible comparisons must be made. It is perfectly permissible to choose a small subset of the comparisons for analysis, provided that this decision is made prior to generating the data and not afterwards based on how the results have played out! In addition, with certain datasets, only certain comparisons may make biological sense or be of interest. Thus one can often focus on a subset of relevant comparisons. As always, common sense and a clear understanding of the biology is essential. These situations are sometimes referred to as planned comparisons, thus emphasizing the requisite premeditation. An example might be testing for the effect on longevity of a particular gene that you have reason to believe controls this process. In addition, you may include some negative controls as well as some “long-shot” candidates that you deduced from reading the literature. The fact that you included all of these conditions in the same experimental run, however, would not necessarily obligate you to compensate for multiple comparisons when analyzing your data.

In addition, when the results of multiple tests are internally consistent, multiple comparison adjustments are often not needed. For example, if you are testing the ability of gene X loss of function to suppress a gain-of-function mutation in gene Y, you may want to test multiple mutant alleles of gene X as well as RNAi targeting several different regions of X. In such cases, you may observe varying degrees of genetic suppression under all the tested conditions. Here you need not adjust for the number of tests carried out, as all the data are supporting the same conclusion. In the same vein, it could be argued that suppression of a mutant phenotype by multiple genes within a single pathway or complex could be exempt from issues of multiple comparisons. Finally, as discussed above, carrying out multiple independent tests may be sufficient to avoid having to apply statistical corrections for multiple comparisons.

3.9. A philosophical argument for making no adjustments for multiple comparisons

Imagine that you have written up a manuscript that contains fifteen figures (fourteen of which are supplemental). Embedded in those figures are 23 independent t -tests, none of which would appear to be obvious candidates for multiple comparison adjustments. However, you begin to worry. Since the chosen significance threshold for your tests was 0.05, there is nearly a 70% chance [1 – (0.95) 23 = 0.693] that at least one of your conclusions is wrong 34 . Thinking about this more, you realize that over the course of your career you hope to publish at least 50 papers, each of which could contain an average of 20 statistical tests. This would mean that over the course of your career you are 99.9999999999999999999947% likely to publish at least one error and will undoubtedly publish many (at least those based on statistical tests). To avoid this humiliation, you decide to be proactive and impose a career-wise Bonferroni correction to your data analysis. From now on, for results with corresponding statistical tests to be considered valid, they must have a P -value of <0.00005 (0.05/1000). Going through your current manuscript, you realize that only four of the 23 tests will meet your new criteria. With great sadness in your heart, you move your manuscript into the trash folder on your desktop.

Although the above narrative may be ridiculous (indeed, it is meant to be so), the underlying issues are very real. Conclusions based on single t -tests, which are not supported by additional complementary data, may well be incorrect. Thus, where does one draw the line? One answer is that no line should be drawn, even in situations where multiple comparison adjustments would seem to be warranted. Results can be presented with corresponding P -values, and readers can be allowed to make their own judgments regarding their validity. For larger data sets, such as those from microarray studies, an estimation of either the number or proportion of likely false positives can be provided to give readers a feeling for the scope of the problem. Even without this, readers could in theory look at the number of comparisons made, the chosen significance threshold, and the number of positive hits to come up with a general idea about the proportion of false positives. Although many reviewers and readers may not be satisfied with this kind of approach, know that there are professional statisticians who support this strategy. Perhaps most importantly, understand that whatever approaches are used, data sets, particularly large ones, will undoubtedly contain errors, including both false positives and false negatives. Wherever possible, seek to confirm your own results using multiple independent methods so that you are less likely to be fooled by chance occurrence.

4. Probabilities and Proportions

4.1. introduction.

Sections 2 and 3 dealt exclusively with issues related to means. For many experiments conducted in our field, however, mean values are not the end goal. For example, we may seek to determine the frequency of a particular defect in a mutant background, which we typically report as either a proportion (e.g., 0.7) or a percentage (e.g., 70%). Moreover, we may want to calculate CIs for our sample percentages or may use a formal statistical test to determine if there is likely to be a real difference between the frequencies observed for two or more samples. In other cases, our analyses may be best served by determining ratios or fold changes , which may require specific statistical tools. Finally, it is often useful, particularly when carrying out genetic experiments, to be able to calculate the probabilities of various outcomes. This section will cover major points that are relevant to our field when dealing with these common situations.

4.2. Calculating simple probabilities

Most readers are likely proficient at calculating the probability of two independent events occurring through application of the multiplication rule . Namely, If event A occurs 20% of the time and event B occurs 40% of the time, then the probability of event A and B both occurring is 0.2 × 0.4 = 0.08 or 8%. More practically, we may wish to estimate the frequency of EcoRI restriction endonuclease sites in the genome. Because the EcoRI binding motif is GAATTC and each nucleotide has a roughly one-in-four chance of occurring at each position, then the chance that any six-nucleotide stretch in the genome will constitute a site for EcoRI is (0.25) 6 = 0.000244140625 or 1 in 4,096. Of course, if all nucleotides are not equally represented or if certain sequences are more or less prevalent within particular genomes, then this will change the true frequency of the site. In fact, GAATTC is over-represented in phage genomes but under-represented in human viral sequences ( Burge et al., 1992 ). Thus, even when calculating straightforward probabilities, one should be careful not to make false assumptions regarding the independence of events.

In carrying out genetic studies, we will often want to determine the likelihood of obtaining a desired genotype. For example, if we are propagating an unbalanced recessive lethal mutation ( let ), we will need to pick phenotypically wild-type animals at each generation and then assess for the presence of the lethal mutation in the first-generation progeny. Basic Mendelian genetics (as applied to C. elegans hermaphrodites) states that the progeny of a let/+ parent will be one-fourth let/let , one-half let/+ , and one-fourth +/+ . Thus, among the non-lethal progeny of a let/+ parent, two-thirds will be let/+ and one-third will be +/+ . A practical question is how many wild-type animals should we single-clone at each generation to ensure that we pick at least one let/+ animal? In this case, using the complement of the multiplication rule, also referred to as the probability of “ at least one ”, will be most germane. We start by asking what is the probability of an individual not being let/+ , which in this case is one-third or 0.333? Therefore the probability of picking five animals, none of which are of genotype let/+ is (0.333) 5 or 0.41%, and therefore the probability of picking at least one let/+ would be 1 − 0.041 = 99.59%. Thus, picking five wild-type animals will nearly guarantee that at least one of the F1 progeny is of our desired genotype. Furthermore, there is a (0.667) 5 ≈ 13.20% chance that all five animals will be let/+ .

4.3. Calculating more-complex probabilities

To calculate probabilities for more-complex problems, it is often necessary to account for the total number of combinations or permutations that are possible in a given situation. In this vernacular, the three different arrangements of the letters ABC, ACB, BAC, are considered to be distinct permutations, but only one combination. Thus, for permutations the order matters, whereas for combinations it does not. Depending on the situation, either combinations or permutations may be most relevant. Because of the polarity inherent to DNA polymers, GAT and TAG are truly different sequences and thus permutations would be germane. So far as a standard mass spectroscopy is concerned, however, the peptides DAVDKEN and KENDAVD are identical, and thus combinations might be more relevant in this case.

To illustrate the process of calculating combinations and permutations, we'll first use an example involving peptides. If each of the twenty standard amino acids (aa) is used only once in the construction of a 20-aa peptide, how many distinct sequences can be assembled? We start by noting that the order of the amino acids will matter, and thus we are concerned with permutations. In addition, given the set up where each amino acid can be used only once, we are sampling without replacement . The solution can be calculated using the following generic formula: # of permutations = n !. Here n is the total number of items and “!” is the mathematical factorial symbol, such that n ! = n × ( n − 1) × ( n − 2) … × 1. For example, 5! = 5 × 4 × 3 × 2 × 1 = 120. Also by convention, 1! and 0! are both equal to one. To solve this problem we therefore multiply 20 × 19 × 18 … 3 × 2 × 1 or 20! ≈ 2.4e 18 , an impressively large number. Note that because we were sampling without replacement, the incremental decrease with each multiplier was necessary to reflect the reduced choice of available amino acids at each step. Had we been sampling with replacement, where each amino acid can be used any number of times, the equation would simply be 20 20 ≈ 1.1e 26 , an even more impressive number!

Going back to the previous genetic example, one might wish to determine the probability of picking five progeny from a parent that is let/+ where three are of genotype let/+ and two are +/+ . One thought would be to use the multiplication rule where we multiply 0.667 × 0.667 × 0.667 × 0.333 × 0.333, or more compactly, (0.667) 3 (0.333) 2 = 0.0329 or 3.29%. If this seems a bit lower than you might expect, your instincts are correct. The above calculation describes only the probability of obtaining any one particular sequence that produces three let/+ (L) and two +/+ (W) worms. For this reason, it underestimates the true frequency of interest, since there are multiple ways of getting the same combination. For example, one possible order would be WWLLL, but equally probable are WLWLL, WLLWL, WLLLW, LLLWW, LLWLW, LLWWL, LWLLW, LWWLL, and LWLWL, giving a total of ten possible permutations. Of course, unlike peptides or strands of DNA, all of the possible orders are equivalent with respect to the relevant outcome, obtaining three let/+ and two +/+ worms. Thus, we must take permutations into account in order to determine the frequency of the generic combination. Because deriving permutations by hand (as we did above) becomes cumbersome (if not impossible) very quickly, one can use the following equation where n is the total number of items with n 1 that are alike and n 2 that are alike, etc., up through n k .

Thus plugging in the numbers for our example, we would have 5!/3! 2! = 120/(6 × 2) = 10. Knowing the number of possible permutations we can then multiply this by the probability of getting any single arrangement of three let/+ and two +/+ worms calculated above, such that 0.0329 × 10 = 0.329 or 32.9%, a number that makes much more sense. This illustrates a more general rule regarding the probability ( Pr ) of obtaining specific combinations:

Pr combination = (# of permutations) × (probability of obtaining any single permutation)

Note, however, that we may often be interested in a slightly different question than the one just posed. For example, what is the probability that we will obtain at least three let/+ animals with five picks from a let/+ parent? In this case, we would have to sum the probabilities for three out of five [(5!/3! 2!) (0.0329) = .329], four out of five [(5!/4! 1!) (0.0329) = 0.165], five out of five [(0.667) 5 = 0.132] let/+ animals, giving us 0.329 + 0.165 + 0.132 = 0.626 or 62.6%.

The ability to calculate permutations can also be used to determine the number of different nucleotide sequences in a 20-mer where each of the four nucleotides (G, A, T, C) is used five times. Namely, 20!/(5!) 4 ≈ 1.2e 10 . Finally, we can calculate the number of different peptides containing five amino acids where each of the twenty amino acids is chosen once without replacement. In this case, we can use a generic formula where n is the total number of items from which we select r items without replacement.

This would give us 20!/(20 − 5)! = 20!/(15)! = 20 × 19 × 18 × 17 × 16 = 1,860,480. The same scenario carried out with replacement would simply be (20) 5 = 3,200,000. Thus, using just a handful of formulas, we are empowered with the ability to make a wide range of predictions for the probabilities that we may encounter . This is important because probabilities are not always intuitive as illustrated by the classic “birthday problem”, which demonstrates that within a group of only 23 people, there is a >50% probability that at least two will share the same birthday 35 .

4.4. The Poisson distribution

Certain types of probabilistic events can be modeled using a distribution developed by the French mathematician Siméon Denis Poisson. Specifically, the Poisson distribution can be used to predict the probability that a given number of events will occur over a stipulated interval of time, distance, space, or other related measure, when said events occur independently of one another. For example, given a known forward mutation rate caused by a chemical mutagen, what is the chance that three individuals from a collection of 1,000 F1s (derived from mutagenized P0 parents) will contain a mutation in gene X? Also, what is the chance that any F1 worm would contain two or more independent mutations within gene X?

The generic formula used to calculate such probabilities is shown below, where µ is the mean number of expected events, x is the number of times that the event occurs over a specified interval, and e is the natural log.

For this formula to predict probabilities accurately, it is required that the events be independent of each other and occur at a constant average rate over the given interval. If these criteria are violated, then the Poisson distribution will not provide a valid model. For example, imagine that we want to calculate the likelihood that a mutant worm that is prone to seizures will have two seizures (i.e., events or x ) within a 5-minute interval. For this calculation, we rely on previous data showing that, on average, mutant worms have 6.2 seizures per hour. Thus, the average (μ) for a 5-minute interval would be 6.2/12 = 0.517. Plugging these numbers into the above formula we obtain P(x) = 0.080 or 8%. Note that if we were to follow 20 different worms for 5 minutes and observed six of them to have two seizures, this would suggest that the Poisson distribution is not a good model for our particular event 36 . Rather, the data would suggest that multiple consecutive seizures occur at a frequency that is higher than predicted by the Poisson distribution, and thus the seizure events are not independent. In contrast, had only one or two of the 20 worms exhibited two seizures within the time interval, this would be consistent with a Poisson model.

4.5. Intuitive methods for calculating probabilities

Another useful strategy for calculating probabilities, as well as other parameters of interest that are governed by chance, is sometimes referred to as the intuitive approach . This includes the practice of plugging hypothetical numbers into scenarios to maximize the clarity of the calculations and conclusions. Our example here will involve efforts to maximize the efficiency of an F2-clonal genetic screen to identify recessive maternal-effect lethal or sterile mutations ( Figure 11 ). For this experiment, we will specify that 100 P0 adults are to be cloned singly onto plates following mutagenesis. Then ten F1 progeny from each P0 will be single-cloned, some small fraction of which will be heterozygous for a desired class of mutation ( m/+ ). To identify mutants of interest, however, F2s of genotype m/m must be single-cloned, and their F3 progeny must be inspected for the presence of the phenotype. The question is: what is the optimal number of F2s to single-clone from each F1 plate?

Schematic diagram of F2-clonal genetic screen for recessive mutations in C. elegans .

Mendelian genetics states that the chance of picking an m/m F2 from an m/+ F1 parent is one in four or 25%, so picking more will of course increase the likelihood of obtaining the desired genotype. But will the returns prove diminishing and, if so, what is the most efficient practice? Table 4 plugs in real numbers to determine the frequency of obtaining m/m animals based on the number of cloned F2s. The first column shows the number of F2 animals picked per F1, which ranges from one to six. In the second column, the likelihood of picking at least one m/m animal is determined using the inverse multiplication rule. As expected, the likelihood increases with larger numbers of F2s, but diminishing returns are evident as the number of F2s increases. Columns 3–5 tabulate the number of worm plates required, the implication being that more plates are both more work and more expense. Columns six and eight calculate the expected number of m/m F2s that would be isolated given frequencies of ( m/+ ) heterozygotes of 0.01 (10 in 1,000 F1s) or 0.001 (1 in 1,000 F1s), respectively. Here, a higher frequency would infer that the desired mutations of interest are more common. Finally, columns seven and nine show the predicted efficiencies of the screening strategies by dividing the number of isolated m/m F2s by the total number of F1 and F2 plates required (e.g., 2.50/2,000 = 1.25e –3 ).

Table 4. Intuitive approach to determine the maximum efficiency of an F2-clonal genetic screen.

From this we can see that either two or three F2s is the most efficient use of plates and possibly time, although other factors could potentially factor into the decision of how many F2s to pick. We can also see that the relative efficiencies are independent of the frequency of the mutation of interest. Importantly, this potentially useful insight was accomplished using basic intuition and a very rudimentary knowledge of probabilities. Of course, the outlined intuitive approach failed to address whether the optimal number of cloned F2s is 2.4 or 2.5 37 , but as we haven't yet developed successful methods to pick or propagate fractions of C. elegans , such details are irrelevant!

We note that an online tool has been created by Shai Shaham ( Shaham, 2007 ) that allows users to optimize the efficiency of genetic screens in C. elegans 38 . To use the tool, users enter several parameters that describe the specific genetic approach (e.g., F1 versus F2 clonal). The website's algorithm then provides a recommended F2-to-F1 screening ratio. Entering parameters that match the example used above, the website suggests picking two F2s for each F1, which corresponds to the number we calculated using our intuitive approach. In addition, the website provides a useful tool for calculating the screen size necessary to achieve a desired level of genetic saturation. For example, one can determine the number of cloned F1s required to ensure that all possible genetic loci will be identified at least once during the course of screening with a 95% confidence level.

4.6. Conditional probability: calculating probabilities when events are not independent

In many situations, the likelihood of two events occurring is not independent. This does not mean that the two events need be totally interdependent or mutually exclusive, just that one event occurring may increase or decrease the likelihood of the other. Put another way, having prior knowledge of one outcome may change the effective probability of a second outcome. Knowing that someone has a well-worn “1997 C. elegans International Meeting” t-shirt in his drawer does not guarantee that he is an aging nerd, but it certainly does increase the probability! The area of statistics that handles such situations is known as Bayesian analysis or inference, after an early pioneer in this area, Thomas Bayes. More generally, conditional probability refers to the probability of an event occurring based on the condition that another event has occurred. Although conditional probabilities are extremely important in certain types of biomedical and epidemiological research, such as predicting disease states given a set of known factors 39 , this issue doesn't arise too often for most C. elegans researchers. Bayesian models and networks have, however, been used in the worm field for applications that include phylogenetic gene tree construction ( Hoogewijs et al., 2008 ; Agarwal and States, 1996 ), modeling developmental processes ( Sun and Hong, 2007 ), and predicting genetic interactions ( Zhong and Sternberg, 2006 ). Bayesian statistics is also used quite extensively in behavioral neuroscience ( Knill and Pouget, 2004 ; Vilares and Kording, 2011 ), which is growing area in the C. elegans field. We refer interested readers to textbooks or the web for additional information (see Appendix A ).

4.7. Binomial proportions

It is common in our field to generate data that take the form of binomial proportions . Examples would include the percentage of mutant worms that arrest as embryos or that display ectopic expression of a GFP reporter. As the name implies, binomial proportions arise from data that fall into two categories such as heads or tails, on or off, and normal or abnormal. More generically, the two outcomes are often referred to as a success or failure . To properly qualify, data forming a binomial distribution must be acquired by random sampling , and each outcome must be independent of all other outcomes. Coin flips are a classic example where the result of any given flip has no influence on the outcome of any other flip. Also, when using a statistical method known as the normal approximation (discussed below), the binomial dataset should contain a minimum of ten outcomes in each category (although some texts may recommend a more relaxed minimum of five). This is generally an issue only when relatively rare events are being measured. For example, flipping a coin 50 times would certainly result in at least ten heads or ten tails, whereas a phenotype with very low penetrance might be detected only in three worms from a sample of 100. In this latter case, a larger sample size would be necessary for the approximation method to be valid. Lastly, sample sizes should not be >10% of the entire population. As we often deal with theoretical populations that are effectively infinite in size, however, this stipulation is generally irrelevant.

An aside on the role of normality in binomial proportions is also pertinent here. It might seem counterintuitive, but the distribution of sample proportions arising from data that are binary does have, with sufficient sample size, an approximately normal distribution. This concept is illustrated in Figure 12 , which computationally simulates data drawn from a population with an underlying “success” rate of 0.25. As can be seen, the distribution becomes more normal with increasing sample size. How large a sample is required, you ask? The short answer is that the closer the underlying rate is to 50%, the smaller the required sample size is; with a more extreme rate (i.e., closer to 0% or 100%), a larger size is required. The requirements are reasonably met by the aforementioned minimum of ten rule.

4.8. Calculating confidence intervals for binomial proportions

To address the accuracy of proportions obtained through random sampling, we will typically want to provide an accompanying CI. For example, political polls will often report the percentage in favor of a candidate along with a 95% CI, which may encompass several percentage points to either side of the midpoint estimate 40 . As previously discussed in the context of means, determining CIs for sample proportions is important because in most cases we can never know the true proportion of the population under study. Although different confidence levels can be used, binomial data are often accompanied by 95% CIs. As for means, lower CIs (e.g., 90%) are associated with narrower intervals, whereas higher CIs (e.g., 99%) are associated with wider intervals. Once again, the meaning of a 95% CI is the same as that discussed earlier in the context of means. If one were to repeat the experiment 100 times and calculate 95% CIs for each repeat, on average 95 of the calculated CIs would contain the true population proportion. Thus, there is a 95% chance that the CI calculated for any given experiment contains the true population proportion.

Perhaps surprisingly, there is no perfect consensus among statisticians as to which of several methods is best for calculating CIs for binomial proportions 41 . Thus, different textbooks or websites may describe several different approaches. That said, for most purposes we can recommend a test that goes by several names including the adjusted Wald, the modified Wald, and the Agresti-Coull (A-C) method ( Agresti and Coull, 1998 ; Agresti and Caffo, 2000 ). Reasons for recommending this test are: (1) it is widely accepted, (2) it is easy to implement (if the chosen confidence level is 95%), and (3) it gives more-accurate CIs than other straightforward methods commonly in use. Furthermore, even though this approach is based on the normal approximation method, the minimum of ten rule can be relaxed.

To implement the A-C method for a 95% CI (the most common choice), add two to both the number of successes and failures. Hence this is sometimes referred to as the plus-four or “+4” method . It then uses the doctored numbers, together with the normal approximation method, to determine the CI for the population proportion. Admittedly this sounds weird, if not outright suspicious, but empirical studies have shown that this method gives consistently better 95% CIs than would be created using the simpler method. For example, if in real life you assayed 83 animals and observed larval arrest in 22, you would change the total number of trials to 87 and the number of arrested larvae to 24. In addition, depending on the software or websites used, you may need to choose the normal approximation method and not something called the exact method for this to work as intended.

Importantly, the proportion and sample size that you report should be the actual proportion and sample size from what you observed; the doctored (i.e., plus-four) numbers are used exclusively to generate the 95% CI. Thus, in the case of the above example, you would report the proportion as 22/83 = 0.265 or 26.5%. The 95% CI would, however, be calculated using the numbers 24 and 87 to give a 95% CI of 18.2%–37.0%. Note that the +4 version of the A-C method is specific for 95% CIs and not for other intervals 42 . Finally, it is worth noting that CIs for proportions that are close to either 0% or 100% will get a bit funky. This is because the CI cannot include numbers <0% or >100%. Thus, CIs for proportions close to 0% or 100% will often be quite lopsided around the midpoint and may not be particularly accurate. Nevertheless, unless the obtained percentage is 0 or 100, we do not recommend doing anything about this as measures used to compensate for this phenomenon have their own inherent set of issues. In other words, if the percentage ranges from 1% to 99%, use the A-C method for calculation of the CI. In cases where the percentage is 0% or 100%, instead use the exact method.

4.9. Tests for differences between two binomial proportions

Very often we will want to compare two proportions for differences. For example, we may observe 85% larval arrest in mutants grown on control RNAi plates and 67% arrest in mutants on RNAi-feeding plates targeting gene X . Is this difference significant from a statistical standpoint? To answer this, two distinct tests are commonly used. These are generically known as the normal approximation and exact methods. In fact, many website calculators or software programs will provide the P -value calculated by each method as a matter of course, although in some cases you may need to select one method. The approximation method (based on the so-called normal distribution) has been in general use much longer, and the theory behind this method is often outlined in some detail in statistical texts. The major reason for the historical popularity of the approximation method is that prior to the advent of powerful desktop computers, calculations using the exact method simply weren't feasible. Its continued use is partly due to convention, but also because the approximation and exact methods typically give very similar results. Unlike the normal approximation method, however, the exact method is valid in all situations, such as when the number of successes is less than five or ten, and can thus be recommended over the approximation method.

Regardless of the method used, the P -value derived from a test for differences between proportions will answer the following question: What is the probability that the two experimental samples were derived from the same population? Put another way, the null hypothesis would state that both samples are derived from a single population and that any differences between the sample proportions are due to chance sampling. Much like statistical tests for differences between means, proportions tests can be one- or two-tailed, depending on the nature of the question. For the purpose of most experiments in basic research, however, two-tailed tests are more conservative and tend to be the norm. In addition, analogous to tests with means, one can compare an experimentally derived proportion against a historically accepted standard, although this is rarely done in our field and comes with the possible caveats discussed in Section 2.3 . Finally, some software programs will report a 95% CI for the difference between two proportions. In cases where no statistically significant difference is present, the 95% CI for the difference will always include zero.

4.10. Tests for differences between more than one binomial proportion

A question that may arise when comparing more than two binomial proportions is whether or not multiple comparisons should be factored into the statistical analysis. The issues here are very similar to those discussed in the context of comparing multiple means ( Section 3 ). In the case of proportions, rather than carrying out an ANOVA, a Chi-square test (discussed below) could be used to determine if any of the proportions are significantly different from each other. Like an ANOVA, however, this may be a somewhat less-than-satisfying test in that a positive finding would not indicate which particular proportions are significantly different. In addition, FDR and Bonferroni-type corrections could also be applied at the level of P -value cutoffs, although these may prove to be too conservative and could reduce the ability to detect real differences (i.e., the power of the experiment).

In general, we can recommend that for findings confirmed by several independent repeats, corrections for multiple comparisons may not be necessary. We illustrate our rationale with the following example. Suppose you were to carry out a genome-wide RNAi screen to identify suppressors of larval arrest in the mutant Y background. A preliminary screen might identify ∼1,000 such clones ranging from very strong to very marginal suppressors. With retesting of these 1,000 clones, most of the false positives from the first round will fail to suppress in the second round and will be thrown out. A third round of retesting will then likely eliminate all but a few false positives, leaving mostly valid ones on the list. This effect can be quantified by imagining that we carry out an exact binomial test on each of ∼20,000 clones in the RNAi library, together with an appropriate negative control, and chose an α level (i.e., the statistical cutoff) of 0.05. By chance alone, 5% or 1,000 out of 20,000 would fall below the P -value threshold. In addition, let's imagine that 100 real positives would also be identified giving us 1,100 positives in total. Admittedly, at this point, the large majority of identified clones would be characterized as false positives. In the second round of tests, however, the large majority of true positives would again be expected to exhibit statistically significant suppression, whereas only 50 of the 1,000 false positives will do so. Following the third round of testing, all but two or three of the false positives will have been eliminated. These, together with the ∼100 true positives, most of which will have passed all three tests, will leave a list of genes that is strongly enriched for true positives. Thus, by carrying out several experimental repeats, additional correction methods are not needed.

4.11. Probability calculations for binomial proportions

At times one may be interested in calculating the probability of obtaining a particular proportion or one that is more extreme, given an expected frequency. For example, what are the chances of tossing a coin 100 times and getting heads 55 times? This can be calculated using a small variation on the formulae already presented above.

Here n is the number of trials, Y is the number of positive outcomes or successes, and p is the probability of a success occurring in each trial. Thus we can determine that the chance of getting exactly 55 heads is quite small, only 4.85%. Nevertheless, given an expected proportion of 0.5, we intuitively understand that 55 out of 100 heads is not an unusual result. In fact, we are probably most interested in knowing the probability of getting a result at least as or more extreme than 55 (whether that be 55 heads or 55 tails). Thus our probability calculations must also include the results where we get 56, 57, 58…100 heads as well as 45, 44, 43 …0 heads. Adding up these probabilities then tells us that we have a 36.8% chance of obtaining at least 55 heads or tails in 100 tosses, which is certainly not unusual. Rather than having to calculate each probability and adding them up, however, a number of websites will do this for you. Nevertheless, be alert and understand the principles behind what you are doing, as some websites may only give you the probability of ≥55%, whereas what you really may need is the summed probability of both ≥55% and ≤45%.

4.12. Probability calculations when sample sizes are large relative to the population size

One of the assumptions for using the binomial distribution is that our population size must be very large relative to our sample size 43 . Thus, the act of sampling itself should not appreciably alter the course of future outcomes (i.e., the probability is fixed and does not change each trial). For example, if we had a (very large) jar with a million marbles, half of them black and half of them white, removing one black marble would not grossly alter the probability of the next marble being black or white. We can therefore treat these types of situations as though the populations were infinite or as though we were sampling with replacement . In contrast, with only ten marbles (five white and five black), picking a black marble first would reduce the probability of the next marble being black from 50% (5/10) to 44.4% (4/9), while increasing the probability of the next marble being white to 55.6% (5/9) 44 . For situations like this, in which the act of sampling noticeably affects the remaining population, the binomial is shelved in favor of something called the hyper-geometric distribution . For example, in the case of the ten marbles, the probability of picking out five marbles, all of which are black, is 0.0040 using the hypergeometric distribution. In contrast, the binomial applied to this situation gives an erroneously high value of 0.031.

For our field, we often see hyper-geometric calculations applied to computational or genomics types of analyses. For example, imagine that we have carried out an RNA-seq experiment and have identified 1,000 genes that are mis-expressed in a particular mutant background. A gene ontology (GO) search reveals that 13 of these genes encode proteins with a RING domain. Given that there are 152 annotated RING domain–containing proteins in C. elegans , what is the probability that at least 13 would arise by chance in a dataset of this size? The rationale for applying a hyper-geometric distribution to this problem would be as follows. If one were to randomly pick one of ∼20,000 worm proteins out of a hat, the probability of it containing a RING domain would be 152/20,000 (0.00760). This leaves 151 remaining RING proteins that could be picked in future turns. The chance that the next protein would contain a RING domain is then 151/19,999 (0.00755). By the time we have come to the thirteenth RING protein in our sample of 1,000 differentially expressed genes, our chances might be something like 140/19,001 (0.00737). Plugging the required numbers into both binomial and hyper-geometric calculators 45 (available on the web), we get probabilities of 0.0458 and 0.0415, respectively. Admittedly, in this case the hyper-geometric method gives us only a slightly smaller probability than the binomial. Here the difference isn't dramatic because the population size (20,000) is not particularly small.

We would also underscore the importance of being conservative in our interpretations of GO enrichment studies. In the above example with RING finger proteins, had the representation in our dataset of 1,000 genes been 12 instead of 13, the probability would have been greater than 0.05 (0.0791 and 0.0844 by hyper-geometric and binomial methods, respectively). Furthermore, we need to consider that there are currently several thousand distinct GO terms that are currently used to classify C. elegans genes. Thus, the random chance of observing over-representation within any one particular GO class will be much less than observing over-representation within some—but no particular—GO class. Going back to marbles, if we had an urn with 100 marbles of ten different colors (10 marbles each), the chance that a random handful of 5 marbles would contain at least 3 of one particular color is only 0.00664. However, the chance of getting at least three out of five the same color is 0.0664. Thus, for GO over-representation to be meaningful, we should look for very low P -values (e.g., ≤1e –5 ).

4.13. Tests for differences between multinomial proportions

Multinomial proportions or distributions refer to data sets where outcomes are divided into three or more discrete categories. A common textbook example involves the analysis of genetic crosses where either genotypic or phenotypic results are compared to what would be expected based on Mendel's laws. The standard prescribed statistical procedure in these situations is the Chi-square goodness-of-fit test, an approximation method that is analogous to the normal approximation test for binomials. The basic requirements for multinomial tests are similar to those described for binomial tests. Namely, the data must be acquired through random sampling and the outcome of any given trial must be independent of the outcome of other trials. In addition, a minimum of five outcomes is required for each category for the Chi-square goodness-of-fit test to be valid. To run the Chi-square goodness-of-fit test, one can use standard software programs or websites. These will require that you enter the number of expected or control outcomes for each category along with the number of experimental outcomes in each category. This procedure tests the null hypothesis that the experimental data were derived from the same population as the control or theoretical population and that any differences in the proportion of data within individual categories are due to chance sampling.

As is the case with binomials, exact tests can also be carried out for multinomial proportions. Such tests tend to be more accurate than approximation methods, particularly if the requirement of at least five outcomes in each category cannot be met. Because of the more-complicated calculations involved, however, exact multinomial tests are less commonly used than the exact binomial test, and web versions are currently difficult to come by. The good news is that, like the binomial tests, the approximate and exact methods for multinomials will largely yield identical results. Also make sure that you don't confuse the Chi-square goodness-of-fit test with the Chi-square test of independence , which also has an exact version termed the Fisher's exact test.

Although many of us will probably not require the Chi-square goodness-of-fit test to sort out if our proportion of yellow wrinkled peas is what we might have expected, understand that this test can be used for any kind of sample data where more than two categories are represented. Imagine that we are studying a gene in which mutations lead to pleiotropy, meaning that a spectrum of distinct phenotypes is observed. Here, the proportion of animals displaying each phenotype could be compared across different alleles of the same gene to determine if specific mutations affect certain developmental processes more than others. In other instances, numerical data, such as the number of mRNA molecules in an embryo, may also benefit from imposing some broader categorization. For example, embryos might be divided into those containing 0–10, 11–50, 51–200, and >200 transcripts. These outcomes could then be compared across different mutant backgrounds or at different developmental time points to identify broad categorical differences in expression. In all of the above cases, a Chi-square goodness-of-fit test would be appropriate to determine if any differences in the observed proportions are statistically significant.

5. Relative differences, ratios, and correlations

5.1. comparing relative versus incremental differences.

It is common in biology for relative changes to be more germane than incremental ones. There are two principal reasons for this. One is that certain biological phenomena can only be properly described and understood through relative changes. For example, if we were to count the number of bacterial cells in a specified volume of liquid culture every hour, we might derive the following numbers: 1,000, 2,000, 4,000, 8,000, 16,000. The pattern is clear; the cells are doubling every hour. Conversely, it would be ridiculous to take the mean of the observed changes in cell number and to state that, on average, the cells increase by 3,750 each hour with a 95% CI of −1,174.35 to 8,674.35! The second reason is due to experimental design. There are many instances where variability between experiments or specimens makes it difficult, if not impossible, to pool mean values from independent repeats in a productive way. Rather, the ratio of experimental and control values within individual experiments or specimens should be our focus. One example of this involves quantifying bands on a western blot and is addressed below. Most traditional statistical approaches, however, are oriented toward the analysis of incremental changes (i.e., where change is measured by subtraction). Thus, it may not always be clear how to analyze data when the important effects are relative.

As an example of a situation in which ratios are likely to be most useful, we consider the analysis of a western blot ( Figure 13 ) (also see Gassmann et al., 2009 ). This schematic blot shows the outcome of an experiment designed to test the hypothesis that loss of gene y activity leads to changes in the expression of protein X in C. elegans . In one scenario, the three blots (A–C) could represent independent biological repeats with lanes 1–3 serving as technical (e.g., loading) repeats. In another scenario, the three blots could serve as technical repeats with lanes 1–3 representing independent biological repeats. Regardless, either scenario will give essentially the same result. Quantification of each band, based on pixel intensity, is indicated in blue 46 . Based on Figure 13 , it seems clear that loss of gene y leads to an increase in the amount of protein X. So does the statistical analysis agree?

Representative western blot analysis.

Figure 14 shows the results of carrying out the statistical analysis in several different ways. This includes, for illustrative purposes, seven distinct two-tailed t -tests (1–7). In t -tests 1–3, data from wild-type and mut y bands were pooled within individual blots to obtain an average. Interestingly, only one of the three, blot A, showed a statistically significant difference ( P ≤ 0.05) between wild type and mut y , despite all three blots appearing to give the same general result. In this case, the failure of blots B and C to show a significant difference is due to slightly more variability between samples of the same kind (i.e., wild type or mut y ) and because with an n = 3, the power of the t -test to detect small or even moderate differences is weak. The situation is even worse when we combine subsets of bands from different blots, such as pooled lanes 1, 2, and 3 ( t -tests 4–6). Pooling all of the wild-type and mut y data ( t -test 7) does, however, lead to a significant difference ( P = 0.0065).

Statistical analysis of western blot data from Figure 13. A summary of test options is shown.

So was the t -test the right way to go? Admittedly, it probably wasn't very satisfying that only one of the first three t -tests indicated a significant difference, despite the raw data looking similar for all three. In addition, we need to consider whether or not pooling data from different blots was even kosher. On the one hand, the intensity of the protein X band in any given lane is influenced by the concentration of protein X in the lysate, which is something that we care about. On the other hand, the observed band intensity is also a byproduct of the volume of lysate loaded, the efficiency of protein transfer to the membrane, the activity of the radiolabel or enzymes used to facilitate visualization, and the length of the exposure time, none of which are relevant to our central question! In fact, in the case of western blots, comparing intensities across different blots is really an apples and oranges proposition, and thus pooling such data violates basic principles of logic 47 . Thus, even though pooling all the data ( t -test 7) gave us a sufficiently low P -value as to satisfy us that there is a statistically significant difference in the numbers we entered, the premise for combining such data was flawed scientifically.

The last test shown in Figure 14 is the output from confidence interval calculations for two ratios. This test was carried out using an Excel tool that is included in this chapter 48 . To use this tool, we must enter for each paired experiment the mean (termed “estimate”) and the SE (“SE of est”) and must also choose a confidence level ( Figure 15 ). Looking at the results of the statistical analysis of ratios ( Figure 14 ), we generally observe much crisper results than were provided by the t -tests. For example, in the three cases where comparisons were made only within individual blots, all three showed significant differences corresponding to P < 0.05 and two (blots A and B) were significant to P < 0.01 49 . In contrast, as would be expected, combining lane data between different blots to obtain ratios did not yield significant results, even though the ratios were of a similar magnitude to the blot-specific data. Furthermore, although combining all values to obtain means for the ratios did give P < 0.05, it was not significant at the α level of 0.01. In any case, we can conclude that this statistical method for acquiring confidence intervals for ratios is a clean and powerful way to handle this kind of analysis.

Confidence interval calculator for a ratio.

It is also worth pointing out that there is another way in which the t -test could be used for this analysis. Namely, we could take the ratios from the first three blots (3.33, 3.41, and 2.48), which average to 3.07, and carry out a one-sample two-tailed t -test. Because the null hypothesis is that there is no difference in the expression of protein X between wild-type and mut y backgrounds, we would use an expected ratio of 1 for the test. Thus, the P -value will tell us the probability of obtaining a ratio of 3.07 if the expected ratio is really one. Using the above data points, we do in fact obtain P = 0.02, which would pass our significance cutoff. In fact, this is a perfectly reasonable use of the t -test, even though the test is now being carried out on ratios rather than the unprocessed data. Note, however, that changing the numbers only slightly to 3.33, 4.51, and 2.48, we would get a mean of 3.44 but with a corresponding P -value of 0.054. This again points out the problem with t -tests when one has very small sample sizes and moderate variation within samples.

5.2. Ratio of means versus mean of ratios

There is also an important point to be made with respect to ratios that concerns the mean value that you would report. Based on the above paragraph, the mean of the three ratios is 3.07. However, looking at t -test 7, which used the pooled data, we can see that the ratio calculated from the total means would be 477/167 = 2.86. This points out a rather confounding property of ratio arithmetic. Namely, that the mean of the ratios (MoR; in this case 3.07) is usually not equal to the ratio of the means (RoM; 2.86). Which of the two you choose to use to report will depend on the question you are trying to answer.

To use a non-scientific (but intuitive) example, we can consider changes in housing prices over time. In a given town, the current appraised value of each house is compared to its value 20 years prior. Some houses may have doubled in value, whereas others may have quadrupled ( Table 5 ). Taking an average of the ratios for individual houses (i.e., the relative increase)—the MoR approach—allows us to determine that the mean increase in value has been 3-fold. However, it turns out that cheaper houses (those initially costing ≤$100,000) have actually gone up about 4-fold on average, whereas more-expensive homes (those initially valued at ≥$300,000) have generally only doubled. Thus, the total increase in the combined value of all the homes in the neighborhood has not tripled but is perhaps 2.5-fold higher than it was 20 years ago (the RoM approach).

Table 5. A tinker-toy illustration for increases in house prices in TinyTown (which has only two households).

Which statistic is more relevant? Well, if you're the mayor and if property taxes are based on the appraised value of homes, your total intake will be only 2.5 times greater than it was 20 years ago. If, on the other hand, you are writing a newspaper article and want to convey the extent to which average housing prices have increased over the past 20 years, 3-fold would seem to be a more salient statistic. In other words, MoR tells us about the average effect on individuals, whereas RoM conveys the overall effect on the population as a whole. In the case of the western blot data, 3.07 (i.e., the MoR) is clearly the better indicator, especially given the stated issues with combining data from different blots. Importantly, it is critical to be aware of the difference between RoM and MoR calculations and to report the statistic that is most relevant to your question of interest.

5.3. Log scales

Data from studies where relative or exponential changes are pervasive may also benefit from transformation to log scales. For example, transforming to a log scale is the standard way to obtain a straight line from a slope that changes exponentially. This can make for a more straightforward presentation and can also simplify the statistical analysis (see Section 6.4 on outliers). Thus, transforming 1, 10, 100, 1,000 into log 10 gives us 0, 1, 2, 3. Which log base you choose doesn't particularly matter, although ten and two are quite intuitive, and therefore popular. The natural 50 log (∼2.718), however, has historical precedent within certain fields and may be considered standard. In some cases, back transformation (from log scale to linear) can be done after the statistical analysis to make the findings clearer to readers.

5.4. Correlation and modeling

For some areas of research (such as ecology, field biology, and psychology), modeling, along with its associated statistics, is a predominant form of analysis. This is not the case for most research conducted in the worm field or, for that matter, by most developmental geneticists or molecular biologists. Correlation, in contrast, can be an important and useful concept. For this reason, we include a substantive, although brief, section on correlation, and a practically non-existent section on modeling.

Correlation describes the co-variation of two variables. For example, imagine that we have a worm strain that expresses reporters for two different genes, one labeled with GFP, the other with mCherry 51 . To see if expression of the two genes is correlated, GFP and mCherry are measured in 50 individual worms, and the data are plotted onto a graph known as a scatterplot . Here, each worm is represented by a single dot with associated GFP and mCherry values corresponding to the x and y axes, respectively ( Figure 16 ). In the case of a positive correlation, the cloud of dots will trend up to the right. If there is a negative correlation, the dots will trend down to the right. If there is little or no correlation, the dots will generally show no obvious pattern. Moreover, the closer the dots come to forming a unified tight line, the stronger the correlation between the variables. Based on Figure 16 , it would appear that there is a positive correlation, even if the dots don't fall exactly on a single line. Importantly, it matters not which of the two values (GFP or mCherry) is plotted on the x or the y axes. The results, including the statistical analysis described below, will come out exactly the same.

Scatterplot of GFP expression versus mCherry. The correlation coefficient is ∼0.68. The units on the axes are arbitrary.

The extent of correlation between two variables can be quantified through calculation of a statistical parameter termed the correlation coefficient (a.k.a. Pearson's product moment correlation coefficient , Pearson's r , or just r ). The formula is a bit messy and the details are not essential for interpretation. The value of r can range from −1 (a perfect negative correlation) to 1 (a perfect positive correlation), or can be 0 52 in the case of no correlation. Thus, depending on the tightness of the correlation, values will range from close to zero (weak or no correlation) to 1 or −1 (perfect correlation). In our example, if one of the two genes encodes a transcriptional activator of the other gene, we would expect to see a positive correlation. In contrast, if one of the two genes encodes a repressor, we should observe a negative correlation. If expression of the two genes is in no way connected, r should be close to zero, although random chance would likely result in r having either a small positive or negative value. Even in cases where a strong correlation is observed, however, it is important not to make the common mistake of equating correlation with causation 53 .

Like other statistical parameters, the SD, SE, and 95% CI can be calculated for r . In addition, a P -value associated with a given r can be determined, which answers the following question: What is the probability that random chance resulted in a correlation coefficient as far from zero as the one observed? Like other statistical tests, larger sample sizes will better detect small correlations that are statistically significant. Nevertheless, it is important to look beyond the P -value in assessing biological significance, especially if r is quite small. The validity of these calculations also requires many of the same assumptions described for other parametric tests including the one that the data have something close to a normal distribution. Furthermore, it is essential that the two parameters are measured separately and that the value for a given x is not somehow calculated using the value of y and vice versa.

Examples of six different scatterplots with corresponding r and P -values are shown in Figure 17 . In addition to these values, a black line cutting through the swarm of red dots was inserted to indicate the slope. This line was determined using a calculation known as the least squares or linear least squares method . The basic idea of this method is to find a straight line that best represents the trend indicated by the data, such that a roughly equal proportion of data points is observed above and below the line. Finally, blue dashed lines indicate the 95% CI of the slope. This means that we can be 95% certain that the true slope (for the population) resides somewhere between these boundaries.

Fits and misfits of regression lines. The units on the axes are arbitrary.

Panels A–C of Figure 17 show examples of a strong ( r = 0.86), weak ( r = −0.27), and nonexistent ( r = 0.005) correlation, respectively. The purple line in panel B demonstrates that a slope of zero can be fit within the 95% CI, which is consistent with the observed P -value of 0.092. Panel D illustrates that although small-sized samples can give the impression of a strong correlation, the P -value may be underwhelming because chance sampling could have resulted in a similar outcome. In other words, similar to SD, r is not affected by sample size 54 , but the P -value most certainly will be. Conversely, a large sample size will detect significance even when the correlation coefficient is relatively weak. Nevertheless, for some types of studies, a small correlation coefficient with a low P -value might be considered scientifically important. Panels E and F point out the dangers of relying just on P -values without looking directly at the scatterplot. In Panel E, we have both a reasonably high value for r along with a low P-value. Looking at the plot, however, it is clear that a straight line is not a good fit for these data points, which curve up to the right and eventually level out. Thus, the reported r and P -values, though technically correct, would misrepresent the true nature of the relationship between these variables. In the case of Panel F, r is effectively zero, but it is clear that the two variables have a very strong relationship. Such examples would require additional analysis, such as modeling, which is described briefly below.

Another very useful thing about r is that it can be squared to give R 2 (or r 2 ), also called the coefficient of determination . The reason R 2 is useful is that it allows for a very easy interpretation of the relationship between the two variables. This is best shown by example. In the case of our GFP/mCherry experiment, we obtained r = 0.68, and squaring this gives us 0.462. Thus, we can say that 46.2% of the variability in mCherry can be explained by differences in the levels of GFP. The rest, 53.8%, is due to other factors. Of course, we can also say that 46.2% of the variability in GFP can be explained by differences in the levels of mCherry, as the R 2 itself does not imply a direction. Of course if GFP is a reporter for a transcription factor and mCherry is a reporter for a structural gene, a causal relationship, along with a specific regulatory direction, is certainly suggested. In this case, additional experiments would have to be carried out to clarify the underlying biology.

5.5. Modeling and regression

The basic idea behind modeling and regression methods is to come up with an equation that can make useful predictions or describe the behavior of a system. In simple linear regression , a single predictor or independent variable , such as the GFP intensity of a heat-shock reporter, might be used to predict the behavior of a response or dependent variable , such as the life span of a worm 55 . The end result would be an equation 56 that describes a line that is often, although not always, straight 57 . Multiple regression is an extension of simple linear regression, but it utilizes two or more variables in the prediction 58 . A classic example of multiple regression used in many statistics texts and classes concerns the weight of bears. Because it's not practical to weigh bears in the field, proxy measures such as head circumference, body length, and abdominal girth are acquired and fitted to an equation (by a human-aided computer or a computer-aided human), such that approximate weights can be inferred without the use of a scale. Like single and multiple linear regression, nonlinear regression also fits data (i.e., predictive variables) to a curve that can be described by an equation. In some cases, the curves generated by nonlinear regression may be quite complex. Unlike linear regression, nonlinear regression cannot be described using simple algebra. Nonlinear regression is an iterative method, and the mathematics behinds its workings are relatively complex. It is used in a number of fields including pharmacology. Logistic regression uses one or more factors to predict the probability or odds of a binary or dichotomous outcome, such as life or death. It is often used to predict or model mortality given a set of factors; it is also used by employers in decisions related to hiring or by government agencies to predict the likelihood of criminal recidivism 59 .

6. Additional considerations and guidelines

6.1. when is a sample size too small.

Several issues can arise with small sample sizes. One is that with weak statistical power , we may simply fail to detect a finding of interest, a point that is addressed below. Even when we do detect a significant effect, however, our conclusions may be undermined because one of the underlying assumptions of many statistical tests—that the population from which the data are derived is approximately normally distributed —is not true. This is a bit of a chicken and egg conundrum. If our sample size is sufficiently large (e.g., >30), then we can check to see if the underlying population is likely to have a normal (or “normal enough”) distribution. If it does, then we're set. If not, we may still be safe because our sample size is sufficient to compensate for lack of normality in the population. Put another way, most statistical tests are robust to all but the most skewed distributions provided that the sample size is large enough. Conversely, with a small sample size, we can't tell much, or perhaps anything, about the underlying normality of the population. Moreover, if the population deviates far enough from normal, then our statistical tests will not be valid.

So, what to do? As already discussed in earlier sections, in some cases it may be reasonable to assume that the underlying distribution is likely normal enough to proceed with the test. Such a decision could be reasonable if this coincides with accepted standards in the field or in cases where the laboratory has previously generated similar larger data sets and shown them to have approximately normal distributions. In other cases, you may decide that collecting additional data points is necessary and justified. Alternatively, you may need to consider using a non-parametric statistical approach for which normality of the data is not a prerequisite (discussed below). Typically, non-parametric tests will have less power, however, so the effects may need to be stronger to achieve statistical significance. If you are uncertain about what to do, consulting a nearby statistician is recommended. In general, one should be cautious about interpreting P -values, especially ones that would be considered borderline (e.g., P = 0.049), when normality is uncertain and sample sizes are small.

6.2. Statistical power

We can also say that a sample size is too small when real effects that would be considered biologically interesting or important fail to be detected or supported by our analyses. Such cases may be referred to as false negatives or Type II errors. Of course, we are seldom in a position of knowing a priori what any outcome should be. Otherwise, why would we be doing the experiment? Thus, knowing when a Type II error has occurred is not possible. Likewise, knowing when a false positive or Type I error has occurred is also impossible. In this latter case, however, the chosen α level represents a stated degree of acceptable risk that a false positive will sneak through. So what should be the allowable risk for a false negative? There is no simple answer to this. In cases where significant health or environmental consequences could arise, the acceptable level of risk may be very low. In contrast, if we are screening an entire genome where we are likely to uncover hundreds or thousands of positive hits, missing a few genes may not be a problem.

The branch of statistics that deals with issues related to false-negative findings is called power analysis . Power analysis answers the following question: If there is a real effect of a certain magnitude, what is the probability that a study of a given size will detect it? Put another way, if the experiment was repeated many times, what fraction would lead to statistically significant outcomes? The answer to this will depend on several factors. Detecting a statistically significant effect is easier when the magnitude of the effect (e.g., the difference between means) is larger and when the variation within samples (e.g., SD) is smaller. In addition, false negatives are obviously more likely to occur with lower α levels than with higher ones (e.g., 0.001 versus 0.05).

In the case of clinical drug trials, the relevance of statistical power is self-evident. Pharmaceutical companies want to be efficient with their time and money, but not at the expense of falsely concluding that a drug is without significant benefits 60 . Thus, prior to any clinical trial, a power analysis is carried out to determine how many subjects must undergo the treatment in order for the study to stand a good chance of detecting a benefit of some specified size. Presumably, failing to detect an effect that is smaller in magnitude than the agreed-upon cutoff would be okay because a relatively ineffective drug would have little commercial value. Power analysis is also critical to biological field research, where collecting samples may be time consuming and costly. In other words, why spend a year chasing after data if an initial power test tells you that you'll be unlikely to detect an effect that you'd consider essential for the endeavor to be worthwhile?

Admittedly, power analysis is not something that springs to the minds of most C. elegans researchers prior to conducting an experiment. We tend to run the experiment with what we think is a reasonable number of specimens and just see what happens. There are potentially two problems with this “shoot from the hip” approach. One is that we may fail to detect true positives of interest. The second is that what we may be doing is already overkill. Namely, our sample sizes may be larger than necessary for detecting biologically significant effects. In particular, if we are doing genome-wide screens, it is in our interest to be as efficient with our time (and money) as possible.

Rather than discussing the formulas that go into calculating power, this section contains links to Excel tools, along with an instructional manual, that can be used to calculate the power of an analysis when comparing means, proportions, and other statistical parameters of interest. In addition, a number of websites and software programs can also serve this purpose. Our intention is that these tools be used by investigators to make accurate projections as to the sample size that will be required to observe effects of a given magnitude. For our example, we will imagine that we are testing RNAi feeding clones that correspond to several hundred transcription factors to see if they affect expression of an embryonic GFP reporter at the comma stage. Because quantifying expression in individual embryos is fairly time consuming, we want to calculate the minimum number of embryos required per RNAi strain, such that we have at least a 90% chance of identifying RNAi clones that lead to some minimum acceptable fold-change. Based on what we know about the biology of the system, we decide that expression increases or decreases of at least 1.5-fold are likely to be most relevant, although slightly smaller changes could also be of interest.

First we analyze GFP expression in 100 embryos from control RNAi plates. We find that the average number of pixels per embryo is 5000, that the SD = 2000, and thus the coefficient of variation is therefore 0.4 (2000/5000). We also learn that expression in comma-stage embryos is approximately normally distributed. For an effect to be of interest then, the average embryonic expression would have to be >7500 or <3300. In addition, we will assume that the SDs of the test samples vary proportionally with the mean. In other words, the SD for our test clones will be identical to our control in cases where there is no difference in mean expression but will be proportionally larger or smaller when RNAi enhances or decreases expression, respectively. This latter assumption is common and fairly intuitive given that larger means are typically associated with larger SDs, including situations where the coefficient of variation remains constant. Finally, we choose the standard α of 0.05.

Plugging these numbers into the Excel tool, we find that with a power of 90%, 18 embryos should be sufficient to detect a 1.5-fold (or 50%) increase in mean GFP expression, whereas 9 embryos are sufficient to detect a 1.5-fold (or 34%) decrease (using a two-tailed t -test; Figure 18A, B ). Playing with the tool, we also notice that increasing our sample size to 22 would give us 95% power to detect 1.5-fold increases ( Figure 18C ) and >99% power to detect decreases of 1.5-fold or greater. In fact, with 22 embryos, we'd have a 90% chance of detecting a 1.33-fold (or 25%) decrease in expression ( Figure 18D ). This seems like a good deal to us and is experimentally feasible, so we decide ultimately to go with 22 embryos as our target sample size. In fact, this kind of back-and-forth negotiation with respect to sample size, sensitivity, and power is exactly what is supposed to take place. In any event, by carrying out the power analysis before we do our screen, we can be much more confident that our studies will be efficient and fruitful.

Examples of output from Power Calculator. This Excel tool is available along with this chapter.

6.3. Can a sample size be too large?

Although it may sound paradoxical, a larger-than-necessary sample size not only may be wasteful but also could lead to results that are grossly misleading. There are two reasons for this, one biological and one ethical. In the case of the former, large sample sizes can lead to statistically significant outcomes even when differences are so small in magnitude as to be biologically irrelevant. In this case, the detected difference is real, it's just not important. As previously stated, statistical significance should never be confused with biological significance. Unfortunately, some readers may be tempted to rely on the reported P -value as a kind of proxy for discerning the bottom line with respect to “significance” in the broader sense. Don't fall into this trap.

The ethical issue surrounding large sample sizes can be illustrated by a thought experiment. Imagine that you are testing for differences between two populations that are in fact identical for some trait of interest (admittedly, something you could never know beforehand). An initial sample size of 10 from each group shows a slight difference in means, as could be expected, but this difference isn't significant at the standard α level of 0.05. Not to be discouraged, you then increase each sample size by 10 and try the test again. If there is still no significant difference, you try adding another 10 and repeat the analysis. Will you eventually get a P -value ≤ 0.05? To answer this question, first recall the meaning of α = 0.05. Five percent of the time, the test will show a statistically significant difference even when the two populations are identical. Put another way, if you were to carry out 100 tests, each with a sample size of 10, 5 tests, on average, would give P ≤ 0.05 by random chance. Had you carried out 100 tests, would you only report the five that were significant? Hopefully not! Well, it turns out that that is effectively what you would be doing if you were to repeatedly increase your sample size with the intent of uncovering a positive finding. At some point, the two samples will show a statistically significant difference by chance—it's guaranteed! In fact, over the lifetime of such a “never-ending study”, chance differences will lead to P ≤ 0.05 5% of the time, P ≤ 0.01 1% of the time, and, if one were to be particularly persistent, a P -value of <0.001 could most certainly be achieved. Obviously the magnitude of the effects in this scenario would be very small (irrelevant in fact), but that won't matter to the statistical test. This kind of approach, sometimes referred to as an ad hoc , is clearly a major no-no, but it is a surprisingly easy trap to fall into. So remember that an experiment carried out ad infinitum will eventually give you a significant P -value even if none exists. Power analysis, as discussed above, can be a rational way to determine the necessary sample size in advance based on a specified α level and a pre-determined minimal difference of interest.

6.4. Dealing with outliers

When it comes to outliers, there is no real consensus—not about how to formally define them, not about how best to detect them, and, most importantly, not about what to actually do with them. There is, of course, universal agreement that any data that are reported in the literature should be accurate and real. The slippery slope comes into play because discarding a data point that falls outside of the expected range of values also serves the purpose of yielding a cleaner, more appealing, and more statistically convincing result. In fact, getting rid of outliers, even ones that should likely be discarded, nearly always does just that. Note that we do not endorse the strategy of throwing out objectionable data points simply to make your results appear cleaner!

Outliers can arise in several ways. One is through incorrect data entry. Another is through experimental error, either on the part of the investigator or the apparatus being used. If there is good reason to believe that any data point, outlier or not, is likely to be flawed in either of these ways, getting rid of it is an easy call. Outliers, however, may also be legitimate, biologically correct data points that have occurred either through chance sampling or because the biology of the system is prone to producing such values. In fact, if it's the underlying biology that's driving things, this may turn out to be the most interesting outcome of the experiment.

Outliers principally affect statistics that are based on continuous variables , such as means and SDs, but do not generally have much of an impact on discrete or non-continuous variables , such as medians. For example, the median of the five values 4, 6, 7, 8, 10 is the same as the median of 4, 6, 7, 8, 235. In contrast, the means and SDs of these two data sets are very different. The resistance of median values to the effects of outliers provides a reasonable argument for using medians over means under certain circumstances. For example, in a subdivision with 100 houses, a mean value of $400,000 may be due to a handful of multi-million dollar properties. A median value of $250,000, however, might better reflect the cost of a typical home. Some problems with medians are that they are not as widely used in the scientific literature as means and also require a different set of statistical tools for their analysis. Specifically, the statistical analysis of medians is carried out using nonparametric approaches, such as a sign test , or through computational methods, such as bootstrapping (discussed below).

Concluding that an unexpected or anomalous value is an outlier deserving of banishment is not something that should ever be done informally. Some would argue that it's not something that should ever be done formally either. In any case, the danger of the informal or ad hoc approach is that our eyes, as well as our preconceptions, will identify many more outliers than really exist. In addition, our hopes and biases for certain experimental outcomes can cloud our judgment as to which values should be removed. Several formal tests can be used to identify “rejectable” outliers including the Grubb's outlier test and Dixon's Q test . These tests will answer the following question: Given a specified sample size, what is the probability that a value of this magnitude would occur assuming that the population from which the sample is derived is normally distributed ? If this probability is ≤0.05, you then have legitimate grounds for tossing the data point. Alternatively, you can report your findings with and without the outlier present and let readers make up their own minds.

Outliers can also appear to be present in samples from populations that have lognormal distributions . In fact, without taking this into account, you could mistakenly reject a legitimate data point based on a statistical test that assumes normality. Lognormal distributions turn out to be quite common in biology. Whereas normal distributions result when many factors contribute additively to some measured outcome, lognormal distributions occur when factors act in a multiplicative fashion. Thus, sample data that appear to contain an outlier or otherwise look to be non-Gaussian, should be tested first by transforming the raw data into their corresponding log values and plotting them on a histogram. If the distribution of the transformed values looks normal, then the original distribution was probably lognormal. To handle things statistically, you can use the log-transformed data to carry out tests that assume normality. Back-transformations to the linear scale can also be carried out if necessary to report certain values. In the case of means, this will produce a parameter referred to as the geometric mean , which (like the median) will tend to give values that are more typical of the data set. For example, we can transform the numbers 6, 12, 19, 42, and 951 to their log 10 values (0.78, 1.1, 1.3, 1.6, 3.0), calculate their mean (1.55), and then back-transform (10 x ) to get a geometric mean of 35.3, which differs substantially from the standard or arithmetic mean of 206.

Another approach, which may be particularly well suited to handling situations where inconsistent or spurious data tend to arise at either end of the spectrum, is to use trimmed or truncated data. For example, a 10% trimmed mean would remove 10% of the highest values AND 10% of the lowest values. Thus, for the series 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, both 1 and 10 would be removed before calculating the mean. This method can provide more information than the median 61 but is conveniently resistant to the effects of outliers. An example of an appropriate use for a trimmed mean might involve quantifying protein expression in C. elegans embryos using immunostaining. Even if done carefully, this method can result in a small fraction of embryos that have no staining as well as some that appear to have high levels of non-specific staining. Thus, the trimmed mean would provide a reasonable measure of the central tendency of the dataset.

6.5. Nonparametric tests

Most of the tests described in the above sections rely on the assumption that the data are sampled from populations with relatively normal distributions 62 . Such tests, which include the t -test, are referred to as parametric tests . Nonparametric tests should be used if the population distributions are known to deviate or are suspected to deviate far from normality. In addition, nonparametric tests may be appropriate if a population deviates only moderately from a normal distribution but the sample size is insufficient to compensate for this at the level of the distribution of the statistic . Although it would be nice to be able to provide some cut-and-dry criteria for making a call regarding when to use parametric versus nonparametric tests, the minimum sample size required for conducting a parametric test will depend on the degree to which the underlying population deviates from normality. The larger the deviation, the larger the sample size must be for parametric tests to be valid. Although some texts recommend a sample size of 30 as being safe, this may be insufficient if the population distribution is very far from normal. Nonparametric tests are also required when statistics are to be carried out on discontinuous parameters such as medians.

The basis for many nonparametric tests involves discarding the actual numbers in the dataset and replacing them with numerical rankings from lowest to highest. Thus, the dataset 7, 12, 54, 103 would be replaced with 1, 2, 3, and 4, respectively. This may sound odd, but the general method, referred to as a sign test , is well grounded. In the case of the Mann-Whitney test, which is used to compare two unpaired groups, data from both groups are combined and ranked numerically (1, 2, 3, … n ). Then the rank numbers are sorted back into their respective starting groups, and a mean rank is tallied for each group 63 . If both groups were sampled from populations with identical means (the null hypothesis), then there should be relatively little difference in their mean ranks, although chance sampling will lead to some differences. Put another way, high- and low-ranking values should be more or less evenly distributed between the two groups. Thus for the Mann-Whitney test, the P -value will answer the following question: Based on the mean ranks of the two groups, what is the probability that they are derived from populations with identical means? As for parametric tests, a P -value ≤ 0.05 is traditionally accepted as statistically significant.

The obvious strength of nonparametric tests is that they allow an investigator to conduct a statistical analysis that would otherwise not be possible (or at least valid) using traditional parametric tests. A downside is that nonparametric tests are somewhat less powerful than parametric tests and so should be applied only when necessary 64 . The reduced power is due to the fact that quantitative information is discarded when ranks are assigned. In addition, nonparametric tests will often lack the ability to provide CIs, which can be quite useful. Finally, nonparametric tests can underestimate the effects of outliers, which may be valid data points and biologically significant. Below is a brief list of some common nonparametric tests and their applications. Note that although these are so-called nonparametric tests, most do require certain assumptions about the data and collection methods for their results to be valid. Be sure to read up on these methods or consult a statistician before you apply them.

The Mann-Whitney test (a.k.a. Wilcoxon rank-sum test, Mann-Whitney U -test, and sign test) is used in the place of a two-sample t -test to compare two groups. Variations on this theme are also used to test for differences in medians.

The Wilcoxon matched pairs signed-rank test (a.k.a. Wilcoxon's test and Wilcoxon signed-rank test) is used in place of the paired t -test for matched data.

The one-sample sign tool , which is similar to a one-sample t -test, is used for medians and can also provide CIs. No major assumptions are required.

The one-sample Wilcoxon signed-rank test has the same applications as the one-sample sign tool but is more powerful. It does, however, assume that the distribution of values is symmetric around the median.

The Kruskal-Wallis test is used in place of a one-way ANOVA.

The rank correlation test (a.k.a. Spearman's rank correlation coefficient and Spearman's rho) is used in place of linear correlation.

The logrank test (a.k.a. Mantel Cox method) is used to compare survival curves.

6.6. A brief word about survival

In the fields of aging and stress, it is common to compare two or more strains for differences in innate laboratory lifespan or in survival following an experimental insult. Relative to studies on human subjects, survival analysis in C. elegans and other simple laboratory organisms is greatly simplified by the nature of the system. The experiments generally begin on the same day for all the subjects within a given sample and the study is typically concluded after all subjects have completed their lifespans. In addition, individual C. elegans rarely need to be censored . That is they don't routinely move way without leaving a forwarding address (although they have been known to crawl off plates) or stop taking their prescribed medicine.

One relatively intuitive approach in these situations would be to calculate the mean duration of survival for each sample along with the individual SDs and SEs. These averages could then be compared using a t -test or related method. Though seemingly straightforward, this approach has several caveats and is not recommended ( Suciu et al., 2004 ). One issue is that survival data are often quite skewed, and therefore tests that rely on assumptions of normality (e.g., the t -test) may not be valid. A second is that the t-test will have less power than some well-accepted alternative tests to detect small, but biologically meaningful, differences between populations. Another possible approach would be to focus on differences in survival at just one or perhaps several selective time points. However, this approach has the disadvantage of superficially appearing to be arbitrary, and it also fails to make use of all the available data.

The preferred alternative to the approaches discussed above is to make use of statistical tests that compare viability over the duration of the entire experiment, wherein each time point represents a cumulative opportunity to detect differences. For the C. elegans field, this can be accomplished using a standard version of the logrank test . The null hypothesis of the logrank test is that there is no difference in survival (or some other trait of interest) between the two populations under study. Thus, a trend of consistent differences over time would lead to rejection of the null hypothesis and a statistically significant finding. The logrank test is mathematically accomplished in part by combining the survival data from both populations and using this value to calculate the expected number of events (such as deaths) that would be predicted to occur in each population at each time point.

For example, from starting samples of 20 worms each, if five deaths occurred in strain A and one death occurred in strain B on day 1, then the total number of deaths on day 1 would be six. In this case, the predicted number of deaths for each strain on day 1 is simply the average of the two strains, namely three deaths. Thus, the difference between the observed and expected ( E obs – E exp ) deaths for strain A is 2 (5 − 3 = 2) and is −2 for strain B (1 − 3 = −2). These calculations get slightly more cumbersome at subsequent time points because adjustments must be made to compensate for differences in sample sizes at the start of each day (only the viable animals remain to be sampled), but the principle remains the same 65 . In this example, if the frequency of death in strain A was authentically higher, then summing the differences from each day (until the strain A sample had perished entirely) should give a positive number. If this number were sufficiently large 66 then a low P -value would be detected. If daily differences were due only to chance, however, it would be expected that a roughly equal number of positive and negative differences would be obtained over the course of the experiment, and thus the summed values would be close to zero, leading to a high P -value (e.g., >0.05). It is important to note that this test as described above is always between two populations, and thus the two populations tested should be explicitly stated for each P -value. Furthermore, issues of multiple comparisons should be taken into account when multiple samples are being compared.

6.7. Fear not the bootstrap

I (DF) have it on good word (from KG) that if statistics were being invented today, bootstrapping , as well as related re-sampling methods , would be the standard go-to approach. The reason that the statistics field took so long to come up with bootstrapping is forgivable. Until very recently, the computing power required to run resampling methods was unfathomable. Had statistics been invented in the last 10 years, however, the central statistical principles that underlie most of our favorite tests would probably be relegated to the side stream of mathematical curiosities as interesting but obscure numerical properties of interest only to theoreticians. Because the genesis and much of the development of statistics took place in the absence of powerful computers we are still, however, entrenched in what could be viewed as an outdated way of doing things. This is not to suggest that the old ways don't work quite well. In fact, they usually give answers that are nearly identical to those provided by the newer computationally intensive methods. The newer methods, however, require fewer assumptions and are therefore more broadly applicable and have fewer associated caveats. In particular, re-sampling methods do not require that the data be sufficiently normal. As such, bootstrapping and related approaches are in fact nonparametric methods 67 . Re-sampling methods can also be applied to statistical parameters that are not handled well by traditional methods, such as the median. Differences between the re-sampling methods and the nonparametric approaches outlined above are that re-sampling methods are not limited to the use of ranked-sign data and can therefore be applied to a wider range of statistical parameters, such as CIs. These properties mean that re-sampling methods have greater statistical power and flexibility than traditional nonparametric approaches. All this points to a future where re-sampling methods will largely rule the day. Get ready.

The basic idea of bootstrapping is to use a sample dataset of modest size to simulate an entire population. An example is provided by carrying out calculations to derive a 95% CI of the mean using the data that were analyzed in Figure 5A ( Section 2.2 ; Figure 19 ). Here, 55 data points comprise the original sample of measured GFP intensities. In the first round of bootstrapping, the 55 data points are randomly re-sampled with replacement to obtain a new data set of 55 values ( Figure 19 ). Notice that, by sampling randomly with replacement, some of the data points from the original dataset are missing, whereas others are repeated two or more times. A mean is then calculated for the re-sampled set, along with other statistical parameters that may be of interest. Round one is done. The results of rounds two and three are also shown for clarity ( Figure 19 ). Now repeat 3,997 more times. At this point, there should be 4,000 means obtained entirely through re-sampling. Next, imagine lining up the 4,000 means from lowest to highest, top to bottom ( Figure 19 ). We can partition off the lowest (top of list) 2.5% of values by drawing a line between the mean values at positions 100 and 101. We can also do this for the highest (bottom of list) 2.5% of values by drawing a line between the means at positions 3,900 and 3,901. To get a 95% CI, we simply report the mean values at positions 101 (14.86890) and 3900 (19.96438). Put another way, had we carried out just 40 iterations 68 instead of 4,000, the 95% CI would range from the second highest to the second lowest number. Thus, at its core, bootstrapping is conceptually very simple 69 . The fact that it's a bear computationally matters only to your computer.

Illustration of the bootstrap process. The original dataset is the one used for Section 2.2, Figure 5A.

The 4,000 individual means obtained through re-sampling can also be used to construct a histogram, which you may notice looks strikingly similar to a normal curve ( Figure 20 ). Not surprisingly, the apex of the curve corresponds very closely to the mean value of the original 55 data points (17.32; Figures 5A and 19 ). Furthermore, if we examine the values at approximately two SDs (1.96 to be exact) to either side of the red vertical lines in Figure 20 , we obtain 14.7 and 19.9, which again correspond closely to the 95% CI obtained by bootstrapping (14.9, 20.0) as well as the 95% CI obtained through traditional approaches (14.6 and 20.0; Figure 5A ). Thus, at some level, bootstrapping dovetails with some of the concepts covered in earlier sections, such as the idea of obtaining normal distributions through theoretical repeated sampling ( Section 2.6 ). In fact, one of the methods to check the validity of data obtained from bootstrap methods is to generate histograms such as the one in Figure 20 .

Bootstrapped sampling distribution of the mean. A normal curve (blue line) has been inserted for reference. Red vertical lines indicate 95% CI boundaries (∼2 SDs to either side of the mean)

Re-sampling methods can also be applied to test for differences in means between two independent samples, a process generically referred to as permutation tests . In this case, data from the two samples are first combined into one set. For example, samples with 22 and 58 data points would be combined to give a single sample of 80. Next, re-sampling without replacement is carried out to generate two new samples of size 22 and 58. Because the re-sampling is done without regard to which group the original data points came from, this will result in a random re-shuffling of the data points into the two groups. Next, means are obtained for each of the new sample sets and the difference between these means is calculated. Round one is now done. The original set of 80 is then resampled thousands more times. A P -value for the difference between means is then derived by noting the proportion of times that the two resampled sets gave mean differences that were as large or larger than the difference between the original datasets. If a P -value that was ≤ 0.05 resulted, the difference would typically be deemed statistically significant. Thus, much like the bootstrap calculation of the 95% CI, the re-sampling method is actually more simple and intuitive to grasp then the classical statistical methods, which rely heavily on theory 70 .

Because bootstrapping takes a single experiment with a limited sample size and effectively turns it into many thousands of experiments, some skeptical scientists may suspect bootstrapping to constitute a form of cheating. Of course this isn't true. That said, whenever you make any inference regarding the population as a whole from a single sample of that population, some assumptions are required. In the case of using classical statistical methods, such as those that rely on the t , z , and other distributions, we are assuming that the sample statistic that we are estimating has a roughly normal distribution. Moreover, the conclusions we derive using these methods are based on the impossible scenario that we carry out the experiment an infinite number of times. Thus, although we tend to accept these standard methods, assumptions are certainly inherent. Bootstrapping is no different. Like the classical methods, it uses the sample data to estimate population parameters. However, instead of using theoretical distributions, such as z and t , we use re-sampling of the data to predict the behavior of the sample statistic. Moreover, in the case of bootstrapping, we don't have to assume that the distribution of our statistic is normal. Therefore, bootstrapping actually requires fewer assumptions than classical methods and is consequently more reliable. Of course, if the experiment wasn't conducted well (non-random or inaccurate sampling methods) or if the numbers are very low, then neither classical nor bootstrapping methods will give reliable results. Statistics cannot validate or otherwise rescue a scientifically flawed experiment.

7. Acknowledgments

We greatly appreciate input from Naomi Ward on the contents of this review. We are particularly indebted to Amy Fluet for editing this ungainly monolith and providing much useful critique. We also thank Oliver Walter for encouraging this project and both anonymous reviewers for taking on this task and providing constructive comments. This work was supported by NIH grant GM066868.

8. References

Agarwal, P., and States, D.J. (1996). A Bayesian evolutionary distance for parametrically aligned sequences. J. Comput. Biol. 3 , 1–17.10.1089/cmb.1996.3.1 [ PubMed : 8697232 ] [ CrossRef ]
Agresti, A., and Coull, B.A. (1998). Approximate is better than Exact for Interval Estimation of Binomial Proportions. The American Statistician 52 , 119–126.
Agresti, A., and Caffo, B. (2000). Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. The American Statistician 54 , 280–288.
Bacchetti P. (2010). Current sample size conventions: flaws, harms, and alternatives. BMC Med. 8 , 17.10.1186/1741-7015-8-17 [ PMC free article : PMC2856520 ] [ PubMed : 20307281 ] [ CrossRef ]
Benjamini, Y., and Hochberg. Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 57 , pp. 289–300.
Brown, L.D., Cai, T., and DasGupta, A. (2001). Interval estimation for a binomial proportion. Statist. Sci. 16 , 101–133.
Burge, C., Campbell, A.M., and Karlin, S. (1992). Over- and under-representation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. U. S. A. 89 , 1358-1362. [ PMC free article : PMC48449 ] [ PubMed : 1741388 ]
Carroll, P.M., Dougherty, B., Ross-Macdonald, P., Browman, K., and FitzGerald, K. (2003) Model systems in drug discovery: chemical genetics meets genomics. Pharmacol. Ther. 99 , 183-220. [ PubMed : 12888112 ]
Doitsidou, M., Flames, N., Lee, A.C., Boyanov, A., and Hobert, O. (2008). Automated screening for mutants affecting dopaminergic-neuron specification in C. elegans . Nat. Methods 10 , 869–72.10.1038/nmeth.1250 [ PMC free article : PMC2693092 ] [ PubMed : 18758453 ] [ CrossRef ]
Gassmann, M., Grenacher, B., Rohde, B., and Vogel, J. (2009). Quantifying Western blots: pitfalls of densitometry. Electrophoresis 11 , 1845–1855.10.1002/elps.200800720 [ PubMed : 19517440 ] [ CrossRef ]
Houser, J. (2007). How many are enough? Statistical power analysis and sample size estimation in clinical research. J. Clin. Res. Best Pract. 3 , 1-5.
Hoogewijs, D., De Henau, S., Dewilde, S., Moens, L., Couvreur, M., Borgonie, G., Vinogradov, S.N., Roy, S.W., and Vanfleteren, J.R. (2008). The Caenorhabditis globin gene family reveals extensive nematode-specific radiation and diversification. BMC Evol. Biol. 8 , 279.10.1186/1471-2148-8-279 [ PMC free article : PMC2576238 ] [ PubMed : 18844991 ] [ CrossRef ]
Jones, L.V., and Tukey, J. (2000). A sensible formulation of the significance test. Psychol. Methods 5 , 411–414. 10.1037/1082-989X.5.4.411 [ PubMed : 11194204 ] [ CrossRef ]
Kamath, R.S., Fraser, A.G., Dong, Y., Poulin, G., Durbin, R., Gotta, M., Kanapin, A., Le Bot, N., Moreno, S., Sohrmann, M., Welchman, D.P., Zipperlen, P., and Ahringer, J. (2003). Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature 16 , 231–237.10.1038/nature01278 [ PubMed : 12529635 ] [ CrossRef ]
Knill, D.C, and Pouget, A. (2004). The Bayesian brain: the role of uncertainty in neural coding and computation. Trends Neurosci. 12 , 712-719. [ PubMed : 15541511 ]
Nagele, P. (2003). Misuse of standard error of the mean (SEM) when reporting variability of a sample. A critical evaluation of four anaesthesia journals. Br. J. Anaesth. 4 , 514-516.10.1093/bja/aeg087 [ PubMed : 12644429 ] [ CrossRef ]
Ramsey, F.L. and Schafer, D.W. (2013). The Statistical Sleuth: a Course in Methods of Data Analysis, Third Edition. (Boston: Brooks/Cole Cengage Learning).
Shaham, S. 2007. Counting mutagenized genomes and optimizing genetic screens in Caenorhabditis elegans . PLoS One 2 , e1117. [ PMC free article : PMC2065842 ] [ PubMed : 17989770 ]
Suciu, G.P., Lemeshow, S., and Moeschberger, M. (2004). Hand Book of Statistics, Volume 23: Advances in Survival Analysis, N. Balakrishnan and C.R. Rao, eds. (Amsterdam: Elsevier B. V.), pp. 251-261.
Sulston, J.E., and Horvitz, H.R. (1977). Post-embryonic cell lineages of the nematode, Caenorhabditis elegans . Dev. Biol. 56 , 110–156. [ PubMed : 838129 ]
Sun, X., and Hong, P. (2007). Computational modeling of Caenorhabditis elegans vulval induction. Bioinformatics 23 , i499-507.10.1093/bioinformatics/btm214 [ PubMed : 17646336 ] [ CrossRef ]
Vilares, I., and Kording, K. (2011). Bayesian models: the structure of the world, uncertainty, behavior, and the brain. Ann. N.Y. Acad. Sci. 1224 , 22–39.10.1111/j.1749-6632.2011.05965.x [ PMC free article : PMC3079291 ] [ PubMed : 21486294 ] [ CrossRef ]
Zhong, W., and Sternberg, P.W. (2006). Genome-wide prediction of C. elegans genetic interactions. Science 311 , 1481–1484.10.1126/science.1123287 [ PubMed : 16527984 ] [ CrossRef ]

9. Appendix A: Microsoft Excel tools

Note: these tools require the Microsoft Windows operating system. Information about running Windows on a Mac (Apple Inc.) can be found at http://store.apple.com/us/browse/guide/windows .

PowercalcTool 1mean.xls
PowercalcTool 2mean.xls
PowercalcTool prop.xls
RatiosTool.xls

10. Appendix B: Recomended reading

Motulsky, H. (2010). Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking, Second Edition (New York: Oxford University Press). This is easily my favorite biostatistics book. It covers a lot of ground, explains the issues and concepts clearly, and provides lots of good practical advice. It's also very readable. Some may want more equations or theory, but for what it is, it's excellent. (DF)
Triola M. F., and Triola M.M. (2006). Biostatistics for the Biological and Health Sciences (Boston: Pearson Addison-Wesley). This is a comprehensive and admirably readable conventional statistical text book aimed primarily at advanced undergraduates and beginning graduate students. It provides many great illustrations of the concepts as well as tons of worked examples. The main author (M. F. Triola) has been writing statistical texts for many years and has honed his craft well. This particular version is very similar to Elementary Statistics (10 th edition) by the same author. (DF)
Gonick, L. and Smith, W. (1993). The Cartoon Guide to Statistics (New York: HarperPerennial). Don't let the name fool you. This book is quite dense and contains the occasional calculus equation. It's an enjoyable read and explains some concepts very well. The humor is corny, but provides some needed levity. (DF)
van Emden, H. (2008). Statistics for Terrified Biologists (Malden, MA: Blackwell Publishing). This paperback explains certain concepts in statistics very well. In particular, its treatment of the theory behind the T test, statistical correlation, and ANOVA is excellent. The downside is that the author operates under the premise that calculations are still done by hand and so uses many shortcut formulas (algebraic variations) that tend to obscure what's really going on. In addition, almost half of the text is dedicated to ANOVA, which may not be hugely relevant for the worm field. (DF)
Sokal R. R., and Rohlf, F. J. (2012). Biometry: The Principles and Practice of Statisticsin Biological Research, Fourth Edition (New York: W. H. Freeman and Co.). One of the “bibles” in the field of biostatistics. Weighing in at 937 pages, this book is chock full of information. Unfortunately, like many holy texts, this book will not be that easy for most of us to read and digest. (DF)
Ramsey, F.L. and Schafer, D.W. (2013) . The Statistical Sleuth: a Course in Methods of Data Analysis, Third Edition. (Boston: Brooks/Cole Cengage Learning). Contemporary statistics for a mature audience (meaning those who do real research), done with lots of examples and explanation, and minimal math. (KG)
Fay, D.S., and Gerow, K. (2013). A biologist's guide to statistical thinking and analysis. WormBook , ed. The C. elegans Research Community, WormBook, Quite simply a tour de farce, written by two authors with impeachable credentials and a keen eye for misguiding naïve biologists.

11. Appendix C: Useful programs for statistical calculations

Minitab A comprehensive and reasonably intuitive program. This program has many useful features and is frequently updated. The downside for Mac users is that you will have to either use a PC (unthinkable) or run it on parallels (or some similar program) that lets you employ a PC interface on your Mac (shudder). (DF)
Microsoft Excel Excel excels at doing arithmetic, but is not meant to be a statistics package. It does most simple things (t-tests) sort of reasonably well, but the interface is clunky. Minitab (or any other stats package) would be preferred. (KG)

12. Appendix D: Useful websites for statistical calculations

There are many such sites, which have the advantage of being free and generally easy to use. Of course, what is available at the time of this writing could change rapidly and without notice. As with anything else that's free on the web, rely on at your own risk!

http://easycalculation.com/statistics/statistics.php
http://www.graphpad.com/quickcalcs/
http://stattrek.com/
http://www.numberempire.com/statisticscalculator.php
http://www.danielsoper.com/statcalc3/default.aspx
http://vassarstats.net/

In theory, we could always capitalize “Normal” to emphasize its role as the name of a distribution, not a reference to “normal”, meaning usual or typical. However, most texts don't bother and so we won't either.

A useful addendum: Four SDs captures the range of most (here, formally 95%) data values; it turns out this is casually true for the distribution for most real-life variables (i.e., not only those that are normally distributed). Most (but not quite all) of the values will span a range of approximately four SDs.

For example, in many instances, data values are known to be composed of only non-negative values. In that instance, if the coefficient of variation (SD/mean) is greater than ∼0.6, this would indicate that the distribution is skewed right.

Indeed the data from Panel B was generated from a normal distribution. However, you can see that the distribution of the sample won't necessarily be perfectly symmetric and bell-shape, though it is close. Also note that just because the distribution in Panel A is bimodal does not imply that classical statistical methods are inapplicable. In fact, a simulation study based on those data showed that the distribution of the sample mean was indeed very close to normal, so a usual t-based confidence interval or test would be valid. This is so because of the large sample size and is a predictable consequence of the Central Limit Theorem (see Section 2 for a more detailed discussion).

We note that the SE formula shown here is for the SE of a mean from a random sample. Changing the sample design (e.g., using stratified sampling) or choosing a different statistic requires the use of a different formula.

Our simulation had only ten random samples of size six. Had we used a much larger number of trials (e.g., 100 instead of 10), these two values would have been much closer to each other.

This calculation (two times the SE) is sometimes called the margin of error for the CI.

Indeed, given the ubiquity of “95%” as a usual choice for confidence level, and applying the concept in Footnote 2, a quick-and-dirty “pretty darn sure” (PDS) CI can be constructed by using 2 times the SE as the margin of error. This will approximately coincide with a 95% CI under many circumstances, as long as the sample size is not small.

The requirement for normality in the context of various tests will be discussed in later sections.

Here meaning by a statistical test where the P-value cutoff or “alpha level” (α) is 0.05.

R.A. Fisher, a giant in the field of statistics, chose this value as being meaningful for the agricultural experiments with which he worked in the 1920s.

Although one of us is in favor of 0.056, as it coincides with his age (modulo a factor of 1000).

The term “statistically significant”, when applied to the results of a statistical test for a difference between two means, implies only that it is plausible that the observed difference (i.e., the difference that arises from the data) likely represents a difference that is real. It does not imply that the difference is “biologically significant” (i.e., important). A better phrase would be “statistically plausible” or perhaps “statistically supported”. Unfortunately, “statistically significant” (in use often shortened to just “significant”) is so heavily entrenched that it is unlikely we can unseat it. It's worth a try, though. Join us, won't you?

When William Gossett introduced the test, it was in the context of his work for Guinness Brewery. To prevent the dissemination of trade secrets and/or to hide the fact that they employed statisticians, the company at that time had prohibited the publication of any articles by their employees. Gossett was allowed an exception, but the higher-ups insisted that he use a pseudonym. He chose the unlikely moniker “Student”.

These are measured by the number of pixels showing fluorescence in a viewing area of a specified size. We will use “billions of pixels” as our unit of measurement.

More accurately, it is the distribution of the underlying populations that we are really concerned with, although this can usually only be inferred from the sample data.

For data sets with distributions that are perfectly symmetric, the skewness will be zero. In this case the mean and median of the data set are identical. For left-skewed distributions, the mean is less than the median and the skewness will be a negative number. For right-skewed distributions, the mean is more than the median and the skewness will be a positive number.

Kurtosis describes the shape or “peakedness” of the data set. In the case of a normal distribution, this number is zero. Distributions with relatively sharp peaks and long tails will have a positive kurtosis value whereas distributions with relatively flat peaks and short tails will have a negative kurtosis value.

A-squared (A2) refers to a numerical value produced by the Anderson-Darling test for normality. The test ultimately generates an approximate P-value where the null hypothesis is that the data are derived from a population that is normal. In the case of the data in Figure 5 , the conclusion is that there is < 0.5% chance that the sample data were derived from a normal population. The conclusion of non-normality can also be reached informally by a visual inspection of the histograms. The Anderson-Darling test does not indicate whether test statistics generated by the sample data will be sufficiently normal.

The list is long, but it includes coefficients in regression models and estimated binomial proportions (and differences in proportions from two independent samples). For an illustration of this phenomenon for proportions, see Figure 12 and discussion thereof.

There are actually many Central Limit Theorems, each with the same conclusion: normality prevails for the distribution of the statistic under consideration. Why many? This is so mainly because details of the proof of the theorem depend on the particular statistical context.

And, as we all know, good judgment comes from experience, and experience comes from bad judgment.

Meaning reasons based on prior experience.

Meaning “after the fact”.

Also see discussion on sample sizes ( Section 2.7 ) and Section 5 for a more complete discussion of issues related to western blots.

This is due to a statistical “law of gravity” called the Central Limit Theorem: as the sample size gets larger, the distribution of the sample mean (i.e., the distribution you would get if you repeated the study ad infinitum) becomes more and more like a normal distribution.

Estimated from the data; again, this is also called the SEDM.

In contrast, you can, with data from sample sizes that are not too small, ask whether they (the data and, hence, the population from whence they came) are normal enough. Judging this requires experience, but, in essence, the larger the sample size, the less normal the distribution can be without causing much concern.

This discussion assumes that the null hypothesis (of no difference) is true in all cases.

Notice that this is the Bonferroni critical value against which all P-values would be compared.

If the null hypothesis is true, P-values are random values, uniformly distributed between 0 and 1.

The name is a bit unfortunate in that all of statistics is devoted to analyzing variance and ascribing it to random sources or certain modeled effects.

These are referred to in the official ANOVA vernacular as treatment groups.

This is true supposing that none are in fact real.

Proofs showing this abound on the internet.

Although this may seem intuitive, we can calculate this using some of the formulae discussed above. Namely, this boils down to a combinations problem described by the formula: Pr(combination) = (# permutations) (probability of any single permutation). For six events in 20, the number of permutations = 20!/6! 14! = 38,760. The probability of any single permutation = (0.08)6 (0.92)14 = 8.16e-8. Multiplying these together we obtain a value of 0.00316. Thus, there is about a 0.3% chance of observing six events if the events are indeed random and independent of each other. Of course, what we really want to know is the chance of observing at least six events, so we also need to include (by simple addition) the probabilities of observing 7, 8, …20 events. For seven events, the probability is only 0.000549, and this number continues to decrease precipitously with increasing numbers of events. Thus, the chance of observing at least six events is still <0.4%, and thus we would suspect that the Poisson distribution does not accurately model our event.

Given that Table 4 indicates that the optimal number of F2s is between 2 and 3.

http://shahamlab .rockefeller .edu/cgi-bin /Genetic_screens/screenfrontpage.cgi

Calculating the probability that a medical patient has a particular (often rare) disease given a positive diagnostic test result is a classic example used to illustrate the utility of Baye's Theorem. Two complimentary examples on the web can be found at: http://vassarstats .net/bayes.html and http://www .tc3.edu/instruct /sbrown/stat/falsepos.htm .

This will often be stated in terms of a margin of error rather than the scientific formalism of a confidence interval.

The reasons for this are complex and due in large part to the demonstrated odd behavior of proportions (See Agresti and Coull 1998 , Agresti and Caffo 2000 ; and Brown et al., 2001 ).

A more precise description of the A-C method is to add the square of the appropriate z-value to the denominator and half of the square of the z-value to the numerator. Conveniently, for the 95% CI, the z-value is 1.96 and thus we add 1.962 = 3.84 (rounded to 4) to the denominator and 3.84/2 = 1.92 (rounded to 2) to the numerator. For a 99% A-C CI, we would add 6.6 (2.5752) to the denominator and 3.3 (6.6/2) to the numerator. Note that many programs will not accept anything other than integers (whole numbers) for the number of successes and failures and so rounding is necessary.

Other assumptions for the binomial include random sampling, independence of trials, and a total of two possible outcomes.

Card counters in Las Vegas use this premise to predict the probability of future outcomes to inform their betting strategies, which makes them unpopular with casino owners.

Note that the numbers you will need to enter for each method are slightly different. The binomial calculators will require you to enter the probability of a success (0.00760), the number of trials (1,000), and the number of successes (13). The hyper-geometric calculator will require you to enter the population size (20,000), the number of successes in the population (152), the sample size (1,000), and the number of success in the sample (13). Also note that because of the computational intensity of the hyper-geometric approach, many websites will not accommodate a population size of >1,000. One website that will handle larger populations ( http://keisan .casio.com/has10/SpecExec .cgi?id=system /2006/1180573202 ) may use an approximation method.

Admittedly, standard western blots would also contain an additional probe to control for loading variability, but this has been omitted for simplification purposes and would not change the analysis following adjustments for differences in loading.

A similar, although perhaps slightly less stringent argument, can be made against averaging cycle numbers from independent qRT-PCR runs. Admittedly, if cDNA template loading is well controlled, qRT-PCR cycle numbers are not as prone to the same arbitrary and dramatic swings as bands on a western. However, subtle differences in the quality or amount of the template, chemical reagents, enzymes, and cycler runs can conspire to produce substantial differences between experiments.

This Excel tool was developed by KG.

The maximum possible P-values can be inferred from the CIs. For example, if a 99% CI does not encompass the number one, the ratio expected if no difference existed, then you can be sure the P-value from a two-tailed test is <0.01.

Admittedly, there is nothing particularly “natural” sounding about 2.718281828…

An example of this is described in Doitsidou et al., 2007

In the case of no correlation, the least-squares fit (which you will read about in a moment) will be a straight line with a slope of zero (i.e., a horizontal line). Generally speaking, even when there is no real correlation, however, the slope will always be a non-zero number because of chance sampling effects.

For example, nations that that supplement their water with fluoride have higher cancer rates. The reason is not because fluoride is mutagenic. It is because fluoride supplements are carried out by wealthier countries where health care is better and people live longer. Since cancer is largely a disease of old age, increased cancer rates in this case simply reflect a wealthier long-lived population. There is no meaningful cause and effect. On a separate note, it would not be terribly surprising to learn that people who write chapters on statistics have an increased tendency to become psychologically unhinged (a positive correlation). One possibility is that the very endeavor of a writing about statistics results in authors becoming mentally imbalanced. Alternatively, volunteering to write a statistics chapter might be a symptom of some underlying psychosis. In these scenarios cause and effect could be occurring, but we don't know which is the cause and which is the effect.

In truth, SD is affected very slightly by sample size, hence SD is considered to be a “biased” estimator of variation. The effect, however, is small and generally ignored by most introductory texts. The same is true for the correlation coefficient, r.

Rea et al. (2005) Nat. Genet. 37, 894-898. In this case, the investigators did not conclude causation but nevertheless suggested that the reporter levels may reflect a physiological state that leads to greater longevity and robust health. Furthermore, based on the worm-sorting methods used, linear regression was not an applicable outcome of their analysis.

The standard form of simple linear regression equations takes the form y = b 1 x + b 0 , where y is the predicted value for the response variable, x is the predictor variable, b 1 is the slope coefficient, and b 0 is the y-axis intercept. Thus, because b1 and b0 are known constants, by plugging in a value for x, y can be predicted. For simple linear regression where the slope is a straight line, the slope coefficient will be the same as that derived using the least-squares method.

Although seemingly nonsensical, the output of a linear regression equation can be a curved line. The confusion is the result of a difference between the common non-technical and mathematical uses of the term “linear”. To generate a curve, one can introduce an exponent, such as a square, to the predictor variable (e.g., x 2 ). Thus, the equation could look like this: y = b 1 x 2 + b0.

A multiple regression equation might look something like this: Y = b 1 X 1 + b 2 X 2 - b 3 X 3 + b 0 , where X 1-3 represent different predictor variables and b 1-3 represent different slope coefficients determined by the regression analysis, and b0 is the Y-axis intercept. Plugging in the values for X 1-3 , Y could thus be predicted.

Even without the use of logistic regression, I can predict with near 100% certainty that I will never agree to author another chapter on statistics! (DF)

An online document describing this issue is available at: firstclinical.com/journal/2007/0703_Power.pdf. In addition, a recent critical analysis of this issue is provided by Bacchetti (2010) .

The median is essentially a trimmed mean where the trimming approaches 100%!

More accurately, the tests assume that the populations are normal enough and that the sample size is large enough such that the distribution of the calculated statistic itself will be normal. This was discussed in Section 1 .

Note that there are subtle variations on this theme, which (depending on the text or source) may go by the same name. These can be used to test for differences in additional statistical parameters such the median.

More accurately, nonparametric tests will be less powerful than parametric tests if both tests were to be simultaneously carried out on a dataset that was normal. The diminished power of nonparametric tests in these situations is particularly exacerbated if sample sizes are small. Obviously, if the data were indeed normal, one would hopefully be aware of this and would apply a parametric test. Conversely, nonparametric tests can actually be more powerful than parametric tests when applied to data that are truly non-Gaussian. Of course, if the data are far from Guassian, then the parametric tests likely wouldn't even be valid. Thus, each type of test is actually “better” or “best” when it is used for its intended purpose.

For example, strains A and B might have six and twelve animals remaining on day 5, respectively. If a total of three animals died by the next day (two from strain A and one from strain B) the expected number of deaths for strain B would be twice that of A, since the population on day 5 was twice that of strain A. Namely, two deaths would be expected for strain B and one for strain A. Thus, the difference between expected and observed deaths for strains A and B would be 1 (2−1=1) and −1 (1−2=−1), respectively.

The final calculation also takes into account sample size and the variance for each sample.

Note that, although not as often used, there are also parametric bootstrapping methods.

This is a bad idea in practice. For some statistical parameters, such as SE, several hundred repetitions may be sufficient to give reliable results. For others, such as CIs, several thousand or more repetitions may be necessary. Moreover, because it takes a computer only two more seconds to carry out 4,000 repetitions than it takes for 300, there is no particular reason to scrimp.

Note that this version of the procedure, the percentile bootstrap, differs slightly from the standard bootstrapping method, the bias corrected and accelerated bootstrap (BCa). Differences are due to a potential for slight bias in the percentile bootstrapping procedure that are not worth discussing in this context. Also, don't be unduly put off by the term “bias”. SD is also a “biased” statistical parameter, as are many others. The BCa method compensates for this bias and also adjusts for skewness when necessary.

A brief disclaimer. Like everything else in statistics, there are some caveats to bootstrapping along with limitations and guidelines that one should become familiar with before diving into the deep end.

Edited by Oliver Hobert. Last revised January 14, 2013, Published June 26, 2013. This chapter should be cited as: Fay D.S. and Gerow K. A biologist's guide to statistical thinking and analysis (June 26, 2013), WormBook , ed. The C. elegans Research Community, WormBook, doi/10.1895/wormbook.1.150.1, http://www .wormbook.org .

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Cite this Page Fay DS, Gerow K. A biologist's guide to statistical thinking and analysis. In: WormBook: The Online Review of C. elegans Biology [Internet]. Pasadena (CA): WormBook; 2005-2018.

In this Page

Comparing two means
Comparisons of more than two means
Probabilities and Proportions
Relative differences, ratios, and correlations
Additional considerations and guidelines
Acknowledgments
Appendix A: Microsoft Excel tools
Appendix B: Recomended reading
Appendix C: Useful programs for statistical calculations
Appendix D: Useful websites for statistical calculations

Related information

PMC PubMed Central citations
PubMed Links to PubMed

Recent Activity

A biologist's guide to statistical thinking and analysis - WormBook A biologist's guide to statistical thinking and analysis - WormBook

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

Introduction

statistical analysis in biological research

The two instances of modern in the title of this book reflect the two major recent revolutions in biological data analyses:

Biology, formerly a science with sparse, often only qualitative data has turned into a field whose production of quantitative data is on par with high energy physics or astronomy, and whose data are wildly more heterogeneous and complex.

Statistics, a field that in the 20th century had become an application ground for probability theory and calculus, often taught loaded with notation and perceived with a heavy emphasis on hypothesis testing, has been transformed by the ubiquity of computers and of data in machine-readable form. Exploratory data analysis, visualization, resampling, simulations, pragmatic hybridizations of Bayesian ideas and methods with frequentist data analysis have become part of the toolset.

The aim of this book is to enable scientists working in biological research to quickly learn many of the important ideas and methods that they need to make the best of their experiments and of other available data. The book takes a hands-on approach. The narrative is driven by classes of questions, or by certain data types. Methods and theory are introduced on a need-to-know basis. We don’t try to systematically deduce from first principles. The book will often throw readers into the pool and hope they can swim in spite of so many missing details.

By no means this book will replace systematic training in underlying theory: probability, linear algebra, calculus, computer science, databases, multivariate statistics. Such training takes many semesters of coursework. Perhaps the book will whet your appetite to engage more deeply with one of these fields.

The challenge: heterogeneity

Any biological system or organism is composed of tens of thousands of components, which can be in different states and interact in multiple ways. Modern biology aims to understand such systems by acquiring comprehensive –and this means high-dimensional– data in their temporal and spatial context, with multiple covariates and interactions. Dealing with this complexity will be our primary challenge. This includes real, biological complexity as well as the practical complexities and heterogeneities of the data we are able to acquire with our always imperfect instruments.

Biological data come in all sorts of shapes: nucleic acid and protein sequences, rectangular tables of counts, multiple tables, continuous variables, batch factors, phenotypic images, spatial coordinates. Besides data measured in lab experiments, there are clinical data, environmental observations, measurements in space a d time, networks and lineage trees, and heaps of previously accumulated knowledge in biological databases as free text or controlled vocabularies, $...$

“Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way.” The Anna Karenina principle.

It is such heterogeneity that motivates our choice of R and Bioconductor as the computational platform for this book – some more on this below.

What’s in this book?

Figure 1 outlines a sequential view of statistical data analysis. Motivated by the groundbreaking work on significance and hypothesis testing in the 1930s by Fisher ( 1935 ) and Neyman and Pearson ( 1936 ) , it is well amenable to mathematical formalism, especially the part where we compute the distribution of test statistics under a hypothesis (null or alternative), or where we need to set up distributional assumptions and can search for analytical approximations.

Real scientific discovery rarely works in the caricature manner of Figure 1 . Tukey ( 1977 ) emphasized two separate approaches. The first he termed exploratory data analysis ( EDA ). EDA uses the data themselves to decide how to conduct the statistical analysis. EDA is built on data visualization, and then complemented by confirmatory data analyses (CDA): hypothesis-driven inferential methods, which ideally should be robust and not rely on complex assumptions. Tukey recommended an iterative approach schematized in Figure 2 that enable us to see the data at different resolutions and from different perspectives. This enables successive refinement of our understanding of the data and the underlying natural phenomena.

Biology in the late 1990s raised the large- $p$ , small- $n$ problem: consider a gene expression dataset for $n=200$ patient-derived tissue samples on $p=20000$ genes. If we want to construct a regression or classification model that “predicts” a clinical variable, for instance the disease type or outcome, from the $20000$ genes, or features, we immediately run into problems, since the potential number of model parameters could be orders of magnitudes larger than the number of measurements. The problems derive from non-identifiability of the parameters or overfitting . At least, this is the case for common models, say, an ordinary multivariate linear model. Statisticians realized that they could remedy the situation by requiring sparsity through the use of regularization techniques ( Hastie, Tibshirani, and Friedman 2008 ) , i.e., by requiring that many of the potential parameters are either zero or at least close to it.

A generalization of the sparsity principle is attained by invoking one of the most powerful recent ideas in high-dimensional statistics, which goes under the name of empirical Bayes : we don’t try to learn all the parameters from scratch, but rather use the fact that groups of them will be similar, or even the same. There are several important book long treatments ( Efron 2010 ) of the subject of large scale inference so essential in modern estimation and hypotheses testing. In the 2010s, computer scientists also discovered ways of engineering “deep neural networks” that can provide sometimes amazing predictive qualities without worrying too much about parameter identifiability or uniqueness.

Simulations play an essential role in this book, as many of the results we need escape the reach of standard analytic approaches. In other words, simulations liberate us from being able to only consider methods that are tractable by “paper and pencil mathematics”, and from worrying about the appropriateness of simplifying assumptions or approximations.

In this book, we try to cover a wide spectrum of these developments and their applications to current biological research. We cover many different types of data that modern biologists have to deal with, including RNA-Seq, flow-cytometry, taxa abundances, imaging data and single cell measurements. We assume no prior formal training in statistics. However, you’ll need some familiarity with R and willingness to engage in mathematical and analytical thinking.

As outlined in the following, the chapters in the book build upon each other, but they are reasonably self-contained, so they can also be studied separately. Each chapter starts with a motivations and goals section. Questions in the text will help you check whether you are following along. The text contains complete R code examples throughout. You don’t need to scrape R code from the HTML or manually copy it from the book. Please use the R files (extension .R ) on this website. Each chapter concludes with a summary of the main points and a set of exercises.

Generative models are our basic building blocks. In order to draw conclusions about complicated data it tends to be useful to have simple models for the data generated under this or that situation. We do this through the use of probability theory and generative models, which we introduce in Chapter 1 . We will use examples from immunology and DNA analysis to describe useful generative models for biological data: binomial, multinomial and Poisson random variables.

Once we know how data would look like under a certain model, we can start working our way backwards: given some data, what model is most likely able to explain it? This bottom up approach is the core of statistical thinking, and we explain it in Chapter 2 .

We saw the primary role of graphics in Tukey’s scheme ( Figure 2 ), and so we’ll learn how to visualize our data in Chapter 3 . We’ll use the grammar of graphics and ggplot2 .

Real biological data often have more complex distributional properties than what we could cover in Chapter 1 . We’ll use mixtures , which we explore in Chapter 4 ; these enable us to build realistic models for heterogeneous biological data and provide solid foundations for choosing appropriate variance stabilizing transformations.

The large, matrix-like datasets in biology naturally lend themselves to clustering : once we define a distance measure between matrix rows (the features), we can cluster and group the genes by similarity of their expression patterns, and similarly, for the columns (the patient samples). We’ll cover clustering in Chapter 5 . Since clustering only relies on distances, we can even apply it to data that are not matrix-shaped, as long as there are objects and distances defined between them.

Further following the path of EDA, we cover the most fundamental unsupervised analysis method for simple matrices – principal component analysis – in Chapter 7 . We turn to more heterogeneous data that combine multiple data types in Chapter 9 . There, we’ll see nonlinear unsupervised methods for counts from single cell data. We’ll also address how to use generalizations of the multivariate approaches covered in Chapter 7 to combinations of categorical variables and multiple assays recorded on the same observational units.

The basic hypothesis testing workflow outlined in Figure 1 is explained in Chapter 6 . We use the opportunity to apply it to one of the most common queries to $n\times p$ -datasets: which of the genes (features) are associated with a certain property of the samples, say, disease type or outcome? However, conventional significance thresholds would lead to lots of spurious associations: with a false positive rate of $\alpha=0.05$ we expect $p\alpha=1000$ false positives if none of the $p=20000$ features has a true association. Therefore we also need to deal with multiple testing.

One of the most fruitful ideas in statistics is that of variance decomposition, or analysis of variance (ANOVA). We’ll explore this, in the framework of linear models and generalized linear models, in Chapter 8 . Since we’ll draw our example data from an RNA-Seq experiment, this gives us also an opportunity to discuss models for such count data, and concepts of robustness .

Nothing in biology makes sense except in the light of evolution 1 , and evolutionary relationships are usefully encoded in phylogenetic trees. We’ll explore networks and trees in Chapter 10 .

1 Theodosius Dobzhansky, https://en.wikipedia.org/wiki/Nothing_in_Biology_Makes_Sense_Except_in_the_Light_of_Evolution

A rich source of data in biology is images, and in Chapter 11 we reinforce our willingness to do EDA on all sorts of heterogeneous data types by exploring feature extraction from images and spatial statistics.

In Chapter 12 , we look at supervised learning: train an algorithm to distinguish between different classes of objects, given a multivariate set of features for each object. We’ll start simple with low-dimensional feature vectors and linear methods, and then step forward to some of the issues of classification in high-dimensional settings. Here, we focus on ‘’classical’’ supervised learning, where (at least conceptually) a training set of ground truth classifications is available to the algorithm all at once, simultaneously, to learn from. We do not (yet…) cover reinforcement learning, a more flexible framework in which one or several agents learn by interacting with a so-called environment by successively performing actions and getting feedback on them; so there are notions of time and state in that framework.

We wrap up in Chapter 13 with considerations on good practices in the design of experiments and of data analyses. For this we’ll use and reflect what we have learned in the course of the preceding chapters.

Computational tools for modern biologists

As we’ll see over and over again, the analysis approaches, tools and choices to be made are manifold. Our work can only be validated by keeping careful records in a reproducible script format. R and Bioconductor provide such a platform.

Although we are tackling many different types of data, questions and statistical methods hands-on, we maintain a consistent computational approach by keeping all the computation under one roof: the R programming language and statistical environment, enhanced by the biological data infrastructure and specialized method packages from the Bioconductor project. The reader will have to start by acquiring some familiarity with R before using the book. There are many good books and online resources. One of them is by Grolemund and Wickham ( 2017 ) , online at http://r4ds.had.co.nz .

R code is a major component of this book. It is how we make the textual explanations explicit. Essentially every data visualization in the book is produced with code that is shown, and the reader should be able to replicate all of these figures, and any other result shown.

Even if you have a some familiarity with R, don’t worry if you don’t immediately understand every line of code in the book. Although we have tried to keep the code explicit and give tips and hints at potentially challenging places, there will be instances where

there is a function invoked that you have not seen before and that does something mysterious,

there is a complicated R expression that you don’t understand (perhaps involving apply -functions or data manipulations from the dplyr package).

Don’t panic. For the mysterious function, have a look at its manual page. Open up RStudio and use the object explorer to look at the variables that go into the expression, and those that come out. Split up the expression to look at intermediate values.

In Chapters 1 and 2 , we use base R functionality for light doses of plotting and data manipulation. As we successively need more sophisticated operations, we introduce the ggplot2 way of making graphics in Chapter 3 . Besides the powerful grammar of graphics concepts that enable us to produce sophisticated plots using only a limited set of instructions, this implies using the dplyr way of data manipulation.

Why R and Bioconductor?

There are many reasons why we have chosen to present all analyses on the R ( Ihaka and Gentleman 1996 ) and Bioconductor ( Huber et al. 2015 ) platforms:

Cutting edge solutions. The availability of over $10,000$ packages ensures that almost all statistical methods are available, including the most recent developments. Moreover, there are implementations of or interfaces to many methods from computer science, mathematics, machine learning, data management, visualization and internet technologies. This puts thousands of person-years of work by experts at your finger tips.

Open source and community-owned. R and Bioconductor have been built collaboratively by a large community of developers. They are constantly tried and tested by thousands of users.

Data input and wrangling. Bioconductor packages support the reading of many of the data types and formats produced by measurement instruments used in modern biology, as well as the needed technology-specific “preprocessing” routines. The community is actively keeping these up-to-date with the rapid developments on the instrument market.

Simulation. There are random number generators for every known statistical distribution and powerful numeric routines for linear algebra, optimization, etc.

Visualization and presentation. R can make attractive, publication-quality graphics. We’ve dedicated Chapter 3 to this and practice data visualization extensively throughout the book.

Easy to use interactive development environment. RStudio is easy and fun to use and helps with all aspects of programming in R. It is an essential piece in following the iterative approach to data analysis schematized in Figure 2 .

Reproducibility. As an equivalent to the laboratory notebook that is standard good practice in labwork, we advocate the use of a computational diary written in the R markdown or quarto formats. We use the quarto system to convert these files into easy-to-read and shareable HTML or PDF documents. These can even become full-fledged scientific articles or supplements. Together with a version control system, this approach also helps with tracking changes.

Collaborative environment. Quarto enables the creation of websites containing code, text, figures and tables with a minimum of work.

Rich data structures. The Bioconductor project has defined specialized data containers to represent complex biological datasets. These help to keep your data consistent, safe and easy to use.

Interoperability and distributed development. Bioconductor in particular contains packages from diverse authors that cover a wide range of functionalities but still interoperate because of the common data containers.

Documentation. Many R packages come with excellent documentation in their function manual pages and vignettes. The vignettes are usually the best starting point into a package, as they give you a high-level narrative on what the package does, whereas the manual pages give detailed information on input, output and inner workings of each function. There are online tutorials, fora and mailing lists for many aspects of working with R and Bioconductor.

High-level language. R is an interpreted high-level language. Its roots in LISP and its functional programming features mean that code is data and can be computed on, which enables efficient programming and is fun. These features facilitate constructing powerful domain specific languages 2 . R is not a fixed language – throughout its history, it has been actively evolving and is constantly improving.

2 Examples include R’s formula interface, the grammar of graphics in ggplot2 , the data manipulation functionality of dplyr and R markdown.

Page built on 2023-08-03 21:37:40.742167 using R version 4.3.0 (2023-04-21)

Research and Industry

Applications in biology and medicine.

There is a long and fruitful history of joint development between Statistics and Biology and Medicine, with data at the core. For instance, Mendel’s fundamental laws of heredity were entirely based on statistical inference applied to data from carefully designed experiments. Most recently, the advent of novel high-throughput and high-resolution biological assays has allowed the exploration of biological processes on a genomic scale and at the resolution of single cells. Applications range from addressing fundamental science questions (e.g., how does the brain work?) to disease prevention, diagnosis, and treatment. Statistical methods are essential to make sense of the massive amounts of data generated by these biotechnologies.

Our faculty have been at the forefront of research at the interface of Statistics with Biology and Medicine, contributing statistical methods and software for genome sequencing, the study of stem cell differentiation, neuroscience, evolutionary biology, epidemiology, infectious disease modeling, clinical trials, and personalized medicine, among others. A hallmark of the Berkeley approach is our close collaboration with biologists and clinicians and our engagement throughout the data science pipeline, including the framing of questions, study design, exploratory data analysis, and the interpretation, validation, and translation of the results into domain insight.

Our faculty have played an essential role in the creation and growth of the Center for Computational Biology and comprise its largest group (10 of the Center’s 48 faculty).

Researchers

Peter Bickel

statistics, machine learning, semiparametric models, asymptotic theory, hidden Markov models, applications to molecular biology

Iain Carmichael

causal inference in experiments and observational studies, with applications to biomedical and social sciences; contaminated data including missing data, measurement error, and selection bias

Sandrine Dudoit

statistics, applied statistics, data science, statistical computing, computational biology and genomics

Steven Evans

Adrián González-Casanova

I'm interested in Probability Theory and its applications in Theoretical Biology, including modelling, simulation and data-driven approaches.

Haiyan Huang

high dimensional and integrative genomic data analysis, network modeling, hierarchical multi-label classification, translational bioinformatics

Nicholas P. Jewell

infectious diseases (specifically HIV), chronic disease epidemiology, environmental epidemiology, survival analysis, human rights statistics

Michael Jordan

computer science, artificial intelligence, computational biology, statistics, machine learning, electrical engineering, applied statistics, optimization

Jon McAuliffe

machine learning, statistical prediction, variational inference, statistical computing, optimization

Rasmus Nielsen

evolution, molecular evolution, population genetics, human variation, human genetics, phylogenetics, applied statistics, genetics, evolutionary processes, evolutionary biology

Sam Pimentel

causal inference, health services & policy analysis, biostatistics, discrete optimization

Elizabeth Purdom

computational biology, bioinformatics, statistics, data analysis, sequencing, cancer genomics

Yun S. Song

computational biology, statistical genetics, applied probability

Philip B. Stark

uncertainty quantification and inference, inverse problems, nonparametrics, risk assessment, earthquake prediction, election auditing, geomagnetism, cosmology, litigation, food/nutrition

Alexander Strang

Stochastic Processes, Hierarchical Bayesian Inference, Random Graph Theory, Multi-Agent Training, Computational Biology, Solution Continuation, Optimization

Ryan Tibshirani

high-dimensional statistics, nonparametric estimation, distribution-free inference, machine learning, convex optimization, numerical methods, tracking and forecasting epidemics

Mark van der Laan

statistics, computational biology and genomics, censored data and survival analysis, medical research, inference in longitudinal studies

statistical inference for high dimensional data and interdisciplinary research in neuroscience, remote sensing, and text summarization

Change Password

Your password must have 8 characters or more and contain 3 of the following:.

a lower case character,
an upper case character,
a special character

Password Changed Successfully

Your password has been changed

Request Username

Can't sign in? Forgot your username?

Enter your email address below and we will send you your username

If the address matches an existing account you will receive an email with instructions to retrieve your username

Teaching Statistics in Biology: Using Inquiry-based Learning to Strengthen Understanding of Statistical Analysis in Biology Laboratory Courses

Anneke M. Metz

Department of Cell Biology and Neuroscience, Montana State University, Bozeman, MT 59717

Search for more papers by this author

There is an increasing need for students in the biological sciences to build a strong foundation in quantitative approaches to data analyses. Although most science, engineering, and math field majors are required to take at least one statistics course, statistical analysis is poorly integrated into undergraduate biology course work, particularly at the lower-division level. Elements of statistics were incorporated into an introductory biology course, including a review of statistics concepts and opportunity for students to perform statistical analysis in a biological context. Learning gains were measured with an 11-item statistics learning survey instrument developed for the course. Students showed a statistically significant 25% ( p < 0.005) increase in statistics knowledge after completing introductory biology. Students improved their scores on the survey after completing introductory biology, even if they had previously completed an introductory statistics course (9%, improvement p < 0.005). Students retested 1 yr after completing introductory biology showed no loss of their statistics knowledge as measured by this instrument, suggesting that the use of statistics in biology course work may aid long-term retention of statistics knowledge. No statistically significant differences in learning were detected between male and female students in the study.

INTRODUCTION

In Bio2010: Transforming Undergraduate Education for Future Research Biologists , the National Research Council (NRC) recommends an increased emphasis on interdisciplinarity in biology curricula, including more emphasis on mathematics and statistics ( NRC, 2003 ). Little work has been done on the teaching of statistics in an integrated manner in undergraduate biology courses. Horgan et al. (1999) describe continuing education workshops in statistics for biological research scientists, but they provide only anecdotal evidence of student learning. A'Brook and Weyers (1996) surveyed the level of statistics teaching in undergraduate biology degree programs (in the United Kingdom), and they found that statistics course work is almost invariably separated from biology course work. Both Horgan and A'Brook conclude that biology students are not learning statistics at the undergraduate (and even graduate) levels, which results in the lack of, or misused, statistical analysis by many professional biologists.

The experience of our faculty teaching undergraduate biology mirrors the observations of both A'Brook and Weyers. In our traditional curriculum, statistics was a required course for the major, but it was not a prerequisite for any particular biology course. As a result, students took the course at greatly varying points in their curriculum, usually as juniors or seniors. Furthermore, because understanding of statistics was not consistently or uniformly stressed within the biology curriculum, students made few intellectual connections between their general statistics course (which uses examples from a wide range of fields, including economics, agriculture, medicine, sociology, and engineering) and the use of statistics in biology. Anecdotal evidence from our courses also suggested that biology students had difficulty understanding quantitative information presented graphically or in tables, were unable to manipulate raw data sets, and could not interpret statistical measures of significance such as p value. The Montana State University (MSU) biomedical sciences major attracts a high percentage of students who hope to attend graduate, medical, or professional school; yet, even this highly motivated group of students seemed to have few skills in analyzing and manipulating biological data, particularly at the lower-division level, even after completing a formal statistics course.

The literature suggests that this is not an uncommon occurrence in biology programs. Peterson (2000) describes an introductory biology course where students were required to perform an inquiry-based experiment to examine the relationships between macroinvertebrates in a campus nature preserve. Instructors in that course found that students had to first be taught statistical analysis before they could analyze their own data. This instruction required time beyond what was available in the classroom, and students continued to struggle with differentiating between statistical significance of data and the significance of their findings with respect to the experimental objective. Using several pre- and postproject exam questions, Peterson reported statistically significant improvement in basic knowledge of statistics in only one of two sections that incorporated supplemental statistics instruction.

Maret and Ziemba (1997) , in their investigation of teaching students to use statistics in scientific hypothesis testing, noticed that students, especially lower-division undergraduates, have few opportunities to practice the use of statistics in testing scientific hypotheses. They suggest that students should be taught the use of statistical tests early in their undergraduate careers, and concomitantly be given opportunity to practice applying statistics to test scientific hypotheses. To ensure that biology majors will have a stronger understanding of quantitative and statistical analysis, we have begun reforming our biomedical sciences curriculum to better integrate mathematics and biology, and in particular statistics and biology.

Although the teaching of statistics in the biology classroom is understudied, we can look to lessons learned in statistics pedagogy to reform biology curricula. Modern introductory statistics courses were established in the late 1970s, and they have seen a continual growth in enrollment since that time ( Guidelines for the Assessment and Instructions in Statistics Education [GAISE], 2005 ). The growth in enrollment has resulted in a democratization of the statistic student population—students are coming into statistics with increasingly diverse backgrounds and professional goals. This has meant shifting from teaching a narrowly defined set of professionals-in-training (e.g., scientists) to making statistics relevant and accessible to students with a wide range of career interests and quantitative skills ( Moore, 1997 ; GAISE, 2005 ). Statistics academicians have responded with a number of curricular reforms that are useful to consider.

Moore (1997) , in considering how to best craft introductory statistics courses for modern students who will become citizens and professionals, but (mostly) not professional statisticians, suggests that “the most effective learning takes place when content (what we want students to learn), pedagogy (what we do to help them learn) and technology reinforce each other in a balanced manner.” There are several key points to this statistics reform. One, it is clear that active learning, with components including laboratory exercises, group work, work with class-generated data, and student written and oral presentations, increases student enthusiasm for, and learning of, statistics. Two, statistics instructors indicate that students are more motivated and better understand concepts when real (and especially student-gathered) data sets are used. It is also important to use technology (e.g., statistics software) to emphasize statistical literacy rather than tedious calculations ( Moore, 1997 ; Garfield et al. , 2002 ; GAISE, 2005 ).

From the literature and our own experience, it was clear that to truly integrate statistics and biology and engage students in the use of statistics in their major program, we would need to provide students with formal training in statistics as early as possible, give them the opportunity to practice the use of statistics, and demonstrate the importance of statistical data analysis in biological research. At MSU, students take a centralized introductory statistics course with peers in many different major disciplines. As part of an ongoing curricular reform effort, we shifted the biomedical science major requirements so that our students now take introductory statistics as incoming freshmen, rather than in their junior or senior year. We then designed our introductory biology course, taken by our majors immediately after completion of introductory statistics, to incorporate the best practices of statistics teaching reform.

We designed and piloted an instrument to determine whether reinforcement of statistical learning (by practice in using statistics in biological applications) aids in student retention of material learned in formal statistics course work, and increases student understanding of statistical concepts. Although assessments exist to measure learning in statistics curricula (e.g., the Statistical Reasoning Assessment, Garfield, 2003 ), there is no instrument available for measuring learning of statistics in a biology curriculum. The survey was administered to students at the beginning and end of introductory biology course. We also resurveyed students 1 yr after completion of introductory biology to determine whether students retained their knowledge of basic statistics in a biological context. Finally, we examined the impact of incorporating biology examples into introductory statistics, and we investigated whether there were gender differences in learning within our study cohort.

We sampled two cohorts of biomedical science majors (2005 and 2006 incoming classes) for this study. Biomedical science students take introductory statistics (Statistics 216), taught by the mathematical sciences department, in the first semester of their freshman year, as a prerequisite to biology course work. In fall 2005, a special section of introductory statistics was offered for biology students that emphasized the use of biological examples in teaching the course material. The course was otherwise identical to other sections of introductory statistics offered that semester. Sixteen students who went on to enroll in the spring 2006 introductory biology course (Biology 213) completed this special section. Forty-four additional students enrolled in regular introductory statistics sections in fall 2005. The remaining students in 2005 satisfied their statistics prerequisite at other institutions, took statistics concurrently, or did not satisfy the prerequisite before enrollment in introductory biology. The special statistics section for biology majors was not available to the 2006 cohort. Over two semesters (spring 2006 and 2007), 264 (123 male, 141 female) students in total completed introductory biology.

Instructional Methods

In introductory biology, students were given a brief review of some elements of basic statistics (charting categorical observations, mean/median/mode, analyzing difference between two populations, confidence interval, distinguishing correlation from causation, p value), and data graphing (scatterplot, box plot, and histogram). In the laboratory component of the course, students were provided with an introduction to the Minitab statistical software package (Minitab, State College, PA), and they completed a measurement, data collection, and statistical analysis exercise (the “Shell Lab,” described below) at the beginning of the course. This material was reinforced with a statistics primer, an appendix in the laboratory manual that reviews basic statistical principles and provides a guide to the Minitab program. For the remainder of the semester, students completed six additional inquiry-based lab exercises, requiring them to collect data and perform basic statistical analyses such as comparison of the means of two populations, distribution of measurements in a population, comparison of actual allele frequencies with expected frequencies, and regression analysis of the association between two variables. The laboratory exercises were designed in a similar manner to the “teams and streams” model ( Luckie et al. , 2004 ), in which 2-wk-long laboratory blocks combine short introductory exercises with modules where students design their own experiments.

Students, working in groups of four, performed three activities using a variety of sea shells (spotted arks, ear moon shells, tiger moon shells, clams, Caribbean arks or brown stripe shells; available in bulk from Seashells.com) and peanuts in the shell. This lab is a variation of the “Find Your Peanut” inquiry exercise from Duke University's Center for Inquiry-based Learning ( Budnitz and Guentensberger, 2001 ), and it included three activities. Activity 1 (demonstrates measurement variation) : Students were instructed to “measure the length of 20 peanut shells.” All student groups measured the same set of peanut shells. Data sets for all the groups were compared, and variations in data measurements were discussed. Activity 2 (correlation of quantitative variables, use of scatterplot): Each group received 10 specimens of one type of shell, and they were tasked to determine whether “height” and “width” (self-defined by students) correlated or varied independently. Activity 3 (comparison of two populations): Each group received two sets of 20 shells, from “warm water” and “cold water,” and they had to determine whether there was a statistically significant difference between the two sets. Students were free to choose their measurement strategy to determine “difference.” Data collection and analysis were performed using Minitab statistical software. This exercise trained students in some basic statistical analyses techniques that they then used in the remaining labs for the semester.

Statistics Survey Design

To determine the level of incoming statistical knowledge for students entering introductory biology, we developed a statistics survey (available as Supplemental Material ) consisting of 11 multiple-choice and short-answer items ( Table 1 ) . The instrument included items at four (knowledge, application, analysis, and evaluation) of the six levels of Bloom's taxonomy of learning objectives ( Bloom, 1956 ). Items were chosen based on statistical concepts both taught in introductory statistics and used in data analysis in the introductory biology labs. We used a multiple-choice (with one short answer) format to allow the survey to be completed as a student activity in the lecture component of the course. Preliminary instrument validation was performed by three introductory biology instructors at MSU and two outside content experts. These initial reviews of the statistics instrument provided useful feedback on question wording and accuracy of answer choices.

a Total students participating in survey n = 186.

b Includes both answer choices C and E.

Delivery of the Survey

The survey was administered on the first day of class to students in introductory biology (“presurvey”). Presurveys were not returned. We asked students to voluntarily self-report class rank, gender, and previous/concurrent statistics course work on the survey. To gauge student learning of statistics by the end of the course, the identical instrument was readministered to the students in a class period during the last week of the course (“postsurvey”). Students were not told ahead of time that the survey would be readministered.

One hundred eighty-six of 264 (70.5%) introductory biology students completed both the pre- and postsurveys and made up the study sample. Of the 186 students, 87 were male and 99 were female. Students were categorized as having completed statistics before the course, taking statistics concurrently, or not satisfying the statistics prerequisite ( Table 2 ). Seventy-eight additional students enrolled in introductory biology did not complete either the presurvey or postsurvey, and they were not included in the study sample. We compared student GPAs of the 186 survey completers and 78 noncompleters to determine whether the study sample was representative of the student population completing the course.

We retested the 2006 cohort 1 yr after completion of their postsurvey, in a sophomore-level required biology course (Biology 215, ecology and evolution). The survey was given in-class during the last week of the semester. This resulted in a sample of 30 students. The sample size was small because a large number of students take this course out of sequence or pass out of it with advanced placement or transfer credit.

Data Analysis

Pre- and postsurveys and demographic data for individual students were matched, and names and student IDs were removed from the data set before analysis. We calculated individual student normalized gains ( g ), where g is the proportion of the actual score gains compared with the maximal gain possible, given the student's initial score; g = (postscore − prescore)/(100% − prescore), for each student in the sample. Average normalized gain, G , defined as the mean of g values in a population, was then calculated for the following student groups: those who had completed a college-level introductory statistics course in a previous semester, those taking statistics concurrently, and students who had no college-level statistics preparation. We also calculated G for students who had completed a biology-themed section of introductory statistics, and we compared survey performance for male and female students. For students retested 1 yr after completing introductory biology, G was calculated using the student's initial presurvey score from the start of spring 2006 and the postsurvey score from the completion of spring 2007, representing a spacing of approximately 16 mo. The assumption of normality for the prescore, postscore, and normalized gain data sets was confirmed by skew and Kurtosis analysis (skew and Kurtosis < 1). All statistical tests for data significance ( t test, analysis of variance [ANOVA], analysis of covariance [ANCOVA]), including skew/Kurtosis calculations and corrections for simultaneous comparisons of groups (Tukey's or Bonferroni) were performed using Minitab version 15.1.1.0.

We developed a survey to measure statistics learning gains in an introductory biology course that emphasized statistical analysis to determine whether statistics knowledge learned in introductory statistics could be either maintained or improved by extra practice and review in a nonstatistics course. The survey tested student recall and understanding of the concepts from introductory statistics that are necessary for basic data analysis in introductory biology. Of the 264 students completing introductory biology, 186 students (70%) who completed both the initial and final surveys were included in the study sample. The mean GPA was 0.35 points higher for study participants (3.36) than for those who did not complete the study (3.01). The difference was statistically significant ( t test, p < 0.005). The mean course grade for nonparticipants was also significantly lower (3.40 for participants, 2.71 for nonparticipants; t test, p < 0.005). These data suggest that the study did not adequately represent students who are “strugglers” in the major. Their exclusion from the study seems to be due to lower attendance rates by struggling students; 98% of students who were excluded from the study sample failed to complete the postsurvey, which was given in class during the last week of classes for the semester.

The statistics survey consisted of questions in four categories: identification of graphs, use of graphical tools, analysis of experimental data, and evaluation ( Table 1 ). The instrument was field tested with 186 biomedical science majors. We gauged question difficulty by the percentage of correct responses on the presurvey. Presurvey correct responses on individual items ranged from nearly 100% (item 2) to 35% (items 8 and 11). Kaplan and Saccuzzo (1997) suggest that instrument multiple-choice questions ideally have initial correct response averages about halfway between random guess and 100%. Because most of the survey items had either four or five answer choices, an initial correct response rate of ∼60% would have been ideal. Thus, some questions that had an 80% or greater initial correct response rate had little utility in gauging learning (e.g., items 2, 3, and 7), although they were useful in demonstrating students' incoming knowledge. Other items (e.g., item 11) were likely too difficult to accurately measure student understanding of that particular concept.

The survey was reliable in measuring an increase in statistics knowledge in introductory biology. We saw a statistically significant improvement of scores on the survey of 24 percentage points between first and second administration of the instrument (from 57% correct to 71% correct; t test, p < 0.005). Figure 1 A indicates median pre- and postsurvey scores for students in all groups. All three groups showed statistically significant improvement in their average survey scores between the pre- and postsurvey ( Figure 1 B): 9 percentage points (from 64 to 73%) for students with previous statistics course work, 32 percentage points for those taking statistics concurrently (from 41 to 73%), and 19 points (from 43 to 62%) for those without statistics training. Each improvement was significant within individual groups ( t test, p < 0.005 in all instances). Confidence intervals indicated in Figure 1 B include Bonferroni's correction for simultaneous confidence levels.

Figure 1. Pre- and postsurvey scores for groups with different levels of statistics course work preparation. Score increases were significant for students who completed statistics prior (Previous) to course enrollment (N = 130; t test, p < 0.005), students enrolled in statistics concurrently (N = 31; t test, p < 0.005), and students with no formal statistics (None) training (N = 25; t test, p < 0.005). (A) Boxplot showing score distribution of pre- and postscores for all groups. Outliers are indicated by asterisk (*). Median scores are indicated by ○. (B) Mean pre- and postscores for all, with 95% Bonferroni's confidence intervals. Mean scores are indicated by ●. Skew < 0.90, Kurtosis < 0.70 for all populations.

The normalized gain ( G ) is an accepted measure used to quantify learning in science courses. It is commonly used in physics and astronomy learning studies ( Hake, 1998 ; Hemenway et al. , 2002 ; Coletta and Phillips, 2005 ). The overall G for our study sample was 0.24 (SD = 0.44). G was calculated for groups with varied statistics preparation ( Figure 2 ). Students who had completed introductory statistics in the prior semester showed modest gains ( G = 0.16, SD = 0.46) on the survey. Both groups of students with no previous statistics at the beginning of the course showed stronger gains than the group with previous statistics. For students who took statistics concurrently, G = 0.53 (SD = 0.26) while for students who took no statistics course at all, G = 0.3 (SD 0.36). Although they posted lower scores on the presurvey, the mean postsurvey score for concurrent students was comparable with that of students who had completed the statistics prerequisite ( Figure 1 B). This suggests that students taking statistics concurrently were on par, by semester's end, with students who took statistics before entering biology. Concurrent students showed the greatest G of all groups ( Figure 2 A). It is not surprising that this group posted the largest gain of all three groups, because these students were receiving concurrent instruction in statistics in both biology and introductory statistics. The gains shown by these students therefore include gains from the biology course and gains from the statistics course.

Figure 2. (A) Average normalized gains ( G ; ○) by students having completed introductory statistics (Previous, N = 130), taking introductory statistics concurrently (Concurrent, N = 31), and having no formal statistics instruction (None, N = 25). Comparisons via ANOVA, with Tukey 95% simultaneous confidence intervals indicated. (B) G (●) by male and female students in each statistics preparation category. The 95% confidence intervals with Bonferroni's corrections are indicated. Previous statistics: male, N = 63; female, N = 67. Concurrent statistics: male, N = 17; female, N = 14. No statistics: male, N = 19; female, N = 6. Skew < 0.90, Kurtosis < 0.70 for all populations.

Of the students in the study sample, 53.2% were female, which was proportional to the percentage of women in the course over two semesters (264 students; 53.5% female). Similar numbers of males and females were in the different statistics preparation groups, with the exception of students without statistics course work ( Table 2 ). We found no statistically significant difference in the performance of male and female students in this study ( Figure 2 B; G = 0.23 [SD = 0.46] for female students and G = 0.25 [SD = 0.41] for male students [ t test, p = 0.719]). Female students showed slightly higher average course grades (3.45 vs. 3.34) and GPA (3.41 vs. 3.31) than male students. These differences were not significant (ANOVA, p = 0.255 for course grades, p = 0.222 for GPA). The difference in learning gains between male and female students was still not significant even when these slight course grade and GPA differences were accounted for via covariate (ANCOVA) analysis.

We designed individual survey items to track varying aspects of students' statistics knowledge. Survey items 1–3 tested whether students could recognize commonly used graphical tools (a histogram, scatterplot, and stem-and-leaf plot). Students had little difficulty identifying a scatterplot (item 2) even without having completed college statistics, posting 98% correct responses on the initial survey ( Figure 3 ). Stem-and-leaf plots (item 3) were also easily identified, especially by semester's end (99% correct responses on final survey). Students struggled more with the distinction between a bar chart and a histogram (item 1), with 59% initially choosing “bar chart.” The bar chart question is notable both for identifying a point of student confusion, and as a more general caveat for instrument design.

Figure 3. Percentage of students answering correctly on statistics pre- and postsurvey items. Item number (1–11) is shown in italics at the base of each bar. N = 186 for each item.

The confusion on this question is understandable, because histograms are technically a special case of a bar chart, and they look much alike. However, the distinction between the two is important, because both categorical data (properly displayed on bar charts) and continuous data (properly displayed on histograms) are commonly encountered in biology. Our students are taught that these are two separate types of graphs (the text used in our introductory statistics course ( Moore and McCabe, 2003 ) explicitly distinguishes the two), but on the initial survey, only 36% chose histogram as the correct answer when bar chart was also an answer choice. However, after using histograms in the introductory biology course, the number of students choosing “histogram” as the correct answer on item 1 nearly doubled (from 67 to 127). Thus, although the bar graph and histogram answer choices are both technically correct and the question needs to be refined, the data have indicated to us that further instruction, giving biology students a more thorough understanding of histograms and bar charts, is beneficial.

Three items of the survey (items 4–6) asked students to choose a graph appropriate to a given data set (box plot, histogram, or scatterplot). The items had initial correct response rates between 53 and 75%, and they showed an increase in correct responses between 12 and 26% on the posttest ( Figure 3 ). These questions indicated that students showed gains in their ability to choose the correct graphical tools to appropriately display data.

The last five survey items represented higher (analysis and evaluation) levels of Bloom's taxonomy. Three questions required students to analyze experimental data (items 7–9). Most students (88% on the presurvey) correctly analyzed quartile data (item 7). Modest gains (31%) were seen in the ability of students to identify a problem in confusing causation with correlation (item 9). The largest score increase was observed on survey item 8, where students were asked to interpret the meaning of the p value statistics (66–128 correct, a 94% gain).

The lowest overall correct responses on the survey were observed with two evaluation questions, regarding p value and determining a causative relationship between variables (items 10 and 11). The inherent difficulty of this level of thinking was borne out by correspondingly low correct response rates. The question about causative relationships was answered correctly on both the pre- and postsurveys by ∼50% of study participants (item 10). Student confusion regarding p value is not surprising. There are at least 13 different misconceptions of p value that have been documented in the literature across mathematics education, statistics, and psychology disciplines ( Lane-Getaz, 2005 ). Research indicates that 61% of lower-division undergraduates believe a low p value indicates that the null hypothesis is true, and more than half believe a large p value means that data are statistically significant ( Lane-Getaz, 2005 ).

Item 11 also elicited a low initial correct response rate. The question was problematic, because answer choice E was ambiguous and could be interpreted as a correct response in addition to answer choice C; if both answer choices are counted, 43% of students answered item 11 correctly on the posttest (35% on pretest). Given that the two choices represent 40% of the answer choices on this question, this suggests student answers were essentially random on both pre- and postsurveys. A suggested revision for the question is shown as “Revised 11” in the Supplemental Material .

Student responses to item 8 better indicate that supplemental instruction and practice in statistics is improving statistics knowledge of biology students. In this item, students were asked to judge whether there was a statistically significant difference between two data sets with a p value of 0.06, and they were also asked to justify their answer in a short statement. Students were scored as providing a correct answer if the reasoning they provided was based on sound statistical principles. Thus, a student could argue the two populations were different (e.g., if an α of 0.10 was assumed), or they could argue the data were not significant because the p value was above an assumed α of 0.05. On the presurvey, only 35% of students gave reasonable answers, whereas that percentage almost doubled to 69% by the postsurvey. For the class as a whole, it seems that the basic understanding of how a p value is used improved, even if students could not precisely define the p value itself.

We compared learning gains between students who had previously taken a specialized statistics section emphasizing biological examples and students who completed the same introductory statistics course without biological emphasis. The G for students in the biology statistics section was more than double ( G = 0.25) that of nonbiology section statistics students that year ( G = 0.10), and also greater than G of students who completed regular introductory statistics courses in the following year ( Table 3 ). This finding could not be shown to have statistical significance (ANOVA, p = 0.584), but it may warrant further study. These preliminary data are particularly interesting because it suggests that teaching statistics in a manner that ties the material to a student's subject of interest results in better long-term retention of statistical knowledge, and it may increase students' ability to use statistical analysis within their own discipline. To determine whether the difference in G was due to an inherent difference in average ability between the students in the two groups, we compared the mean GPA for the two groups. There was no statistically significant difference between the mean GPA of the biology statistics group and the regular statistics group ( t test, p = 0.557).

a Differences were not statistically significant (ANOVA with Tukey's correction for multiple pairwise comparisons, p = 0.584; population skew = −0.86, Kurtosis = 0.44).

Students who were surveyed in spring 2006 were surveyed again 1 yr later, in a required sophomore-level biology course (spring 2007). Students who completed both pre- and postsurveys in 2006, and the survey again in 2007 (N = 30), showed a slightly higher G after three semesters ( G = 0.19, SD = 0.40), than after one semester ( G = 0.16, SD = 0.50). These values were not statistically distinguishable ( t test, p = 0.839), suggesting retention or slight improvement in their level of statistics knowledge with progression through the curriculum.

When our department faculty first began to reform the introductory biology curriculum using an inquiry-based learning model, we noted that students had difficulty with quantitative analysis and exhibited anxiety regarding statistical data analysis. Statistics anxiety, well documented in the classroom (e.g., Onwuegbuzie et al. , 1997 ), seems to stem from a more generalized math phobia, fear of failure, and the student's perception that statistics is disconnected from his or her real-world experience ( Pan and Tang, 2005 ). We recognized that to integrate statistics into the biology curriculum, we would need to help our students overcome their own statistics anxiety.

Starting in 2006, we modified our introductory biology course to provide students with supplementary instruction in statistics, followed by the opportunity for students to practice statistical analysis of their own data generated in inquiry-based labs ( Figure 4 ). Work in both statistics (e.g., Magel, 1996 ; Moore, 1997 ; Delucchi, 2006 ) and physics education research (e.g., Hake, 1998 ) suggests that students' analytical skills in a course improve most when students have the opportunity to practice problem solving during the semester. When statistics is emphasized in a biological or medical course context (e.g., Bahn, 1970 ; Peterson, 2000 ), student skills also seem to improve, although little work has been done on quantifying statistics learning within a biology curriculum. We therefore designed an 11-item statistics survey to quantify the effects of the statistics interventions in the biology course.

Figure 4. Students work in teams to design experiments and to collect and analyze data.

The normalized gain measure ( G ) has been used in high school and college physics courses to measure the effectiveness of a course in promoting conceptual understanding. Hake (1998) reported a G of 0.23 for traditional physics (mechanics) courses using a standardized test of mechanics reasoning, the Force Concept Inventory (FCI). Higher G values (0.48–0.60) on the FCI for courses that emphasized problem-solving skills also have been reported previously ( Coletta and Phillips, 2005 ). Because we were examining the learning of statistics within a biology course, we had expectations of fairly modest gains in increasing the conceptual understanding of statistics. Although a standard, published inventory was not used in this study, a G value of 0.25, similar to G values obtained using published instruments, suggests that the survey instrument was appropriate to the material being tested.

The survey was designed so that students who had completed elementary statistics should have been able to answer a high percentage of the questions correctly without studying. Overall, students scored well enough on the pretest (58% average) to allow improvement, but not so low as to indicate that the survey was unreasonably difficult. All student groups (those who had completed statistics, those taking statistics concurrently, and students with no formal statistics course work) showed gains in statistics knowledge by the end of the semester.

It has been reported that girls and young women trail behind their male peers in mathematics achievement test performance starting at about high school ( Kiefer and Sekaquaptewa, 2007 ). Other evidence suggests that female students learn less than their male peers in college science courses and underperform on science tests (e.g., Hake, 1998 ; McCullough, 2004 ). We were therefore interested in determining whether there were gender differences in initial performance and in learning gains between male and female students on the statistics survey.

No statistically significant differences were found in the performance of male and female students on the survey ( Figure 2 ). Given that female performance is often found to be lagging behind that of male students in science disciplines, it is important that no statistically significant differences in learning, as determined by this survey instrument, were detected. It may be that learning differences disappear as women gain parity in numbers in the classroom. The rate of female participation in both the study and the introductory biology courses (∼53%) was similar to rates seen nationally. In 2001, 57% of biological science bachelor degree recipients were women, although the percentage of women in biology drops off in graduate school and into higher levels of academia ( Sible et al. , 2006 ).

We did observe that women consistently showed slightly lower G than men in all categories, even when both their average course grade and GPA were higher. Recent work indicates that although girls outperform boys in classroom mathematics work, they have until very recently trailed behind boys on standardized achievement tests, perhaps due to “stereotype threat”—the conscious or subconscious influence of negative stereotypes in our culture regarding girls' mathematics competence ( Steele et al. , 2002 ; Kenney-Benson et al. , 2006 ). It may be that stereotype threat somewhat affected women's own assessment of competence during what was perceived as a “surprise math quiz” and caused them, as a group, to slightly underperform compared with the male students.

We made a preliminary observation that a statistics course emphasizing biological examples seemed to contribute to gains in understanding of statistics in a subsequent biology course ( Table 3 ). With funding from a Howard Hughes Medical Institute education grant, we were able to offer, in fall 2005, one section of introductory statistics that heavily emphasized biological examples. The G of the student group that had previously been exposed to statistics in a biological context via this section was 2.5 times greater than the G of students who took an identical course in a section that used few, if any, biology examples.

Interestingly, survey prescores for both groups were similar. It may be that although all students initially have difficulty when presented with statistics in a biology course, it is the students who have had previous exposure to the use of statistics in biology who retain more of this knowledge upon second exposure. Berger et al. (1999) suggest that material that has been learned, and subsequently forgotten, can be reactivated with a minimal corrective intervention. Material so retrieved can then remain accessible to the student for years in the future. In our case, it seems that using statistics in introductory biology may act as a corrective intervention, where the memories that have been lost (in this case, of the use of statistics in biology) are quickly recovered when a student re-encounters the material in the biology classroom. Because gains on the survey were stronger for students completing statistics with a biology emphasis than for students completing general statistics, it may be that teaching statistics in a more integrated manner in the first place is helpful.

It must be noted that our sample size for this analysis was very small, and this phenomenon requires further investigation. Unfortunately, we were unable to institutionalize a biology-themed statistics course at MSU, and we have not been able to confirm this trend. Barriers to institutionalization of the course were twofold. One issue is the department-based organizational structure of the university. No agreement could be reached across the mathematics and biology departments as to which department would pay for and staff the course in future years. The second barrier was more philosophical. Statistics is a service course taught by the mathematical sciences department for majors across campus, including engineering, physics, business, chemistry, and many biological sciences. It was felt that if efforts were made to make special sections for biology, it would lead to a demand for special sections for all majors and would make scheduling and administration of this already complex course (approximately 40 sections and 1600 students per year) unmanageable.

For statistical learning to continue to be useful to students after the completion of introductory statistics, this knowledge needs to be stabilized to allow long-term access. Retesting of our students 1 yr after the completion of introductory biology indicated that they had retained their statistics knowledge at levels at or above where they were after the completion of introductory biology. Research on the nature of memory indicates that repeated exposure to material is useful in long-term retention of knowledge ( Bjork, 1988 ; Bahrick and Hall, 1991a , b ). Bahrick and Hall (2005) show that practice sessions spaced apart are critical for long-term retention of knowledge. This may be because intervals between use of the knowledge means that retrieval failures will occur, which help students identify gaps in knowledge or understanding that can then be rectified. It is possible that students initially experienced retrieval failures upon entering biology (students posted low survey scores at the start of the biology course), followed by relearning during the semester. Such relearning may have allowed students to retain their statistics knowledge over longer time frames.

Beyond looking at overall trends on the survey, analysis of student responses to individual questions has been enlightening. For example, survey results indicated that the concept of p value is difficult for our students, but also that students improved their understanding of p value in the biology course. Alternatively, there was almost no improvement on item 10 (an evaluative question on the difference between causation and correlation), indicating that we did not teach this concept well and need to make curricular changes. The essentially random responses on item 11 as administered indicate that the question was poorly crafted and not valuable in assessing student learning. This item needs to be improved on future generations of the instrument, ideally incorporating the most common misconceptions as foils. The results from this initial pilot of the survey will also help to improve the instrument, and a better understanding of the strengths and weaknesses in student understanding of statistics will allow us to better tailor course content to the needs of our students in future semesters.

A final issue of note is the problem in classroom research of capturing strugglers—students who are not fully engaged in the course and less likely to complete the study than their more successful peers. In this study, pre- and postsurvey data were obtained for only 70% of students completing the course, and nonparticipating students had significantly lower GPAs and course grades than participating students. Obtaining data on nonparticipating students is a problem inherent to classroom research; the population is nonrandom and classroom circumstances must be considered in the research design. Possible solutions include attempting to catch students in subsequent class periods or providing the opportunity for students to take posttests at their convenience (e.g., online). However, both of these methods will likely suffer from the same lack of student participation, and additionally add variables (e.g., time on task or advance notice) that may confound the data. Embedding the postsurvey in a final exam is most likely to capture all the students who complete the course. However, even this tactic may not be appropriate, depending on the scope of the instrument or the design of the course, and also it will not capture students who fail to finish the class. Accounting for the influence of struggling students on the course population should be given the fullest consideration in study design; data from such studies will be richer and better indicate how courses can be improved to serve every student who walks into the classroom.

CONCLUSIONS

Students in an introductory biology course showed statistically significant gains in their understanding of statistics when given the means and opportunity to practice data gathering and analysis of biological experiments. Learning gains were measured using a new instrument designed specifically to measure the students' knowledge of statistics in the biology classroom. The work provides a model for strengthening the statistical analysis skills of biology majors.

ACKNOWLEDGMENTS

We are grateful to the students of Biology 213 and 215 for support of the study and positive suggestions for improving the courses. I thank Clarissa Dirks (Evergreen College) and Matthew Cunningham (University of Washington) for project brainstorming and initial review of the statistics instrument; Alex Dimitrov for initial outline and final review of the Statistics Primer used in the introductory biology sequence; Gwen Jacobs, Cathy Cripps, Steven Kalinowski, and Mark Taper for Biology 213 and 215 course design and help in implementing the study; and Martha Peters and Cali Morrison for aid in completing the study. This research was supported in part by a grant to MSU from the Howard Hughes Medical Institute through the Undergraduate Science Education Program, and it was performed with the review and approval of the Institutional Review Board for the Protection of Human Subjects at MSU Bozeman (exempt status).

A'Brook R., Weyers J.D.B. ( 1996 ). Teaching of statistics to UK undergraduate biology students in 1995 . J. Biol. Educ. 30 , 281-288. Google Scholar
Bahn A. K. ( 1970 ). Goals and methods of evaluating the teaching of biostatistics to medical students . Arch. Environ. Health 21 , 559-568. Medline , Google Scholar
Bahrick H. P., Hall L. K. ( 1991a ). Preventive and corrective maintenance of access to knowledge . Appl. Cogn. Psychol. 5 , 1-18. Google Scholar
Bahrick H. P., Hall L. K. ( 1991b ). Lifetime maintenance of high school mathematics content . J. Exp. Psychol. Gen. 120 , 20-33 1991. Google Scholar
Bahrick H. P., Hall L. K. ( 2005 ). The importance of retrieval failures to long-term retention: a metacognitive explanation of the spacing effect . J. Mem. Lang. 52 , 566-577. Google Scholar
Berger S. A., Hall L. K., Bahrick H. P. ( 1999 ). Stabilizing access to marginal and submarginal knowledge . J. Exp. Psychol. Appl. 5 , 438-447. Google Scholar
Bjork R. A. ( 1988 , Ed. M. M. GrunebergP. E. MorrisR. N. Sykes , Retrieval practice and the maintenance of knowledge In: Practical Aspects of Memory: Current Research and Issues, Volume 1, Memory of Everyday Life , New York: Wiley, 396-401. Google Scholar
Bloom B. S. ( 1956 ). Taxonomy of Educational Objectives: The Classification of Educational Goals In: Handbook I: Cognitive Domain , New York: Longman. Google Scholar
Budnitz N., Guentensberger T. ( 2001 ). Find Your Peanut accessed 10 July 2007 Durham, NC Center for Inquiry-based Learning, Duke University www.biology.duke.edu/cibl/exercised/find_your_peanut.htm . Google Scholar
Coletta V. P., Phillips J. A. ( 2005 ). Interpreting FCI scores: normalized gain, preinstruction scores and scientific reasoning ability . Am. J. Phys. 73 , 1172-1182. Google Scholar
Delucchi M. ( 2006 ). The efficacy of collaborative learning groups in an undergraduate statistics course . Coll. Teach. 54 , 244-248. Google Scholar
Garfield J. B. ( 2003 ). Assessing statistical reasoning . Stat. Educ. Res. J. 2 , 22-38. Google Scholar
Garfield J., Hogg B., Schau C., Whittinghill D. ( 2002 ). First courses in statistical science: the status of educational reform efforts J. Stat. Educ. accessed 12 November 2007 10 2 www.amstat.org/publications/jse/v10n2/garfield.html . Google Scholar
Guidelines for the Assessment and Instruction in Statistics Education ( 2005 ). American Statistical Association Guidelines for the Assessment and Instruction in Statistics Education (GAISE) Project, February, 2005 accessed 12 November 2007 www.amstat.org/education/gaise/ . Google Scholar
Hake R. R. ( 1998 ). Interactive-engagement versus traditional methods; a six thousand-student survey of mechanics test data for introductory physics courses . Am. J. Phys. 66 , 67-74. Google Scholar
Hemenway M. K., Straits W. J., Wilke R. R., Hufnagel B. ( 2002 ). Educational research in an introductory astronomy course . Innov. High. Educ. 26 , 271-280. Google Scholar
Horgan G. W., Elston D. A., Franklin M. F., Glasbey C. A., Hunter E. A., Talbot M., Kempton R. A., McNicol J. W., Wright F. ( 1999 ). Teaching statistics to biological research scientists . Statistician 48 , 393-400. Google Scholar
Kaplan R. M., Saccuzzo D. P. ( 1997 ). Psychological Testing: Principles, Application and Issues , 6th ed. Belmont, CA: Wadsworth Publishing. Google Scholar
Kenney-Benson G. A., Pomerantz E. M., Ryan A. M., Patrick H. ( 2006 ). Sex differences in math performance: the role of children's approach to schoolwork . Dev. Psychol. 42 , 11-26. Medline , Google Scholar
Kiefer A. K., Sekaquaptewa D. ( 2007 ). Implicit stereotypes, gender identification and math-related outcomes: a prospective study of female college students . Psychol. Sci. 18 , 13-18. Medline , Google Scholar
Lane-Getaz S. J. ( 2005 ). Summary of P-value survey research 2005 United States Conference on Teaching Statistics (USCOTS) 19–21 May 2005 Columbus, OH The Ohio State University Columbus, OH: CAUSE, Department of Statistics, The Ohio State University. www.causeweb.org/uscots/uscots05/spotlight/files/R6.pdf . Google Scholar
Luckie D. B., Maleszweski J. J., Loznak S. D., Krha M. ( 2004 ). Infusion of collaborative inquiry throughout a biology curriculum increases student learning: a four-year study of “teams and streams.” Adv . Physiol. Educ. 287 , 199-209. Google Scholar
Magel R. C. ( 1996 ). Increasing participation in large introductory statistics classes . Am. Stat. 50 , 51-56. Google Scholar
Maret T. J., Ziemba R. E. ( 1997 ). Statistics and hypothesis testing in biology: teaching students the relationship between statistical tests and scientific hypotheses . J. Coll. Sci. Teach. 26 , 283-285. Google Scholar
McCullough L. ( 2004 ). Gender, context and physics assessment . J. Int. Womens Stud. 5 , 20-30. Google Scholar
Moore D. S. ( 1997 ). New pedagogy and new content: the case of statistics . Int. Stat. Rev. 65 , 123-165. Google Scholar
Moore D. S., McCabe G. P. ( 2003 ). Introduction to the Practice of Statistics , 4th ed. New York: Freeman and Co.. Google Scholar
National Research Council ( 2003 ). BIO 2010, Transforming Undergraduate Education for Future Research Biologists Committee on Undergraduate Education to Prepare Research Scientists for the 21st Century Washington, DC National Academies Press. Google Scholar
Onwuegbuzie A. J., DaRos J., Ryan J. ( 1997 ). The components of statistics anxiety: a phenomenological study . Focus Learn. Probl. Math. 19 , 11-35. Google Scholar
Pan W., Tang M. ( 2005 ). Students' perceptions on factors of statistics anxiety and instructional strategies . J. Instr. Psychol. 32 , 205-214. Google Scholar
Peterson C. E. ( 2000 ). An experimental project approach to biology . J. Coll. Sci. Teach. 30 , 162-165. Google Scholar
Sible J. C., Wilhelm D. E., Lederman M. ( 2006 ). Teaching cellular and molecular biology for gender equity . CBE Life Sci. Educ. 5 , 227-238. Link , Google Scholar
Steele C. M., Spencer S. J., Aronson J. ( 2002 , Ed. M. Zanna , Contending with group image: the psychology of stereotype and social identity threat In: Advances in Experimental Social Psychology , Vol. 34 New York: Academic Press, 379-340. Google Scholar
Teaching students to R3eason, not merely to solve problem sets: The role of philosophy and visual data communication in accessible data science education 8 June 2023 | PLOS Computational Biology, Vol. 19, No. 6
An Equity-Focused Redesign of an Introductory Organismal Biology Lab Course To Develop Foundational Scientific Practices Journal of Microbiology & Biology Education, Vol. 24, No. 1
Building a Performance-Based Assessment of Graph Construction Using Evidence-Centered Design 9 March 2023
Building a Performance-Based Assessment of Graph Construction Using Evidence-Centered Design 15 October 2023
Bridging statistics and life sciences undergraduate education 14 September 2022 | Journal of Biological Education, Vol. 7
Alicia M. Caughman and
Emily G. Weigel
Anita Schuchardt, Monitoring Editor
Chiara Forrester ,
Shane Schwikert ,
James Foster , and
Lisa Corwin
Elisabeth Schussler, Monitoring Editor
Jessica Dewey ,
Jenna Hicks , and
Anita Schuchardt
Grant Ean Gardner, Monitoring Editor
Teasing apart the impacts of curriculum and professional development on teaching assistants’ teaching practices 9 February 2022 | PLOS ONE, Vol. 17, No. 2
Paired Multiple-Choice Questions Reveal Students’ Incomplete Statistical Thinking about Variation during Data Analysis Journal of Microbiology & Biology Education, Vol. 22, No. 2
Inquiry Based Learning and Responsible Research and Innovation: Examples of Interdisciplinary Approaches at Different Schooling Levels 25 September 2021
Integrating research into a molecular cloning course to address the evolving biotechnology landscape 27 July 2020 | Biochemistry and Molecular Biology Education, Vol. 49, No. 1
Using Messy, Authentic Data to Promote Data Literacy & Reveal the Nature of Science The American Biology Teacher, Vol. 82, No. 7
Development of the Biological Variation In Experimental Design And Analysis (BioVEDA) assessment 20 July 2020 | PLOS ONE, Vol. 15, No. 7
Age-appropriate password “best practice” ontologies for early educators and parents International Journal of Child-Computer Interaction, Vol. 23-24
How strongly does statistical reasoning influence knowledge and acceptance of evolution? 18 March 2019 | Journal of Research in Science Teaching, Vol. 56, No. 9
Using Inquiry-Based Learning to Enhance Immunology Laboratory Skills 22 October 2019 | Frontiers in Immunology, Vol. 10
The State of Statistics Education Research in Client Disciplines: Themes and Trends Across the University 2 December 2019 | Journal of Statistics Education, Vol. 27, No. 3
Melissa K. Kjelvik and
Elizabeth H. Schultheis
Stephanie Gardner, Monitoring Editor
Impact of quantitative literacy on student reasoning in plant anatomy course 12 March 2019 | Journal of Physics: Conference Series, Vol. 1157
Going Back to Basics: How to Master the Art of Making Scientifically Sound Questions 13 November 2018
Infusion of Quantitative and Statistical Concepts Into Biology Courses Does Not Improve Quantitative Literacy 10 October 2023 | Journal of College Science Teaching, Vol. 47, No. 5
Effectiveness of Learning Through Guided Inquiry 4 November 2018
Making evolution stick: using sticky notes to teach the mechanisms of evolutionary change 28 November 2017 | Evolution: Education and Outreach, Vol. 10, No. 1
Aakanksha Angra and
Stephanie M. Gardner ,
Jennifer Momsen, Monitoring Editor
Sarah E. Andrews ,
Christopher Runyon , and
Melissa L. Aikens
Cynthia Bauerle, Monitoring Editor
The Impact of an Interactive Statistics Module on Novices’ Development of Scientific Process Skills and Attitudes in a First-Semester Research Foundations Course Journal of Microbiology & Biology Education, Vol. 17, No. 3
Dodge, Duck, Dip, Dive & Dependence: Using Dodgeball to Explore Frequency Dependent Selection The American Biology Teacher, Vol. 78, No. 7
Annwesa P. Dasgupta ,
Trevor R. Anderson , and
Nancy Pelaez
Erin Dolan, Monitoring Editor
Student Centered Learning in Statistics: Analysis of Systematic Review Procedia - Social and Behavioral Sciences, Vol. 103
Integration of bioinformatics into an undergraduate biology curriculum and the impact on development of mathematical skills 22 August 2012 | Biochemistry and Molecular Biology Education, Vol. 40, No. 5
Long-Term Retention of Knowledge and Critical Thinking Skills in Developmental Biology Journal of Microbiology & Biology Education, Vol. 13, No. 2
A primer for data assimilation with ecological models using Markov Chain Monte Carlo (MCMC) 27 August 2011 | Oecologia, Vol. 167, No. 3
Integrating Active Learning & Quantitative Skills into Undergraduate Introductory Biology Curricula The American Biology Teacher, Vol. 73, No. 8
Migdalisel Colon-Berlingeri , and
Patricia A. Burrowes
John R. Jungck, Monitoring Editor
Andreas Madlung ,
Martina Bremer ,
Edward Himelblau , and
Alexa Tullis
John R Jungck, Monitoring Editor
Marin Moravec ,
Adrienne Williams ,
Nancy Aguilar-Roca , and
Diane K. O'Dowd
Barbara Wakimoto, Monitoring Editor
Andrej Šorgo
John Jungck, Monitoring Editor
Kelly E. Matthews ,
Peter Adams , and
Merrilyn Goos
Audrey M. Depelteau ,
Karl H. Joplin ,
Aimee Govett ,
Hugh A. Miller , and
Edith Seier
Reinventing the Ames Test as a Quantitative Lab That Connects Classical and Molecular Genetics 1 January 2009 | Genetics, Vol. 181, No. 1

Submitted: 12 July 2007 Revised: 26 November 2007 Accepted: 15 April 2008

Effective Use of Statistics in Research – Methods and Tools for Data Analysis

Remember that impending feeling you get when you are asked to analyze your data! Now that you have all the required raw data, you need to statistically prove your hypothesis. Representing your numerical data as part of statistics in research will also help in breaking the stereotype of being a biology student who can’t do math.

Statistical methods are essential for scientific research. In fact, statistical methods dominate the scientific research as they include planning, designing, collecting data, analyzing, drawing meaningful interpretation and reporting of research findings. Furthermore, the results acquired from research project are meaningless raw data unless analyzed with statistical tools. Therefore, determining statistics in research is of utmost necessity to justify research findings. In this article, we will discuss how using statistical methods for biology could help draw meaningful conclusion to analyze biological studies.

Table of Contents

Role of Statistics in Biological Research

Statistics is a branch of science that deals with collection, organization and analysis of data from the sample to the whole population. Moreover, it aids in designing a study more meticulously and also give a logical reasoning in concluding the hypothesis. Furthermore, biology study focuses on study of living organisms and their complex living pathways, which are very dynamic and cannot be explained with logical reasoning. However, statistics is more complex a field of study that defines and explains study patterns based on the sample sizes used. To be precise, statistics provides a trend in the conducted study.

Biological researchers often disregard the use of statistics in their research planning, and mainly use statistical tools at the end of their experiment. Therefore, giving rise to a complicated set of results which are not easily analyzed from statistical tools in research. Statistics in research can help a researcher approach the study in a stepwise manner, wherein the statistical analysis in research follows –

1. Establishing a Sample Size

Usually, a biological experiment starts with choosing samples and selecting the right number of repetitive experiments. Statistics in research deals with basics in statistics that provides statistical randomness and law of using large samples. Statistics teaches how choosing a sample size from a random large pool of sample helps extrapolate statistical findings and reduce experimental bias and errors.

2. Testing of Hypothesis

When conducting a statistical study with large sample pool, biological researchers must make sure that a conclusion is statistically significant. To achieve this, a researcher must create a hypothesis before examining the distribution of data. Furthermore, statistics in research helps interpret the data clustered near the mean of distributed data or spread across the distribution. These trends help analyze the sample and signify the hypothesis.

3. Data Interpretation Through Analysis

When dealing with large data, statistics in research assist in data analysis. This helps researchers to draw an effective conclusion from their experiment and observations. Concluding the study manually or from visual observation may give erroneous results; therefore, thorough statistical analysis will take into consideration all the other statistical measures and variance in the sample to provide a detailed interpretation of the data. Therefore, researchers produce a detailed and important data to support the conclusion.

Types of Statistical Research Methods That Aid in Data Analysis

Statistical analysis is the process of analyzing samples of data into patterns or trends that help researchers anticipate situations and make appropriate research conclusions. Based on the type of data, statistical analyses are of the following type:

1. Descriptive Analysis

The descriptive statistical analysis allows organizing and summarizing the large data into graphs and tables . Descriptive analysis involves various processes such as tabulation, measure of central tendency, measure of dispersion or variance, skewness measurements etc.

2. Inferential Analysis

The inferential statistical analysis allows to extrapolate the data acquired from a small sample size to the complete population. This analysis helps draw conclusions and make decisions about the whole population on the basis of sample data. It is a highly recommended statistical method for research projects that work with smaller sample size and meaning to extrapolate conclusion for large population.

3. Predictive Analysis

Predictive analysis is used to make a prediction of future events. This analysis is approached by marketing companies, insurance organizations, online service providers, data-driven marketing, and financial corporations.

4. Prescriptive Analysis

Prescriptive analysis examines data to find out what can be done next. It is widely used in business analysis for finding out the best possible outcome for a situation. It is nearly related to descriptive and predictive analysis. However, prescriptive analysis deals with giving appropriate suggestions among the available preferences.

5. Exploratory Data Analysis

EDA is generally the first step of the data analysis process that is conducted before performing any other statistical analysis technique. It completely focuses on analyzing patterns in the data to recognize potential relationships. EDA is used to discover unknown associations within data, inspect missing data from collected data and obtain maximum insights.

6. Causal Analysis

Causal analysis assists in understanding and determining the reasons behind “why” things happen in a certain way, as they appear. This analysis helps identify root cause of failures or simply find the basic reason why something could happen. For example, causal analysis is used to understand what will happen to the provided variable if another variable changes.

7. Mechanistic Analysis

This is a least common type of statistical analysis. The mechanistic analysis is used in the process of big data analytics and biological science. It uses the concept of understanding individual changes in variables that cause changes in other variables correspondingly while excluding external influences.

Important Statistical Tools In Research

Researchers in the biological field find statistical analysis in research as the scariest aspect of completing research. However, statistical tools in research can help researchers understand what to do with data and how to interpret the results, making this process as easy as possible.

1. Statistical Package for Social Science (SPSS)

It is a widely used software package for human behavior research. SPSS can compile descriptive statistics, as well as graphical depictions of result. Moreover, it includes the option to create scripts that automate analysis or carry out more advanced statistical processing.

2. R Foundation for Statistical Computing

This software package is used among human behavior research and other fields. R is a powerful tool and has a steep learning curve. However, it requires a certain level of coding. Furthermore, it comes with an active community that is engaged in building and enhancing the software and the associated plugins.

3. MATLAB (The Mathworks)

It is an analytical platform and a programming language. Researchers and engineers use this software and create their own code and help answer their research question. While MatLab can be a difficult tool to use for novices, it offers flexibility in terms of what the researcher needs.

4. Microsoft Excel

Not the best solution for statistical analysis in research, but MS Excel offers wide variety of tools for data visualization and simple statistics. It is easy to generate summary and customizable graphs and figures. MS Excel is the most accessible option for those wanting to start with statistics.

5. Statistical Analysis Software (SAS)

It is a statistical platform used in business, healthcare, and human behavior research alike. It can carry out advanced analyzes and produce publication-worthy figures, tables and charts .

6. GraphPad Prism

It is a premium software that is primarily used among biology researchers. But, it offers a range of variety to be used in various other fields. Similar to SPSS, GraphPad gives scripting option to automate analyses to carry out complex statistical calculations.

This software offers basic as well as advanced statistical tools for data analysis. However, similar to GraphPad and SPSS, minitab needs command over coding and can offer automated analyses.

Use of Statistical Tools In Research and Data Analysis

Statistical tools manage the large data. Many biological studies use large data to analyze the trends and patterns in studies. Therefore, using statistical tools becomes essential, as they manage the large data sets, making data processing more convenient.

Following these steps will help biological researchers to showcase the statistics in research in detail, and develop accurate hypothesis and use correct tools for it.

There are a range of statistical tools in research which can help researchers manage their research data and improve the outcome of their research by better interpretation of data. You could use statistics in research by understanding the research question, knowledge of statistics and your personal experience in coding.

Have you faced challenges while using statistics in research? How did you manage it? Did you use any of the statistical tools to help you with your research data? Do write to us or comment below!

Frequently Asked Questions

Statistics in research can help a researcher approach the study in a stepwise manner: 1. Establishing a sample size 2. Testing of hypothesis 3. Data interpretation through analysis

Statistical tools in research can help researchers understand what to do with data and how to interpret the results, making this process as easy as possible. They can manage large data sets, making data processing more convenient. A great number of tools are available to carry out statistical analysis of data like SPSS, SAS (Statistical Analysis Software), and Minitab.

nice article to read

Holistic but delineating. A very good read.

Rate this article Cancel Reply

Your email address will not be published.

Enago Academy's Most Popular Articles

Reporting Research

Research Interviews: An effective and insightful way of data collection

Research interviews play a pivotal role in collecting data for various academic, scientific, and professional…

Planning Your Data Collection: Designing methods for effective research

Planning your research is very important to obtain desirable results. In research, the relevance of…

Language & Grammar

Best Plagiarism Checker Tool for Researchers — Top 4 to choose from!

While common writing issues like language enhancement, punctuation errors, grammatical errors, etc. can be dealt…

Industry News
Publishing News

2022 in a Nutshell — Reminiscing the year when opportunities were seized and feats were achieved!

It’s beginning to look a lot like success! Some of the greatest opportunities to research…

Manuscript Preparation
Publishing Research

Qualitative Vs. Quantitative Research — A step-wise guide to conduct research

A research study includes the collection and analysis of data. In quantitative research, the data…

2022 in a Nutshell — Reminiscing the year when opportunities were seized and feats…

Sign-up to read more

Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:

2000+ blog articles
50+ Webinars
10+ Expert podcasts
50+ Infographics
10+ Checklists
Research Guides

We hate spam too. We promise to protect your privacy and never spam you.

I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:

What should universities' stance be on AI tools in research and academic writing?

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Books Received
Published: 12 June 1943

Statistical Analysis in Biology

J. B. S. HALDANE

Nature volume 151 , pages 654–655 ( 1943 ) Cite this article

DR. MATHER'S book begins with the remarkable statement that “Statistics is the mathematics of experiment”. In fact, statistics originated in the spheres of demography and economics, where experiment is difficult, dangerous, and rarely scientific ; and among their most successful spheres of application is astronomy, where it is wholly impossible. Perhaps it is for this reason that Dr. Mother completely ignores the construction and interpretation of life-tables, though these are not merely fundamental for human biology, but are assuming an ever-growing importance in animal biology also. Nor is there any suggestion for the treatment of statistics of fertility, such as are given in Salisbury's “The Reproductive Capacity of Plants”. Yet any theory of evolution must be based on such data as these.

By Dr. K. Mather. Pp. iv+248. (London: Methuen and Co., Ltd., 1943.) 16 s . net.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article.

HALDANE, J. Statistical Analysis in Biology. Nature 151 , 654–655 (1943). https://doi.org/10.1038/151654a0

Download citation

Issue Date : 12 June 1943

DOI : https://doi.org/10.1038/151654a0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

Explore articles by subject
Guide to authors
Editorial policies

IMAGES

Statistical and modern data science methods used for workingon
An Introduction to Statistical Analysis in Research: With Applications
Data Analysis in Biological Research
Biological Data Analysis: From Raw Data to Insights
Biological Analysis and Interpretation For Improved Research Outcomes
Biological Data Analysis

VIDEO

Fundamental Biostatistics: How to Understand and Analyze biological data
12- Biostatistics : spss
Statistical Analysis Software in Clinical Research
01
Day-1|Application of Statistics in Biological data analysis| Prof. P.K.Sahu, BCKV|07.09.2023
What do statisticians research?

COMMENTS

A biologist's guide to statistical thinking and analysis
The proper understanding and use of statistical tools are essential to the scientific enterprise. This is true both at the level of designing one's own experiments as well as for critically evaluating studies carried out by others. Unfortunately, many researchers who are otherwise rigorous and thoughtful in their scientific approach lack sufficient knowledge of this field. This methods chapter ...
Statistics for Biologists
Statistics for Biologists. There is no disputing the importance of statistical analysis in biological research, but too often it is considered only after an experiment is completed, when it may be ...
(PDF) Statistical Methods in Biology: Design and Analysis of
Biology: Design and Analysis of Experiments and Regression is a practi-. cal and illustrative guide to the design of experiments and data analysis in the. biological and agricultural sciences. The ...
An Introduction to Statistical Analysis in Research
An Introduction to Statistical Analysis in Research: With Applications in the Biological and Life Sciences is an ideal textbook for upper-undergraduate and graduate-level courses in research methods, biostatistics, statistics, biology, kinesiology, sports science and medicine, health and physical education, medicine, and nutrition. The book is ...
Points of Significance
There is no disputing the importance of statistical analysis in biological research, but too often it is considered only after an experiment is completed, when it may be too late.
PDF Analysis for Biologists
Analysis for Biologists Applying statistical concepts to biological scenarios, this established textbook continues to be the go-to tool for advanced undergraduates and postgraduates studying biostatistics or experimental design in biology-related areas. Chapters cover linear models, common regression and ANOVA methods, mixed effects models, model
Modern Statistics for Modern Biology
Introduction. Biology, formerly a science with sparse, often only qualitative data has turned into a field whose production of quantitative data is on par with high energy physics or astronomy, and whose data are wildly more heterogeneous and complex. Statistics, a field that in the 20th century had become an application ground for probability ...
An Introduction to Statistical Analysis in Research: With Applications
An Introduction to Statistical Analysis in Research: With Applications in the Biological and Life Sciences is an ideal textbook for upper-undergraduate and graduate-level courses in research methods, biostatistics, statistics, biology, kinesiology, sports science and medicine, health and physical education, medicine, and nutrition. The book is ...
An Introduction to Statistical Analysis in Research: With ...
An Introduction to Statistical Analysis in Research: With Applications in the Biological and Life Sciences is an ideal textbook for upper-undergraduate and graduate-level courses in research methods, biostatistics, statistics, biology, kinesiology, sports science and medicine, health and physical education, medicine, and nutrition. The book is ...
Introduction
Welcome to the third edition of the Handbook of Biological Statistics!This online textbook evolved from a set of notes for my Biological Data Analysis class at the University of Delaware. My main goal in that class is to teach biology students how to choose the appropriate statistical test for a particular experiment, then apply that test and interpret the results.
PDF An Introductory Guide for Field Biologists Biostatistics with R
Biology, Faculty of Science, University of South Bohemia, eské Bud jovice, Czech Republic. His main research interests are multivariate statistical analysis, modern regression methods and the role of arbuscular mycorrhizal symbiosis in the functioning of plant communities. He is co-author of multivariate analysis
Statistical methods and tools in biosciences
15.1. Introduction. Statistics is a branch of science that deals with collecting, organizing, and analyzing data and drawing inferences from the samples to the population [1]. This requires appropriately designing the study, selecting an appropriate study sample, and choosing the proper statistical test.
Statistical Analysis in Biology
Statistical analysis in biology involves the application of statistical methods to biological data in order to draw meaningful conclusions. It encompasses techniques such as hypothesis testing, regression analysis, and experimental design to analyze and interpret biological phenomena.
Biostatistical Design and Analysis Using R: A Practical Guide
An abridged description of biostatistical principles and analysis sequence keys are combined together with worked examples of the practical use of R into a complete practical guide to designing and analyzing real biological research. Topics covered include: simple hypothesis testing, graphing. exploratory data analysis and graphical summaries.
Applications in Biology and Medicine
Applications in Biology and Medicine. There is a long and fruitful history of joint development between Statistics and Biology and Medicine, with data at the core. For instance, Mendel's fundamental laws of heredity were entirely based on statistical inference applied to data from carefully designed experiments.
Biological Importance and Statistical Significance
Statistical ideas behind the analysis of experiments related to crop composition and the genetic factors underlying composition are discussed. The emphasis is on concepts rather than statistical formulations. Statistical analysis and biological considerations are shown to be complementary rather than contradictory, in that the statistical analysis of a data set depends on the experimental ...
Teaching Statistics in Biology: Using Inquiry-based Learning to
There is an increasing need for students in the biological sciences to build a strong foundation in quantitative approaches to data analyses. Although most science, engineering, and math field majors are required to take at least one statistics course, statistical analysis is poorly integrated into undergraduate biology course work, particularly at the lower-division level. Elements of ...
Role of Statistics in Research
Role of Statistics in Biological Research. Statistics is a branch of science that deals with collection, organization and analysis of data from the sample to the whole population. Moreover, it aids in designing a study more meticulously and also give a logical reasoning in concluding the hypothesis.
Statistical Analysis in Biology
Abstract. DR. MATHER'S book begins with the remarkable statement that "Statistics is the mathematics of experiment". In fact, statistics originated in the spheres of demography and economics ...