Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved March 20, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples.

Module 8: Inference for One Proportion

Introduction to hypothesis testing, what you’ll learn to do: given a claim about a population, construct an appropriate set of hypotheses to test and properly interpret p values and type i / ii errors. .

Hypothesis testing is part of inference. Given a claim about a population, we will learn to determine the null and alternative hypotheses. We will recognize the logic behind a hypothesis test and how it relates to the P-value as well as recognizing type I and type II errors. These are powerful tools in exploring and understanding data in real-life.

Contribute!

Improve this page Learn More

  • Concepts in Statistics. Provided by : Open Learning Initiative. Located at : http://oli.cmu.edu . License : CC BY: Attribution
  • Inferential Statistics Decision Making Table. Provided by : Wikimedia Commons: Adapted by Lumen Learning. Located at : https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Inferential_Statistics_Decision_Making_Table.png/120px-Inferential_Statistics_Decision_Making_Table.png . License : CC BY: Attribution

Footer Logo Lumen Waymaker

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.6 - confidence intervals & hypothesis testing.

Confidence intervals and hypothesis tests are similar in that they are both inferential methods that rely on an approximated sampling distribution. Confidence intervals use data from a sample to estimate a population parameter. Hypothesis tests use data from a sample to test a specified hypothesis. Hypothesis testing requires that we have a hypothesized parameter. 

The simulation methods used to construct bootstrap distributions and randomization distributions are similar. One primary difference is a bootstrap distribution is centered on the observed sample statistic while a randomization distribution is centered on the value in the null hypothesis. 

In Lesson 4, we learned confidence intervals contain a range of reasonable estimates of the population parameter. All of the confidence intervals we constructed in this course were two-tailed. These two-tailed confidence intervals go hand-in-hand with the two-tailed hypothesis tests we learned in Lesson 5. The conclusion drawn from a two-tailed confidence interval is usually the same as the conclusion drawn from a two-tailed hypothesis test. In other words, if the the 95% confidence interval contains the hypothesized parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always fail to reject the null hypothesis. If the 95% confidence interval does not contain the hypothesize parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always reject the null hypothesis.

Example: Mean Section  

This example uses the Body Temperature dataset built in to StatKey for constructing a  bootstrap confidence interval and conducting a randomization test . 

Let's start by constructing a 95% confidence interval using the percentile method in StatKey:

  

The 95% confidence interval for the mean body temperature in the population is [98.044, 98.474].

Now, what if we want to know if there is enough evidence that the mean body temperature is different from 98.6 degrees? We can conduct a hypothesis test. Because 98.6 is not contained within the 95% confidence interval, it is not a reasonable estimate of the population mean. We should expect to have a p value less than 0.05 and to reject the null hypothesis.

\(H_0: \mu=98.6\)

\(H_a: \mu \ne 98.6\)

\(p = 2*0.00080=0.00160\)

\(p \leq 0.05\), reject the null hypothesis

There is evidence that the population mean is different from 98.6 degrees. 

Selecting the Appropriate Procedure Section  

The decision of whether to use a confidence interval or a hypothesis test depends on the research question. If we want to estimate a population parameter, we use a confidence interval. If we are given a specific population parameter (i.e., hypothesized value), and want to determine the likelihood that a population with that parameter would produce a sample as different as our sample, we use a hypothesis test. Below are a few examples of selecting the appropriate procedure. 

Example: Cheese Consumption Section  

Research question: How much cheese (in pounds) does an average American adult consume annually? 

What is the appropriate inferential procedure? 

Cheese consumption, in pounds, is a quantitative variable. We have one group: American adults. We are not given a specific value to test, so the appropriate procedure here is a  confidence interval for a single mean .

Example: Age Section  

Research question:  Is the average age in the population of all STAT 200 students greater than 30 years?

There is one group: STAT 200 students. The variable of interest is age in years, which is quantitative. The research question includes a specific population parameter to test: 30 years. The appropriate procedure is a  hypothesis test for a single mean .

Try it! Section  

For each research question, identify the variables, the parameter of interest and decide on the the appropriate inferential procedure.

Research question:  How strong is the correlation between height (in inches) and weight (in pounds) in American teenagers?

There are two variables of interest: (1) height in inches and (2) weight in pounds. Both are quantitative variables. The parameter of interest is the correlation between these two variables.

We are not given a specific correlation to test. We are being asked to estimate the strength of the correlation. The appropriate procedure here is a  confidence interval for a correlation . 

Research question:  Are the majority of registered voters planning to vote in the next presidential election?

The parameter that is being tested here is a single proportion. We have one group: registered voters. "The majority" would be more than 50%, or p>0.50. This is a specific parameter that we are testing. The appropriate procedure here is a  hypothesis test for a single proportion .

Research question:  On average, are STAT 200 students younger than STAT 500 students?

We have two independent groups: STAT 200 students and STAT 500 students. We are comparing them in terms of average (i.e., mean) age.

If STAT 200 students are younger than STAT 500 students, that translates to \(\mu_{200}<\mu_{500}\) which is an alternative hypothesis. This could also be written as \(\mu_{200}-\mu_{500}<0\), where 0 is a specific population parameter that we are testing. 

The appropriate procedure here is a  hypothesis test for the difference in two means .

Research question:  On average, how much taller are adult male giraffes compared to adult female giraffes?

There are two groups: males and females. The response variable is height, which is quantitative. We are not given a specific parameter to test, instead we are asked to estimate "how much" taller males are than females. The appropriate procedure is a  confidence interval for the difference in two means .

Research question:  Are STAT 500 students more likely than STAT 200 students to be employed full-time?

There are two independent groups: STAT 500 students and STAT 200 students. The response variable is full-time employment status which is categorical with two levels: yes/no.

If STAT 500 students are more likely than STAT 200 students to be employed full-time, that translates to \(p_{500}>p_{200}\) which is an alternative hypothesis. This could also be written as \(p_{500}-p_{200}>0\), where 0 is a specific parameter that we are testing. The appropriate procedure is a  hypothesis test for the difference in two proportions.

Research question:  Is there is a relationship between outdoor temperature (in Fahrenheit) and coffee sales (in cups per day)?

There are two variables here: (1) temperature in Fahrenheit and (2) cups of coffee sold in a day. Both variables are quantitative. The parameter of interest is the correlation between these two variables.

If there is a relationship between the variables, that means that the correlation is different from zero. This is a specific parameter that we are testing. The appropriate procedure is a  hypothesis test for a correlation . 

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

10.26: Hypothesis Test for a Population Mean (5 of 5)

  • Last updated
  • Save as PDF
  • Page ID 14164

Learning Objectives

  • Interpret the P-value as a conditional probability.

We finish our discussion of the hypothesis test for a population mean with a review of the meaning of the P-value, along with a review of type I and type II errors.

Review of the Meaning of the P-value

At this point, we assume you know how to use a P-value to make a decision in a hypothesis test. The logic is always the same. If we pick a level of significance (α), then we compare the P-value to α.

  • If the P-value ≤ α, reject the null hypothesis. The data supports the alternative hypothesis.
  • If the P-value > α, do not reject the null hypothesis. The data is not strong enough to support the alternative hypothesis.

In fact, we find that we treat these as “rules” and apply them without thinking about what the P-value means. So let’s pause here and review the meaning of the P-value, since it is the connection between probability and decision-making in inference.

Birth Weights in a Town

Let’s return to the familiar context of birth weights for babies in a town. Suppose that babies in the town had a mean birth weight of 3,500 grams in 2010. This year, a random sample of 50 babies has a mean weight of about 3,400 grams with a standard deviation of about 500 grams. Here is the distribution of birth weights in the sample.

Dot plot of birth weights, ranging from around 2,000 grams to 4,000 grams.

Obviously, this sample weighs less on average than the population of babies in the town in 2010. A decrease in the town’s mean birth weight could indicate a decline in overall health of the town. But does this sample give strong evidence that the town’s mean birth weight is less than 3,500 grams this year?

We now know how to answer this question with a hypothesis test. Let’s use a significance level of 5%.

Let μ = mean birth weight in the town this year. The null hypothesis says there is “no change from 2010.”

  • H 0 : μ < 3,500
  • H a : μ = 3,500

Since the sample is large, we can conduct the T-test (without worrying about the shape of the distribution of birth weights for individual babies.)

Statistical software tells us the P-value is 0.082 = 8.2%. Since the P-value is greater than 0.05, we fail to reject the null hypothesis.

Our conclusion: This sample does not suggest that the mean birth weight this year is less than 3,500 grams ( P -value = 0.082). The sample from this year has a mean of 3,400 grams, which is 100 grams lower than the mean in 2010. But this difference is not statistically significant. It can be explained by the chance fluctuation we expect to see in random sampling.

What Does the P-Value of 0.082 Tell Us?

A simulation can help us understand the P-value. In a simulation, we assume that the population mean is 3,500 grams. This is the null hypothesis. We assume the null hypothesis is true and select 1,000 random samples from a population with a mean of 3,500 grams. The mean of the sampling distribution is at 3,500 (as predicted by the null hypothesis.) We see this in the simulated sampling distribution.

If the mean = 3,500 then 86 out of the 1,000 random samples have a sample mean less than 3,400. This is 0.086 = 8.6%

In the simulation, we can see that about 8.6% of the samples have a mean less than 3,400. Since probability is the relative frequency of an event in the long run, we say there is an 8.6% chance that a random sample of 500 babies has a mean less than 3,400 if the population mean is 3,500. We can see that the corresponding area to the left of T = −1.41 in the T-model (with df = 49) also gives us a good estimate of the probability. This area is the P-value, about 8.2%.

If we generalize this statement, we say the P-value is the probability that random samples have results more extreme than the data if the null hypothesis is true. (By more extreme, we mean further from value of the parameter, in the direction of the alternative hypothesis.) We can also describe the P-value in terms of T-scores. The P-value is the probability that the test statistic from a random sample has a value more extreme than that associated with the data if the null hypothesis is true.

What Does a P-Value Mean?

Do women who smoke run the risk of shorter pregnancy and premature birth? The mean pregnancy length is 266 days. We test the following hypotheses.

  • H 0 : μ = 266
  • H a : μ < 266

Suppose a random sample of 40 women who smoke during their pregnancy have a mean pregnancy length of 260 days with a standard deviation of 21 days. The P-value is 0.04.

What probability does the P-value of 0.04 describe? Label each of the following interpretations as valid or invalid.

https://assessments.lumenlearning.co...sessments/3654

https://assessments.lumenlearning.co...sessments/3655

https://assessments.lumenlearning.co...sessments/3656

Review of Type I and Type II Errors

We know that statistical inference is based on probability, so there is always some chance of making a wrong decision. Recall that there are two types of wrong decisions that can be made in hypothesis testing. When we reject a null hypothesis that is true, we commit a type I error. When we fail to reject a null hypothesis that is false, we commit a type II error.

The following table summarizes the logic behind type I and type II errors.

A table that summarizes the logic behind type I and type II errors. If Ho is true and we reject Ho (accept Ha), this is a correct decision. If Ho is true and we fail to reject Ho (not enough evidence to accept Ha), this is a correct decision. If Ho is false (Ha is true) and we reject Ho (accept Ha), this is a correct decision. If Ho is false (Ha is true) and we fail to reject Ho (not enough evidence to accept Ha), this is a type II error.

It is possible to have some influence over the likelihoods of committing these errors. But decreasing the chance of a type I error increases the chance of a type II error. We have to decide which error is more serious for a given situation. Sometimes a type I error is more serious. Other times a type II error is more serious. Sometimes neither is serious.

Recall that if the null hypothesis is true, the probability of committing a type I error is α. Why is this? Well, when we choose a level of significance (α), we are choosing a benchmark for rejecting the null hypothesis. If the null hypothesis is true, then the probability that we will reject a true null hypothesis is α. So the smaller α is, the smaller the probability of a type I error.

It is more complicated to calculate the probability of a type II error. The best way to reduce the probability of a type II error is to increase the sample size. But once the sample size is set, larger values of α will decrease the probability of a type II error (while increasing the probability of a type I error).

General Guidelines for Choosing a Level of Significance

  • If the consequences of a type I error are more serious, choose a small level of significance (α).
  • If the consequences of a type II error are more serious, choose a larger level of significance (α). But remember that the level of significance is the probability of committing a type I error.
  • In general, we pick the largest level of significance that we can tolerate as the chance of a type I error.

Let’s return to the investigation of the impact of smoking on pregnancy length.

Recap of the hypothesis test: The mean human pregnancy length is 266 days. We test the following hypotheses.

https://assessments.lumenlearning.co...sessments/3778

https://assessments.lumenlearning.co...sessments/3779

https://assessments.lumenlearning.co...sessments/3780

Let’s Summarize

In this “Hypothesis Test for a Population Mean,” we looked at the four steps of a hypothesis test as they relate to a claim about a population mean.

Step 1: Determine the hypotheses.

  • The hypotheses are claims about the population mean, µ.
  • The null hypothesis is a hypothesis that the mean equals a specific value, µ 0 .

Step 2: Collect the data.

Since the hypothesis test is based on probability, random selection or assignment is essential in data production. Additionally, we need to check whether the t-model is a good fit for the sampling distribution of sample means. To use the t-model, the variable must be normally distributed in the population or the sample size must be more than 30. In practice, it is often impossible to verify that the variable is normally distributed in the population. If this is the case and the sample size is not more than 30, researchers often use the t-model if the sample is not strongly skewed and does not have outliers.

Step 3: Assess the evidence.

  • If a t-model is appropriate, determine the t-test statistic for the data’s sample mean.
  • Use the test statistic, together with the alternative hypothesis, to determine the P-value.
  • The P-value is the probability of finding a random sample with a mean at least as extreme as our sample mean, assuming that the null hypothesis is true.
  • As in all hypothesis tests, if the alternative hypothesis is greater than, the P-value is the area to the right of the test statistic. If the alternative hypothesis is less than, the P-value is the area to the left of the test statistic. If the alternative hypothesis is not equal to, the P-value is equal to double the tail area beyond the test statistic.

Step 4: Give the conclusion.

The logic of the hypothesis test is always the same. To state a conclusion about H 0 , we compare the P-value to the significance level, α.

  • If P ≤ α, we reject H 0 . We conclude there is significant evidence in favor of H a .
  • If P > α, we fail to reject H 0 . We conclude the sample does not provide significant evidence in favor of H a .
  • We write the conclusion in the context of the research question. Our conclusion is usually a statement about the alternative hypothesis (we accept H a or fail to acceptH a ) and should include the P-value.

Other Hypothesis Testing Notes

  • Remember that the P-value is the probability of seeing a sample mean at least as extreme as the one from the data if the null hypothesis is true. The probability is about the random sample; it is not a “chance” statement about the null or alternative hypothesis.
  • If our test results in rejecting a null hypothesis that is actually true, then it is called a type I error.
  • If our test results in failing to reject a null hypothesis that is actually false, then it is called a type II error.
  • If rejecting a null hypothesis would be very expensive, controversial, or dangerous, then we really want to avoid a type I error. In this case, we would set a strict significance level (a small value of α, such as 0.01).
  • Finally, remember the phrase “garbage in, garbage out.” If the data collection methods are poor, then the results of a hypothesis test are meaningless.

Contributors and Attributions

  • Concepts in Statistics. Provided by : Open Learning Initiative. Located at : http://oli.cmu.edu . License : CC BY: Attribution

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Ind Psychiatry J
  • v.18(2); Jul-Dec 2009

Hypothesis testing, type I and type II errors

Amitav banerjee.

Department of Community Medicine, D. Y. Patil Medical College, Pune, India

U. B. Chitnis

S. l. jadhav, j. s. bhawalkar, s. chaudhury.

1 Department of Psychiatry, RINPAS, Kanke, Ranchi, India

Hypothesis testing is an important activity of empirical research and evidence-based medicine. A well worked up hypothesis is half the answer to the research question. For this, both knowledge of the subject derived from extensive review of the literature and working knowledge of basic statistical concepts are desirable. The present paper discusses the methods of working up a good hypothesis and statistical concepts of hypothesis testing.

Karl Popper is probably the most influential philosopher of science in the 20 th century (Wulff et al ., 1986). Many scientists, even those who do not usually read books on philosophy, are acquainted with the basic principles of his views on science. The popularity of Popper’s philosophy is due partly to the fact that it has been well explained in simple terms by, among others, the Nobel Prize winner Peter Medawar (Medawar, 1969). Popper makes the very important point that empirical scientists (those who stress on observations only as the starting point of research) put the cart in front of the horse when they claim that science proceeds from observation to theory, since there is no such thing as a pure observation which does not depend on theory. Popper states, “… the belief that we can start with pure observation alone, without anything in the nature of a theory, is absurd: As may be illustrated by the story of the man who dedicated his life to natural science, wrote down everything he could observe, and bequeathed his ‘priceless’ collection of observations to the Royal Society to be used as inductive (empirical) evidence.

STARTING POINT OF RESEARCH: HYPOTHESIS OR OBSERVATION?

The first step in the scientific process is not observation but the generation of a hypothesis which may then be tested critically by observations and experiments. Popper also makes the important claim that the goal of the scientist’s efforts is not the verification but the falsification of the initial hypothesis. It is logically impossible to verify the truth of a general law by repeated observations, but, at least in principle, it is possible to falsify such a law by a single observation. Repeated observations of white swans did not prove that all swans are white, but the observation of a single black swan sufficed to falsify that general statement (Popper, 1976).

CHARACTERISTICS OF A GOOD HYPOTHESIS

A good hypothesis must be based on a good research question. It should be simple, specific and stated in advance (Hulley et al ., 2001).

Hypothesis should be simple

A simple hypothesis contains one predictor and one outcome variable, e.g. positive family history of schizophrenia increases the risk of developing the condition in first-degree relatives. Here the single predictor variable is positive family history of schizophrenia and the outcome variable is schizophrenia. A complex hypothesis contains more than one predictor variable or more than one outcome variable, e.g., a positive family history and stressful life events are associated with an increased incidence of Alzheimer’s disease. Here there are 2 predictor variables, i.e., positive family history and stressful life events, while one outcome variable, i.e., Alzheimer’s disease. Complex hypothesis like this cannot be easily tested with a single statistical test and should always be separated into 2 or more simple hypotheses.

Hypothesis should be specific

A specific hypothesis leaves no ambiguity about the subjects and variables, or about how the test of statistical significance will be applied. It uses concise operational definitions that summarize the nature and source of the subjects and the approach to measuring variables (History of medication with tranquilizers, as measured by review of medical store records and physicians’ prescriptions in the past year, is more common in patients who attempted suicides than in controls hospitalized for other conditions). This is a long-winded sentence, but it explicitly states the nature of predictor and outcome variables, how they will be measured and the research hypothesis. Often these details may be included in the study proposal and may not be stated in the research hypothesis. However, they should be clear in the mind of the investigator while conceptualizing the study.

Hypothesis should be stated in advance

The hypothesis must be stated in writing during the proposal state. This will help to keep the research effort focused on the primary objective and create a stronger basis for interpreting the study’s results as compared to a hypothesis that emerges as a result of inspecting the data. The habit of post hoc hypothesis testing (common among researchers) is nothing but using third-degree methods on the data (data dredging), to yield at least something significant. This leads to overrating the occasional chance associations in the study.

TYPES OF HYPOTHESES

For the purpose of testing statistical significance, hypotheses are classified by the way they describe the expected difference between the study groups.

Null and alternative hypotheses

The null hypothesis states that there is no association between the predictor and outcome variables in the population (There is no difference between tranquilizer habits of patients with attempted suicides and those of age- and sex- matched “control” patients hospitalized for other diagnoses). The null hypothesis is the formal basis for testing statistical significance. By starting with the proposition that there is no association, statistical tests can estimate the probability that an observed association could be due to chance.

The proposition that there is an association — that patients with attempted suicides will report different tranquilizer habits from those of the controls — is called the alternative hypothesis. The alternative hypothesis cannot be tested directly; it is accepted by exclusion if the test of statistical significance rejects the null hypothesis.

One- and two-tailed alternative hypotheses

A one-tailed (or one-sided) hypothesis specifies the direction of the association between the predictor and outcome variables. The prediction that patients of attempted suicides will have a higher rate of use of tranquilizers than control patients is a one-tailed hypothesis. A two-tailed hypothesis states only that an association exists; it does not specify the direction. The prediction that patients with attempted suicides will have a different rate of tranquilizer use — either higher or lower than control patients — is a two-tailed hypothesis. (The word tails refers to the tail ends of the statistical distribution such as the familiar bell-shaped normal curve that is used to test a hypothesis. One tail represents a positive effect or association; the other, a negative effect.) A one-tailed hypothesis has the statistical advantage of permitting a smaller sample size as compared to that permissible by a two-tailed hypothesis. Unfortunately, one-tailed hypotheses are not always appropriate; in fact, some investigators believe that they should never be used. However, they are appropriate when only one direction for the association is important or biologically meaningful. An example is the one-sided hypothesis that a drug has a greater frequency of side effects than a placebo; the possibility that the drug has fewer side effects than the placebo is not worth testing. Whatever strategy is used, it should be stated in advance; otherwise, it would lack statistical rigor. Data dredging after it has been collected and post hoc deciding to change over to one-tailed hypothesis testing to reduce the sample size and P value are indicative of lack of scientific integrity.

STATISTICAL PRINCIPLES OF HYPOTHESIS TESTING

A hypothesis (for example, Tamiflu [oseltamivir], drug of choice in H1N1 influenza, is associated with an increased incidence of acute psychotic manifestations) is either true or false in the real world. Because the investigator cannot study all people who are at risk, he must test the hypothesis in a sample of that target population. No matter how many data a researcher collects, he can never absolutely prove (or disprove) his hypothesis. There will always be a need to draw inferences about phenomena in the population from events observed in the sample (Hulley et al ., 2001). In some ways, the investigator’s problem is similar to that faced by a judge judging a defendant [ Table 1 ]. The absolute truth whether the defendant committed the crime cannot be determined. Instead, the judge begins by presuming innocence — the defendant did not commit the crime. The judge must decide whether there is sufficient evidence to reject the presumed innocence of the defendant; the standard is known as beyond a reasonable doubt. A judge can err, however, by convicting a defendant who is innocent, or by failing to convict one who is actually guilty. In similar fashion, the investigator starts by presuming the null hypothesis, or no association between the predictor and outcome variables in the population. Based on the data collected in his sample, the investigator uses statistical tests to determine whether there is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis that there is an association in the population. The standard for these tests is shown as the level of statistical significance.

The analogy between judge’s decisions and statistical tests

TYPE I (ALSO KNOWN AS ‘α’) AND TYPE II (ALSO KNOWN AS ‘β’)ERRORS

Just like a judge’s conclusion, an investigator’s conclusion may be wrong. Sometimes, by chance alone, a sample is not representative of the population. Thus the results in the sample do not reflect reality in the population, and the random error leads to an erroneous inference. A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is actually false in the population. Although type I and type II errors can never be avoided entirely, the investigator can reduce their likelihood by increasing the sample size (the larger the sample, the lesser is the likelihood that it will differ substantially from the population).

False-positive and false-negative results can also occur because of bias (observer, instrument, recall, etc.). (Errors due to bias, however, are not referred to as type I and type II errors.) Such errors are troublesome, since they may be difficult to detect and cannot usually be quantified.

EFFECT SIZE

The likelihood that a study will be able to detect an association between a predictor variable and an outcome variable depends, of course, on the actual magnitude of that association in the target population. If it is large (such as 90% increase in the incidence of psychosis in people who are on Tamiflu), it will be easy to detect in the sample. Conversely, if the size of the association is small (such as 2% increase in psychosis), it will be difficult to detect in the sample. Unfortunately, the investigator often does not know the actual magnitude of the association — one of the purposes of the study is to estimate it. Instead, the investigator must choose the size of the association that he would like to be able to detect in the sample. This quantity is known as the effect size. Selecting an appropriate effect size is the most difficult aspect of sample size planning. Sometimes, the investigator can use data from other studies or pilot tests to make an informed guess about a reasonable effect size. When there are no data with which to estimate it, he can choose the smallest effect size that would be clinically meaningful, for example, a 10% increase in the incidence of psychosis. Of course, from the public health point of view, even a 1% increase in psychosis incidence would be important. Thus the choice of the effect size is always somewhat arbitrary, and considerations of feasibility are often paramount. When the number of available subjects is limited, the investigator may have to work backward to determine whether the effect size that his study will be able to detect with that number of subjects is reasonable.

α,β,AND POWER

After a study is completed, the investigator uses statistical tests to try to reject the null hypothesis in favor of its alternative (much in the same way that a prosecuting attorney tries to convince a judge to reject innocence in favor of guilt). Depending on whether the null hypothesis is true or false in the target population, and assuming that the study is free of bias, 4 situations are possible, as shown in Table 2 below. In 2 of these, the findings in the sample and reality in the population are concordant, and the investigator’s inference will be correct. In the other 2 situations, either a type I (α) or a type II (β) error has been made, and the inference will be incorrect.

Truth in the population versus the results in the study sample: The four possibilities

The investigator establishes the maximum chance of making type I and type II errors in advance of the study. The probability of committing a type I error (rejecting the null hypothesis when it is actually true) is called α (alpha) the other name for this is the level of statistical significance.

If a study of Tamiflu and psychosis is designed with α = 0.05, for example, then the investigator has set 5% as the maximum chance of incorrectly rejecting the null hypothesis (and erroneously inferring that use of Tamiflu and psychosis incidence are associated in the population). This is the level of reasonable doubt that the investigator is willing to accept when he uses statistical tests to analyze the data after the study is completed.

The probability of making a type II error (failing to reject the null hypothesis when it is actually false) is called β (beta). The quantity (1 - β) is called power, the probability of observing an effect in the sample (if one), of a specified effect size or greater exists in the population.

If β is set at 0.10, then the investigator has decided that he is willing to accept a 10% chance of missing an association of a given effect size between Tamiflu and psychosis. This represents a power of 0.90, i.e., a 90% chance of finding an association of that size. For example, suppose that there really would be a 30% increase in psychosis incidence if the entire population took Tamiflu. Then 90 times out of 100, the investigator would observe an effect of that size or larger in his study. This does not mean, however, that the investigator will be absolutely unable to detect a smaller effect; just that he will have less than 90% likelihood of doing so.

Ideally alpha and beta errors would be set at zero, eliminating the possibility of false-positive and false-negative results. In practice they are made as small as possible. Reducing them, however, usually requires increasing the sample size. Sample size planning aims at choosing a sufficient number of subjects to keep alpha and beta at acceptably low levels without making the study unnecessarily expensive or difficult.

Many studies s et al pha at 0.05 and beta at 0.20 (a power of 0.80). These are somewhat arbitrary values, and others are sometimes used; the conventional range for alpha is between 0.01 and 0.10; and for beta, between 0.05 and 0.20. In general the investigator should choose a low value of alpha when the research question makes it particularly important to avoid a type I (false-positive) error, and he should choose a low value of beta when it is especially important to avoid a type II error.

The null hypothesis acts like a punching bag: It is assumed to be true in order to shadowbox it into false with a statistical test. When the data are analyzed, such tests determine the P value, the probability of obtaining the study results by chance if the null hypothesis is true. The null hypothesis is rejected in favor of the alternative hypothesis if the P value is less than alpha, the predetermined level of statistical significance (Daniel, 2000). “Nonsignificant” results — those with P value greater than alpha — do not imply that there is no association in the population; they only mean that the association observed in the sample is small compared with what could have occurred by chance alone. For example, an investigator might find that men with family history of mental illness were twice as likely to develop schizophrenia as those with no family history, but with a P value of 0.09. This means that even if family history and schizophrenia were not associated in the population, there was a 9% chance of finding such an association due to random error in the sample. If the investigator had set the significance level at 0.05, he would have to conclude that the association in the sample was “not statistically significant.” It might be tempting for the investigator to change his mind about the level of statistical significance ex post facto and report the results “showed statistical significance at P < 10”. A better choice would be to report that the “results, although suggestive of an association, did not achieve statistical significance ( P = .09)”. This solution acknowledges that statistical significance is not an “all or none” situation.

Hypothesis testing is the sheet anchor of empirical research and in the rapidly emerging practice of evidence-based medicine. However, empirical research and, ipso facto, hypothesis testing have their limits. The empirical approach to research cannot eliminate uncertainty completely. At the best, it can quantify uncertainty. This uncertainty can be of 2 types: Type I error (falsely rejecting a null hypothesis) and type II error (falsely accepting a null hypothesis). The acceptable magnitudes of type I and type II errors are set in advance and are important for sample size calculations. Another important point to remember is that we cannot ‘prove’ or ‘disprove’ anything by hypothesis testing and statistical tests. We can only knock down or reject the null hypothesis and by default accept the alternative hypothesis. If we fail to reject the null hypothesis, we accept it by default.

Source of Support: Nil

Conflict of Interest: None declared.

  • Daniel W. W. In: Biostatistics. 7th ed. New York: John Wiley and Sons, Inc; 2002. Hypothesis testing; pp. 204–294. [ Google Scholar ]
  • Hulley S. B, Cummings S. R, Browner W. S, Grady D, Hearst N, Newman T. B. 2nd ed. Philadelphia: Lippincott Williams and Wilkins; 2001. Getting ready to estimate sample size: Hypothesis and underlying principles In: Designing Clinical Research-An epidemiologic approach; pp. 51–63. [ Google Scholar ]
  • Medawar P. B. Philadelphia: American Philosophical Society; 1969. Induction and intuition in scientific thought. [ Google Scholar ]
  • Popper K. Unended Quest. An Intellectual Autobiography. Fontana Collins; p. 42. [ Google Scholar ]
  • Wulff H. R, Pedersen S. A, Rosenberg R. Oxford: Blackwell Scientific Publicatons; Empirism and Realism: A philosophical problem. In: Philosophy of Medicine. [ Google Scholar ]

COMMENTS

  1. Hypothesis Testing Flashcards

    Testing the hypothesis: Step 2 (Set an acceptable level of risk, referred to as the alpha level) When testing a research hypothesis, 4 possible outcomes or decisions: 1) null hypothesis is accepted when it is true (correct decision); 2) null hypothesis is rejected when it is false (correct decision); accepting alternative hypothesis.

  2. Hypothesis Testing Practice Questions Flashcards

    Study with Quizlet and memorize flashcards containing terms like A method for testing a claim or hypothesis about a parameter in a population, using data measured in a sample, is called: A) the central limit theorem B) hypothesis testing C) significance testing D) both b and c, The _____ hypothesis is a statement about a population parameter, such as the population mean, that is assumed to be ...

  3. Hypothesis Testing Flashcards

    Definition. 1 / 20. Hypothesis testing is the process of testing whether a certain statement or observation is statistically meaningful relative to an unknown population parameter. The point of a hypothesis test is to determine the validity of any data-driven claim we make at a given level of significance.

  4. Hypothesis testing quiz Flashcards

    Study with Quizlet and memorize flashcards containing terms like What distribution is used when testing hypotheses, What does the null hypothesis state, Why do we test the null hypothesis and more.

  5. Hypothesis Testing Flashcards

    In this method, we test some hypothesis by determining the likelihood that a sample statistic could have been selected, if the hypothesis regarding the population parameter were true. Step 1: State of hypothesis. Step 2: Set the criteria for a decision. Step 3: Compute the test statistic. Step 4: Make a decision.

  6. Hypothesis Testing Flashcards

    1. state the hypothesis. 2. select the appropriate test statistic. 3. specify the level of significance. 4. state the decision rule regarding the hypothesis. 5. collect the sample and calculate the sample statistics. 6. make a decision regarding the hypothesis. 7. make a decision based on the results of the test. what is a null hypothesis?

  7. Hypothesis Testing Flashcards

    -Everything follows this hypothesis-Test on sample on sample of population and generalize results -Hypothesis testing attempts to disprove the null hypothesis. Step 1 (H 0) ... Other Quizlet sets. GEO 1. 25 terms. emelyy_kk. Psychology 2000 Chapter 2 Vocabulary. 94 terms. andrewskewes. 362 lipids quiz. 60 terms. caitlin_roberts7. HY 120 Final ...

  8. 7.1: Basics of Hypothesis Testing

    Test Statistic: z = x¯¯¯ −μo σ/ n−−√ z = x ¯ − μ o σ / n since it is calculated as part of the testing of the hypothesis. Definition 7.1.4 7.1. 4. p - value: probability that the test statistic will take on more extreme values than the observed test statistic, given that the null hypothesis is true. It is the probability ...

  9. Hypothesis Testing

    Table of contents. Step 1: State your null and alternate hypothesis. Step 2: Collect data. Step 3: Perform a statistical test. Step 4: Decide whether to reject or fail to reject your null hypothesis. Step 5: Present your findings. Other interesting articles. Frequently asked questions about hypothesis testing.

  10. S.3 Hypothesis Testing

    S.3 Hypothesis Testing. In reviewing hypothesis tests, we start first with the general idea. Then, we keep returning to the basic procedures of hypothesis testing, each time adding a little more detail. The general idea of hypothesis testing involves: Making an initial assumption. Collecting evidence (data).

  11. 9.2: Hypothesis Testing

    To test a null hypothesis, find the p -value for the sample data and graph the results. When deciding whether or not to reject the null the hypothesis, keep these two parameters in mind: α > p − value, reject the null hypothesis. α ≤ p − value, do not reject the null hypothesis.

  12. 3.1: The Fundamentals of Hypothesis Testing

    A hypothesis is a claim or statement about a characteristic of a population of interest to us. A hypothesis test is a way for us to use our sample statistics to test a specific claim. Example 3.1.1 3.1. 1: The population mean weight is known to be 157 lb. We want to test the claim that the mean weight has increased.

  13. 6a.2

    Below these are summarized into six such steps to conducting a test of a hypothesis. Set up the hypotheses and check conditions: Each hypothesis test includes two hypotheses about the population. One is the null hypothesis, notated as \ (H_0 \), which is a statement of a particular parameter value. This hypothesis is assumed to be true until ...

  14. 4.4: Hypothesis Testing

    Testing Hypotheses using Confidence Intervals. We can start the evaluation of the hypothesis setup by comparing 2006 and 2012 run times using a point estimate from the 2012 sample: x¯12 = 95.61 x ¯ 12 = 95.61 minutes. This estimate suggests the average time is actually longer than the 2006 time, 93.29 minutes.

  15. Simple hypothesis testing (video)

    We multiply the probabilities, as there is a 50% chance of the first coin coming up heads (in theory, one out of every two times the coin will be heads). Therefore, in the universe of first-coin-flip-heads, 1 out of ever 2 flips there will theoretically come up with a second head. Overall, then, 25% of the time we will get two heads.

  16. Hypothesis Testing (1 of 5)

    The null hypothesis gives the value of the parameter that we will use to create the sampling distribution. In this way, the null hypothesis states what we assume to be true about the population. The alternative hypothesis usually reflects the claim in the research question about the value of the parameter. The alternative hypothesis says the ...

  17. Introduction to Hypothesis Testing

    Hypothesis testing is part of inference. Given a claim about a population, we will learn to determine the null and alternative hypotheses. We will recognize the logic behind a hypothesis test and how it relates to the P-value as well as recognizing type I and type II errors. These are powerful tools in exploring and understanding data in real-life.

  18. 9.E: Hypothesis Testing with One Sample (Exercises)

    An Introduction to Statistics class in Davies County, KY conducted a hypothesis test at the local high school (a medium sized-approximately 1,200 students-small city demographic) to determine if the local high school's percentage was lower. One hundred fifty students were chosen at random and surveyed.

  19. 1.4: Basic Concepts of Hypothesis Testing

    Learning Objectives. One of the main goals of statistical hypothesis testing is to estimate the P P value, which is the probability of obtaining the observed results, or something more extreme, if the null hypothesis were true. If the observed results are unlikely under the null hypothesis, reject the null hypothesis.

  20. 6.6

    6.6 - Confidence Intervals & Hypothesis Testing. Confidence intervals and hypothesis tests are similar in that they are both inferential methods that rely on an approximated sampling distribution. Confidence intervals use data from a sample to estimate a population parameter. Hypothesis tests use data from a sample to test a specified hypothesis.

  21. Hypothesis Testing

    It is the total probability of achieving a value so rare and even rarer. It is the area under the normal curve beyond the P-Value mark. This P-Value is calculated using the Z score we just found. Each Z-score has a corresponding P-Value. This can be found using any statistical software like R or even from the Z-Table.

  22. 10.26: Hypothesis Test for a Population Mean (5 of 5)

    Step 3: Assess the evidence. Step 4: Give the conclusion. Interpret the P-value as a conditional probability. We finish our discussion of the hypothesis test for a population mean with a review of the meaning of the P-value, along with a review of type I and type II errors.

  23. Hypothesis testing, type I and type II errors

    Hypothesis testing is an important activity of empirical research and evidence-based medicine. A well worked up hypothesis is half the answer to the research question. For this, both knowledge of the subject derived from extensive review of the literature and working knowledge of basic statistical concepts are desirable. The present paper ...