null hypothesis for regression

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

13.6 Testing the Regression Coefficients

Learning objectives.

Conduct and interpret a hypothesis test on individual regression coefficients.

Previously, we learned that the population model for the multiple regression equation is

[latex]\begin{eqnarray*} y & = & \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_kx_k +\epsilon \end{eqnarray*}[/latex]

where [latex]x_1,x_2,\ldots,x_k[/latex] are the independent variables, [latex]\beta_0,\beta_1,\ldots,\beta_k[/latex] are the population parameters of the regression coefficients, and [latex]\epsilon[/latex] is the error variable. In multiple regression, we estimate each population regression coefficient [latex]\beta_i[/latex] with the sample regression coefficient [latex]b_i[/latex].

In the previous section, we learned how to conduct an overall model test to determine if the regression model is valid. If the outcome of the overall model test is that the model is valid, then at least one of the independent variables is related to the dependent variable—in other words, at least one of the regression coefficients [latex]\beta_i[/latex] is not zero. However, the overall model test does not tell us which independent variables are related to the dependent variable. To determine which independent variables are related to the dependent variable, we must test each of the regression coefficients.

Testing the Regression Coefficients

For an individual regression coefficient, we want to test if there is a relationship between the dependent variable [latex]y[/latex] and the independent variable [latex]x_i[/latex].

No Relationship . There is no relationship between the dependent variable [latex]y[/latex] and the independent variable [latex]x_i[/latex]. In this case, the regression coefficient [latex]\beta_i[/latex] is zero. This is the claim for the null hypothesis in an individual regression coefficient test: [latex]H_0: \beta_i=0[/latex].
Relationship. There is a relationship between the dependent variable [latex]y[/latex] and the independent variable [latex]x_i[/latex]. In this case, the regression coefficients [latex]\beta_i[/latex] is not zero. This is the claim for the alternative hypothesis in an individual regression coefficient test: [latex]H_a: \beta_i \neq 0[/latex]. We are not interested if the regression coefficient [latex]\beta_i[/latex] is positive or negative, only that it is not zero. We only need to find out if the regression coefficient is not zero to demonstrate that there is a relationship between the dependent variable and the independent variable. This makes the test on a regression coefficient a two-tailed test.

In order to conduct a hypothesis test on an individual regression coefficient [latex]\beta_i[/latex], we need to use the distribution of the sample regression coefficient [latex]b_i[/latex]:

The mean of the distribution of the sample regression coefficient is the population regression coefficient [latex]\beta_i[/latex].
The standard deviation of the distribution of the sample regression coefficient is [latex]\sigma_{b_i}[/latex]. Because we do not know the population standard deviation we must estimate [latex]\sigma_{b_i}[/latex] with the sample standard deviation [latex]s_{b_i}[/latex].
The distribution of the sample regression coefficient follows a normal distribution.

Steps to Conduct a Hypothesis Test on a Regression Coefficient

[latex]\begin{eqnarray*} H_0: & & \beta_i=0 \\ \\ \end{eqnarray*}[/latex]

[latex]\begin{eqnarray*} H_a: & & \beta_i \neq 0 \\ \\ \end{eqnarray*}[/latex]

Collect the sample information for the test and identify the significance level [latex]\alpha[/latex].

[latex]\begin{eqnarray*}t & = & \frac{b_i-\beta_i}{s_{b_i}} \\ \\ df & = & n-k-1 \\ \\ \end{eqnarray*}[/latex]

The results of the sample data are significant. There is sufficient evidence to conclude that the null hypothesis [latex]H_0[/latex] is an incorrect belief and that the alternative hypothesis [latex]H_a[/latex] is most likely correct.
The results of the sample data are not significant. There is not sufficient evidence to conclude that the alternative hypothesis [latex]H_a[/latex] may be correct.
Write down a concluding sentence specific to the context of the question.

The required [latex]t[/latex]-score and p -value for the test can be found on the regression summary table, which we learned how to generate in Excel in a previous section.

The human resources department at a large company wants to develop a model to predict an employee’s job satisfaction from the number of hours of unpaid work per week the employee does, the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job satisfaction score is out of 10, with higher values indicating greater job satisfaction.

Previously, we found the multiple regression equation to predict the job satisfaction score from the other variables:

[latex]\begin{eqnarray*} \hat{y} & = & 4.7993-0.3818x_1+0.0046x_2+0.0233x_3 \\ \\ \hat{y} & = & \mbox{predicted job satisfaction score} \\ x_1 & = & \mbox{hours of unpaid work per week} \\ x_2 & = & \mbox{age} \\ x_3 & = & \mbox{income (\$1000s)}\end{eqnarray*}[/latex]

At the 5% significance level, test the relationship between the dependent variable “job satisfaction” and the independent variable “hours of unpaid work per week”.

Hypotheses:

[latex]\begin{eqnarray*} H_0: & & \beta_1=0 \\ H_a: & & \beta_1 \neq 0 \end{eqnarray*}[/latex]

The regression summary table generated by Excel is shown below:

The p -value for the test on the hours of unpaid work per week regression coefficient is in the bottom part of the table under the P-value column of the Hours of Unpaid Work per Week row . So the p -value=[latex]0.0082[/latex].

Conclusion:

Because p -value[latex]=0.0082 \lt 0.05=\alpha[/latex], we reject the null hypothesis in favour of the alternative hypothesis. At the 5% significance level there is enough evidence to suggest that there is a relationship between the dependent variable “job satisfaction” and the independent variable “hours of unpaid work per week.”

The null hypothesis [latex]\beta_1=0[/latex] is the claim that the regression coefficient for the independent variable [latex]x_1[/latex] is zero. That is, the null hypothesis is the claim that there is no relationship between the dependent variable and the independent variable “hours of unpaid work per week.”
The alternative hypothesis is the claim that the regression coefficient for the independent variable [latex]x_1[/latex] is not zero. The alternative hypothesis is the claim that there is a relationship between the dependent variable and the independent variable “hours of unpaid work per week.”
When conducting a test on a regression coefficient, make sure to use the correct subscript on [latex]\beta[/latex] to correspond to how the independent variables were defined in the regression model and which independent variable is being tested. Here the subscript on [latex]\beta[/latex] is 1 because the “hours of unpaid work per week” is defined as [latex]x_1[/latex] in the regression model.
The p -value for the tests on the regression coefficients are located in the bottom part of the table under the P-value column heading in the corresponding independent variable row.
Because the alternative hypothesis is a [latex]\neq[/latex], the p -value is the sum of the area in the tails of the [latex]t[/latex]-distribution. This is the value calculated out by Excel in the regression summary table.
The p -value of 0.0082 is a small probability compared to the significance level, and so is unlikely to happen assuming the null hypothesis is true. This suggests that the assumption that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to reject the null hypothesis in favour of the alternative hypothesis. In other words, the regression coefficient [latex]\beta_1[/latex] is not zero, and so there is a relationship between the dependent variable “job satisfaction” and the independent variable “hours of unpaid work per week.” This means that the independent variable “hours of unpaid work per week” is useful in predicting the dependent variable.

At the 5% significance level, test the relationship between the dependent variable “job satisfaction” and the independent variable “age”.

[latex]\begin{eqnarray*} H_0: & & \beta_2=0 \\ H_a: & & \beta_2 \neq 0 \end{eqnarray*}[/latex]

The p -value for the test on the age regression coefficient is in the bottom part of the table under the P-value column of the Age row . So the p -value=[latex]0.8439[/latex].

Because p -value[latex]=0.8439 \gt 0.05=\alpha[/latex], we do not reject the null hypothesis. At the 5% significance level there is not enough evidence to suggest that there is a relationship between the dependent variable “job satisfaction” and the independent variable “age.”

The null hypothesis [latex]\beta_2=0[/latex] is the claim that the regression coefficient for the independent variable [latex]x_2[/latex] is zero. That is, the null hypothesis is the claim that there is no relationship between the dependent variable and the independent variable “age.”
The alternative hypothesis is the claim that the regression coefficient for the independent variable [latex]x_2[/latex] is not zero. The alternative hypothesis is the claim that there is a relationship between the dependent variable and the independent variable “age.”
When conducting a test on a regression coefficient, make sure to use the correct subscript on [latex]\beta[/latex] to correspond to how the independent variables were defined in the regression model and which independent variable is being tested. Here the subscript on [latex]\beta[/latex] is 2 because “age” is defined as [latex]x_2[/latex] in the regression model.
The p -value of 0.8439 is a large probability compared to the significance level, and so is likely to happen assuming the null hypothesis is true. This suggests that the assumption that the null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject the null hypothesis. In other words, the regression coefficient [latex]\beta_2[/latex] is zero, and so there is no relationship between the dependent variable “job satisfaction” and the independent variable “age.” This means that the independent variable “age” is not particularly useful in predicting the dependent variable.

At the 5% significance level, test the relationship between the dependent variable “job satisfaction” and the independent variable “income”.

[latex]\begin{eqnarray*} H_0: & & \beta_3=0 \\ H_a: & & \beta_3 \neq 0 \end{eqnarray*}[/latex]

The p -value for the test on the income regression coefficient is in the bottom part of the table under the P-value column of the Income row . So the p -value=[latex]0.0060[/latex].

Because p -value[latex]=0.0060 \lt 0.05=\alpha[/latex], we reject the null hypothesis in favour of the alternative hypothesis. At the 5% significance level there is enough evidence to suggest that there is a relationship between the dependent variable “job satisfaction” and the independent variable “income.”

The null hypothesis [latex]\beta_3=0[/latex] is the claim that the regression coefficient for the independent variable [latex]x_3[/latex] is zero. That is, the null hypothesis is the claim that there is no relationship between the dependent variable and the independent variable “income.”
The alternative hypothesis is the claim that the regression coefficient for the independent variable [latex]x_3[/latex] is not zero. The alternative hypothesis is the claim that there is a relationship between the dependent variable and the independent variable “income.”
When conducting a test on a regression coefficient, make sure to use the correct subscript on [latex]\beta[/latex] to correspond to how the independent variables were defined in the regression model and which independent variable is being tested. Here the subscript on [latex]\beta[/latex] is 3 because “income” is defined as [latex]x_3[/latex] in the regression model.
The p -value of 0.0060 is a small probability compared to the significance level, and so is unlikely to happen assuming the null hypothesis is true. This suggests that the assumption that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to reject the null hypothesis in favour of the alternative hypothesis. In other words, the regression coefficient [latex]\beta_3[/latex] is not zero, and so there is a relationship between the dependent variable “job satisfaction” and the independent variable “income.” This means that the independent variable “income” is useful in predicting the dependent variable.

Concept Review

The test on a regression coefficient determines if there is a relationship between the dependent variable and the corresponding independent variable. The p -value for the test is the sum of the area in tails of the [latex]t[/latex]-distribution. The p -value can be found on the regression summary table generated by Excel.

The hypothesis test for a regression coefficient is a well established process:

Write down the null and alternative hypotheses in terms of the regression coefficient being tested. The null hypothesis is the claim that there is no relationship between the dependent variable and independent variable. The alternative hypothesis is the claim that there is a relationship between the dependent variable and independent variable.
Collect the sample information for the test and identify the significance level.
The p -value is the sum of the area in the tails of the [latex]t[/latex]-distribution. Use the regression summary table generated by Excel to find the p -value.
Compare the p -value to the significance level and state the outcome of the test.

Understanding the Null Hypothesis for Linear Regression

Linear regression is a technique we can use to understand the relationship between one or more predictor variables and a response variable .

If we only have one predictor variable and one response variable, we can use simple linear regression , which uses the following formula to estimate the relationship between the variables:

ŷ = β 0 + β 1 x

ŷ: The estimated response value.
β 0 : The average value of y when x is zero.
β 1 : The average change in y associated with a one unit increase in x.
x: The value of the predictor variable.

Simple linear regression uses the following null and alternative hypotheses:

H 0 : β 1 = 0
H A : β 1 ≠ 0

The null hypothesis states that the coefficient β 1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

The alternative hypothesis states that β 1 is not equal to zero. In other words, there is a statistically significant relationship between x and y.

If we have multiple predictor variables and one response variable, we can use multiple linear regression , which uses the following formula to estimate the relationship between the variables:

ŷ = β 0 + β 1 x 1 + β 2 x 2 + … + β k x k

β 0 : The average value of y when all predictor variables are equal to zero.
β i : The average change in y associated with a one unit increase in x i .
x i : The value of the predictor variable x i .

Multiple linear regression uses the following null and alternative hypotheses:

H 0 : β 1 = β 2 = … = β k = 0
H A : β 1 = β 2 = … = β k ≠ 0

The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically significant relationship with the response variable, y.

The alternative hypothesis states that not every coefficient is simultaneously equal to zero.

The following examples show how to decide to reject or fail to reject the null hypothesis in both simple linear regression and multiple linear regression models.

Example 1: Simple Linear Regression

Suppose a professor would like to use the number of hours studied to predict the exam score that students will receive in his class. He collects data for 20 students and fits a simple linear regression model.

The following screenshot shows the output of the regression model:

Output of simple linear regression in Excel

The fitted simple linear regression model is:

Exam Score = 67.1617 + 5.2503*(hours studied)

To determine if there is a statistically significant relationship between hours studied and exam score, we need to analyze the overall F value of the model and the corresponding p-value:

Overall F-Value: 47.9952
P-value: 0.000

Since this p-value is less than .05, we can reject the null hypothesis. In other words, there is a statistically significant relationship between hours studied and exam score received.

Example 2: Multiple Linear Regression

Suppose a professor would like to use the number of hours studied and the number of prep exams taken to predict the exam score that students will receive in his class. He collects data for 20 students and fits a multiple linear regression model.

Multiple linear regression output in Excel

The fitted multiple linear regression model is:

Exam Score = 67.67 + 5.56*(hours studied) – 0.60*(prep exams taken)

To determine if there is a jointly statistically significant relationship between the two predictor variables and the response variable, we need to analyze the overall F value of the model and the corresponding p-value:

Overall F-Value: 23.46
P-value: 0.00

Since this p-value is less than .05, we can reject the null hypothesis. In other words, hours studied and prep exams taken have a jointly statistically significant relationship with exam score.

Note: Although the p-value for prep exams taken (p = 0.52) is not significant, prep exams combined with hours studied has a significant relationship with exam score.

Additional Resources

Understanding the F-Test of Overall Significance in Regression How to Read and Interpret a Regression Table How to Report Regression Results How to Perform Simple Linear Regression in Excel How to Perform Multiple Linear Regression in Excel

The Complete Guide: How to Report Regression Results

R vs. r-squared: what’s the difference, related posts, how to normalize data between -1 and 1, vba: how to check if string contains another..., how to interpret f-values in a two-way anova, how to create a vector of ones in..., how to find the mode of a histogram..., how to find quartiles in even and odd..., how to determine if a probability distribution is..., what is a symmetric histogram (definition & examples), how to calculate sxy in statistics (with example), how to calculate sxx in statistics (with example).

Linear regression - Hypothesis testing

by Marco Taboga , PhD

This lecture discusses how to perform tests of hypotheses about the coefficients of a linear regression model estimated by ordinary least squares (OLS).

Table of contents

Normal vs non-normal model

The linear regression model, matrix notation, tests of hypothesis in the normal linear regression model, test of a restriction on a single coefficient (t test), test of a set of linear restrictions (f test), tests based on maximum likelihood procedures (wald, lagrange multiplier, likelihood ratio), tests of hypothesis when the ols estimator is asymptotically normal, test of a restriction on a single coefficient (z test), test of a set of linear restrictions (chi-square test), learn more about regression analysis.

The lecture is divided in two parts:

in the first part, we discuss hypothesis testing in the normal linear regression model , in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors;

in the second part, we show how to carry out hypothesis tests in linear regression analyses where the hypothesis of normality holds only in large samples (i.e., the OLS estimator can be proved to be asymptotically normal).

How to choose which test to carry out after estimating a linear regression model.

We also denote:

We now explain how to derive tests about the coefficients of the normal linear regression model.

It can be proved (see the lecture about the normal linear regression model ) that the assumption of conditional normality implies that:

How the acceptance region is determined depends not only on the desired size of the test , but also on whether the test is:

one-tailed (only one of the two things, i.e., either smaller or larger, is possible).

For more details on how to determine the acceptance region, see the glossary entry on critical values .

The F test is one-tailed .

A critical value in the right tail of the F distribution is chosen so as to achieve the desired size of the test.

Then, the null hypothesis is rejected if the F statistics is larger than the critical value.

In this section we explain how to perform hypothesis tests about the coefficients of a linear regression model when the OLS estimator is asymptotically normal.

As we have shown in the lecture on the properties of the OLS estimator , in several cases (i.e., under different sets of assumptions) it can be proved that:

These two properties are used to derive the asymptotic distribution of the test statistics used in hypothesis testing.

The test can be either one-tailed or two-tailed . The same comments made for the t-test apply here.

Like the F test, also the Chi-square test is usually one-tailed .

The desired size of the test is achieved by appropriately choosing a critical value in the right tail of the Chi-square distribution.

The null is rejected if the Chi-square statistics is larger than the critical value.

Want to learn more about regression analysis? Here are some suggestions:

R squared of a linear regression ;

Gauss-Markov theorem ;

Generalized Least Squares ;

Multicollinearity ;

Dummy variables ;

Selection of linear regression models

Partitioned regression ;

Ridge regression .

How to cite

Please cite as:

Taboga, Marco (2021). "Linear regression - Hypothesis testing", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression-hypothesis-testing.

Most of the learning materials found on this website are now available in a traditional textbook format.

F distribution
Beta distribution
Conditional probability
Central Limit Theorem
Binomial distribution
Mean square convergence
Delta method
Almost sure convergence
Mathematical tools
Fundamentals of probability
Probability distributions
Asymptotic theory
Fundamentals of statistics
About Statlect
Cookies, privacy and terms of use
Loss function
Almost sure
Type I error
Precision matrix
Integrable variable
To enhance your privacy,
we removed the social buttons,
but don't forget to share .

9.1 Null and Alternative Hypotheses

The actual test begins by considering two hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

H 0 , the — null hypothesis: a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion. In other words, the difference equals 0.

H a —, the alternative hypothesis: a claim about the population that is contradictory to H 0 and what we conclude when we reject H 0 .

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are reject H 0 if the sample information favors the alternative hypothesis or do not reject H 0 or decline to reject H 0 if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in H 0 and H a :

H 0 always has a symbol with an equal in it. H a never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

Example 9.1

H 0 : No more than 30 percent of the registered voters in Santa Clara County voted in the primary election. p ≤ 30 H a : More than 30 percent of the registered voters in Santa Clara County voted in the primary election. p > 30

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25 percent. State the null and alternative hypotheses.

Example 9.2

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are the following: H 0 : μ = 2.0 H a : μ ≠ 2.0

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

H 0 : μ __ 66
H a : μ __ 66

Example 9.3

We want to test if college students take fewer than five years to graduate from college, on the average. The null and alternative hypotheses are the following: H 0 : μ ≥ 5 H a : μ < 5

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

H 0 : μ __ 45
H a : μ __ 45

Example 9.4

An article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third of the students pass. The same article stated that 6.6 percent of U.S. students take advanced placement exams and 4.4 percent pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6 percent. State the null and alternative hypotheses. H 0 : p ≤ 0.066 H a : p > 0.066

On a state driver’s test, about 40 percent pass the test on the first try. We want to test if more than 40 percent pass on the first try. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

H 0 : p __ 0.40
H a : p __ 0.40

Collaborative Exercise

Bring to class a newspaper, some news magazines, and some internet articles. In groups, find articles from which your group can write null and alternative hypotheses. Discuss your hypotheses with the rest of the class.

As an Amazon Associate we earn from qualifying purchases.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute Texas Education Agency (TEA). The original material is available at: https://www.texasgateway.org/book/tea-statistics . Changes were made to the original material, including updates to art, structure, and other content updates.

Access for free at https://openstax.org/books/statistics/pages/1-introduction

Authors: Barbara Illowsky, Susan Dean
Publisher/website: OpenStax
Book title: Statistics
Publication date: Mar 27, 2020
Location: Houston, Texas
Book URL: https://openstax.org/books/statistics/pages/1-introduction
Section URL: https://openstax.org/books/statistics/pages/9-1-null-and-alternative-hypotheses

© Jan 23, 2024 Texas Education Agency (TEA). The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

Knowledge Base
Simple Linear Regression | An Easy Introduction & Examples

Simple Linear Regression | An Easy Introduction & Examples

Published on February 19, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Simple linear regression is used to estimate the relationship between two quantitative variables . You can use simple linear regression when you want to know:

How strong the relationship is between two variables (e.g., the relationship between rainfall and soil erosion).
The value of the dependent variable at a certain value of the independent variable (e.g., the amount of soil erosion at a certain level of rainfall).

Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line, while logistic and nonlinear regression models use a curved line. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

If you have more than one independent variable, use multiple linear regression instead.

Assumptions of simple linear regression, how to perform a simple linear regression, interpreting the results, presenting the results, can you predict values outside the range of your data, other interesting articles, frequently asked questions about simple linear regression.

Simple linear regression is a parametric test , meaning that it makes certain assumptions about the data. These assumptions are:

Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.
Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among observations.
Normality : The data follows a normal distribution .

Linear regression makes one additional assumption:

The relationship between the independent and dependent variable is linear : the line of best fit through the data points is a straight line (rather than a curve or some sort of grouping factor).

If your data do not meet the assumptions of homoscedasticity or normality, you may be able to use a nonparametric test instead, such as the Spearman rank test.

If your data violate the assumption of independence of observations (e.g., if observations are repeated over time), you may be able to perform a linear mixed-effects model that accounts for the additional structure in the data.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Simple linear regression formula

The formula for a simple linear regression is:

$y = {\beta_0} + {\beta_1{X}} + {\epsilon}$

y is the predicted value of the dependent variable ( y ) for any given value of the independent variable ( x ).
B 0 is the intercept , the predicted value of y when the x is 0.
B 1 is the regression coefficient – how much we expect y to change as x increases.
x is the independent variable ( the variable we expect is influencing y ).
e is the error of the estimate, or how much variation there is in our estimate of the regression coefficient.

Linear regression finds the line of best fit line through your data by searching for the regression coefficient (B 1 ) that minimizes the total error (e) of the model.

While you can perform a linear regression by hand , this is a tedious process, so most people use statistical programs to help them quickly analyze the data.

Simple linear regression in R

R is a free, powerful, and widely-used statistical program. Download the dataset to try it yourself using our income and happiness example.

Dataset for simple linear regression (.csv)

Load the income.data dataset into your R environment, and then run the following command to generate a linear model describing the relationship between income and happiness:

This code takes the data you have collected data = income.data and calculates the effect that the independent variable income has on the dependent variable happiness using the equation for the linear model: lm() .

To learn more, follow our full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function in R:

This function takes the most important parameters from the linear model and puts them into a table, which looks like this:

Simple linear regression summary output in R

This output table first repeats the formula that was used to generate the results (‘Call’), then summarizes the model residuals (‘Residuals’), which give an idea of how well the model fits the real data.

Next is the ‘Coefficients’ table. The first row gives the estimates of the y-intercept, and the second row gives the regression coefficient of the model.

Row 1 of the table is labeled (Intercept) . This is the y-intercept of the regression equation, with a value of 0.20. You can plug this into your regression equation if you want to predict happiness values across the range of income that you have observed:

The next row in the ‘Coefficients’ table is income. This is the row that describes the estimated effect of income on reported happiness:

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The number in the table (0.713) tells us that for every one unit increase in income (where one unit of income = 10,000) there is a corresponding 0.71-unit increase in reported happiness (where happiness is a scale of 1 to 10).

The Std. Error column displays the standard error of the estimate. This number shows how much variation there is in our estimate of the relationship between income and happiness.

The t value column displays the test statistic . Unless you specify otherwise, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that our results occurred by chance.

The Pr(>| t |) column shows the p value . This number tells us how likely we are to see the estimated effect of income on happiness if the null hypothesis of no effect were true.

Because the p value is so low ( p < 0.001), we can reject the null hypothesis and conclude that income has a statistically significant effect on happiness.

The last three lines of the model summary are statistics about the model as a whole. The most important thing to notice here is the p value of the model. Here it is significant ( p < 0.001), which means that this model is a good fit for the observed data.

When reporting your results, include the estimated effect (i.e. the regression coefficient), standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what your regression coefficient means:

It can also be helpful to include a graph with your results. For a simple linear regression, you can simply plot the observations on the x and y axis and then include the regression line and regression function:

No! We often say that regression models can be used to predict the value of the dependent variable at certain values of the independent variable. However, this is only true for the range of values where we have actually measured the response.

We can use our income and happiness regression analysis as an example. Between 15,000 and 75,000, we found an r 2 of 0.73 ± 0.0193. But what if we did a second survey of people making between 75,000 and 150,000?

The r 2 for the relationship between income and happiness is now 0.21, or a 0.21-unit increase in reported happiness for every 10,000 increase in income. While the relationship is still statistically significant (p<0.001), the slope is much smaller than before.

What if we hadn’t measured this group, and instead extrapolated the line from the 15–75k incomes to the 70–150k incomes?

You can see that if we simply extrapolated from the 15–75k income data, we would overestimate the happiness of people in the 75–150k income range.

If we instead fit a curve to the data, it seems to fit the actual pattern much better.

It looks as though happiness actually levels off at higher incomes, so we can’t use the same regression line we calculated from our lower-income data to predict happiness at higher levels of income.

Even when you see a strong pattern in your data, you can’t know for certain whether that pattern continues beyond the range of values you have actually measured. Therefore, it’s important to avoid extrapolating beyond what the data actually tell you.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

Chi square test of independence
Statistical power
Descriptive statistics
Degrees of freedom
Pearson correlation
Null hypothesis

Methodology

Double-blind study
Case-control study
Research ethics
Data collection
Hypothesis testing
Structured interviews

Research bias

Hawthorne effect
Unconscious bias
Recall bias
Halo effect
Self-serving bias
Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Simple linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line. Both variables should be quantitative.

For example, the relationship between temperature and the expansion of mercury in a thermometer can be modeled using a straight line: as temperature increases, the mercury expands. This linear relationship is so certain that we can use mercury thermometers to measure temperature.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

measuring the distance of the observed y-values from the predicted y-values at each value of x;
squaring each of these distances;
calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Simple Linear Regression | An Easy Introduction & Examples. Scribbr. Retrieved March 20, 2024, from https://www.scribbr.com/statistics/simple-linear-regression/

Is this article helpful?

Rebecca Bevans

Other students also liked, an introduction to t tests | definitions, formula and examples, multiple linear regression | a quick guide (examples), linear regression in r | a step-by-step guide & examples.

school Campus Bookshelves
menu_book Bookshelves
perm_media Learning Objects
login Login
how_to_reg Request Instructor Account
hub Instructor Commons
Download Page (PDF)
Download Full Book (PDF)
Periodic Table
Physics Constants
Scientific Calculator
Reference & Cite
Tools expand_more
Readability

selected template will load here

This action is not available.

3.3.4: Hypothesis Test for Simple Linear Regression

Last updated
Save as PDF
Page ID 28708

We will now describe a hypothesis test to determine if the regression model is meaningful; in other words, does the value of $X$ in any way help predict the expected value of $Y$?

Simple Linear Regression ANOVA Hypothesis Test

Model Assumptions

The residual errors are random and are normally distributed.
The standard deviation of the residual error does not depend on $X$
A linear relationship exists between $X$ and $Y$
The samples are randomly selected

Test Hypotheses

$H_o$: $X$ and $Y$ are not correlated

$H_a$: $X$ and $Y$ are correlated

$H_o$: $\beta_1$ (slope) = 0

$H_a$: $\beta_1$ (slope) ≠ 0

Test Statistic

$F=\dfrac{M S_{\text {Regression }}}{M S_{\text {Error }}}$

$d f_{\text {num }}=1$

$d f_{\text {den }}=n-2$

Sum of Squares

$S S_{\text {Total }}=\sum(Y-\bar{Y})^{2}$

$S S_{\text {Error }}=\sum(Y-\hat{Y})^{2}$

$S S_{\text {Regression }}=S S_{\text {Total }}-S S_{\text {Error }}$

In simple linear regression, this is equivalent to saying “Are X an Y correlated?”

In reviewing the model, $Y=\beta_{0}+\beta_{1} X+\varepsilon$, as long as the slope ($\beta_{1}$) has any non‐zero value, $X$ will add value in helping predict the expected value of $Y$. However, if there is no correlation between X and Y, the value of the slope ($\beta_{1}$) will be zero. The model we can use is very similar to One Factor ANOVA.

The Results of the test can be summarized in a special ANOVA table:

Example: Rainfall and sales of sunglasses

Design : Is there a significant correlation between rainfall and sales of sunglasses?

Research Hypothese s:

$H_o$: Sales and Rainfall are not correlated $H_o$: 1 (slope) = 0

$H_a$: Sales and Rainfall are correlated $H_a$: 1 (slope) ≠ 0

Type I error would be to reject the Null Hypothesis and $t$ claim that rainfall is correlated with sales of sunglasses, when they are not correlated. The test will be run at a level of significance ($\alpha$) of 5%.

The test statistic from the table will be $\mathrm{F}=\dfrac{\text { MSRegression }}{\text { MSError }}$. The degrees of freedom for the numerator will be 1, and the degrees of freedom for denominator will be 5‐2=3.

Critical Value for $F$ at $\alpha$of 5% with $df_{num}=1$ and $df_{den}=3} is 10.13. Reject \(H_o$ if $F >10.13$. We will also run this test using the $p$‐value method with statistical software, such as Minitab.

Data/Results

$F=341.422 / 12.859=26.551$, which is more than the critical value of 10.13, so Reject $H_o$. Also, the $p$‐value = 0.0142 < 0.05 which also supports rejecting $H_o$.

Sales of Sunglasses and Rainfall are negatively correlated.

Online Degree Explore Bachelor’s & Master’s degrees
MasterTrack™ Earn credit towards a Master’s degree
University Certificates Advance your career with graduate-level learning
Top Courses
Join for Free

Data Collection and Root Cause Analysis

Taught in English

Financial aid available

Gain insight into a topic and learn the fundamentals

Instructors: Skill-Up EdTech Team +1 more

Instructors

Included with Coursera Plus

Recommended experience

Beginner level

None required

What you'll learn

Define data requirements to gather relevant data from the process using appropriate data collection methods.

Calculate baseline process performance metrics based on the collected data.

Analyze data for variations and use data analysis tools and techniques to identify the root causes for the problem or variation.

Skills you'll gain

Lean Six Sigma
Business Process Mapping
DMAIC framework
Root cause analysis
Critical to quality (CTQ)

Details to know

Add to your LinkedIn profile

9 assignments

See how employees at top companies are mastering in-demand skills

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

There are 4 modules in this course

The course will equip you with the competencies and essential skills required to excel in the American Society for Quality (ASQ) Certified Six Sigma Yellow Belt (CSSYB) exam and contribute to process improvement programs. This course focuses on various data collection tools and techniques to analyze data, identify the root causes of a problem, and explore the concepts of measurement system analysis (MSA), and hypothesis testing.

By the end of this course, you will be able to: • Define data requirements to gather relevant data from the process using appropriate data collection methods. • Calculate baseline process performance metrics based on the collected data. • Analyze data for variations and use data analysis tools and techniques to identify the root causes for the problem or variation. The course is best suited for entry-level professionals who are new to the world of Six Sigma and wish to improve their professional experience and opportunities. For this course, no prior knowledge is required, however, it is recommended that you complete the first course, Introduction to Lean Six Sigma and Project Identification Methods in the ASQ-Certified Six Sigma Yellow Belt Exam Prep Specialization.

Basic Statistics and Process Performance Measurement

This module introduces you to descriptive statistics, a branch of statistics that involves summarizing and describing the main features of a dataset. It provides tools and techniques to organize, present, and analyze data to gain insights into its central tendencies, variability, and distribution. Descriptive statistics is fundamental in data analysis and is a basis for more advanced statistical methods. You will also be introduced to inferential statistics. The module also explains the various data types and helps you differentiate between qualitative and quantitative data and data coming from internal and external sources. It describes the data collection process. Further, the module delves into the concept of measurement system analysis (MSA) and its components to understand the variations in the measurement process.

What's included

7 videos 2 readings 3 assignments 1 discussion prompt

7 videos • Total 35 minutes

Course Introduction • 3 minutes • Preview module
Descriptive and inferential statistics • 4 minutes
Measures of central tendency • 4 minutes
Measures of variability • 5 minutes
Qualitative and quantitative data • 4 minutes
Data collection plan • 6 minutes
Measurement system analysis (MSA) • 5 minutes

2 readings • Total 14 minutes

Course Overview • 10 minutes
Importance of descriptive statistics in data collection/Qualitative vs. quantitative data • 4 minutes

3 assignments • Total 34 minutes

Practice Quiz: Data description and dispersion measures • 6 minutes
Practice Quiz: Data types and measurement system analysis • 8 minutes
Graded Quiz: Basic Statistics and Process Performance Measurement • 20 minutes

1 discussion prompt • Total 15 minutes

Difference between central tendency and measures of dispersion • 15 minutes

Root Cause Analysis Techniques

This module explains the differences between value-added and non-value-added activities. It makes a case for non-value-added activities that are necessary to enable the smooth running of the organization. The module also discusses how to identify the bottlenecks in a system and suggests ways to eliminate them. Lastly, the module explores the various techniques to conduct a root cause analysis (RCA) for the identified problem in your process or organization. The first one, Pareto analysis, is based on the Pareto principle, which states that approximately 80% of the effects come from 20% of the causes. This analysis helps prioritize potential root causes based on their relative impact. You will also learn how to use the fishbone diagram, also known as the Ishikawa or cause and effect diagram, which visually represents the potential causes contributing to a problem while categorizing the possible causes into specific groups to facilitate the identification of root causes. Additionally, you will learn about the five whys, a simple yet powerful technique involving repeatedly asking “why” to identify the root cause of a problem. It helps to peel the layers of symptoms and surface-level causes to get to the core issue.

5 videos 1 reading 3 assignments 1 discussion prompt

5 videos • Total 23 minutes

Value-added and non-value-added activities • 4 minutes • Preview module
Identify bottlenecks • 4 minutes
RCA tool 1: Pareto principle • 6 minutes
RCA tool 2: Fishbone diagram • 3 minutes
RCA tool 3: Five whys • 5 minutes

1 reading • Total 5 minutes

Value-added and non-value-added activities: A business perspective • 5 minutes

3 assignments • Total 32 minutes

Practice Quiz: Value analysis • 4 minutes
Practice Quiz: RCA tools • 8 minutes
Graded Quiz: Root Cause Analysis Techniques • 20 minutes
Elimination or reduction of non-value-added activities • 15 minutes

Hypothesis Testing and Investigating the Relationship

This module provides a comprehensive overview of hypothesis testing, an essential statistical tool used to assess the validity of claims or hypotheses about populations. You will learn about the hypothesis testing process, its application in real-world scenarios, and how to interpret the hypothesis test results to make better decisions. The module will take you through the different types of hypotheses, types of errors, and the significance of the p-value in hypothesis testing. You will also learn about the principles and applications of correlation and regression techniques. The module discusses the types of correlation and the roles of dependent and independent variables in regression analysis. lt explains the implications of R-squared values in regression analysis. The module also explains simple linear regression and the difference between deterministic and probabilistic models.

8 videos 1 reading 3 assignments 1 discussion prompt

8 videos • Total 44 minutes

Hypothesis and its types • 5 minutes • Preview module
Null hypothesis versus alternative hypothesis • 6 minutes
Type I and Type II errors • 5 minutes
P value to test hypotheses • 4 minutes
Correlation analysis: Types of correlation • 5 minutes
Correlation analysis • 4 minutes
R-squared values • 4 minutes
Regression analysis: Simple linear regression • 7 minutes
Interpretation of R-squared values • 5 minutes

3 assignments • Total 46 minutes

Practice Quiz: Hypothesis testing & analysis • 8 minutes
Practice Quiz: Investigating the relationship • 8 minutes
Graded Quiz: Hypothesis Testing and Investigating the Relationship • 30 minutes
Difference between correlation and determination coefficients • 15 minutes

Peer-Reviewed Assignment

This is a peer-review assignment based on the concepts taught in the Data Collection and Root Cause Analysis course. In this assignment, you will apply your knowledge of hypothesis testing to a real-life scenario.

1 video 2 readings 1 peer review

1 video • Total 3 minutes

Course wrap-up video • 3 minutes • Preview module

2 readings • Total 3 minutes

Congratulations and next steps • 2 minutes
Thanks from the course team • 1 minute

1 peer review • Total 120 minutes

Peer-reviewed assignment: Hypothesis testing • 120 minutes

SkillUp EdTech provides everyone with transformational learning experiences. Built on a solid foundation of the field's latest practices, SkillUp EdTech provides real-world experiences and activities to help prepare you for your new career.

Recommended if you're interested in Data Analysis

SkillUp EdTech

Intro to Lean Six Sigma and Project Identification Methods

Why people choose coursera for their career.

New to Data Analysis? Start here.

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions

When will i have access to the lectures and assignments.

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.

The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

What will I get if I subscribe to this Specialization?

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

What is the refund policy?

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy Opens in a new tab .

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.