Linear regression - Hypothesis testing

by Marco Taboga , PhD

This lecture discusses how to perform tests of hypotheses about the coefficients of a linear regression model estimated by ordinary least squares (OLS).

Table of contents

Normal vs non-normal model

The linear regression model, matrix notation, tests of hypothesis in the normal linear regression model, test of a restriction on a single coefficient (t test), test of a set of linear restrictions (f test), tests based on maximum likelihood procedures (wald, lagrange multiplier, likelihood ratio), tests of hypothesis when the ols estimator is asymptotically normal, test of a restriction on a single coefficient (z test), test of a set of linear restrictions (chi-square test), learn more about regression analysis.

The lecture is divided in two parts:

in the first part, we discuss hypothesis testing in the normal linear regression model , in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors;

in the second part, we show how to carry out hypothesis tests in linear regression analyses where the hypothesis of normality holds only in large samples (i.e., the OLS estimator can be proved to be asymptotically normal).

How to choose which test to carry out after estimating a linear regression model.

We also denote:

We now explain how to derive tests about the coefficients of the normal linear regression model.

It can be proved (see the lecture about the normal linear regression model ) that the assumption of conditional normality implies that:

How the acceptance region is determined depends not only on the desired size of the test , but also on whether the test is:

one-tailed (only one of the two things, i.e., either smaller or larger, is possible).

For more details on how to determine the acceptance region, see the glossary entry on critical values .

[eq28]

The F test is one-tailed .

A critical value in the right tail of the F distribution is chosen so as to achieve the desired size of the test.

Then, the null hypothesis is rejected if the F statistics is larger than the critical value.

In this section we explain how to perform hypothesis tests about the coefficients of a linear regression model when the OLS estimator is asymptotically normal.

As we have shown in the lecture on the properties of the OLS estimator , in several cases (i.e., under different sets of assumptions) it can be proved that:

These two properties are used to derive the asymptotic distribution of the test statistics used in hypothesis testing.

The test can be either one-tailed or two-tailed . The same comments made for the t-test apply here.

[eq50]

Like the F test, also the Chi-square test is usually one-tailed .

The desired size of the test is achieved by appropriately choosing a critical value in the right tail of the Chi-square distribution.

The null is rejected if the Chi-square statistics is larger than the critical value.

Want to learn more about regression analysis? Here are some suggestions:

R squared of a linear regression ;

Gauss-Markov theorem ;

Generalized Least Squares ;

Multicollinearity ;

Dummy variables ;

Selection of linear regression models

Partitioned regression ;

Ridge regression .

How to cite

Please cite as:

Taboga, Marco (2021). "Linear regression - Hypothesis testing", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression-hypothesis-testing.

Most of the learning materials found on this website are now available in a traditional textbook format.

  • F distribution
  • Beta distribution
  • Conditional probability
  • Central Limit Theorem
  • Binomial distribution
  • Mean square convergence
  • Delta method
  • Almost sure convergence
  • Mathematical tools
  • Fundamentals of probability
  • Probability distributions
  • Asymptotic theory
  • Fundamentals of statistics
  • About Statlect
  • Cookies, privacy and terms of use
  • Loss function
  • Almost sure
  • Type I error
  • Precision matrix
  • Integrable variable
  • To enhance your privacy,
  • we removed the social buttons,
  • but don't forget to share .

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Statistics and probability

Course: statistics and probability   >   unit 5, fitting a line to data.

  • Estimating the line of best fit exercise
  • Eyeballing the line of best fit
  • Estimating with linear regression (linear models)
  • Estimating equations of lines of best fit, and using them to make predictions
  • Line of best fit: smoking in 1945
  • Estimating slope of line of best fit
  • Equations of trend lines: Phone data

Linear regression review

linear hypothesis in statistics

What is linear regression?

  • (Choice A)   A ‍   A A ‍  
  • (Choice B)   B ‍   B B ‍  
  • (Choice C)   C ‍   C C ‍  
  • (Choice D)   None of the lines fit the data. D None of the lines fit the data.

Using equations for lines of fit

Example: finding the equation.

  • (Choice A)   y = 5 x + 1.5 ‍   A y = 5 x + 1.5 ‍  
  • (Choice B)   y = 1.5 x + 5 ‍   B y = 1.5 x + 5 ‍  
  • (Choice C)   y = − 1.5 x + 5 ‍   C y = − 1.5 x + 5 ‍  
  • Your answer should be
  • an integer, like 6 ‍  
  • a simplified proper fraction, like 3 / 5 ‍  
  • a simplified improper fraction, like 7 / 4 ‍  
  • a mixed number, like 1   3 / 4 ‍  
  • an exact decimal, like 0.75 ‍  
  • a multiple of pi, like 12   pi ‍   or 2 / 3   pi ‍  

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.2 - the general linear f-test, the general linear f-test section  .

The " general linear F-test " involves three basic steps, namely:

  • Define a larger  full model . (By "larger," we mean one with more parameters.)
  • Define a smaller reduced model . (By "smaller," we mean one with fewer parameters.)
  • Use an F- statistic to decide whether or not to reject the smaller reduced model in favor of the larger full model.

As you can see by the wording of the third step, the null hypothesis always pertains to the reduced model, while the alternative hypothesis always pertains to the full model.

The easiest way to learn about the general linear test is to first go back to what we know, namely the simple linear regression model. Once we understand the general linear test for the simple case, we then see that it can be easily extended to the multiple-case model. We take that approach here.

The Full Model Section  

The " full model ", which is also sometimes referred to as the " unrestricted model ," is the model thought to be most appropriate for the data. For simple linear regression, the full model is:

\(y_i=(\beta_0+\beta_1x_{i1})+\epsilon_i\)

Here's a plot of a hypothesized full model for a set of data that we worked with previously in this course (student heights and grade point averages):

plot

And, here's another plot of a hypothesized full model that we previously encountered (state latitudes and skin cancer mortalities):

plot

In each plot, the solid line represents what the hypothesized population regression line might look like for the full model. The question we have to answer in each case is "does the full model describe the data well?" Here, we might think that the full model does well in summarizing the trend in the second plot but not the first.

The Reduced Model Section  

The " reduced model ," which is sometimes also referred to as the " restricted model ," is the model described by the null hypothesis \(H_{0}\). For simple linear regression, a common null hypothesis is \(H_{0} : \beta_{1} = 0\). In this case, the reduced model is obtained by "zeroing out" the slope \(\beta_{1}\) that appears in the full model. That is, the reduced model is:

\(y_i=\beta_0+\epsilon_i\)

This reduced model suggests that each response \(y_{i}\) is a function only of some overall mean, \(\beta_{0}\), and some error \(\epsilon_{i}\).

Let's take another look at the plot of student grade point average against height, but this time with a line representing what the hypothesized population regression line might look like for the reduced model:

plot

Not bad — there (fortunately?!) doesn't appear to be a relationship between height and grade point average. And, it appears as if the reduced model might be appropriate in describing the lack of a relationship between heights and grade point averages. What does the reduced model do for the skin cancer mortality example?

plot

It doesn't appear as if the reduced model would do a very good job of summarizing the trend in the population.

F-Statistic Test Section  

How do we decide if the reduced model or the full model does a better job of describing the trend in the data when it can't be determined by simply looking at a plot? What we need to do is to quantify how much error remains after fitting each of the two models to our data. That is, we take the general linear test approach:

  • Obtain the least squares estimates of \(\beta_{0}\) and \(\beta_{1}\).
  • Determine the error sum of squares, which we denote as " SSE ( F )."
  • Obtain the least squares estimate of \(\beta_{0}\).
  • Determine the error sum of squares, which we denote as " SSE ( R )."

Recall that, in general, the error sum of squares is obtained by summing the squared distances between the observed and fitted (estimated) responses:

\(\sum(\text{observed } - \text{ fitted})^2\)

Therefore, since \(y_i\) is the observed response and \(\hat{y}_i\) is the fitted response for the full model :

\(SSE(F)=\sum(y_i-\hat{y}_i)^2\)

And, since \(y_i\) is the observed response and \(\bar{y}\) is the fitted response for the reduced model :

\(SSE(R)=\sum(y_i-\bar{y})^2\)

Let's get a better feel for the general linear F-test approach by applying it to two different datasets. First, let's look at the Height and GPA data . The following plot of grade point averages against heights contains two estimated regression lines — the solid line is the estimated line for the full model, and the dashed line is the estimated line for the reduced model:

plot

As you can see, the estimated lines are almost identical. Calculating the error sum of squares for each model, we obtain:

\(SSE(F)=\sum(y_i-\hat{y}_i)^2=9.7055\)

\(SSE(R)=\sum(y_i-\bar{y})^2=9.7331\)

The two quantities are almost identical. Adding height to the reduced model to obtain the full model reduces the amount of error by only 0.0276 (from 9.7331 to 9.7055). That is, adding height to the model does very little in reducing the variability in grade point averages. In this case, there appears to be no advantage in using the larger full model over the simpler reduced model.

Look what happens when we fit the full and reduced models to the skin cancer mortality and latitude dataset :

plot

Here, there is quite a big difference between the estimated equation for the full model (solid line) and the estimated equation for the reduced model (dashed line). The error sums of squares quantify the substantial difference in the two estimated equations:

\(SSE(F)=\sum(y_i-\hat{y}_i)^2=17173\)

\(SSE(R)=\sum(y_i-\bar{y})^2=53637\)

Adding latitude to the reduced model to obtain the full model reduces the amount of error by 36464 (from 53637 to 17173). That is, adding latitude to the model substantially reduces the variability in skin cancer mortality. In this case, there appears to be a big advantage in using the larger full model over the simpler reduced model.

Where are we going with this general linear test approach? In short:

  • The general linear test involves a comparison between SSE ( R ) and SSE ( F ).
  • If SSE ( F ) is close to SSE ( R ), then the variation around the estimated full model regression function is almost as large as the variation around the estimated reduced model regression function. If that's the case, it makes sense to use the simpler reduced model.
  • On the other hand, if SSE ( F ) and SSE ( R ) differ greatly, then the additional parameter(s) in the full model substantially reduce the variation around the estimated regression function. In this case, it makes sense to go with the larger full model.

How different does SSE ( R ) have to be from SSE ( F ) in order to justify using the larger full model? The general linear F -statistic:

\(F^*=\left( \dfrac{SSE(R)-SSE(F)}{df_R-df_F}\right)\div\left( \dfrac{SSE(F)}{df_F}\right)\)

helps answer this question. The F -statistic intuitively makes sense — it is a function of SSE ( R )- SSE ( F ), the difference in the error between the two models. The degrees of freedom — denoted \(df_{R}\) and \(df_{F}\) — are those associated with the reduced and full model error sum of squares, respectively.

We use the general linear F -statistic to decide whether or not:

  • to reject the null hypothesis \(H_{0}\colon\) The reduced model
  • in favor of the alternative hypothesis \(H_{A}\colon\) The full model

In general, we reject \(H_{0}\) if F * is large — or equivalently if its associated P -value is small.

The test applied to the simple linear regression model Section  

For simple linear regression, it turns out that the general linear F -test is just the same ANOVA F -test that we learned before. As noted earlier for the simple linear regression case, the full model is:

and the reduced model is:

Therefore, the appropriate null and alternative hypotheses are specified either as:

  • \(H_{0} \colon y_i = \beta_{0} + \epsilon_{i}\)
  • \(H_{A} \colon y_i = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i}\)
  • \(H_{0} \colon \beta_{1} = 0 \)
  • \(H_{A} \colon \beta_{1} ≠ 0 \)

The degrees of freedom associated with the error sum of squares for the reduced model is n -1, and:

\(SSE(R)=\sum(y_i-\bar{y})^2=SSTO\)

The degrees of freedom associated with the error sum of squares for the full model is n -2, and:

\(SSE(F)=\sum(y_i-\hat{y}_i)^2=SSE\)

Now, we can see how the general linear F -statistic just reduces algebraically to the ANOVA F -test that we know:

Can be rewritten by substituting...

\(\begin{aligned} &&df_{R} = n - 1\\  &&df_{F} = n - 2\\ &&SSE(R)=SSTO\\&&SSE(F)=SSE\end{aligned}\)

\(F^*=\left( \dfrac{SSTO-SSE}{(n-1)-(n-2)}\right)\div\left( \dfrac{SSE}{(n-2)}\right)=\frac{MSR}{MSE}\)

That is, the general linear F -statistic reduces to the ANOVA F -statistic:

\(F^*=\dfrac{MSR}{MSE}\)

For the student height and grade point average example:

\( F^*=\dfrac{MSR}{MSE}=\dfrac{0.0276/1}{9.7055/33}=\dfrac{0.0276}{0.2941}=0.094\)

For the skin cancer mortality example:

\( F^*=\dfrac{MSR}{MSE}=\dfrac{36464/1}{17173/47}=\dfrac{36464}{365.4}=99.8\)

The P -value is calculated as usual. The P -value answers the question: "what is the probability that we’d get an F* statistic as large as we did if the null hypothesis were true?" The P -value is determined by comparing F * to an F distribution with 1 numerator degree of freedom and n -2 denominator degrees of freedom. For the student height and grade point average example, the P -value is 0.761 (so we fail to reject \(H_{0}\) and we favor the reduced model), while for the skin cancer mortality example, the P -value is 0.000 (so we reject \(H_{0}\) and we favor the full model).

Example 6-2: Alcohol and muscle Strength Section  

Does alcoholism have an effect on muscle strength? Some researchers (Urbano-Marquez, et al , 1989) who were interested in answering this question collected the following data ( Alcohol Arm data ) on a sample of 50 alcoholic men:

  • x = the total lifetime dose of alcohol ( kg per kg of body weight) consumed
  • y = the strength of the deltoid muscle in the man's non-dominant arm

The full model is the model that would summarize a linear relationship between alcohol consumption and arm strength. The reduced model, on the other hand, is the model that claims there is no relationship between alcohol consumption and arm strength.

  • \(H_0 \colon y_i = \beta_0 + \epsilon_i \)
  • \(H_A \colon y_i = \beta_0 + \beta_{1}x_i + \epsilon_i\)
  • \(H_0 \colon \beta_1 = 0\)
  • \(H_A \colon \beta_1 ≠ 0\)

Upon fitting the reduced model to the data, we obtain:

plot

\(SSE(R)=\sum(y_i-\bar{y})^2=1224.32\)

Note that the reduced model does not appear to summarize the trend in the data very well.

Upon fitting the full model to the data, we obtain:

plot

\(SSE(F)=\sum(y_i-\hat{y}_i)^2=720.27\)

The full model appears to describe the trend in the data better than the reduced model.

The good news is that in the simple linear regression case, we don't have to bother with calculating the general linear F -statistic. Minitab does it for us in the ANOVA table.

Click on the light bulb to see the error in the full and reduced models.

Analysis of Variance

As you can see, Minitab calculates and reports both SSE ( F ) — the amount of error associated with the full model — and SSE ( R ) — the amount of error associated with the reduced model. The F -statistic is:

\( F^*=\dfrac{MSR}{MSE}=\dfrac{504.04/1}{720.27/48}=\dfrac{504.04}{15.006}=33.59\)

and its associated P -value is < 0.001 (so we reject \(H_{0}\) and favor the full model). We can conclude that there is a statistically significant linear association between lifetime alcohol consumption and arm strength.

This concludes our discussion of our first aside from the general linear F-test. Now, we move on to our second aside from sequential sums of squares.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

4.4: Hypothesis Testing

  • Last updated
  • Save as PDF
  • Page ID 283

  • David Diez, Christopher Barr, & Mine Çetinkaya-Rundel
  • OpenIntro Statistics

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Is the typical US runner getting faster or slower over time? We consider this question in the context of the Cherry Blossom Run, comparing runners in 2006 and 2012. Technological advances in shoes, training, and diet might suggest runners would be faster in 2012. An opposing viewpoint might say that with the average body mass index on the rise, people tend to run slower. In fact, all of these components might be influencing run time.

In addition to considering run times in this section, we consider a topic near and dear to most students: sleep. A recent study found that college students average about 7 hours of sleep per night.15 However, researchers at a rural college are interested in showing that their students sleep longer than seven hours on average. We investigate this topic in Section 4.3.4.

Hypothesis Testing Framework

The average time for all runners who finished the Cherry Blossom Run in 2006 was 93.29 minutes (93 minutes and about 17 seconds). We want to determine if the run10Samp data set provides strong evidence that the participants in 2012 were faster or slower than those runners in 2006, versus the other possibility that there has been no change. 16 We simplify these three options into two competing hypotheses :

  • H 0 : The average 10 mile run time was the same for 2006 and 2012.
  • H A : The average 10 mile run time for 2012 was different than that of 2006.

We call H 0 the null hypothesis and H A the alternative hypothesis.

Null and alternative hypotheses

  • The null hypothesis (H 0 ) often represents either a skeptical perspective or a claim to be tested.
  • The alternative hypothesis (H A ) represents an alternative claim under consideration and is often represented by a range of possible parameter values.

15 theloquitur.com/?p=1161

16 While we could answer this question by examining the entire population data (run10), we only consider the sample data (run10Samp), which is more realistic since we rarely have access to population data.

The null hypothesis often represents a skeptical position or a perspective of no difference. The alternative hypothesis often represents a new perspective, such as the possibility that there has been a change.

Hypothesis testing framework

The skeptic will not reject the null hypothesis (H 0 ), unless the evidence in favor of the alternative hypothesis (H A ) is so strong that she rejects H 0 in favor of H A .

The hypothesis testing framework is a very general tool, and we often use it without a second thought. If a person makes a somewhat unbelievable claim, we are initially skeptical. However, if there is sufficient evidence that supports the claim, we set aside our skepticism and reject the null hypothesis in favor of the alternative. The hallmarks of hypothesis testing are also found in the US court system.

Exercise \(\PageIndex{1}\)

A US court considers two possible claims about a defendant: she is either innocent or guilty. If we set these claims up in a hypothesis framework, which would be the null hypothesis and which the alternative? 17

Jurors examine the evidence to see whether it convincingly shows a defendant is guilty. Even if the jurors leave unconvinced of guilt beyond a reasonable doubt, this does not mean they believe the defendant is innocent. This is also the case with hypothesis testing: even if we fail to reject the null hypothesis, we typically do not accept the null hypothesis as true. Failing to find strong evidence for the alternative hypothesis is not equivalent to accepting the null hypothesis.

17 H 0 : The average cost is $650 per month, \(\mu\) = $650.

In the example with the Cherry Blossom Run, the null hypothesis represents no difference in the average time from 2006 to 2012. The alternative hypothesis represents something new or more interesting: there was a difference, either an increase or a decrease. These hypotheses can be described in mathematical notation using \(\mu_{12}\) as the average run time for 2012:

  • H 0 : \(\mu_{12} = 93.29\)
  • H A : \(\mu_{12} \ne 93.29\)

where 93.29 minutes (93 minutes and about 17 seconds) is the average 10 mile time for all runners in the 2006 Cherry Blossom Run. Using this mathematical notation, the hypotheses can now be evaluated using statistical tools. We call 93.29 the null value since it represents the value of the parameter if the null hypothesis is true. We will use the run10Samp data set to evaluate the hypothesis test.

Testing Hypotheses using Confidence Intervals

We can start the evaluation of the hypothesis setup by comparing 2006 and 2012 run times using a point estimate from the 2012 sample: \(\bar {x}_{12} = 95.61\) minutes. This estimate suggests the average time is actually longer than the 2006 time, 93.29 minutes. However, to evaluate whether this provides strong evidence that there has been a change, we must consider the uncertainty associated with \(\bar {x}_{12}\).

1 6 The jury considers whether the evidence is so convincing (strong) that there is no reasonable doubt regarding the person's guilt; in such a case, the jury rejects innocence (the null hypothesis) and concludes the defendant is guilty (alternative hypothesis).

We learned in Section 4.1 that there is fluctuation from one sample to another, and it is very unlikely that the sample mean will be exactly equal to our parameter; we should not expect \(\bar {x}_{12}\) to exactly equal \(\mu_{12}\). Given that \(\bar {x}_{12} = 95.61\), it might still be possible that the population average in 2012 has remained unchanged from 2006. The difference between \(\bar {x}_{12}\) and 93.29 could be due to sampling variation, i.e. the variability associated with the point estimate when we take a random sample.

In Section 4.2, confidence intervals were introduced as a way to find a range of plausible values for the population mean. Based on run10Samp, a 95% confidence interval for the 2012 population mean, \(\mu_{12}\), was calculated as

\[(92.45, 98.77)\]

Because the 2006 mean, 93.29, falls in the range of plausible values, we cannot say the null hypothesis is implausible. That is, we failed to reject the null hypothesis, H 0 .

Double negatives can sometimes be used in statistics

In many statistical explanations, we use double negatives. For instance, we might say that the null hypothesis is not implausible or we failed to reject the null hypothesis. Double negatives are used to communicate that while we are not rejecting a position, we are also not saying it is correct.

Example \(\PageIndex{1}\)

Next consider whether there is strong evidence that the average age of runners has changed from 2006 to 2012 in the Cherry Blossom Run. In 2006, the average age was 36.13 years, and in the 2012 run10Samp data set, the average was 35.05 years with a standard deviation of 8.97 years for 100 runners.

First, set up the hypotheses:

  • H 0 : The average age of runners has not changed from 2006 to 2012, \(\mu_{age} = 36.13.\)
  • H A : The average age of runners has changed from 2006 to 2012, \(\mu _{age} 6 \ne 36.13.\)

We have previously veri ed conditions for this data set. The normal model may be applied to \(\bar {y}\) and the estimate of SE should be very accurate. Using the sample mean and standard error, we can construct a 95% con dence interval for \(\mu _{age}\) to determine if there is sufficient evidence to reject H 0 :

\[\bar{y} \pm 1.96 \times \dfrac {s}{\sqrt {100}} \rightarrow 35.05 \pm 1.96 \times 0.90 \rightarrow (33.29, 36.81)\]

This confidence interval contains the null value, 36.13. Because 36.13 is not implausible, we cannot reject the null hypothesis. We have not found strong evidence that the average age is different than 36.13 years.

Exercise \(\PageIndex{2}\)

Colleges frequently provide estimates of student expenses such as housing. A consultant hired by a community college claimed that the average student housing expense was $650 per month. What are the null and alternative hypotheses to test whether this claim is accurate? 18

Sample distribution of student housing expense. These data are moderately skewed, roughly determined using the outliers on the right.

H A : The average cost is different than $650 per month, \(\mu \ne\) $650.

18 Applying the normal model requires that certain conditions are met. Because the data are a simple random sample and the sample (presumably) represents no more than 10% of all students at the college, the observations are independent. The sample size is also sufficiently large (n = 75) and the data exhibit only moderate skew. Thus, the normal model may be applied to the sample mean.

Exercise \(\PageIndex{3}\)

The community college decides to collect data to evaluate the $650 per month claim. They take a random sample of 75 students at their school and obtain the data represented in Figure 4.11. Can we apply the normal model to the sample mean?

If the court makes a Type 1 Error, this means the defendant is innocent (H 0 true) but wrongly convicted. A Type 2 Error means the court failed to reject H 0 (i.e. failed to convict the person) when she was in fact guilty (H A true).

Example \(\PageIndex{2}\)

The sample mean for student housing is $611.63 and the sample standard deviation is $132.85. Construct a 95% confidence interval for the population mean and evaluate the hypotheses of Exercise 4.22.

The standard error associated with the mean may be estimated using the sample standard deviation divided by the square root of the sample size. Recall that n = 75 students were sampled.

\[ SE = \dfrac {s}{\sqrt {n}} = \dfrac {132.85}{\sqrt {75}} = 15.34\]

You showed in Exercise 4.23 that the normal model may be applied to the sample mean. This ensures a 95% confidence interval may be accurately constructed:

\[\bar {x} \pm z*SE \rightarrow 611.63 \pm 1.96 \times 15.34 \times (581.56, 641.70)\]

Because the null value $650 is not in the confidence interval, a true mean of $650 is implausible and we reject the null hypothesis. The data provide statistically significant evidence that the actual average housing expense is less than $650 per month.

Decision Errors

Hypothesis tests are not flawless. Just think of the court system: innocent people are sometimes wrongly convicted and the guilty sometimes walk free. Similarly, we can make a wrong decision in statistical hypothesis tests. However, the difference is that we have the tools necessary to quantify how often we make such errors.

There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a statement about which one might be true, but we might choose incorrectly. There are four possible scenarios in a hypothesis test, which are summarized in Table 4.12.

A Type 1 Error is rejecting the null hypothesis when H0 is actually true. A Type 2 Error is failing to reject the null hypothesis when the alternative is actually true.

Exercise 4.25

In a US court, the defendant is either innocent (H 0 ) or guilty (H A ). What does a Type 1 Error represent in this context? What does a Type 2 Error represent? Table 4.12 may be useful.

To lower the Type 1 Error rate, we might raise our standard for conviction from "beyond a reasonable doubt" to "beyond a conceivable doubt" so fewer people would be wrongly convicted. However, this would also make it more difficult to convict the people who are actually guilty, so we would make more Type 2 Errors.

Exercise 4.26

How could we reduce the Type 1 Error rate in US courts? What influence would this have on the Type 2 Error rate?

To lower the Type 2 Error rate, we want to convict more guilty people. We could lower the standards for conviction from "beyond a reasonable doubt" to "beyond a little doubt". Lowering the bar for guilt will also result in more wrongful convictions, raising the Type 1 Error rate.

Exercise 4.27

How could we reduce the Type 2 Error rate in US courts? What influence would this have on the Type 1 Error rate?

A skeptic would have no reason to believe that sleep patterns at this school are different than the sleep patterns at another school.

Exercises 4.25-4.27 provide an important lesson:

If we reduce how often we make one type of error, we generally make more of the other type.

Hypothesis testing is built around rejecting or failing to reject the null hypothesis. That is, we do not reject H 0 unless we have strong evidence. But what precisely does strong evidence mean? As a general rule of thumb, for those cases where the null hypothesis is actually true, we do not want to incorrectly reject H 0 more than 5% of the time. This corresponds to a significance level of 0.05. We often write the significance level using \(\alpha\) (the Greek letter alpha): \(\alpha = 0.05.\) We discuss the appropriateness of different significance levels in Section 4.3.6.

If we use a 95% confidence interval to test a hypothesis where the null hypothesis is true, we will make an error whenever the point estimate is at least 1.96 standard errors away from the population parameter. This happens about 5% of the time (2.5% in each tail). Similarly, using a 99% con dence interval to evaluate a hypothesis is equivalent to a significance level of \(\alpha = 0.01\).

A confidence interval is, in one sense, simplistic in the world of hypothesis tests. Consider the following two scenarios:

  • The null value (the parameter value under the null hypothesis) is in the 95% confidence interval but just barely, so we would not reject H 0 . However, we might like to somehow say, quantitatively, that it was a close decision.
  • The null value is very far outside of the interval, so we reject H 0 . However, we want to communicate that, not only did we reject the null hypothesis, but it wasn't even close. Such a case is depicted in Figure 4.13.

In Section 4.3.4, we introduce a tool called the p-value that will be helpful in these cases. The p-value method also extends to hypothesis tests where con dence intervals cannot be easily constructed or applied.

alt

Formal Testing using p-Values

The p-value is a way of quantifying the strength of the evidence against the null hypothesis and in favor of the alternative. Formally the p-value is a conditional probability.

definition: p-value

The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true. We typically use a summary statistic of the data, in this chapter the sample mean, to help compute the p-value and evaluate the hypotheses.

A poll by the National Sleep Foundation found that college students average about 7 hours of sleep per night. Researchers at a rural school are interested in showing that students at their school sleep longer than seven hours on average, and they would like to demonstrate this using a sample of students. What would be an appropriate skeptical position for this research?

This is entirely based on the interests of the researchers. Had they been only interested in the opposite case - showing that their students were actually averaging fewer than seven hours of sleep but not interested in showing more than 7 hours - then our setup would have set the alternative as \(\mu < 7\).

alt

We can set up the null hypothesis for this test as a skeptical perspective: the students at this school average 7 hours of sleep per night. The alternative hypothesis takes a new form reflecting the interests of the research: the students average more than 7 hours of sleep. We can write these hypotheses as

  • H 0 : \(\mu\) = 7.
  • H A : \(\mu\) > 7.

Using \(\mu\) > 7 as the alternative is an example of a one-sided hypothesis test. In this investigation, there is no apparent interest in learning whether the mean is less than 7 hours. (The standard error can be estimated from the sample standard deviation and the sample size: \(SE_{\bar {x}} = \dfrac {s_x}{\sqrt {n}} = \dfrac {1.75}{\sqrt {110}} = 0.17\)). Earlier we encountered a two-sided hypothesis where we looked for any clear difference, greater than or less than the null value.

Always use a two-sided test unless it was made clear prior to data collection that the test should be one-sided. Switching a two-sided test to a one-sided test after observing the data is dangerous because it can inflate the Type 1 Error rate.

TIP: One-sided and two-sided tests

If the researchers are only interested in showing an increase or a decrease, but not both, use a one-sided test. If the researchers would be interested in any difference from the null value - an increase or decrease - then the test should be two-sided.

TIP: Always write the null hypothesis as an equality

We will find it most useful if we always list the null hypothesis as an equality (e.g. \(\mu\) = 7) while the alternative always uses an inequality (e.g. \(\mu \ne 7, \mu > 7, or \mu < 7)\).

The researchers at the rural school conducted a simple random sample of n = 110 students on campus. They found that these students averaged 7.42 hours of sleep and the standard deviation of the amount of sleep for the students was 1.75 hours. A histogram of the sample is shown in Figure 4.14.

Before we can use a normal model for the sample mean or compute the standard error of the sample mean, we must verify conditions. (1) Because this is a simple random sample from less than 10% of the student body, the observations are independent. (2) The sample size in the sleep study is sufficiently large since it is greater than 30. (3) The data show moderate skew in Figure 4.14 and the presence of a couple of outliers. This skew and the outliers (which are not too extreme) are acceptable for a sample size of n = 110. With these conditions veri ed, the normal model can be safely applied to \(\bar {x}\) and the estimated standard error will be very accurate.

What is the standard deviation associated with \(\bar {x}\)? That is, estimate the standard error of \(\bar {x}\). 25

The hypothesis test will be evaluated using a significance level of \(\alpha = 0.05\). We want to consider the data under the scenario that the null hypothesis is true. In this case, the sample mean is from a distribution that is nearly normal and has mean 7 and standard deviation of about 0.17. Such a distribution is shown in Figure 4.15.

alt

The shaded tail in Figure 4.15 represents the chance of observing such a large mean, conditional on the null hypothesis being true. That is, the shaded tail represents the p-value. We shade all means larger than our sample mean, \(\bar {x} = 7.42\), because they are more favorable to the alternative hypothesis than the observed mean.

We compute the p-value by finding the tail area of this normal distribution, which we learned to do in Section 3.1. First compute the Z score of the sample mean, \(\bar {x} = 7.42\):

\[Z = \dfrac {\bar {x} - \text {null value}}{SE_{\bar {x}}} = \dfrac {7.42 - 7}{0.17} = 2.47\]

Using the normal probability table, the lower unshaded area is found to be 0.993. Thus the shaded area is 1 - 0.993 = 0.007. If the null hypothesis is true, the probability of observing such a large sample mean for a sample of 110 students is only 0.007. That is, if the null hypothesis is true, we would not often see such a large mean.

We evaluate the hypotheses by comparing the p-value to the significance level. Because the p-value is less than the significance level \((p-value = 0.007 < 0.05 = \alpha)\), we reject the null hypothesis. What we observed is so unusual with respect to the null hypothesis that it casts serious doubt on H 0 and provides strong evidence favoring H A .

p-value as a tool in hypothesis testing

The p-value quantifies how strongly the data favor H A over H 0 . A small p-value (usually < 0.05) corresponds to sufficient evidence to reject H 0 in favor of H A .

TIP: It is useful to First draw a picture to find the p-value

It is useful to draw a picture of the distribution of \(\bar {x}\) as though H 0 was true (i.e. \(\mu\) equals the null value), and shade the region (or regions) of sample means that are at least as favorable to the alternative hypothesis. These shaded regions represent the p-value.

The ideas below review the process of evaluating hypothesis tests with p-values:

  • The null hypothesis represents a skeptic's position or a position of no difference. We reject this position only if the evidence strongly favors H A .
  • A small p-value means that if the null hypothesis is true, there is a low probability of seeing a point estimate at least as extreme as the one we saw. We interpret this as strong evidence in favor of the alternative.
  • We reject the null hypothesis if the p-value is smaller than the significance level, \(\alpha\), which is usually 0.05. Otherwise, we fail to reject H 0 .
  • We should always state the conclusion of the hypothesis test in plain language so non-statisticians can also understand the results.

The p-value is constructed in such a way that we can directly compare it to the significance level ( \(\alpha\)) to determine whether or not to reject H 0 . This method ensures that the Type 1 Error rate does not exceed the significance level standard.

alt

If the null hypothesis is true, how often should the p-value be less than 0.05?

About 5% of the time. If the null hypothesis is true, then the data only has a 5% chance of being in the 5% of data most favorable to H A .

alt

Exercise 4.31

Suppose we had used a significance level of 0.01 in the sleep study. Would the evidence have been strong enough to reject the null hypothesis? (The p-value was 0.007.) What if the significance level was \(\alpha = 0.001\)? 27

27 We reject the null hypothesis whenever p-value < \(\alpha\). Thus, we would still reject the null hypothesis if \(\alpha = 0.01\) but not if the significance level had been \(\alpha = 0.001\).

Exercise 4.32

Ebay might be interested in showing that buyers on its site tend to pay less than they would for the corresponding new item on Amazon. We'll research this topic for one particular product: a video game called Mario Kart for the Nintendo Wii. During early October 2009, Amazon sold this game for $46.99. Set up an appropriate (one-sided!) hypothesis test to check the claim that Ebay buyers pay less during auctions at this same time. 28

28 The skeptic would say the average is the same on Ebay, and we are interested in showing the average price is lower.

Exercise 4.33

During early October, 2009, 52 Ebay auctions were recorded for Mario Kart.29 The total prices for the auctions are presented using a histogram in Figure 4.17, and we may like to apply the normal model to the sample mean. Check the three conditions required for applying the normal model: (1) independence, (2) at least 30 observations, and (3) the data are not strongly skewed. 30

30 (1) The independence condition is unclear. We will make the assumption that the observations are independent, which we should report with any nal results. (2) The sample size is sufficiently large: \(n = 52 \ge 30\). (3) The data distribution is not strongly skewed; it is approximately symmetric.

H 0 : The average auction price on Ebay is equal to (or more than) the price on Amazon. We write only the equality in the statistical notation: \(\mu_{ebay} = 46.99\).

H A : The average price on Ebay is less than the price on Amazon, \(\mu _{ebay} < 46.99\).

29 These data were collected by OpenIntro staff.

Example 4.34

The average sale price of the 52 Ebay auctions for Wii Mario Kart was $44.17 with a standard deviation of $4.15. Does this provide sufficient evidence to reject the null hypothesis in Exercise 4.32? Use a significance level of \(\alpha = 0.01\).

The hypotheses were set up and the conditions were checked in Exercises 4.32 and 4.33. The next step is to find the standard error of the sample mean and produce a sketch to help find the p-value.

alt

Because the alternative hypothesis says we are looking for a smaller mean, we shade the lower tail. We find this shaded area by using the Z score and normal probability table: \(Z = \dfrac {44.17 \times 46.99}{0.5755} = -4.90\), which has area less than 0.0002. The area is so small we cannot really see it on the picture. This lower tail area corresponds to the p-value.

Because the p-value is so small - specifically, smaller than = 0.01 - this provides sufficiently strong evidence to reject the null hypothesis in favor of the alternative. The data provide statistically signi cant evidence that the average price on Ebay is lower than Amazon's asking price.

Two-sided hypothesis testing with p-values

We now consider how to compute a p-value for a two-sided test. In one-sided tests, we shade the single tail in the direction of the alternative hypothesis. For example, when the alternative had the form \(\mu\) > 7, then the p-value was represented by the upper tail (Figure 4.16). When the alternative was \(\mu\) < 46.99, the p-value was the lower tail (Exercise 4.32). In a two-sided test, we shade two tails since evidence in either direction is favorable to H A .

Exercise 4.35 Earlier we talked about a research group investigating whether the students at their school slept longer than 7 hours each night. Let's consider a second group of researchers who want to evaluate whether the students at their college differ from the norm of 7 hours. Write the null and alternative hypotheses for this investigation. 31

Example 4.36 The second college randomly samples 72 students and nds a mean of \(\bar {x} = 6.83\) hours and a standard deviation of s = 1.8 hours. Does this provide strong evidence against H 0 in Exercise 4.35? Use a significance level of \(\alpha = 0.05\).

First, we must verify assumptions. (1) A simple random sample of less than 10% of the student body means the observations are independent. (2) The sample size is 72, which is greater than 30. (3) Based on the earlier distribution and what we already know about college student sleep habits, the distribution is probably not strongly skewed.

Next we can compute the standard error \((SE_{\bar {x}} = \dfrac {s}{\sqrt {n}} = 0.21)\) of the estimate and create a picture to represent the p-value, shown in Figure 4.18. Both tails are shaded.

31 Because the researchers are interested in any difference, they should use a two-sided setup: H 0 : \(\mu\) = 7, H A : \(\mu \ne 7.\)

alt

An estimate of 7.17 or more provides at least as strong of evidence against the null hypothesis and in favor of the alternative as the observed estimate, \(\bar {x} = 6.83\).

We can calculate the tail areas by rst nding the lower tail corresponding to \(\bar {x}\):

\[Z = \dfrac {6.83 - 7.00}{0.21} = -0.81 \xrightarrow {table} \text {left tail} = 0.2090\]

Because the normal model is symmetric, the right tail will have the same area as the left tail. The p-value is found as the sum of the two shaded tails:

\[ \text {p-value} = \text {left tail} + \text {right tail} = 2 \times \text {(left tail)} = 0.4180\]

This p-value is relatively large (larger than \(\mu\)= 0.05), so we should not reject H 0 . That is, if H 0 is true, it would not be very unusual to see a sample mean this far from 7 hours simply due to sampling variation. Thus, we do not have sufficient evidence to conclude that the mean is different than 7 hours.

Example 4.37 It is never okay to change two-sided tests to one-sided tests after observing the data. In this example we explore the consequences of ignoring this advice. Using \(\alpha = 0.05\), we show that freely switching from two-sided tests to onesided tests will cause us to make twice as many Type 1 Errors as intended.

Suppose the sample mean was larger than the null value, \(\mu_0\) (e.g. \(\mu_0\) would represent 7 if H 0 : \(\mu\) = 7). Then if we can ip to a one-sided test, we would use H A : \(\mu > \mu_0\). Now if we obtain any observation with a Z score greater than 1.65, we would reject H 0 . If the null hypothesis is true, we incorrectly reject the null hypothesis about 5% of the time when the sample mean is above the null value, as shown in Figure 4.19.

Suppose the sample mean was smaller than the null value. Then if we change to a one-sided test, we would use H A : \(\mu < \mu_0\). If \(\bar {x}\) had a Z score smaller than -1.65, we would reject H 0 . If the null hypothesis is true, then we would observe such a case about 5% of the time.

By examining these two scenarios, we can determine that we will make a Type 1 Error 5% + 5% = 10% of the time if we are allowed to swap to the "best" one-sided test for the data. This is twice the error rate we prescribed with our significance level: \(\alpha = 0.05\) (!).

alt

Caution: One-sided hypotheses are allowed only before seeing data

After observing data, it is tempting to turn a two-sided test into a one-sided test. Avoid this temptation. Hypotheses must be set up before observing the data. If they are not, the test must be two-sided.

Choosing a Significance Level

Choosing a significance level for a test is important in many contexts, and the traditional level is 0.05. However, it is often helpful to adjust the significance level based on the application. We may select a level that is smaller or larger than 0.05 depending on the consequences of any conclusions reached from the test.

  • If making a Type 1 Error is dangerous or especially costly, we should choose a small significance level (e.g. 0.01). Under this scenario we want to be very cautious about rejecting the null hypothesis, so we demand very strong evidence favoring H A before we would reject H 0 .
  • If a Type 2 Error is relatively more dangerous or much more costly than a Type 1 Error, then we should choose a higher significance level (e.g. 0.10). Here we want to be cautious about failing to reject H 0 when the null is actually false. We will discuss this particular case in greater detail in Section 4.6.

Significance levels should reflect consequences of errors

The significance level selected for a test should reflect the consequences associated with Type 1 and Type 2 Errors.

Example 4.38

A car manufacturer is considering a higher quality but more expensive supplier for window parts in its vehicles. They sample a number of parts from their current supplier and also parts from the new supplier. They decide that if the high quality parts will last more than 12% longer, it makes nancial sense to switch to this more expensive supplier. Is there good reason to modify the significance level in such a hypothesis test?

The null hypothesis is that the more expensive parts last no more than 12% longer while the alternative is that they do last more than 12% longer. This decision is just one of the many regular factors that have a marginal impact on the car and company. A significancelevel of 0.05 seems reasonable since neither a Type 1 or Type 2 error should be dangerous or (relatively) much more expensive.

Example 4.39

The same car manufacturer is considering a slightly more expensive supplier for parts related to safety, not windows. If the durability of these safety components is shown to be better than the current supplier, they will switch manufacturers. Is there good reason to modify the significance level in such an evaluation?

The null hypothesis would be that the suppliers' parts are equally reliable. Because safety is involved, the car company should be eager to switch to the slightly more expensive manufacturer (reject H 0 ) even if the evidence of increased safety is only moderately strong. A slightly larger significance level, such as \(\mu = 0.10\), might be appropriate.

Exercise 4.40

A part inside of a machine is very expensive to replace. However, the machine usually functions properly even if this part is broken, so the part is replaced only if we are extremely certain it is broken based on a series of measurements. Identify appropriate hypotheses for this test (in plain language) and suggest an appropriate significance level. 32

  • Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Linear regression hypothesis testing: Concepts, Examples

Simple linear regression model

In relation to machine learning , linear regression is defined as a predictive modeling technique that allows us to build a model which can help predict continuous response variables as a function of a linear combination of explanatory or predictor variables. While training linear regression models, we need to rely on hypothesis testing in relation to determining the relationship between the response and predictor variables. In the case of the linear regression model, two types of hypothesis testing are done. They are T-tests and F-tests . In other words, there are two types of statistics that are used to assess whether linear regression models exist representing response and predictor variables. They are t-statistics and f-statistics. As data scientists , it is of utmost importance to determine if linear regression is the correct choice of model for our particular problem and this can be done by performing hypothesis testing related to linear regression response and predictor variables. Many times, it is found that these concepts are not very clear with a lot many data scientists. In this blog post, we will discuss linear regression and hypothesis testing related to t-statistics and f-statistics . We will also provide an example to help illustrate how these concepts work.

Table of Contents

What are linear regression models?

A linear regression model can be defined as the function approximation that represents a continuous response variable as a function of one or more predictor variables. While building a linear regression model, the goal is to identify a linear equation that best predicts or models the relationship between the response or dependent variable and one or more predictor or independent variables.

There are two different kinds of linear regression models. They are as follows:

  • Simple or Univariate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and one predictor or independent variable. The form of the equation that represents a simple linear regression model is Y=mX+b, where m is the coefficients of the predictor variable and b is bias. When considering the linear regression line, m represents the slope and b represents the intercept.
  • Multiple or Multi-variate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and more than one predictor or independent variable. The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

While training linear regression models, the requirement is to determine the coefficients which can result in the best-fitted linear regression line. The learning algorithm used to find the most appropriate coefficients is known as least squares regression . In the least-squares regression method, the coefficients are calculated using the least-squares error function. The main objective of this method is to minimize or reduce the sum of squared residuals between actual and predicted response values. The sum of squared residuals is also called the residual sum of squares (RSS). The outcome of executing the least-squares regression method is coefficients that minimize the linear regression cost function .

The residual e of the ith observation is represented as the following where [latex]Y_i[/latex] is the ith observation and [latex]\hat{Y_i}[/latex] is the prediction for ith observation or the value of response variable for ith observation.

[latex]e_i = Y_i – \hat{Y_i}[/latex]

The residual sum of squares can be represented as the following:

[latex]RSS = e_1^2 + e_2^2 + e_3^2 + … + e_n^2[/latex]

The least-squares method represents the algorithm that minimizes the above term, RSS.

Once the coefficients are determined, can it be claimed that these coefficients are the most appropriate ones for linear regression? The answer is no. After all, the coefficients are only the estimates and thus, there will be standard errors associated with each of the coefficients.  Recall that the standard error is used to calculate the confidence interval in which the mean value of the population parameter would exist. In other words, it represents the error of estimating a population parameter based on the sample data. The value of the standard error is calculated as the standard deviation of the sample divided by the square root of the sample size. The formula below represents the standard error of a mean.

[latex]SE(\mu) = \frac{\sigma}{\sqrt(N)}[/latex]

Thus, without analyzing aspects such as the standard error associated with the coefficients, it cannot be claimed that the linear regression coefficients are the most suitable ones without performing hypothesis testing. This is where hypothesis testing is needed . Before we get into why we need hypothesis testing with the linear regression model, let’s briefly learn about what is hypothesis testing?

Train a Multiple Linear Regression Model using R

Before getting into understanding the hypothesis testing concepts in relation to the linear regression model, let’s train a multi-variate or multiple linear regression model and print the summary output of the model which will be referred to, in the next section. 

The data used for creating a multi-linear regression model is BostonHousing which can be loaded in RStudioby installing mlbench package. The code is shown below:

install.packages(“mlbench”) library(mlbench) data(“BostonHousing”)

Once the data is loaded, the code shown below can be used to create the linear regression model.

attach(BostonHousing) BostonHousing.lm <- lm(log(medv) ~ crim + chas + rad + lstat) summary(BostonHousing.lm)

Executing the above command will result in the creation of a linear regression model with the response variable as medv and predictor variables as crim, chas, rad, and lstat. The following represents the details related to the response and predictor variables:

  • log(medv) : Log of the median value of owner-occupied homes in USD 1000’s
  • crim : Per capita crime rate by town
  • chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • rad : Index of accessibility to radial highways
  • lstat : Percentage of the lower status of the population

The following will be the output of the summary command that prints the details relating to the model including hypothesis testing details for coefficients (t-statistics) and the model as a whole (f-statistics) 

linear regression model summary table r.png

Hypothesis tests & Linear Regression Models

Hypothesis tests are the statistical procedure that is used to test a claim or assumption about the underlying distribution of a population based on the sample data. Here are key steps of doing hypothesis tests with linear regression models:

  • Hypothesis formulation for T-tests: In the case of linear regression, the claim is made that there exists a relationship between response and predictor variables, and the claim is represented using the non-zero value of coefficients of predictor variables in the linear equation or regression model. This is formulated as an alternate hypothesis. Thus, the null hypothesis is set that there is no relationship between response and the predictor variables . Hence, the coefficients related to each of the predictor variables is equal to zero (0). So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis for each test states that a1 = 0, a2 = 0, a3 = 0 etc. For all the predictor variables, individual hypothesis testing is done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. Thus, if there are, say, 5 features, there will be five hypothesis tests and each will have an associated null and alternate hypothesis.
  • Hypothesis formulation for F-test : In addition, there is a hypothesis test done around the claim that there is a linear regression model representing the response variable and all the predictor variables. The null hypothesis is that the linear regression model does not exist . This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0.
  • F-statistics for testing hypothesis for linear regression model : F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0. F-statistics is calculated as a function of sum of squares residuals for restricted regression (representing linear regression model with only intercept or bias and all the values of coefficients as zero) and sum of squares residuals for unrestricted regression (representing linear regression model). In the above diagram, note the value of f-statistics as 15.66 against the degrees of freedom as 5 and 194. 
  • Evaluate t-statistics against the critical value/region : After calculating the value of t-statistics for each coefficient, it is now time to make a decision about whether to accept or reject the null hypothesis. In order for this decision to be made, one needs to set a significance level, which is also known as the alpha level. The significance level of 0.05 is usually set for rejecting the null hypothesis or otherwise. If the value of t-statistics fall in the critical region, the null hypothesis is rejected. Or, if the p-value comes out to be less than 0.05, the null hypothesis is rejected.
  • Evaluate f-statistics against the critical value/region : The value of F-statistics and the p-value is evaluated for testing the null hypothesis that the linear regression model representing response and predictor variables does not exist. If the value of f-statistics is more than the critical value at the level of significance as 0.05, the null hypothesis is rejected. This means that the linear model exists with at least one valid coefficients. 
  • Draw conclusions : The final step of hypothesis testing is to draw a conclusion by interpreting the results in terms of the original claim or hypothesis. If the null hypothesis of one or more predictor variables is rejected, it represents the fact that the relationship between the response and the predictor variable is not statistically significant based on the evidence or the sample data we used for training the model. Similarly, if the f-statistics value lies in the critical region and the value of the p-value is less than the alpha value usually set as 0.05, one can say that there exists a linear regression model.

Why hypothesis tests for linear regression models?

The reasons why we need to do hypothesis tests in case of a linear regression model are following:

  • By creating the model, we are establishing a new truth (claims) about the relationship between response or dependent variable with one or more predictor or independent variables. In order to justify the truth, there are needed one or more tests. These tests can be termed as an act of testing the claim (or new truth) or in other words, hypothesis tests.
  • One kind of test is required to test the relationship between response and each of the predictor variables (hence, T-tests)
  • Another kind of test is required to test the linear regression model representation as a whole. This is called F-test.

While training linear regression models, hypothesis testing is done to determine whether the relationship between the response and each of the predictor variables is statistically significant or otherwise. The coefficients related to each of the predictor variables is determined. Then, individual hypothesis tests are done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. If at least one of the null hypotheses is rejected, it represents the fact that there exists no relationship between response and that particular predictor variable. T-statistics is used for performing the hypothesis testing because the standard deviation of the sampling distribution is unknown. The value of t-statistics is compared with the critical value from the t-distribution table in order to make a decision about whether to accept or reject the null hypothesis regarding the relationship between the response and predictor variables. If the value falls in the critical region, then the null hypothesis is rejected which means that there is no relationship between response and that predictor variable. In addition to T-tests, F-test is performed to test the null hypothesis that the linear regression model does not exist and that the value of all the coefficients is zero (0). Learn more about the linear regression and t-test in this blog – Linear regression t-test: formula, example .

Recent Posts

Ajitesh Kumar

  • Pricing Analytics in Banking: Strategies, Examples - May 15, 2024
  • How to Learn Effectively: A Holistic Approach - May 13, 2024
  • How to Choose Right Statistical Tests: Examples - May 13, 2024

Ajitesh Kumar

One response.

Very informative

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:
  • Excellence Awaits: IITs, NITs & IIITs Journey

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • Pricing Analytics in Banking: Strategies, Examples
  • How to Learn Effectively: A Holistic Approach
  • How to Choose Right Statistical Tests: Examples
  • Data Lakehouses Fundamentals & Examples
  • Machine Learning Lifecycle: Data to Deployment Example

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

Linear Hypothesis Tests

Most regression output will include the results of frequentist hypothesis tests comparing each coefficient to 0. However, in many cases, you may be interested in whether a linear sum of the coefficients is 0. For example, in the regression

You may be interested to see if \(GoodThing\) and \(BadThing\) (both binary variables) cancel each other out. So you would want to do a test of \(\beta_1 - \beta_2 = 0\).

Alternately, you may want to do a joint significance test of multiple linear hypotheses. For example, you may be interested in whether \(\beta_1\) or \(\beta_2\) are nonzero and so would want to jointly test the hypotheses \(\beta_1 = 0\) and \(\beta_2=0\) rather than doing them one at a time. Note the and here, since if either one or the other is rejected, we reject the null.

Keep in Mind

  • Be sure to carefully interpret the result. If you are doing a joint test, rejection means that at least one of your hypotheses can be rejected, not each of them. And you don’t necessarily know which ones can be rejected!
  • Generally, linear hypothesis tests are performed using F-statistics. However, there are alternate approaches such as likelihood tests or chi-squared tests. Be sure you know which on you’re getting.
  • Conceptually, what is going on with linear hypothesis tests is that they compare the model you’ve estimated against a more restrictive one that requires your restrictions (hypotheses) to be true. If the test you have in mind is too complex for the software to figure out on its own, you might be able to do it on your own by taking the sum of squared residuals in your original unrestricted model (\(SSR_{UR}\)), estimate the alternate model with the restriction in place (\(SSR_R\)) and then calculate the F-statistic for the joint test using \(F_{q,n-k-1} = ((SSR_R - SSR_{UR})/q)/(SSR_{UR}/(n-k-1))\).

Also Consider

  • The process for testing a nonlinear combination of your coefficients, for example testing if \(\beta_1\times\beta_2 = 1\) or \(\sqrt{\beta_1} = .5\), is generally different. See Nonlinear hypothesis tests .

Implementations

Linear hypothesis test in R can be performed for most regression models using the linearHypothesis() function in the car package. See this guide for more information.

Tests of coefficients in Stata can generally be performed using the built-in test command.

Statology

Statistics Made Easy

7 Best YouTube Channels to Learn Statistics for Free

7 Best YouTube Channels to Learn Statistics for Free

Statistics is one of the most challenging math topics to master. You can go about learning statistics from textbooks, blogs, courses, and more. But the most effective way, perhaps, is to learn from the right teachers who simplify even intimidating stats concepts and help you understand and apply what you’ve learned.

YouTube is one of the platforms where you can find many such educators who create super helpful content on statistics. We’ve compiled a list of YouTube channels with rock solid content on stats and math for data science. 

Most of these channels are run by educators who have an advanced degree in math and statistics and teach university-level stats courses. So you’re definitely in for a rewarding learning experience. Let’s begin.

1. StatQuest with Josh Starmer

Josh Starmer is a renowned educator who runs the YouTube channel StatQuest with Josh Starmer , known for making stats and data science concepts super accessible. 

So whether you want to learn statistics for data science, machine learning algorithms, or deep learning, this channel has got you covered. The videos on this channel are engaging and have easy-to-follow visuals and fun examples. The Statistics Fundamentals playlist on this channel has 60 videos covering the following essential statistics and probability concepts:

2. Dr Nic’s Maths and Stats

Dr Nic’s Maths and Stats is another great YouTube channel to learn math, statistics, and all things Excel. Dr Nic has decades of experience teaching math and stats, and the channel has a ton of helpful and engaging stats content.

From basic statistics to statistical inference, there are dedicated playlists on the channel for the following topics:

3. Zedstatistics

Zedstatistics is yet another good YouTube channel that teaches statistics. On this channel, you can find engaging stats lectures.

From the basics of descriptive statistics to slightly more involved topics in hypothesis testing and statistical inference, this channel has helpful content on essential stats concepts. Currently the channel has dedicated playlists for the following topics: 

4. Dr. Stats-A-Lot

Dr. Stats-A-Lot is a stats channel run by Mark Ledbetter, and has a ton of content on R programming and statistics. The lecture videos go into the topics in great detail.

So whether you’re interested in exploring statistics or you want to learn stats for school, you’ll probably find everything you need to learn. Currently, the channel has the following playlists:

5. MarinStatsLectures – R Programming & Statistics

MarinStatLectures is a YouTube channel run by Prof. Mike Marine and has loads of content on statistics, R programming, and statistics with R. The lectures on this channel are part of courses taught at the University of British Columbia for Master’s and Ph.D students in statistics. And you can learn them all for free on this channel.

For all the statistics concepts, there are R programming tutorial components, too. Here are some of the playlist on this channel:

6. Khan Academy

Khan Academy is known for its large suite of high-quality math classes. You’ve probably already used Khan Academy as a companion for your high school or college math. And we’ll look at what the Statistics playlist on the channel covers. 

If you want to get up and running with statistics fundamentals or want to brush up stats concepts ahead of interviews, you’ll find the statistics playlist helpful. The topics covered are as follows:

7. Brandon Foltz

Brandon Foltz ’s YouTube channel has high-quality math and statistics content. If you want to learn introductory statistics in depth, you can check out the stats playlists on this channel.

There are about 21 playlists covering essentials stats topic such as:

Wrapping Up

And that’s a wrap. I hope you found this compilation of YouTube channels to learn stats helpful. Learning statistics is super useful if you ever want to get into the data field. If you’re looking for other resources (books and courses) to learn statistics, here are a couple of articles you’ll find helpful: 

Happy learning!

Featured Posts

5 Tips for Interpreting P-Values Correctly in Hypothesis Testing

Bala Priya is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

  • Open access
  • Published: 24 May 2024

Rosace : a robust deep mutational scanning analysis framework employing position and mean-variance shrinkage

  • Jingyou Rao 1 ,
  • Ruiqi Xin 2   na1 ,
  • Christian Macdonald 3   na1 ,
  • Matthew K. Howard 3 , 4 , 5 ,
  • Gabriella O. Estevam 3 , 4 ,
  • Sook Wah Yee 3 ,
  • Mingsen Wang 6 ,
  • James S. Fraser 3 , 7 ,
  • Willow Coyote-Maestas 3 , 7 &
  • Harold Pimentel   ORCID: orcid.org/0000-0001-8556-2499 1 , 8 , 9  

Genome Biology volume  25 , Article number:  138 ( 2024 ) Cite this article

29 Accesses

Metrics details

Deep mutational scanning (DMS) measures the effects of thousands of genetic variants in a protein simultaneously. The small sample size renders classical statistical methods ineffective. For example, p -values cannot be correctly calibrated when treating variants independently. We propose Rosace , a Bayesian framework for analyzing growth-based DMS data. Rosace leverages amino acid position information to increase power and control the false discovery rate by sharing information across parameters via shrinkage. We also developed Rosette for simulating the distributional properties of DMS. We show that Rosace is robust to the violation of model assumptions and is more powerful than existing tools.

Understanding how protein function is encoded at the residue level is a central challenge in modern protein science. Mutations can cause diseases and drive evolution through perturbing protein function in a myriad of ways, such as by altering its conformational ensemble and stability or its interaction with ligands and binding partners. In these contexts, mutations may result in a loss of function, gain of function, or a neutral phenotype (i.e., no discernable effects). Mutations also often exert effects across multiple phenotypes, and these perturbations can ultimately propagate to alter complex processes in cell biology and physiology. Reverse genetics approaches offer a powerful handle for researchers to investigate biology via introducing mutations and observing the resulting phenotypic changes.

Deep mutational scanning (DMS) is a technique for systematically determining the effect of a large library of mutations individually on a phenotype of interest by performing pooled assays and measuring the relative effects of each variant (Fig.  1 A) [ 1 , 2 , 3 ]. It has improved clinical variant interpretation [ 4 ] and provided insights into the biophysical modeling and mechanistic models of genetic variants [ 5 ]. Taking enzymes as an example, these phenotypes could include catalytic activity [ 6 ] or stability [ 7 , 8 ]. For a transcription factor, the phenotype could be DNA binding specificity or transcriptional activity [ 9 ]. The relevant phenotype for a membrane transporter might be folding and trafficking or substrate transport [ 10 ]. These phenotypes are often captured by growth-based [ 7 , 10 , 11 , 12 , 13 , 14 , 15 , 16 ], binding-based [ 9 , 17 , 18 ], or fluorescence-based assays [ 8 , 10 , 19 ]. Those experiments are inherently differently designed and merit separate analysis frameworks. In growth-based assays, the relative growth rates of cells are of interest. In a binding-based assay, the selection probabilities are of interest. In fluorescence-based assays, changes to the distribution of reporter gene expression are measured. In this paper, we focus solely on growth-based screens.

figure 1

Deep mutational scanning and overview of Rosace  framework. A Each amino acid of the selected protein sequence is mutated to another mutant in deep mutational scanning. B Cells carrying different variants are grown in the same pool under selection pressure. At each time point, cells are sequenced to output the count table. Replications can be produced either pre-transfection or post-transfection. C Rosace is an R package that accepts input from the raw sequencing count table and outputs the posterior distribution of functional score

In a growth-based DMS experiment, we grow a pool of cells carrying different variants under a selective pressure linked to gene function. At set intervals, we sequence the cells to identify each variant’s frequency in the pool. The change in the frequency over the course of the experiment, from initial frequencies to subsequent measurements, serves as a metric of the variant’s functional effects (Fig.  1 B). The functional score is often computed for each variant in the DMS screen and compared against those of synonymous mutations or wild-type cells to display the relative functional change of the protein caused by the mutation. Thus, reliable inference of functional scores is crucial to understanding both individual mutations and at which residue location variants tend to have significant functional effects.

The main challenge of functional score inference is that even under the simplest model, there are at least two estimators required for each mutation (mean and variance of functional change), and in practice, it is rare to have more than three replicates. As a result, it has been posited that under naïve estimators that have been commonly employed, there are likely issues with the false discovery rate and the statistical power of detecting mutations that significantly change the function of the protein [ 20 ]. Regardless, incorporating domain-specific assumptions is required to make inference tractable with few samples and thousands of parameters.

To alleviate the small-sample-size inference problem in DMS, four commonly used methods have been developed: dms_tools [ 21 ], Enrich2 [ 18 ], DiMSum [ 20 ], and EMPIRIC [ 22 ]. dms_tools uses Bayesian inference for reliable inference. However, rather than giving a score to each variant, dms_tools generates a score for each amino acid at each position, assuming linear addition of multiple mutation effects and ignoring epistasis coupling. Thus, dms_tools is not directly comparable to other methods and is excluded from our benchmarking analysis. Enrich2 simplifies the variance estimator by assuming that counts are Poisson-distributed (the variance being equal to the mean) and combines the replicates using a random-effect model. DiMSum , however, argues that the assumption in Enrich2 is not enough to control type-I error. As a result, DiMSum builds upon Enrich2 and includes additional variance terms to model the over-dispersion of sequencing counts. However, as presented in Faure et al. 2020 [ 20 ], this ratio-based method only applies to the DMS screen with one round of selection, while many DMS screens have more than two rounds of selection (i.e., sampling at multiple time points) [ 10 , 11 , 23 ]. Alternatively, EMPIRIC fits a Bayesian model that infers each variant separately with non-informative uniform prior to all parameters and thus does not shrink the estimates to robustly correct the variance in estimates due to the small sample size. Further, the model does not accommodate multiple replicates. In addition, mutscan [ 24 ], a recently developed R package for DMS analysis, employed two established statistical models edgeR and limma-voom . However, these two methods were originally designed for RNA-seq data and the data generation process for DMS is very different. One of the key differences is consistency among replicates. In RNA-seq, gene expression is relatively consistent across replicates under the same condition, while in DMS, counts of variants can vary much since the a priori representation in the initial variant library can be vastly inconsistent among replicates.

While these methods provide reasonable regularization of the score’s variance, additional information can further improve the prior. One solution is incorporating residue position information. It has been noted that amino acids in particular regions have an oversized effect on the protein’s function, and other frameworks have incorporated positions for various purposes. In the form of hidden Markov models (HMMs) and position-specific scoring matrices (PSSMs), this is the basis for the sensitive detection of homology in protein sequences [ 25 ]. These results directly imply that variants at the same position likely share some similarities in their behavior and thus that incorporating local information into modeling might produce more robust inferences. However, no existing methods have incorporated residue position information into their models yet.

To overcome these limitations, we present Rosace , the first growth-based DMS method that incorporates local positional information to increase inference performance. Rosace implements a hierarchical model that parameterizes each variant’s effect as a function of the positional effect, thus providing a way to incorporate both position-specific information and shrinkage into the model. Additionally, we developed Rosette , a simulation framework that attempts to simulate several properties of DMS such as bimodality, similarities in behavior across similar substitutions, and the overdispersion of counts. Compared to previous simulation frameworks such as the one in Enrich2 , Rosette uses parameters directly inferred from the specific input experiment and generates counts that reflect the true level of noise in the real experiment. We use Rosette to simulate several screening modalities and show that our inference method, Rosace , exhibits higher power and controls the false discovery rate (FDR) better on average than existing methods. Importantly, Rosace and Rosette are not two views of the same model— Rosette is based on a set of assumptions that are different from or even opposite to those of Rosace . Rosace ’s ability to accommodate data generated under different assumptions shows its robustness. Finally, we run Rosace on real datasets and it shows a much lower FDR than existing methods while maintaining similar power on experimentally validated positive controls.

Overview of Rosace  framework

Rosace is a Bayesian framework for analyzing growth-based deep mutational scanning data, producing variant-level estimates from sequencing counts. The full (position-aware) method requires as input the raw sequencing counts and the position labels of variants. It outputs the posterior distribution of variants’ functional scores, which can be further evaluated to conduct hypothesis testing, plotting, and other downstream analyses (Fig.  1 C). If the position label is hard to acquire with heuristics, for example, in the case of random multiple-mutation data, position-unaware Rosace model can be run without position label input. Rosace is available as an R package. To generate the input of Rosace from sequencing reads, we share a Snakemake workflow dubbed Dumpling for short-read-based experiments in the GitHub repository described in the “ Methods ” section. Additionally, Rosace supports input count data processed from Enrich2 [ 18 ] for other protocols such as barcoded sequencing libraries.

Rosace  hierarchical model with positional information and score shrinkage

Here, we begin by motivating the use of positional information. Next, we describe the intuition of how we use the positional information. Finally, we describe the remaining dimensions of shrinkage which assist in robust estimates with few experiment replicates.

A variant is herein defined as the amino acid identity at a position in a protein, where that identity may differ from the wild-type sequence. In this context, synonymous, missense, nonsense, and indel variants are all considered and can be processed by Rosace (see the “ Methods ” section for details). The sequence position of a variant p ( v ) provides information on the functional effects to the protein from the variant. We define the position-level functional score \(\phi _{p(v)}\) as the mean functional score of all variants on a given position.

To motivate the use of positional information, we take the posterior distribution of the position-level functional score estimated from a real DMS experiment, a cytotoxicity-based growth screen of a human transporter, OCT1 (Fig.  2 A). In this experiment, variants with decreased activity are expected to increase in abundance, as they lose the ability to import a cytotoxic substrate during selection, and variants with increased activity will decrease in abundance similarly. We observe that most position-level score estimates \(\widehat{\phi }_{p(v)}\) significantly deviate from the mean, implying that position has material idiosyncratic variation and thus carries information about the protein’s functional architecture.

figure 2

Rosace shares information at the same position to inform variant effects. A Smoothed position-specific score (sliding window = 5) across positions from OCT1 cytotoxicity screen. Red dotted lines at score = 0 (neutral position). B A conceptual view of the Rosace generative model. Each position has an overall effect, from which variant effects are conferred. Note the prior is wide enough to allow effects that do not follow the mean. Wild-type score distribution is assumed to be at 0. C Plate model representation of Rosace . See the “ Methods ” section for the description of parameters

To incorporate the positional information into our model, we introduce a position-specific score \(\phi _{p(v)}\) where p ( v ) maps variant v to its amino acid position. The variant-specific score \(\beta _v\) is regularized and controlled by the value of \(\phi _{p(v)}\) . To illustrate the point, we conceptually categorize position into three types: positively selected ( \(\phi _{p(v)} \gg 0\) ), (nearly) neutral ( \(\phi _{p(v)} \approx 0\) ), and negatively selected ( \(\phi _{p(v)} \ll 0\) ) (Fig.  2 B). Variants in a positively selected position tend to have scores centered around the positive mean estimate of \(\phi _{p(v)}\) , and vice versa for the negatively selected position. Variants in a neutral position tend to be statistically non-significant as the region might not be important to the measured phenotype.

Regularization of the score’s variance is achieved mainly by sharing information across variants within the position and asserting weakly informative priors on the parameters (Fig.  2 C). Functional scores of the variants within the position are drawn from the same set of parameters \(\phi _{p(v)}\) and \(\sigma _{p(v)}\) . The error term \(\epsilon _{g(v)}\) in the linear regression on normalized counts is also shared in the mean count group (see the “ Methods ” section) to prevent biased estimation of the error and incorporate mean-variance relationship commonly modeled in RNA-seq [ 26 , 27 ]. Importantly, while we use the position information to center the prior, the prior is weak enough to allow variants at a position to deviate from the mean. For example, we show that the nonsense variants indeed deviate from the positional mean (Additional file 1: Fig. S3). The variant-level intercept \(b_v\) is given a strong prior with a tight distribution centered at 0 to prevent over-fitting.

Rosace  performance on various datasets

To test the performance of Rosace , we ran Rosace along with Enrich2 , mutscan (both limma-voom and edgeR ), DiMSum , and simple linear regression (the naïve method) on the OCT1 cytotoxicity screen. DiMSum cannot analyze data with three selection rounds, so we ran DiMSum with only the first two time points. The data is pre-processed with wild-type normalization for all three methods. The analysis is done on all subsets of three replicates ( \(\{1\}, \{2\}, \{3\}, \{1,2\}, \{1,3\}, \{2,3\}, \{1,2,3\}\) ).

While we do not have a set of true negative control variants, we assume most synonymous mutations would not change the phenotype, and thus, we use synonymous mutation as a proxy for negative controls. We compute the percentage of significant synonymous mutations called by the hypothesis testing as one representation of the false discovery rate (FDR). The variants are ranked based on the hypothesis testing statistics from the method ( p -value for frequentist methods and local false sign rate [ 28 ], or lfsr ) for Bayesian methods). In an ideal scenario with no noise, the line of ranked variants by FDR is flat at 0 and slowly rises after all true variants with effect are called. Rosace has a very flat segment among the top 25% of the ranked variants compared to DiMSum , Enrich2 , and the naïve method and keeps the FDR lower than mutscan(limma) and mutscan(edgeR) until the end (Fig.  3 A). Importantly, we note that the Rosace curve moves only slightly from 1 replicate to 3 replicates, while the other methods shift more, implying that the change in the number of synonymous mutations called is minor for Rosace , despite having fewer replicates (Fig.  3 A).

figure 3

False discovery rate and sensitivity on OCT1 cytotoxicity data. A Percent of synonymous mutations called (false discovery rate) versus ranked variants by hypothesis testing. The left panel is from taking the mean of analysis of the three individual replicates. Ideally, the line would be flat at 0 until all the variants with true effects are discovered. B Number of validated variants called (in total 10) versus number of replicates. If only 1 or 2 replicates are used, we iterate through all possible combinations. For example, the three points for Rosace on 2 replicates use Replicate \(\{1, 2\}\) , \(\{1, 3\}\) , and \(\{2, 3\}\) respectively. (DiMSum can only process two time points, and thus is disadvantaged in experiments such as OCT1)

While lower FDR may result in lower power in the method, we show that Rosace is consistently powerful in detecting the OCT1-positive control variants. Yee et al. [ 10 ] conducted lower-throughput radioligand uptake experiments in HEK293T cells and validated 10 variants that have a loss-of-function or gain-of-function phenotype. We use the number of validated variants to approximate the power of the method. As shown in Fig.  3 B, Rosace has comparable power to Enrich2 , mutscan(limma) , and mutscan(edgeR) regardless of the number of replicates, while the naïve method is unable to detect anything in the case of one replicate. Rosace calls significantly fewer synonymous mutations than every other method while maintaining high power, showing that Rosace is robust in real data.

In OCT1, loss of function leads to enrichment rather than depletion, which is relatively uncommon. To complement findings on OCT1, we conducted a similar analysis on the kinase MET data [ 11 ] (3 replicates, 3 selection rounds), whose loss of function leads to depletion. Applied to this dataset, Rosace and its position-unaware version have comparable power to Enrich2 , mutscan(limma) , and mutscan(edgeR) with any number of replicates used, and the naïve method remains less powerful than other methods, especially with one replicate only. Consistent with OCT1, Rosace again calls fewer synonymous mutations and better controls the false discovery rate. The results are visualized in the Supplementary Figures (Additional file 1: Figs. S12-15).

To test Rosace performance on diverse datasets, we also ran all methods on the CARD11 data [ 14 ] (5 replicates, 1 selection round), the MSH2 data [ 12 ] (3 replicates, 1 selection round), the BRCA1 data [ 13 ] (2 replicates, 2 selection rounds), and the BRCA1-RING data [ 23 ] (6 replicates, 5 selection rounds) (Table S1). In addition to those human protein datasets, we also applied Rosace to a bacterial protein, Cohesin [ 29 ] (1 replicate, 1 selection round) (Table S1). We use the pathogenic and benign variants in ClinVar [ 30 ], EVE [ 31 ], and AlphaMissense [ 32 ] to provide a proxy of positive and negative control variants. Rosace consistently shows high sensitivity in detecting the positive control variants in all three datasets while controlling the false discovery rate (Additional file 1: Figs. S5-S11). Noting that the number of clinically verified variants is limited and those identified in the prediction models usually have extreme effects, we do not observe a large difference between the methods’ performance.

To alleviate a potential concern that the position-level shrinkage given by Rosace is too large, we plot the functional scores calculated by Rosace against those by Enrich2 across several DMS datasets (Additional file 1: Figs. S2-4). We find that the synonymous variants’ functional scores are similar in magnitude to those of other variants, so synonymous variants are not shrunken too strongly to zero. We also find that stop codon and indel variants have consistently significant effect scores, implying that position-level shrinkage is not so strong that those variants’ effects are neutralized. This result implies that the position prior benefits the model mainly through a more stable standard error estimate enabling improved prioritization as a function of local false sign rate or other posterior ranking criteria that are a function of the variance.

Rosette : DMS data simulation which matches marginal distributions from real DMS data

To further benchmark the performance of Rosace and other related methods, we propose a new simulation framework called Rosette , which generates DMS data using parameters directly inferred from the real experiment to gain the flexibility of mimicking the overall structure of most growth-based DMS screen data (Fig.  4 A).

figure 4

Rosette simulation framework preserves the overall structure of growth-based DMS screens. The plots show the result of using OCT1 data as input. A Rosette generates summary statistics from real data and simulates the sequencing count. B Generative model for Rosette simulation. C The distribution of real and predicted functional scores is similar. D , E Five summary statistics are needed for Rosette

Intuitively, if we construct a simulation that closely follows the assumptions of our model, our model should have outstanding performance. To facilitate a fair comparison with other methods, the simulation presented here is not aligned with the assumptions made in Rosace . In fact, the central assumption that variant position carries information is violated by construction to showcase the robustness of Rosace .

To re-clarify the terminology used throughout this paper, “mutant” refers to the substitution, insertion, or deletion of amino acids. A position-mutant pair is considered a variant. Mutants are categorized into mutant groups with hierarchical clustering schemes or predefined criteria (our model uses the former that are expected to align with the biophysical properties of amino acids). Variants are grouped in two ways: (1) by their functional change to the protein, namely neutral, loss-of-function (LOF), or gain-of-function (GOF), referred to as “variant groups,” and (2) by the mean of the raw sequencing counts across replicates, referred to as “variant mean groups.”

Rosette calculates two summary statistics from the raw sequencing counts (dispersion of the sequencing count \(\eta\) and dispersion of the variant library \(\eta _0\) ) (Fig.  4 D) and three others from the score estimates (the proportion of each mutant group \(\varvec{p}\) , the functional score’s distribution of each variant group \(\varvec{\theta }\) , and the weight of each variant group \(\varvec{\alpha }\) ) (Fig.  4 E). Since we are only learning the distribution of the scores instead of the functional characteristics of individual variants, the score estimates can be naïve (e.g., simple linear regression) or more complicated (e.g.,  Rosace ).

The dispersion of the sequencing counts \(\eta\) measures how much variability in variant representation there is in the entire experimental procedure, during both cell culture and sequencing. When \(\eta\) goes to infinity, it means that the sequencing count is almost the same as the expected true cell count (no over-dispersion). When \(\eta\) is small, it shows an over-dispersion of the sequencing count. In an ideal experiment with no over-dispersion, the proportion of synonymous mutations should be invariant to time due to the absence of functional changes. However, from the real data, we have observed a large variability of proportion changes within the synonymous mutations at different selection rounds, which is attributed to over-dispersion and cannot be explained by a simple multinomial distribution in existing simulation frameworks (Additional file 1: Fig. S1). Indeed, all methods, including the naïve method, achieve near-perfect performance in the Enrich2 simulations with a correlation score greater than 0.99 (Additional file 1: Fig. S27). Therefore, we choose to model the sequencing step with a Dirichlet-Multinomial distribution that includes \(\eta\) as the dispersion parameter.

The dispersion of variant library \(\eta _0\) measures how much variability already exists in variant representation before the cell selection. Theoretically, each variant would have around the same number of cells at the initial time point. However, due to the imbalance during the variant library generation process and the cell culture of the initial population that might already be under selection, we sometimes see a wide dispersion of counts across variants. To estimate this dispersion, we fit a Dirichlet-Multinomial distribution under the assumption that the variants in the cell pool at the initial time point should have equal proportions.

The distribution and the structure of the underlying true functional score across variants are controlled by the rest of the summary statistics. We make a few assumptions here. First, the functional score distribution of mutants across positions (or a row in the heatmap (Fig.  4 A)) is different, but within the mutant group, the mutants are independent and identically distributed (or exchangeable). We estimate the mutant group by hierarchical clustering with distance defined by empirical Jenson-Shannon Divergence and record its proportion \(\hat{\varvec{p}}\) . Second, each variant belongs to the neutral hypothesis (score close to 0, similar to synonymous mutations) or the alternative hypothesis (away from 0, different from synonymous mutations). The number of the variant group can be 1–3 (neutral, GOF, and LOF) based on the number of modes in the marginal functional score distribution, and the variants within a variant group are exchangeable. We estimate the borderline of the variant group by Gaussian mixture clustering and fit the distribution parameter \(\hat{\varvec{\theta }}\) . Finally, we assume that the positions are independent. While this is a simplifying assumption, to consider the relationship between positions, we would need to incorporate additional assumptions about the functional region of the protein. As a result, we treat the positions as exchangeable and model the proportion of variant group identity (neutral, GOF, LOF) in each mutant group by a Dirichlet distribution with parameter \(\hat{\varvec{\alpha }}\) .

To simulate the sequencing count from the summary statistics, we use a generative model that mimics the experiment process and is completely different from the Rosace inference model for fair benchmarking. We first draw the functional score of each variant \(\beta _v\) from the structure described in the summary statistics and the ones in the neutral group are set to be 0. Then, we map the functional score to its latent functional parameters: the cell growth rate in the growth screen. Next, we generate the cell count at a particular time point \(N_{v,t,r}\) by the cell count at the previous time point \(N_{v,t-1,r}\) and the latent functional parameters. Finally, the sequencing count is generated from a Dirichlet-Multinomial distribution with the summarized dispersion parameter and the cell count.

The simulation result shows that the simulated functional score distribution is comparable to the real experimental data (Fig.  4 C). We also demonstrate that the simulation is not particularly favorable to models containing positional information such as Rosace . From Fig.  4 E, we observe that in the simulation, the positional-level score is not as widespread as the real data. In addition, the positions with extreme scores (very positive scores in the OCT1 dataset) have reduced standard deviation in the real data, but not in the simulation (Additional file 1: Figs. S18d, S19d, S20d). As a result, we would expect the performance of Rosace to be better in real data than in the simulation.

Testing Rosace  false discovery control with Rosette  simulation

To test the performance of Rosace , we generate simulated data using Rosette from two distinctive growth-based assays: the transporter OCT1 data where LOF variants are positively selected [ 10 ] and the kinase MET data where LOF variants are negatively selected [ 11 ]. We further included the result of a saturation genome editing dataset CARD11 [ 14 ] in Additional file 1: Figs. S17-23. The OCT1 DMS screen measures the impact of variants on cytotoxic drug SM73 uptake mediated by the transporter OCT1. If a mutation causes the transporter protein to have decreased activity, the cells in the pool will import less substrate and thus die more slowly than wide-type or those with synonymous mutations, so the LOF variants would be positively selected. In the MET DMS screen, the kinase drives proliferation and cell growth in the BA/F3 mammalian cell line in the absence of IL-3 (interleukin-3) withdrawal. If the variant protein fails to function, the cells will die faster than the wild-type cells, so the LOF variants will be negatively selected. Both data sets have a clear separation of two modes in the functional score distribution (neutral and LOF) (Additional file 1: Figs. S18a, S19a). We benchmark Rosace with Enrich2 , mutscan(edgeR) , mutscan(limma) , and the naïve method in scenarios where we use 1 or all 3 of replicates and 1 or all 3 of selection rounds. DiMSum is benchmarked when there is only one round of selection because it is not designed to handle multiple rounds. Each scenario is repeated 10 times. The results of all methods show similar correlations with the latent growth rates (Additional file 1: Fig. S21), and thus, for benchmarking purposes, we focus on hypothesis testing.

We compare methods from a variant ranking point of view, comparing methods in terms of the number of false discoveries for any given number of variants selected to be LOF. This is because Rosace is a Bayesian framework that uses lfsr instead of p -values as the metric for variant selection and it is hard to translate lfsr to FDR for a hard threshold. Variants are ranked by adjusted p -values or lfsr (ascending). Methods that perform well will rank the truly LOF variants in the simulation ahead of non-LOF variants. In an ideal scenario with no noise, we would expect the line of ranked variants by FDR to be flat at 0 and slowly rise after all LOF variants are called. The results in Fig.  5 show that even though the position assumption is violated in the Rosette simulation, Rosace is robust enough to maintain a relatively low FDR in all simulation conditions.

figure 5

Benchmark of false discovery control on Rosette simulation. Variants are ranked by hypothesis testing (adjusted p-values or lfsr ). The false discovery rate at each rank is computed as the proportion of neutral variants assuming all the variants till the rank cutoff are called significant. R is the number of replicates and T is the number of selection rounds. MET data is used for negative selection and OCT1 data for positive selection. Ideally, the line would be flat at 0 until the rank where all variants with true effects are discovered. (DiMSum can only process two time points and thus is disadvantaged in experiments with more than two time points, or one selection round)

Testing Rosace  power with Rosette  simulation

Next, we investigate the sensitivity of benchmarking methods at different FDR or lfsr cutoff. It is important to keep in mind that Rosace uses raw lfsr from the sampling result while all other methods use the Benjamini-Hochberg Procedure to control the false discovery rate. As a result, the cutoff for Rosace is on a different scale.

Rosace is the only method that displays high sensitivity in all conditions with a low false discovery rate. In the case of one selection round and three replicates ( \(T = 1\) and \(R = 3\) ), mutscan(edgeR) and mutscan(limma) do not have the power to detect any significant variants with the FDR threshold at 0.1. The same scenario occurs with DiMSum at negative selection and the naïve method at \(T = 3\) and \(R = 1\) (Fig.  6 ). The naïve method in general has very low power, while Enrich2 has a very inflated FDR.

figure 6

Benchmark of sensitivity versus FDR. The upper row is simulated from a modified version of Rosette simulation to favor position-informed models. The bottom row is the results from standard Rosette . Circles, triangles, squares, and crosses represent LOF variant selection at adjusted p-values or lfsr of 0.001, 0.01, 0.05, and 0.10, respectively. Variants with the opposite sign of selection are then excluded. Ideally, for all methods besides Rosace , each symbol would lie directly above the corresponding symbol on the x-axis indicating true FDR. For Rosace , lfsr has no direct translation to FDR so the cutoff represented by the shape is theoretically on a different scale. (DiMSum can only process two time points, and thus is disadvantaged in experiments with more than two time points, or one selection round)

We benchmark Rosace on both Rosette simulations, which inherently violate the position assumption, and a modified version of Rosette that favors the position-informed model. We show that model misspecification does increase the false discovery rate of Rosace , but Rosace is robust enough to outperform all other methods (except for DiMSum with \(T = 1\) and \(R = 3\) and positive selection) even when the position assumption is strongly violated (Fig.  6 ).

One of Rosace ’s contributions is accounting for positional information in DMS analysis. The model assumes the prior information that variants on the same position have similar functional effects, resulting in higher sensitivity and better FDR. Furthermore, Rosace is also capable of incorporating other types of prior information on the similarity of variants.

Despite the value of positional information in statistical inference as demonstrated in this paper, it is unclear how multiple random mutations should be position-labeled. In this case, simple position heuristics are often unsatisfying, and one might argue that a position scalar should not cluster the variants in random mutagenesis experiments with large-scale in-frame insertion and deletion, such as those on viruses. These types of experiments are not the focus of this paper, but are still very important and require careful future research.

Another critique of Rosace is the extent of bias we introduce into the score inference through position-prior information. While it is certainly possible to introduce a large bias, Rosace was developed to be a robust model ensuring near-unbiased inference or prediction even when assumptions are not precisely complied with or even violated. We demonstrate the robustness of Rosace through our data simulation framework, Rosette . The generative procedures of Rosette explicitly violate the prior assumptions made by Rosace , but even with Rosette ’s data, Rosace can learn important information. We also show that the position-level shrinkage is not strong using real data, further manifesting the robustness of Rosace .

The development of DMS simulation frameworks such as Rosette can also drive experimental design. For example, to select the best number of time points and replicates with regard to the trade-off between statistical robustness and costs of the experiment, an experimentalist can conduct a pilot experiment and use its data to infer summary statistics through Rosette . Rosette will then generate simulations close to a real experiment. Experimentalists can find the optimal tool for data analysis given an experimental design by applying candidate tools to the simulation data. Similarly, given a data analysis framework, experimentalists can choose from multiple experiment designs by using Rosace to simulate all those experiments and observe if any designs have enough power to detect most of the LOF or GOF variants with a low false discovery rate.

This paper only applies our tool to growth screens, one of several functional phenotyping methods possible by DMS techniques. Another possibility is the binding experiment, where a portion of cells are selected at each time point. In this case, the expectation of functional scores computed by Rosace is a log transformation of the variant’s selection proportion [ 18 ], and one could potentially use Rosace for DMS analysis as in Enrich2 . The third method is fluorescently activated cell sorting (FACS-seq)—a branch of literature uses binned FACS-seq screens to sort the variant libraries based on protein phenotypes. Since the experiment has multiple bins, one can potentially capture the distributional change of molecular properties beyond mean shifting [ 8 , 10 , 19 , 33 ]. Although of different design, FACS-seq-based screens can also be analyzed using a framework similar to Rosace . Building such frameworks incorporating prior information for experiments beyond growth screens enables the community to exploit a wider range of experimental data.

As the function of a protein is rarely one-dimensional, one can measure multiple phenotypes of a variant in a set of experiments [ 10 , 16 , 34 ]. For example, the OCT1 data mentioned earlier [ 10 ] measures both the transporter surface expression from a FACS-seq screen and drug cytotoxicity with a growth screen. Multi-phenotype DMS experiments also call for analysis frameworks to accommodate multidimensional outcomes by modeling the interaction or the correlation of phenotypes of each variant. One successful attempt models the causal biophysical mechanism of protein folding and binding [ 35 ], and there are many more protein properties other than those two. A unifying framework for the multi-phenotype analysis remains unsolved and challenging. One needs to account for different experimental designs to directly compare scores between phenotypes, and carefully select inferred features most relevant to the scientific questions, requiring both efforts from the experimental and computational side. Nevertheless, we believe that the multi-phenotype analysis will eventually guide us to develop better mechanistic or probabilistic models for how mutations drive proteins in evolution, how they lead to malfunction and diseases, and how to better engineer new proteins.

Conclusions

We present Rosace , a Bayesian framework for analyzing growth-based deep mutational scanning data. In addition, we develop Rosette , a simulation framework that recapitulates the properties of actual DMS experiments, but relies on an orthogonal data generation process from Rosace . From both simulation and real data analysis, we show that Rosace has better FDR control and higher sensitivity compared to existing methods and that it provides reliable estimates for downstream analyses.

Pipeline: raw read to sequencing count

To facilitate the broader adoption of the Rosace framework for DMS experiments, we have developed a sequencing pipeline for short-read-based experiments using Snakemake which we dub Dumpling [ 36 ]. This pipeline handles directly sequenced single-variant libraries containing synonymous, missense, nonsense, and multi-length indel mutations, going from raw reads to final scores and quality control metrics. Raw sequencing data in the form of fastq files is first obtained as demultiplexed paired-end files. The user then defines the experimental architecture using a csv file defining the conditions, replicates, and time points corresponding to each file, which is parsed along with a configuration file. The reads are processed for quality and contaminants using BBDuk, and then the paired reads are error-corrected using BBMerge. The cleaned reads are then mapped onto the reference sequence using BBMap [ 37 ]. Variants in the resulting SAM file are called and counted using the AnalyzeSaturationMutagenesis tool in GATK v4 [ 38 ]. This tool provides a direct count of the number of times each distinct genotype is detected in an experiment. We generate various QC metrics throughout the process and combine them using MultiQC for an easy-to-read final overview [ 39 ].

Due to the degeneracy of indel alignments, the genotyping of codon-level deletions sometimes does not hew to the reading frame due to leftwise alignment. Additionally, due to errors in oligo synthesis, assembly, during in vivo passaging or during sequencing, some genotypes that were not designed as part of the library may be introduced. A fundamental assumption of DMS is the independence of individual variants, and so to reduce noise and eliminate error, our pipeline removes those that were not part of our planned design before analysis, as well as renames variants to be consistent at the amino acid level, before exporting the variant counts in a format for Rosace .

Pre-processing of sequencing count

In a growth DMS screen with V variants, we define v to be the variant index. A function p ( v ) maps the variant v to its position label. T indicates the number of selection rounds and index t is an integer ranging from 0 to T . A total of R replicates are measured, with r as the replicate index. We denote \(c_{v,t,r}\) the raw sequencing count of cells with variant v at time point t in replicate r .

In addition, “mutant” refers to substitution with one of the 20 amino acids, insertion of an amino acid, or deletion. Thus, a variant is uniquely identified by its mutant and the position where the mutant occurs ( p ( v )).

The default pre-processing pipeline of Rosace includes four steps: variant filtering, count imputation, count normalization, and replicate integration. First, variants with more than 50% of missing count data are filtered out in each replicate. Then, variants with a few missing data (less than 50%) are imputed using either the K-nearest neighbor averaging ( K = 10) or filled with 0. Next, imputed raw counts are log-transformed with added pseudo-count 1/2 and normalized by the wild-type cells or the sum of sequencing counts for synonymous mutations. This step, which is proposed by Enrich2 , allows for the computed functional score of wild-type cells to be approximately 0. Additionally, the counts for each variant before selection are aligned to be 0 for simple prior specification of the intercept.

Previous papers suggest the usage of other methods such as total-count normalization when the wild-type is incorrectly estimated or subject to high levels of error [ 18 , 20 ]. We include this in Rosace as an option. Finally, replicates in the same experiment are joined together for the input of the hierarchical model. If a variant is dropped out in some but not all replicates, Rosace imputes the missing replicate data with the mean of the other replicates.

Rosace : hierarchical model and functional score inference

Rosace assumes that the aligned counts are generated by the following time-dependent linear function. Let \(\beta _v\) be the defined functional score or slope, \(b_v\) be the intercept, and \(\epsilon _{g(v)}\) be the error term. The core of Rosace is a linear regression:

where g ( v ) maps the variant v to its mean group—the grouping method will be explained below.

p ( v ) is the function that maps a variant v to its amino acid position. If the information of variants’ mutation types is given, Rosace will assign synonymous variants to many artificial “control” positions. The number of synonymous variants per control position is determined by the maximum number of non-synonymous variants per position. Assigning synonymous variants to control positions incorporates the extra information while not giving too strong a shrinkage to synonymous variants (Additional file 1: Figs. S2-S4). In addition, we regroup positions with fewer than 10 variants together to avoid having too few variants in a position. For example, if the DMS screen has fewer than 10 mutants per position, adjacent positions will be grouped to form one position label. Also, the position of a continuous indel variant is labeled as a mutation of the leftmost amino acid residue (e.g., an insertion between positions 99 and 100 is labeled as position 99 and a deletion of positions 100 through 110 is labeled as position 100).

We assume that the variants at the same position are more likely to share similar functional effects. Thus, we build the layer above \(\beta _v\) using position-level parameters \(\phi _{p(v)}\) and \(\sigma _{p(v)}\) .

The mean and precision parameters are given a weakly informative normal prior and variance parameters are given weakly informative inverse-gamma distribution.

We further cluster the variant into mean groups of 25 based on its value of mean count across time points and replicates. The mapping between the variant and its mean group is denoted as g ( v ). Thus, we model the mean-variance relationship by assuming variants with a lower mean are expected to have higher error terms in the linear regression and vice versa.

Stan [ 40 ] is used in Rosace for Bayesian inference over our model. We use the default inference method, the No-U-Turn sampler (NUTS), a variant of the Hamiltonian Monte Carlo (HMC) algorithm. Compared to other widely used Monte Carlo samplers, for example, the Metropolis-Hastings algorithm, HMC has reduced correlation between successive samples, resulting in fewer samples reaching a similar level of accuracy [ 41 ]. NUTS further improves HMC by automatically determining the number of steps in each iteration of HMC sampling to more efficiently sample from the posterior [ 42 ].

The lower bound of the number of mutants per position index \(|\{v|p(v)=i\}|\) (10) and the size of the variant’s mean group \(g_p\) (25) can be changed.

Rosette : the OCT1 and MET datasets

We use the following datasets as input of the Rosette simulation: the OCT1 dataset by Yee et al. [ 10 ] as an example of positive selection and the MET dataset by Estevam et al . [ 11 ] as an example of negative selection. Specifically, we use replicate 2 of the cytotoxicity selection screen in the OCT1 dataset for both score distribution and raw count dispersion. For the MET dataset, we select the experiment with IL-3 withdrawal under wild-type genetic background (without exon 14 skipping). Raw counts are extracted from replicate 1 but the scores are calculated from all three replicates because of the frequent dropouts at the initial time point.

The sequencing reads and the resulting sequencing counts are processed in the default pipeline described in the previous method sections. Scores are then computed using simple linear regression (the naïve method). The naïve method is used as the Rosette input because we are trying to learn the global distribution of the scores instead of identifying individual variants and, while uncalibrated, naïve estimates are unbiased.

Rosette : summary statistics from real data

Summary statistics inferred by Rosette can be categorized into two types: one for the dispersion of sequencing counts and the other for the dispersion of score distribution.

First, we estimate dispersion \(\eta\) in the sequencing count. We assume the sequencing count at time point 0 reflects the true variant library before selection. Since the functional scores of synonymous variants are approximately 0, the proportion of synonymous mutations in the population should approximately be the same after selection. Let the set of indices of synonymous mutations be \(\textbf{v}_s = \{v_{s1}, v_{s2}, \dots \}\) . The count of each synonymous mutation at time point t is \(\textbf{c}_{\textbf{v}_s, t} = (c_{v_{s1}, t}, c_{v_{s2}, t}, \dots )\) . The model we use to fit \(\eta\) is thus

from which we find the maximum likelihood estimation \(\hat{\eta }\) .

Dispersion of the initial variant library \(\eta _0\) is estimated similarly by fitting a Dirichlet-Multinomial distribution on the sequencing counts of the initial time point assuming that in an ideal experiment, the proportion of each variant in the library should be the same. Similar to above, the indices of all mutations are \(\textbf{v} = \{1, 2, \dots , V\}\) , and the count of each mutation at time point 0 is \(\textbf{c}_{\textbf{v}, 0} = (c_{1, 0}, c_{2, 0}, \dots , c_{V, 0})\) . From the following model

we can again find the maximum likelihood of the variant library dispersion \(\hat{\eta _0}\) . Notice that \(\hat{\eta }_0\) is usually much smaller than \(\hat{\eta }\) (i.e. more overdispersed) because \(\hat{\eta }_0\) contains both the dispersion of the variant library as well as the sequencing step.

To characterize the distribution of functional scores, we first cluster mutants into groups, as mutants often have different properties and exert different influences on protein function. We calculate the empirical Jensen-Shannon divergence (JSD) to measure the distance between two mutants, using bins of 0.1 to find the empirical probability density function. Ideally, a clustering scheme should produce a grouping that reflects the inherent properties of an amino acid that are independent of position. Thus, we are more concerned with the general shape of the distribution than the similarity between paired observations. It leads to our preference for JSD over Euclidean distance as the clustering metric. To cluster mutants into four mutant groups \(g_{m} = \{1, 2, 3, 4\}\) , we use hierarchical clustering (“hclust” function with complete linkage method in R), and we record the proportions \(\widehat{\varvec{p}}\) to simulate any number of mutants in the simulation (the number of mutant groups can also be changed). The underlying assumption is that mutants in each mutant group are very similar and can be treated as interchangeable. We define \(f_1(v)\) as the function that maps a variant to its corresponding mutant group \(g_{m}\) .

Then, we cluster the variants into different variant groups. In the case of our examples, the shape is not unimodal but bimodal. The OCT1 screen has a LOF mode on the right (positive selection) and the MET screen has a LOF mode on the left (negative selection). While it is possible to observe both GOF and LOF variants, we observed in our datasets that GOF variants are so rare that they do not constitute a mode on the mixed distribution, resulting in a bimodal distribution. To cluster the non-synonymous variants into groups \(g_{v}\) , we use the Gaussian Mixture model with two mixtures for our examples to decide the cutoff of the groups, and we fit the Gaussian distribution for each variant group again to learn the parameters of the distribution. The synonymous variants have their own group labeled as control. Let \(f_2(v)\) denote the function that maps a variant to its corresponding variant group \(g_{v}\) . The result of the simulation shows that even the synonymous mutations with scores close to 0 can have large negative effects due to random dropout. Thus, we later set the effect of the control and the neutral group to be constant 0 and still observe a similar distribution as seen in the real data. For each variant, we have one of the models below, depending on whether the variant results in LOF or has no effects:

We use \(\widehat{\varvec{\theta }}\) to denote the collection of estimated distributional parameters for all variant groups.

Finally, we define the number of variants in each variant group at each position

For each position p , we can thus find the count of variants belonging to any mutant-variant group \(\varvec{o}_{p} \in \textbf{N}^{\Vert g_m \Vert \Vert g_v \Vert }\) . Treating each position as an observation, we fit a Dirichlet distribution to characterize the distribution of variant group identities among mutants at any position:

The final summary statistics are \(\hat{\eta }\) , \(\hat{\eta _0}\) , \(\hat{\varvec{p}}\) , \(\hat{\varvec{\theta }}\) , and \(\hat{\varvec{\alpha }}\) . We also need T , the number of selection rounds, to map \(\beta _v\) into the latent functional parameter \(\mu _v\) in growth screens.

Rosette : data generative model

We simulate as the real experiment the same number of mutants M , the number of positions P , and the number of variants V ( \(M \times P\) ). The important hyperparameters that need to be specified are the average number of reads per variant D (100, also referred to as the sequencing depth), initial cell population count \(P_0\) (200 V ), and wild-type doubling rate \(\delta\) between time points ( \(-2\) or 2). One also needs to specify the number of replicates R and selection rounds T .

The simulation largely consists of two major steps: (1) generating latent growth rates \(\mu _v\) and (2) generating cell counts \(N_{v,t,r}\) and sequencing counts \(c_{v,t,r}\) .

In step 1, the mutant group and variant group labeling of each variant is first generated. Specifically, we assign a mutant to the mutant group \(g_m\) by the proportion \(\hat{\varvec{p}}\) and then assign a variant to the variant group \(g_v\) by drawing \(\varvec{o}_p\) from Dirichlet distribution with parameter \(\hat{\varvec{\alpha }}\) (Eq.  10 ). Using \(\hat{\varvec{\theta }}\) , we randomly generate \(\beta _v\) for each variant based on its \(g_v\) (Eq.  8 ). The mapping between \(\beta _v\) and \(\mu _v\) requires an understanding of the generative model, so it will be defined after we present the cell growth model.

In step 2, the starting cell population \(N_{v,r,0}\) is drawn from a Dirichlet-Multinomial distribution using \(\hat{\eta }_0\) and we assume that replicates are biological replicates:

where \(P_0\) is the total cell population. The cells are growing exponentially and we determine the cell count by a Poisson distribution

where \(\Delta t\) is the pseudo-passing time. It differs from index t and will be defined in the next paragraph. Similar to how we define \(\textbf{c}_{\textbf{v}, t, r}\) , we define the true cell count of each variant at time point t and replicate r to be \(\textbf{N}_{\textbf{v}, t, r} = (N_{1, t, r}, \dots , N_{V, t, r})\) . The sequencing count for each variant is

where D is the sequencing depth per variant. Empirically, we can set input \(\hat{\eta }\) and \(\hat{\eta }_0\) slightly higher than the estimated summary statistics. This is because the estimated values encompass all the noises in the experiment, while the true values only represent the noise from the sequencing step.

To find the mapping between \(\beta _v\) and \(\mu _v\) , we define \(\delta\) to be the wild-type doubling rate and naturally compute \(\Delta t:= \frac{\delta \log 2}{\mu _{wt}}\) , the pseudo-passing time in each round. Then we can compute the expectation of \(\beta _v\) with the linear regression model. For simplicity, we omit the replicate index r and assume r is fixed in the next set of equations.

The final mapping between simulated \(\beta _v\) and \(\mu _v\) is then described in the following

with \(\mu _{wt}\) set to be \(\text {sgn}(\delta )\) .

Modified Rosette that favors position-informed models

In the original, position-agnostic version of Rosette , a \(\Vert g_m \Vert \Vert g_v \Vert\) -dimensional vector is drawn from the same Dirichlet distribution for each position. The vector can be regarded as a quota for each mutant-variant group. Variants at each position are assigned their mutant-variant group according to the quota. As a result, at one position, variants from all variant groups (neutral, LOF, and GOF) would exist, and this violates the assumption in Rosace that variants at one position would have similar functional effects (strong LOF and GOF variants are very unlikely to be at the same position). To show that Rosace could indeed take advantage of the position information when it exists in the data, we create a modified version of Rosette where variants at one position could only belong to one variant group. Specifically, a position can have either neutral, LOF, or GOF variants, but not a mixture among any variant groups.

Benchmarking

The naïve method (simple linear regression) is conducted by the “lm” function in R on processed data. For each variant, normalized counts are regressed against time. Raw two-sided p -values are computed from t -statistics given by the “lm” function. It is then corrected using the Benjamini-Hochberg Procedure to adjust the p -values.

For Enrich2 , we use the built-in variant filtering and wild-type (“wt”) normalization. All analyses use a random-effect model as presented in the paper. When there is more than one selection round, we use weighted linear regression. Otherwise, a simple ratio test is performed. The resulting p -values are adjusted using the Benjamini-Hochberg Procedure.

DiMSum requires the variant labeling to be DNA sequences. As a result, we have to generate dummy sequences. It is applied to all simulations with one selection round with the default settings. The z -statistics are computed using the variant’s mean estimate over the estimated standard deviation and the adjusted p -value is computed from the z -score with Benjamini-Hochberg procedure. DiMSum only processes data with one selection round (two time points) and thus may be disadvantaged when analyzing datasets with multiple selection rounds.

mutscan is an end-to-end pipeline that requires the input to be sequencing reads. Conversely, Rosette only generates sequencing counts, which can be calculated from sequencing reads but cannot be used to recover sequencing reads. To facilitate benchmarking, we use a SummarizedExperiment object to feed the Rosette output to their function “calculateRelativeFC,” which does take sequencing counts as input. We benchmark both mutscan(edgeR) and mutscan(limma) with default normalization and hyperparameters as provided in the function. We use the “logFC_shrunk” and “FDR” columns in mutscan(edgeR) output and the “logFC” and “adj.P.Val” columns in mutscan(limma) output.

We run Rosace with position information of variants and labeling of synonymous mutations. However, Rosace is a Bayesian framework so it does not compute FDR like the frequentist methods above. All Rosace power/FDR calculations are done under the Bayesian local false sign rate ( lfsr ) setting [ 28 ]. As a result, in the simulation, we present the rank-FDR curve and the FDR-Sensitivity curve as the metrics instead of setting an identical or different hard threshold on FDR and lfsr . In the real data benchmarking, both the FDR and lfsr thresholds are set to be 0.05.

Rosace without position label is denoted as Rosace (nopos) in the Additional file 1: Figs. S5–S15, S19–S23, and S25. It removes the position layer in Fig.  2 C and keeps only the variant and replicate layer. The test statistics and model evaluation are presented identically as the full Rosace model.

Availability of data and materials

Rosace is implemented as an R package and is distributed on GitHub ( https://github.com/pimentellab/rosace ), under the MIT open-source license. The package also includes functions for Rosette simulation. An archived version of Rosace is available on Zenodo [ 43 ].

The integrated sequencing pipeline for short-read-based experiments is available on GitHub ( https://github.com/odcambc/dumpling ).

Scripts and pre-processed public datasets used to perform data analysis and generate figures for the paper are uploaded on GitHub as well ( https://github.com/roserao/rosace-paper-script ).

The protein datasets we used are as follows: OCT1 [ 10 ], MET [ 11 ], CARD11 [ 14 ], MSH2 [ 12 ], BRCA1 [ 13 ], BRCA1-RING [ 23 ], and Cohesin [ 29 ]. OCT1 and MET are available on NIH NCBI BioProject with accession codes PRJNA980726 and PRJNA993160 . CARD11, BRCA1, and Cohesin are available as supplementary files to their respective publications. MSH2 is available on Gene Expression Omnibus with accession code GSE162130 . BRCA1-RING is available on MaveDB with accession code mavedb:00000003-a-1 .

The benchmarking datasets are EVE [ 31 ] ( evemodel.org ), ClinVar [ 30 ] ( gnomad.broadinstitute.org ), and AlphaMissense [ 32 ] ( alphamissense.hegelab.org ).

Fowler DM, Stephany JJ, Fields S. Measuring the activity of protein variants on a large scale using deep mutational scanning. Nat Protoc. 2014;9(9):2267–84. https://doi.org/10.1038/nprot.2014.153 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nature Methods. 2014;11(8):801–7. https://doi.org/10.1038/nmeth.3027 .

Araya CL, Fowler DM. Deep mutational scanning: assessing protein function on a massive scale. Trends Biotechnol. 2011;29(9):435–42. https://doi.org/10.1016/j.tibtech.2011.04.003 .

Tabet D, Parikh V, Mali P, Roth FP, Claussnitzer M. Scalable functional assays for the interpretation of human genetic variation. Annu Rev Genet. 2022;56(1):441–65. https://doi.org/10.1146/annurev-genet-072920-032107 .

Article   CAS   PubMed   Google Scholar  

Stein A, Fowler DM, Hartmann-Petersen R, Lindorff-Larsen K. Biophysical and mechanistic models for disease-causing protein variants. Trends Biochem Sci. 2019;44(7):575–88. https://doi.org/10.1016/j.tibs.2019.01.003 .

Romero PA, Tran TM, Abate AR. Dissecting enzyme function with microfluidic-based deep mutational scanning. Proc Natl Acad Sci USA. 2015;112:7159–64. https://doi.org/10.1073/PNAS.1422285112 .

Chen JZ, Fowler DM, Tokuriki N. Comprehensive exploration of the translocation, stability and substrate recognition requirements in vim-2 lactamase. eLife. 2020;9:1–31.

Article   CAS   Google Scholar  

Matreyek KA, Starita LM, Stephany JJ, Martin B, Chiasson MA, Gray VE, et al. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nat Genet. 2018;50(6):874–82. https://doi.org/10.1038/s41588-018-0122-z .

Leander M, Liu Z, Cui Q, Raman S. Deep mutational scanning and machine learning reveal structural and molecular rules governing allosteric hotspots in homologous proteins. eLife. 2022;11. https://doi.org/10.7554/ELIFE.79932 .

Yee SW, Macdonald C, Mitrovic D, Zhou X, Koleske ML, Yang J, et al. The full spectrum of OCT1 (SLC22A1) mutations bridges transporter biophysics to drug pharmacogenomics. bioRxiv. 2023. https://doi.org/10.1101/2023.06.06.543963 .

Estevam GO, Linossi EM, Macdonald CB, Espinoza CA, Michaud JM, Coyote-Maestas W, et al. Conserved regulatory motifs in the juxtamembrane domain and kinase N-lobe revealed through deep mutational scanning of the MET receptor tyrosine kinase domain. eLife. 2023. https://doi.org/10.7554/elife.91619.1 .

Jia X, Burugula BB, Chen V, Lemons RM, Jayakody S, Maksutova M, et al. Massively parallel functional testing of MSH2 missense variants conferring Lynch syndrome risk. Am J Hum Genet. 2021;108:163–75. https://doi.org/10.1016/J.AJHG.2020.12.003 .

Findlay GM, Daza RM, Martin B, Zhang MD, Leith AP, Gasperini M, et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature. 2018;562(7726):217–22. https://doi.org/10.1038/s41586-018-0461-z .

Meitlis I, Allenspach EJ, Bauman BM, Phan IQ, Dabbah G, Schmitt EG, et al. Multiplexed functional assessment of genetic variants in CARD11. Am J Hum Genet. 2020;107:1029–43. https://doi.org/10.1016/J.AJHG.2020.10.015 .

Flynn JM, Rossouw A, Cote-Hammarlof P, Fragata I, Mavor D, Hollins C III, et al. Comprehensive fitness maps of Hsp90 show widespread environmental dependence. eLife. 2020;9:e53810. https://doi.org/10.7554/eLife.53810 .

Article   PubMed   PubMed Central   Google Scholar  

Steinberg B, Ostermeier M. Shifting fitness and epistatic landscapes reflect trade-offs along an evolutionary pathway. J Mol Biol. 2016;428(13):2730–43. https://doi.org/10.1016/j.jmb.2016.04.033 .

Fowler DM, Araya CL, Fleishman SJ, Kellogg EH, Stephany JJ, Baker D, et al. High-resolution mapping of protein sequence-function relationships. Nat Methods. 2010;7(9):741–6. https://doi.org/10.1038/nmeth.1492 .

Rubin AF, Gelman H, Lucas N, Bajjalieh SM, Papenfuss AT, Speed TP, et al. A statistical framework for analyzing deep mutational scanning data. Genome Biol. 2017;18:1–15. https://doi.org/10.1186/S13059-017-1272-5/FIGURES/7 .

Article   Google Scholar  

Coyote-Maestas W, Nedrud D, He Y, Schmidt D. Determinants of trafficking, conduction, and disease within a K + channel revealed through multiparametric deep mutational scanning. eLife. 2022;11:e76903. https://doi.org/10.7554/eLife.76903 .

Faure AJ, Schmiedel JM, Baeza-Centurion P, Lehner B. DiMSum: An error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies. Genome Biol. 2020;21:1–23. https://doi.org/10.1186/S13059-020-02091-3/TABLES/2 .

Bloom JD. Software for the analysis and visualization of deep mutational scanning data. BMC Bioinformatics. 2015;16:1–13. https://doi.org/10.1186/S12859-015-0590-4/FIGURES/6 .

Bank C, Hietpas RT, Wong A, Bolon DN, Jensen JD. A Bayesian MCMC approach to assess the complete distribution of fitness effects of new mutations: Uncovering the potential for adaptive walks in challenging environments. Genetics. 2014;196:841–52. https://doi.org/10.1534/GENETICS.113.156190/-/DC1 .

Starita LM, Young DL, Islam M, Kitzman JO, Gullingsrud J, Hause RJ, et al. Massively parallel functional analysis of BRCA1 RING domain variants. Genetics. 2015;200(2):413–22. https://doi.org/10.1534/genetics.115.175802 .

Soneson C, Bendel AM, Diss G, Stadler MB. mutscan-a flexible R package for efficient end-to-end analysis of multiplexed assays of variant effect data. Genome Biol. 2023;12(24):1–22. https://doi.org/10.1186/S13059-023-02967-0/FIGURES/6 .

Eddy SR. Accelerated Profile HMM Searches. PLOS Comput Biol. 2011;7(10):1–16. https://doi.org/10.1371/journal.pcbi.1002195 .

Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009;26(1):139–40. https://doi.org/10.1093/bioinformatics/btp616 .

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:1–21.

Stephens M. False discovery rates: a new deal. Biostatistics. 2017;18:275–94. https://doi.org/10.1093/BIOSTATISTICS/KXW041 .

Article   PubMed   Google Scholar  

Kowalsky CA, Whitehead TA. Determination of binding affinity upon mutation for type I dockerin-cohesin complexes from C lostridium thermocellum and C lostridium cellulolyticum using deep sequencing. Proteins Struct Funct Bioinforma. 2016;84(12):1914–28.

Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46(D1):D1062–7.

Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021;599(7883):91–5.

Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023;381(6664):eadg7492.

Starr TN, Greaney AJ, Hilton SK, Ellis D, Crawford KHD, Dingens AS, et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell. 2020;182:1295-1310.e20. https://doi.org/10.1016/J.CELL.2020.08.012 .

Stiffler M, Hekstra D, Ranganathan R. Evolvability as a function of purifying selection in TEM-1 beta-lactamase. Cell. 2015;160(5):882–892. Publisher Copyright: © 2015 Elsevier Inc. https://doi.org/10.1016/j.cell.2015.01.035 .

Faure AJ, Domingo J, Schmiedel JM, Hidalgo-Carcedo C, Diss G, Lehner B. Mapping the energetic and allosteric landscapes of protein binding domains. Nature. 2022;604(7904):175–83. https://doi.org/10.1038/s41586-022-04586-4 .

Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, et al. Sustainable data analysis with Snakemake. F1000Research. 2021;10:33.  https://f1000research.com/articles/10-33/v2 .

Bushnell B. BBTools software package. 2014. https://sourceforge.net/projects/bbmap . Accessed 11 June 2021.

Van der Auwera GA, O’Connor BD. Genomics in the cloud: using Docker, GATK, and WDL in Terra. Sebastopol: O’Reilly Media; 2020.

Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–8. https://doi.org/10.1093/bioinformatics/btw354 .

Stan Development Team. RStan: the R interface to Stan. 2023. R package version 2.21.8. https://mc-stan.org/ . Accessed 22 May 2024.

Betancourt M. A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434. 2017.  https://arxiv.org/abs/1701.02434 .

Hoffman MD, Gelman A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J Mach Learn Res. 2014;15(47):1593–623.

Google Scholar  

Rao J. pimentellab/rosace. 2023. Zenodo. https://doi.org/10.5281/zenodo.10814911 .

Download references

Review history

The review history is available as Additional file 2.

Peer review information

Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Author information

Ruiqi Xin and Christian Macdonald contributed equally to this work.

Authors and Affiliations

Department of Computer Science, UCLA, Los Angeles, CA, USA

Jingyou Rao & Harold Pimentel

Computational and Systems Biology Interdepartmental Program, UCLA, Los Angeles, CA, USA

Department of Bioengineering and Therapeutic Sciences, UCSF, San Francisco, CA, USA

Christian Macdonald, Matthew K. Howard, Gabriella O. Estevam, Sook Wah Yee, James S. Fraser & Willow Coyote-Maestas

Tetrad Graduate Program, UCSF, San Francisco, CA, USA

Matthew K. Howard & Gabriella O. Estevam

Department of Pharmaceutical Chemistry, UCSF, San Francisco, CA, USA

Matthew K. Howard

Department of Mathematics, Baruch College, CUNY, New York, NY, USA

Mingsen Wang

Quantitative Biosciences Institute, UCSF, San Francisco, CA, USA

James S. Fraser & Willow Coyote-Maestas

Department of Computational Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA

Harold Pimentel

Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA

You can also search for this author in PubMed   Google Scholar

Contributions

JR, CM, WCM, and HP jointly conceived the project. JR and HP developed the statistical model and the simulation framework. JR, MW, and RX wrote the software and its support. JR performed the data analysis and benchmarking. CM wrote the sequencing pipeline. SWY and CM performed the OCT1 experiment and GOE performed the MET experiment. JR and HP wrote the manuscript with input from MW, CM, WCM, MH, and JSF. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Willow Coyote-Maestas or Harold Pimentel .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Competing interests

JSF has consulted for Octant Bio, a company that develops multiplexed assays of variant effects. The other authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1: supplementary figures and tables., additional file 2: review history., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Rao, J., Xin, R., Macdonald, C. et al. Rosace : a robust deep mutational scanning analysis framework employing position and mean-variance shrinkage. Genome Biol 25 , 138 (2024). https://doi.org/10.1186/s13059-024-03279-7

Download citation

Received : 31 October 2023

Accepted : 14 May 2024

Published : 24 May 2024

DOI : https://doi.org/10.1186/s13059-024-03279-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Genome Biology

ISSN: 1474-760X

linear hypothesis in statistics

IMAGES

  1. How to Write and Test Statistical Hypotheses in Simple Linear

    linear hypothesis in statistics

  2. Multiple Linear Regression Hypothesis Testing in Matrix Form

    linear hypothesis in statistics

  3. hypothesis test formula statistics

    linear hypothesis in statistics

  4. Estimate simple linear regression equation using spss

    linear hypothesis in statistics

  5. 13 Different Types of Hypothesis (2024)

    linear hypothesis in statistics

  6. Test a Hypothesis

    linear hypothesis in statistics

VIDEO

  1. Hypothesis Testing in Simple Linear Regression

  2. General linear hypothesis tests based on F distribution (STAT 331)

  3. Application of Hypothesis Testing and Linear Regression in Real-life

  4. Introduction to Hypothesis Testing Part 2

  5. Statistics and probability- Hypothesis testing of coefficients in multiple Linear regression

  6. Concept of Hypothesis

COMMENTS

  1. 12.2.1: Hypothesis Test for Linear Regression

    The hypotheses are: Find the critical value using dfE = n − p − 1 = 13 for a two-tailed test α = 0.05 inverse t-distribution to get the critical values ±2.160. Draw the sampling distribution and label the critical values, as shown in Figure 12-14. Figure 12-14: Graph of t-distribution with labeled critical values.

  2. 5.2

    5.2 - Writing Hypotheses. The first step in conducting a hypothesis test is to write the hypothesis statements that are going to be tested. For each test you will have a null hypothesis ( H 0) and an alternative hypothesis ( H a ). When writing hypotheses there are three things that we need to know: (1) the parameter that we are testing (2) the ...

  3. Hypothesis Testing

    Step 2: Collect data. For a statistical test to be valid, it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in. Hypothesis testing example.

  4. Simple Linear Regression

    Simple linear regression example. You are a social researcher interested in the relationship between income and happiness. You survey 500 people whose incomes range from 15k to 75k and ask them to rank their happiness on a scale from 1 to 10. Your independent variable (income) and dependent variable (happiness) are both quantitative, so you can ...

  5. PDF Chapter 9 Simple Linear Regression

    218 CHAPTER 9. SIMPLE LINEAR REGRESSION 9.2 Statistical hypotheses For simple linear regression, the chief null hypothesis is H 0: β 1 = 0, and the corresponding alternative hypothesis is H 1: β 1 6= 0. If this null hypothesis is true, then, from E(Y) = β 0 + β 1x we can see that the population mean of Y is β 0 for

  6. Linear regression

    The lecture is divided in two parts: in the first part, we discuss hypothesis testing in the normal linear regression model, in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors; in the second part, we show how to carry out hypothesis tests in linear regression analyses where the ...

  7. Linear Regression Explained with Examples

    A parameter multiplied by an independent variable (IV) Then, you build the linear regression formula by adding the terms together. These rules limit the form to just one type: Dependent variable = constant + parameter * IV + … + parameter * IV. This formula is linear in the parameters. However, despite the name linear regression, it can model ...

  8. PDF LINEAR MODELS IN STATISTICS

    12.7.3 General Linear Hypothesis 326 12.8 An Illustration of Estimation and Testing 329 12.8.1 Estimable Functions 330 12.8.2 Testing a Hypothesis 331 12.8.3 Orthogonality of Columns of X 333 13 One-Way Analysis-of-Variance: Balanced Case 339 13.1 The One-Way Model 339 13.2 Estimable Functions 340 13.3 Estimation of Parameters 341

  9. The Complete Guide to Linear Regression Analysis

    In the case of simple linear regression we performed the hypothesis testing by using the t statistics to see is there any relationship between the TV advertisement and sales. In the same manner, for multiple linear regression, we can perform the F test to test the hypothesis as, H0: β1 = β2 = · · · = βp = 0. Ha: At least one βj is non-zero.

  10. Linear regression review (article)

    Write a linear equation to describe the given model. Step 1: Find the slope. This line goes through ( 0, 40) and ( 10, 35) , so the slope is 35 − 40 10 − 0 = − 1 2 . Step 2: Find the y -intercept. We can see that the line passes through ( 0, 40) , so the y -intercept is 40 . Step 3: Write the equation in y = m x + b form.

  11. PDF Linear Hypothesis

    Linear Hypothesis 1. Introduction The term 'linear hypothesis' is often used inter-changeably with the term 'linear model.' Statistical methods using linear models are widely used in the behavioral and social sciences, e.g., regression analy-sis,analysisofvariance,analysisofcovariance,multi-variate analysis, time series analysis, and ...

  12. Test statistics

    Test statistic example. To test your hypothesis about temperature and flowering dates, you perform a regression test. The regression test generates: a regression coefficient of 0.36. a t value comparing that coefficient to the predicted range of regression coefficients under the null hypothesis of no relationship.

  13. 6.2

    The " general linear F-test " involves three basic steps, namely: Define a larger full model. (By "larger," we mean one with more parameters.) Define a smaller reduced model. (By "smaller," we mean one with fewer parameters.) Use an F-statistic to decide whether or not to reject the smaller reduced model in favor of the larger full model.

  14. 4.4: Hypothesis Testing

    This is also the case with hypothesis testing: even if we fail to reject the null hypothesis, we typically do not accept the null hypothesis as true. Failing to find strong evidence for the alternative hypothesis is not equivalent to accepting the null hypothesis. 17 H 0: The average cost is $650 per month, μ = $650.

  15. Linear regression hypothesis testing: Concepts, Examples

    F-statistics for testing hypothesis for linear regression model: F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0.

  16. Linear Hypothesis Tests

    Linear Hypothesis Tests. Most regression output will include the results of frequentist hypothesis tests comparing each coefficient to 0. However, in many cases, you may be interested in whether a linear sum of the coefficients is 0. For example, in the regression. Outcome = β0 +β1 ×GoodT hing+β2 ×BadT hing O u t c o m e = β 0 + β 1 × G ...

  17. Null & Alternative Hypotheses

    You can use a statistical test to decide whether the evidence favors the null or alternative hypothesis. Each type of statistical test comes with a specific way of phrasing the null and alternative hypothesis. However, the hypotheses can also be phrased in a general way that applies to any test. ... Linear regression: There is a relationship ...

  18. How to Use the linearHypothesis() Function in R

    You can use the linearHypothesis() function from the car package in R to test linear hypotheses in a specific regression model.. This function uses the following basic syntax: linearHypothesis(fit, c(" var1=0", "var2=0 ")) This particular example tests if the regression coefficients var1 and var2 in the model called fit are jointly equal to zero.. The following example shows how to use this ...

  19. Linear Hypothesis Testing in Linear Models With High-Dimensional

    Journal of the American Statistical Association Volume 117, 2022 - Issue 540. Submit an article Journal homepage. 2,091 ... 3 CrossRef citations to date 0. Altmetric Theory and Methods. Linear Hypothesis Testing in Linear Models With High-Dimensional Responses. Changcheng Li Runze Li Department of Statistics , Pennsylvania State ...

  20. 7 Best YouTube Channels to Learn Statistics for Free

    Linear models, linear and multiple regression 2. Dr Nic's Maths and Stats. ... From the basics of descriptive statistics to slightly more involved topics in hypothesis testing and statistical inference, this channel has helpful content on essential stats concepts. Currently the channel has dedicated playlists for the following topics:

  21. Choosing the Right Statistical Test

    Statistical tests are used in hypothesis testing. They can be used to: ... The test statistic tells you how different two or more groups are from the overall population mean, or how different a linear slope is from the slope predicted by a null hypothesis. Different test statistics are used in different statistical tests.

  22. How to perform statistical data analysis in MS Excel

    Key Takeaways. Open your Excel spreadsheet and input your data. Use the "Data Analysis" toolpack by enabling it from the "Add-ins" menu. Select the type of statistical analysis you want to ...

  23. Rosace: a robust deep mutational scanning analysis framework employing

    The variants are ranked based on the hypothesis testing statistics from the method (p-value for frequentist methods and local false sign rate , ... The naïve method (simple linear regression) is conducted by the "lm" function in R on processed data. For each variant, normalized counts are regressed against time. ...