Introduction to Econometrics with R

7.3 joint hypothesis testing using the f-statistic.

The estimated model is

\[ \widehat{TestScore} = \underset{(15.21)}{649.58} -\underset{(0.48)}{0.29} \times STR - \underset{(0.04)}{0.66} \times english + \underset{(1.41)}{3.87} \times expenditure. \]

Now, can we reject the hypothesis that the coefficient on \(size\) and the coefficient on \(expenditure\) are zero? To answer this, we have to resort to joint hypothesis tests. A joint hypothesis imposes restrictions on multiple regression coefficients. This is different from conducting individual \(t\) -tests where a restriction is imposed on a single coefficient. Chapter 7.2 of the book explains why testing hypotheses about the model coefficients one at a time is different from testing them jointly.

The homoskedasticity-only \(F\) -Statistic is given by

\[ F = \frac{(SSR_{\text{restricted}} - SSR_{\text{unrestricted}})/q}{SSR_{\text{unrestricted}} / (n-k-1)} \]

with \(SSR_{restricted}\) being the sum of squared residuals from the restricted regression, i.e., the regression where we impose the restriction. \(SSR_{unrestricted}\) is the sum of squared residuals from the full model, \(q\) is the number of restrictions under the null and \(k\) is the number of regressors in the unrestricted regression.

It is fairly easy to conduct \(F\) -tests in R . We can use the function linearHypothesis() contained in the package car .

The output reveals that the \(F\) -statistic for this joint hypothesis test is about \(8.01\) and the corresponding \(p\) -value is \(0.0004\) . Thus, we can reject the null hypothesis that both coefficients are zero at any level of significance commonly used in practice.

A heteroskedasticity-robust version of this \(F\) -test (which leads to the same conclusion) can be conducted as follows:

The standard output of a model summary also reports an \(F\) -statistic and the corresponding \(p\) -value. The null hypothesis belonging to this \(F\) -test is that all of the population coefficients in the model except for the intercept are zero, so the hypotheses are \[H_0: \beta_1=0, \ \beta_2 =0, \ \beta_3 =0 \quad \text{vs.} \quad H_1: \beta_j \neq 0 \ \text{for at least one} \ j=1,2,3.\]

This is also called the overall regression \(F\) -statistic and the null hypothesis is obviously different from testing if only \(\beta_1\) and \(\beta_3\) are zero.

We now check whether the \(F\) -statistic belonging to the \(p\) -value listed in the model’s summary coincides with the result reported by linearHypothesis() .

The entry value is the overall \(F\) -statistics and it equals the result of linearHypothesis() . The \(F\) -test rejects the null hypothesis that the model has no power in explaining test scores. It is important to know that the \(F\) -statistic reported by summary is not robust to heteroskedasticity.

Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10 .

  • Payment Plans
  • Product List
  • Partnerships

AnalystPrep

  • Try Free Trial
  • Study Packages
  • Levels I, II & III Lifetime Package
  • Video Lessons
  • Study Notes
  • Practice Questions
  • Levels II & III Lifetime Package
  • About the Exam
  • About your Instructor
  • Part I Study Packages
  • Parts I & II Packages
  • Part I & Part II Lifetime Package
  • Part II Study Packages
  • Exams P & FM Lifetime Package
  • Quantitative Questions
  • Verbal Questions
  • Data Insight Questions
  • Live Tutoring
  • About your Instructors
  • EA Practice Questions
  • Data Sufficiency Questions
  • Integrated Reasoning Questions

Joint Hypotheses Testing

Joint Hypotheses Testing

In multiple regression, the intercept in simple regression represents the expected value of the dependent variable when the independent variable is zero, while in multiple regression, it’s the expected value when all independent variables are zero. The interpretation of slope coefficients remains the same as in simple regression.

Tests for single coefficients in multiple regression are similar to those in simple regression, including one-sided tests. The default test is against zero, but you can test against other hypothesized values.

In some cases, you might want to test a subset of variables jointly in multiple regression, comparing models with and without specific variables. This involves a joint hypothesis test where you restrict some coefficients to zero. To test  the slope against a hypothesized value other than zero we will need to:

  • You can conduct the test by either modifying the hypothesized parameter value, B j , in the test statistic or
  • by comparing B j with the confidence interval boundaries derived from the regression coefficient output.

At times, we may want to collectively test a subset of variables within a multiple regression. To illustrate this concept and set the stage, let’s say we aim to compare the regression outcomes for a portfolio’s excess returns using Fama and French’s three-factor model (MKTRF, SMB, HML) with those using their five-factor model (MKTRF, SMB, HML, RMW, CMA). Given that both models share three factors (MKTRF, SMB, HML), the comparison revolves around assessing the necessity of the two additional variables: the return difference between the most profitable and least profitable firms (RMW) and the return difference between firms with the most conservative and most aggressive investment strategies (CMA). The primary goal in determining the superior model lies in achieving simplicity by identifying the most effective independent variables in explaining variations in the dependent variable.

Now, let’s contemplate a more comprehensive model:

$$Y_i= b_0+b_1 X_1i+b_2 X_2i+b_3 X_3i+b_4 X_4i+b_5 X_5i+ε_i$$

The model above has five independent variables and it is referred to as the unrestricted model . Sometimes we may want to test whether and together have no significant contribution used to explain the dependent variable, i.e\(X_4\) =\(X_5\)=0. We compare the full model (unrestricted model) to: $$Y_i= b_0+b_1 X_1i+b_2 X_2i+b_3 X_3i+ε_i$$ This is referred to as the restricted model because it excludes \(X_4\) and \(X_5\), which will have the effect of restricting the slope coefficients on \(X_4\) and \(X_5\) to be equal to 0. These models are also termed nested models because the restricted model is contained within the unrestricted model. This model comparison entails a null hypothesis that encompasses a combined restriction on two coefficients, namely, \(H_0:b_4=b_5=0\) against \(H_A:b_4\) or \(b_5≠0\)

We employ a statistic to compare nested models, pitting the unrestricted model against a restricted version with some slope coefficients set to zero. This statistic assesses how the joint restriction affects the restricted model’s ability to explain the dependent variable compared to the unrestricted model. We test the influence of omitted variables jointly using an F-distributed test statistic. $$F=\frac{(\text{Sum of squares error restricted model}-\text{Sum of squares error unrestricted})/q}{(\text{Sum of squares error unrestricted model})/(n-k-l)}$$

\(q\)= Number of restrictions

When we want to compare an unrestricted model to a restricted model \( q\)= 2 because we are testing the null hypothesis of \(b_4=b_5=0\).The F- statistic has \(n-k-1\)   and \(q\)   degrees of freedom. In summary, the unrestricted model includes a larger set of explanatory variables, whereas the restricted model has \(q\) fewer independent variables because the slope coefficients of the excluded variables are forced to be zero.

Why not just conduct hypothesis tests for each individual variable and make conclusions based on that data? In many cases of multiple regression with financial variables, there’s typically some level of correlation among the variables. As a result, there may be shared explanatory power that isn’t fully accounted for when testing individual slopes.

Table 1: Partial ANOVA Results for Models Using Three and Five Factors $$ \begin{array}{c|c|c|c|c} \textbf{Source}& \textbf{Factors} & \textbf{Residual} & \textbf{Mean }&\textbf{Degrees} \\ & \textbf{} & \textbf{sum of squares}& \textbf{squares errors}&\textbf{of freedom }\\ \hline \text{Restricted} & 1,2,3 & 66.9825 & 1.4565 & 44 \\ \hline \text{Unrestricted} & 1,2,3,4,5 & 58.7232 & 1.3012 & 42\\ \hline\end{array} $$

Test of Hypothesis for factors 4 and 5 at 1% Level of significance

Step1: State the hypothesis             \(H_0:b_4=b_5=0\) vs. \(H_a:\) at least \(b_j≠0\)

Step 2: Identify the appropriate test statistic. $$F=\frac{(\text{Sum of squares error restricted model}-\text{Sum of squares error unrestricted})/q}{(\text{Sum of squares error unrestricted model})/(n-k-l)}$$

Step 3: Specify the level of significance.

            \(α\)=1% (one-tail, right side)

Step 4: State the decision rule.

            Critical F -value = 5.149. Reject the null if the calculated F- statistic

            exceeds 5.149.

Step 5: Calculate the test statistic. $$F=\frac{(66.9825-58.7232)/2}{58.7232/42}=\frac{4.1297}{1.3982}=2.9536$$

Step 6: Make a decision

            Fail to reject the null hypothesis since the answer is less than the critical f-

            value.

Hypothesis testing involves testing an assumption regarding a population parameter. A null hypothesis is a condition believed to be false. We reject the null hypothesis in the presence of enough evidence against it and accept the alternative hypothesis .

Hypothesis testing is performed on the estimated slope coefficients to establish if the independent variables explain the variation in the dependent variable.

The t-statistic for testing the significance of the individual coefficients in a multiple regression model is calculated using the formula below:

$$ t=\frac{\widehat{b_j}-b_{H0}}{S_{\widehat{b_j}}} $$

\(\widehat{b_j}\) = Estimated regression coefficient.

\(b_{H0}\) = Hypothesized value.

\(S_{\widehat{b_j}}\) = The standard error of the estimated coefficient.

It is important to note that the test statistic has \(n-k-1\) degrees of freedom, where \(k\) is the number of independent variables and 1 is the intercept term.

A t-test tests the null hypothesis that the regression coefficient equals some hypothesized value against the alternative hypothesis that it does not.

$$ H_0:b_j=v\ vs\ H_a:b_j\neq v $$

\(v\) = Hypothesized value.

The F-test determines whether all the independent variables help explain the dependent variable. It is a test of regression’s overall significance. The F-test involves testing the null hypothesis that all the slope coefficients in the regression are jointly equal to zero against the alternative hypothesis that at least one slope coefficient is not equal to 0.

i.e.: \(H_0: b_1 = b_2 = … = b_k = 0\) versus \(H_a\): at least one \(b_j\neq 0\)

We must understand that we cannot use the t-test to determine whether every slope coefficient is zero. This is because individual tests do not account for interactions among the dependent variables.

The following inputs are required to determine the test statistic for the null hypothesis.

  • The total number of observations, \(n\).
  • The total number of slope coefficients to be estimated, \(k+1\), where \(k\) = number of slope coefficients.
  • The residual sum of squares (SSE), which is the unexplained variation .
  • The regression sum of squares (RSS), which is the explained variation .

The F-statistic (which is a one-tailed test ) is computed as:

$$ F=\frac{\left(\frac{RSS}{k}\right)}{\left(\frac{SSE}{n- \left(k + 1\right)}\right)}=\frac{\text{Mean Regression sum of squares (MSR)}}{\text{Mean squared error(MSE)}} $$

  • \(RSS\) = Regression sum of squares.
  • \(SSE\) = Sum of squared errors.
  • \(n\) = Number of observations.

\(k\) = Number of independent variables.

A large value of \(F\) implies that the regression model explains variation in the dependent variable. On the other hand, the value of \(F\) will be zero when the independent variables do not explain the dependent variable.

The F-test is denoted as \(F_{k,(n-\left(k+1\right)}\). The test should have \(k\) degrees of freedom in the numerator and \(n-(k+1)\) degrees of freedom in the denominator.

Decision Rule

We reject the null hypothesis at a given significance level, \(\alpha\), if the calculated value of \(F\) is greater than the upper critical value of the one-tailed \(F\) distribution with the specified degrees of freedom.

I.e., reject \(H_0\) if \(F-\text{statistic}> F_c (\text{critical value})\). Graphically, we see the following:

The F-statistic - CFA, FRM, and Actuarial Exams Study Notes

Example: Calculating and Interpreting the F-statistic

The following ANOVA table presents the output of the multiple regression analysis of the price of the US Dollar index on the inflation rate and real interest rate.

$$ \textbf{ANOVA} \\ \begin{array}{c|c|c|c|c|c} & \text{df} & \text{SS} & \text{MS} & \text{F} & \text{Significance F} \\ \hline \text{Regression} & 2 & 432.2520 & 216.1260 & 7.5405 & 0.0179 \\ \hline \text{Residual} & 7 & 200.6349 & 28.6621 & & \\ \hline \text{Total} & 9 & 632.8869 & & & \end{array} $$

Test the null hypothesis that all independent variables are equal to zero at the 5% significance level.

We test the null hypothesis:

\(H_0: \beta_1= \beta_2 = 0\) verses \(H_a:\) at least one \(b_j\neq 0\)

with the following variables:

Number of slope coefficients: \((k) = 2\).

Degrees of freedom in the denominator:  \(n-(k+1) = 10-(2+1) = 7\).

Residual sum of squares: \(RSS = 432.2520\).

Sum of squared errors: \(SSE = 200.6349\).

$$ F=\frac{\left(\frac{RSS}{k}\right)}{\left(\frac{SSE}{n- \left(k + 1\right)}\right)}=\frac{\frac{432.2520}{2}}{\frac{200.6349}{7}}=7.5405 $$

For \(\alpha= 0.05\), the critical value of \(F\) with \(k = 2\) and \((n – k – 1) = 7\) degrees of freedom \(F_{0.05, 2, 7,}\) is approximately 4.737.

cfa-level-2-f-test

Additionally, you will notice that from the ANOVA table, the column “Significance F” reports a p-value of 0.0179, which is less than 0.05. The p-value implies that the smallest level of significance at which the null hypothesis can be rejected is 0.0179.

Analyzing Multiple Regression Models for Model Fit

$$ \begin{array}{l|l} \textbf{Statistic} & \textbf{Assessing criteria} \\ \hline {\text{Adjusted } R^2} & \text{It is better if it is higher.} \\ \hline \text{Akaike’s information criterion (AIC)} & \text{It is better if it is lower.} \\ \hline {\text{Schwarz’s Bayesian information} \\ \text{criterion (BIC)}} & \text{A lower number is better.} \\ \hline {\text{An analysis of slope coefficients using} \\ \text{the t-statistic}} & {\text{The critical t-value(s) are located} \\ \text{outside the given range for the} \\ \text{selected significance level.}} \\ \hline {\text{Test of slope coefficients using the} \\ \text{F-test}} & {\text{The F-value for the selected} \\ \text{significance level exceeds the} \\ \text{critical value.}} \end{array} $$

Several models that explain the same dependent variable are evaluated using Akaike’s information criterion (AIC). Often, it can be calculated from the information in the regression output, but most regression software includes it as part of the output.

$$ AIC=n\ ln\left[\frac{\text{Sum of squares error}}{n}\right]+2(k+1) $$

\(n\) = Sample size.

\(2\left(k+1\right)\) = The model is penalized when independent variables are included.

Models with the same dependent variables can be compared using Schwarz’s Bayesian Information Criterion (BIC).

$$ BIC=n \ ln\left[\frac{\text{Sum of squares error}}{n}\right]+ln(n)(k+1) $$

  • Models with more parameters incur a significant penalty, so BIC prefers models with fewer parameters. It is because ln(n) exceeds 2, even for very small sample sizes.
Question Which of the following statements is most accurate ? The best-fitting model is the regression model with the highest adjusted \(R^2\) and low BIC and AIC. The best-fitting model is the regression model with the lowest adjusted \(R^2\) and high BIC and AIC. The best-fitting model is the regression model with both high adjusted \(R^2\) and high BIC and AIC. Solution The correct answer is A . A regression model with a high adjusted \(R^2\) and a low AIC and BIC will generally be the best fit. B and C are incorrect . The best-fitting regression model generally has a high adjusted \(R^2\) and a low AIC and BIC.

Offered by AnalystPrep

what is joint hypothesis testing

Assumptions Underlying Multiple Linear Regression

The use of multiple regression for forecasting, maturity structure of yield volatilities.

Bond managers must quantify interest rate volatilities for bonds with embedded options because... Read More

Metrics and Visuals Interpretation

Most quantitative stock selection models use a multifactor structure. For example, fundamental managers... Read More

Valuation of Commodities in Contrast t ...

We can use the following modes of valuation to compare commodities and equities.... Read More

Assessing the Long-run Fair Values of ...

Parity conditions are useful in the assessment of the fair value of currencies.... Read More

MBA 8350: Course Companion for Analyzing and Leveraging Data

8.5 joint hypothesis tests, 8.5.1 simple versus joint tests.

We have already considered all there is to know about simple hypothesis tests.

\[H_0: \beta = 0 \quad \text{versus} \quad H_1: \beta \neq 0\]

With the established (one-sided or two-sided) hypotheses, we were able to calculate a p-value and conclude. There is nothing more to it than that.

A simple hypothesis test follows the same constraints as how we interpret single coefficients: all else equal . In particular, when we conduct a simple hypothesis test, we must calculate a test statistic under the null while assuming that all other coefficients are unchanged. This might be fine under some circumstances, but what if we want to test the population values of multiple regression coefficients at the same time? Doing this requires going from simple hypothesis tests to joint hypothesis tests.

Joint hypothesis tests consider a stated null involving multiple PRF coefficients simultaneously. Consider the following general PRF:

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \varepsilon_i\]

A simple hypothesis test such as

\[H_0: \beta_1 = 0 \quad \text{versus} \quad H_1: \beta_1 \neq 0\]

is conducted under the assumption that \(\beta_2\) and \(\beta_3\) are left to be whatever the data says they should be. In other words, a simple hypothesis test can only address a value for one coefficient at a time while being silent on all others.

A joint hypothesis states a null hypothesis that considers multiple PRF coefficients simultaneously. The statement in the null hypothesis can become quite sophisticated and test some very interesting statements.

For example, we can test if all population coefficients are equal to zero - which explicitly states that none of the independent variables are important.

\[H_0: \beta_1 = \beta_2 = \beta_3 = 0 \quad \text{versus} \quad H_1: \beta_1 \neq 0,\; \beta_2 \neq 0,\; \text{or} \; \beta_3 \neq 0\]

We don’t have to be so extreme and test that just two of the three coefficients are simultaneously zero.

\[H_0: \beta_1 = \beta_3 = 0 \quad \text{versus} \quad H_1: \beta_1 \neq 0\; \text{or} \; \beta_3 \neq 0\]

If we have a specific theory in mind, we could also test if PRF coefficients are simultaneously equal to specific (nonzero) numbers.

\[H_0: \beta_1 = 1 \; \text{or} \; \beta_3 = 4 \quad \text{versus} \quad H_1: \beta_1 \neq 1\; \text{or} \; \beta_3 \neq 4\]

Finally, we can test if PRF coefficients behave according to some relative measures. Instead of stating in the null that coefficients are equal to some specific number, we can state that they are equal (or opposite) to each other or they behave according to some mathematical condition.

\[H_0: \beta_1 = -\beta_3 \quad \text{versus} \quad H_1: \beta_1 \neq -\beta_3\]

\[H_0: \beta_1 + \beta_3 = 1 \quad \text{versus} \quad H_1: \beta_1 + \beta_3 \neq 1\]

\[H_0: \beta_1 + 5\beta_3 = 3 \quad \text{versus} \quad H_1: \beta_1 + 5\beta_3 \neq 3\]

As long as you can state a hypothesis involving multiple PRF coefficients in a linear expression, then we can test the hypothesis using a joint test. There are an infinite number of possibilities, so it is best to give you a couple of concrete examples to establish just how powerful these tests can be.

Application

One chapter of my PhD dissertation concluded with a single joint hypothesis test. The topic I was researching was the Bank-Lending Channel of Monetary Policy Transmission , which is a bunch of jargon dealing with how banks respond to changes in monetary policy established by the Federal Reserve. A paper from 1992 written by Ben Bernanke and Alan Blinder established that aggregate bank lending volume responded to changes in monetary policy (identified as movements in the Federal Funds Rate). 15 A simplified version of their model (below) considers the movement in bank lending as the dependent variable and the movement in the Fed Funds Rate (FFR) as the independent variable.

\[L_i = \beta_0 + \beta_1 FFR_i + \varepsilon_i\]

While this is a simplification of the model actually estimated, you can see that \(\beta_1\) will concisely capture the change in bank lending given an increase in the Fed Funds Rate.

\[\beta_1 = \frac{\Delta L_i}{\Delta FFR_i}\]

Since an increase in the Federal Funds Rate indicates a tightening of monetary policy, the authors proposed a simple hypothesis test to show that an increase in the FFR delivers a decrease in bank lending.

\[H_0:\beta_1 \geq 0 \quad \text{versus} \quad H_1:\beta_1 < 0\]

Their 1992 paper rejects the null hypothesis above, which gave them empirical evidence that bank lending responds to monetary policy changes. The bank lending channel was established!

My dissertation tested an implicit assumption of their model: symmetry .

The interpretation of the slope of this regression works for both increases and decreases in the Fed Funds Rate. Assuming that \(\beta_1 <0\) , a one-unit increase in the FFR will deliver an expected decline of \(\beta_1\) units of lending on average. However, it also states that a one-unit decrease in the FFR will deliver an expected increase of \(\beta_1\) units of lending on average. This symmetry is baked into the model. The only way we can explicitly test this assumption is to extend the model and perform a joint hypothesis test.

Suppose we separated the FFR variable into increases in the interest rate and decreases in the interest rate.

\[FFR_i^+ = FFR_i >0 \quad \text{(zero otherwise)}\] \[FFR_i^- = FFR_i <0 \quad \text{(zero otherwise)}\]

If we were to put both of these variables into a similar regression, then we could separate the change in lending from increases and decreases in the interest rate.

\[L_i = \beta_0 + \beta_1 FFR_i^+ + \beta_2 FFR_i^- + \varepsilon_i\]

\[\beta_1 = \frac{\Delta L_i}{\Delta FFR_i^+}, \quad \beta_2 = \frac{\Delta L_i}{\Delta FFR_i^-}\]

Notice that both \(\beta_1\) and \(\beta_2\) are still hypothesized to be negative numbers. However, the first model imposed the assumption that they were the same negative number while this model allows them to be different. We can therefore test the hypothesis that they are the same number by performing the following joint hypothesis:

\[H_0: \beta_1=\beta_2 \quad \text{versus} \quad H_1: \beta_1 \neq \beta_2\]

In case you were curious, the null hypothesis get rejected and this provides evidence that the bank lending channel is indeed asymmetric . This implies that banks respond more to monetary tightenings than monetary expansions, which should make sense given all of the low amounts of bank lending in the post-global recession of 2008 despite interest rates being at all time lows.

Conducting a Joint Hypothesis Test

A joint hypothesis test involves four steps:

Estimate an unrestricted model

Impose the null hypothesis and estimate a restricted model

Construct a test statistic under the null

Determine a p-value and conclude

1. Estimate an Unrestricted Model

An analysis begins with a regression model that can adequately capture what you are setting out to uncover. In general terms, this is a model that doesn’t impose any serious assumptions on the way the world works so you can adequately test these assumptions. Suppose we have a hypothesis that two independent variables impact a dependent variable by the same quantitative degree. In that case, we need a model that does not impose this hypothesis.

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_i\]

The model above allows for the two independent variables to impact the dependent variable in whatever way the data sees fit. Since there is no imposition of the hypothesis on the model, or no restriction that the hypothesis be obeyed, then this model is called the unrestricted model.

2. Estimate a Restricted Model

A restricted model involves both the unrestricted model and the null hypothesis. If we wanted to test if the two slope hypotheses were the same, then our joint hypothesis is just like the one in the previous example:

\[H_0:\beta_1=\beta_2 \quad \text{versus} \quad H_1:\beta_1 \neq \beta_2\]

With the null hypothesis established, we now need to construct a restricted model which results from imposing the null hypothesis on the unrestricted model. In particular, starting with the unrestricted model and substituting the null, we get the following:

\[Y_i = \beta_0 + \beta_2 X_{1i} + \beta_2 X_{2i} + \varepsilon_i\]

\[Y_i = \beta_0 + \beta_2 (X_{1i} + X_{2i}) + \varepsilon_i\]

\[Y_i = \beta_0 + \beta_2 \tilde{X}_{i} + \varepsilon_i \quad \text{where} \quad \tilde{X}_{i} = X_{1i} + X_{2i}\]

Imposing the null hypothesis restricts the two slope coefficients to be identical. If we construct the new variable \(\tilde{X}_i\) according to how the model dictates, then we can use the new variable to estimate the restricted model.

3. Construct a test statistic under the null

Now that we have our unrestricted and restricted models estimated, the only two things we need from them are the \(R^2\) values from each. We will denote the \(R^2\) from the unrestricted model as the unrestricted \(R^2\) or \(R^2_u\) , and the \(R^2\) from the restricted model as the restricted \(R^2\) or \(R^2_r\) .

These two pieces of information are used with two degrees of freedom measures to construct a test statistic under the null - which is conceptually similar to how we perform simple hypothesis tests. However, while simple hypothesis tests are performed assuming a Student’s t distribution, joint hypothesis tests are performed assuming an entirely new distribution: An F distribution.

Roughly speaking, an F distribution arises from taking the square of a t distribution. Since simple hypothesis tests deal with t distributions, and the joint hypothesis deals with \(R^2\) values, you get the general idea. An F-statistic under the null is given by

\[F=\frac{(R^2_u - R^2_r)/m}{(1-R^2_u)/(n-k-1)} \sim F_{m,\;n-k-1}\]

\(R^2_u\) is the unrestricted \(R^2\) - the \(R^2\) from the unrestricted model.

\(R^2_r\) is the restricted \(R^2\) - the \(R^2\) from the restricted model.

\(m\) is the numerator degrees of freedom - the number of restrictions imposed on the restricted model. In other words, count up the number of equal signs in the null hypothesis.

\(n-k-1\) is the denominator degrees of freedom - this is the degrees of freedom for a simple hypothesis test performed on the unrestricted model.

In simple hypothesis tests, we constructed a t-statistic that is presumably drawn from a t-distribution. We are essentially doing the same thing here by constructing a F-statistic that is presumably drawn from a F-distribution.

what is joint hypothesis testing

The F-distribution has a few conceptual properties we should discuss.

An F statistic is restricted to be non-negative.

This should make sense because the expressions in both the numerator and denominator of our F-statistic calculation are both going to be non-negative. The numerator is always going to be non-negative because \(R^2_u \geq R^2_r\) . In other words, the unrestricted model will always explain more or at least as much of the variation in the dependent variable as the restricted model does. When the two models explain the same amount of variation, then the \(R^2\) values are the same and the numerator is zero. When the two models explain different amounts of variation, then this means that the restriction prevents the model from explaining as much of the variation in the dependent variable it otherwise would when not being restricted.

The Rejection Region is Always in the Right Tail

If we have \(R^2_u = R^2_r\) , then this implies that the restricted model and the unrestricted model are explaining the same amount of variation in the dependent variable. Think hard about what this is saying. If both models have the same \(R^2\) , then they are essentially the same model . One model is unrestricted meaning it can choose any values for coefficients it sees fit. The other model is restricted meaning we are forcing it to follow whatever is specified in the null. If these two models are the same, then the restriction doesn’t matter . In other words, the model is choosing the values under the null whether or not we are imposing the null. If that is the case, then the f-statistic will be equal to or close to zero.

If we have \(R^2_u > R^2_r\) , then this implies that the restriction imposed by the null hypothesis is hampering the model from explaining as much of the volatility in the dependent variable than it otherwise would have. The more \(R^2_u > R^2_r\) , the more \(F>0\) . Once this F-statistic under the null becomes large enough, we reject the null. This means that the difference between the unrestricted and restricted models is so large that we have evidence to state that the null hypothesis is simply not going on in the data. This implies that the rejection region is always in the right tail, and the p-value is always calculated from the right as well.

4. Determine a P-value and Conclude

Again, we establish a confidence level \(\alpha\) as we would with any hypothesis test. This delivers an acceptable probability of a type I error and breaks the distribution into a rejection region and a non-rejection region.

For example, suppose you set \(\alpha = 0.05\) and have \(m=2\) and \(n-k-1 = 100\) . This means that the non-rejection region will take up 95% of the area of the F-distribution with 2 and 100 degrees of freedom.

If an F-statistic is greater than 3.09 then we can reject the null of the joint hypothesis with at least 95% confidence.

what is joint hypothesis testing

As in any hypothesis test, we can also calculate a p-value. This will deliver the maximum confidence level at which we can reject the null.

Notice that since the probability is calculated from the left by default (like the other commands), we can use the above code to automatically calculate \(1-p\) .

8.5.2 Applications

Lets consider two applications. The first application is not terribly interesting, but it will illustrate a joint hypothesis test that is always provided to you free of charge with any set of regression results. The second application is more involved and delivers the true importance of joint tests.

Application 1: A wage application

This is the same scenario we considered for the dummy variable section, only without gender as a variable.

Suppose you are a consultant hired by a firm to help determine the underlying features of the current wage structure for their employees. You want to understand why some wage rates are different from others. Let our dependent variable be wage (the hourly wage of an individual employee) and the independent variables be given by…

educ be the total years of education of an individual employee

exper be the total years of experience an individual employee had prior to starting with the company

tenure is the number of years an employee has been working with the firm.

The resulting PRF is given by…

\[wage_i=\beta_0+\beta_1educ_i+\beta_2exper_i+\beta_3tenure_i+\varepsilon_i\]

Suppose we wanted to test that none of these independent variables help explain movements in wages, so the resulting joint hypothesis would be

\[H_0: \beta_1 = \beta_2 = \beta_3 = 0 \quad \text{versus} \quad H_1: \beta_1 \neq 0, \; \beta_2 \neq 0, \; \text{or} \; \beta_3 \neq 0\]

The unrestricted model is one where each of the coefficients can be whatever number the data wants them to be.

Our unrestricted model can explain roughly 30% of the variation in wages.

The next step is to estimate the restricted model - the model with the null hypothesis imposed. In this case you will notice that setting all slope coefficients to zero results in a rather strange looking model:

\[wage_i=\beta_0+\varepsilon_i\]

This model contains no independent variables. If you were to estimate this model, then the intercept term would return the average wage in the data and the error term will simply be every deviation from the individual wage observations with it’s average value. Since it is impossible for the deterministic component of this model to explain any of the variation in wages, then this implies that the restricted \(R^2\) is zero by definition. Note that this is only a special case because of what the restricted model looks like. There will be more interesting cases where the restricted \(R^2\) will need to be determined by estimating a restricted model.

Now that we have the restricted and unrestricted \(R^2\) , we need the degrees of freedom to calculate an F-statistic under the null. The numerator degrees of freedom \((m)\) denotes how many restrictions we placed on the restricted model. Since the null hypothesis sets all three slope coefficients to zero, we consider this to be 3 restrictions. The denominator degrees of freedom \((n-k-1)\) is taken directly from the unrestricted model. Since \(n=526\) and we originally had 3 independent variables ( \(k=3\) ), the denominator degrees of freedom is \(n-k-1=522\) . We can now calculate our F statistic under the null as well as our p-value.

Note that since our F-statistic is far from 0, we can reject the null with approximately 100% confidence (i.e. the p-value is essentially zero).

What can we conclude from this?

Since we rejected the null hypothesis, that means we have statistical evidence that the alternative hypothesis is true. However, take a look at the what the alternative hypothesis actually says. It says that at least one of the population coefficients are statistically different from zero. It doesn’t say which ones. It doesn’t say how many. That’s it…

Is there a short cut?

Remember that all regression results provide the simple hypothesis that each slope coefficient is equal to zero.

\[H_0: \beta=0 \quad \text{versus} \quad H_1: \beta \neq 0\]

All regression results also provide the joint hypothesis that all slope coefficients are equal to zero. You can see the result at the bottom of the summary page. The last line delivers the same F-statistic we calculated above as well as a p-value that is essentially zero.

Note that while this uninteresting joint hypothesis test is done by default. Other joint tests require a bit more work.

Application 2: Constant Returns to Scale

Suppose you have data on the Gross Domestic Product (GDP) of a country as well as observations on two aggregate inputs of production: the nation’s capital stock (K) and aggregate labor supply (L). One popular regression to run in growth economics is to see if a nation’s aggregate production function possesses constant returns to scale . If it does, then if you scale up a nation’s inputs by a particular percentage, then you will get the exact same percentage increase in output (i.e., double the inputs results in double the outputs). This has implications for what the size an economy should be, but we won’t get into those details now.

The PRF is given by

\[lnGDP_i = \beta_0 + \beta_K \;lnK_i + \beta_L \;lnL_i + \varepsilon_i\]

\(lnGDP_i\) is an observation of total output

\(lnK_i\) is an observation of total capital stock

\(lnL_i\) is an observation of total labor stock.

These variables are actually in logs , but we will ignore that for now.

If we are testing for constant returns to scale, then we want to show that increasing all of the inputs by a certain amount will result in the same increase in output. Technical issues aside, this results in the following null hypothesis for a joint test:

\[H_0: \beta_K + \beta_L = 1 \quad \text{versus} \quad H_1: \beta_K + \beta_L \neq 1\]

We now have all we need to test for CRS:

The unrestricted model can explain around 96% of the variation in the dependent variable. For us to determine how much the restricted model can explain, we first need to see exactly what the restriction does to our model. Starting from the unrestricted model, imposing the restriction delivers the following:

\[lnGDP_i = \beta_0 + \beta_K \; lnK_i + \beta_L \; lnL_i + \varepsilon_i\] \[lnGDP_i = \beta_0 + (1 - \beta_L) \; lnK_i + \beta_L \; lnL_i + \varepsilon_i\]

\[(lnGDP_i - lnK_i) = \beta_0 + \beta_L \; (lnL_i - lnK_i) + \varepsilon_i\] \[\tilde{Y}_i = \beta_0 + \beta_L \; \tilde{X}_i + \varepsilon_i\] where \[\tilde{Y}_i=lnGDP_i - lnK_i \quad \text{and} \quad \tilde{X}_i=lnL_i - lnK_i\]

Notice how these derivations deliver exactly how the variables of the model need to be transformed and what the restricted model needs to be estimated.

The restricted model can explain roughly 94% of the variation in the dependent variable. To see if this reduction in \(R^2\) is enough to reject the null hypothesis, we need to calculate an F-statistic. The numerator degrees of freedom is \(m=1\) because there is technically only one restriction in the null. The denominator degrees of freedom uses \(n=24\) and \(k=2\) .

As in the previous application, we received a very high F-statistic and a very low p-value. This means we reject the hypothesis that this country has an aggregate production function that exhibits constant returns to scale with slightly over 99.5% confidence.

Bernanke, B., & Blinder, A. (1992). The Federal Funds Rate and the Channels of Monetary Transmission. The American Economic Review , 82(4), 901-921. ↩︎

Introductory Econometrics

Chapter 17: joint hypothesis testing.

Chapter 16 shows how to test a hypothesis about a single slope parameter in a regression equation. This chapter explains how to test hypotheses about more than one of the parameters in a multiple regression model. Simultaneous multiple parameter hypothesis testing generally requires constructing a test statistic that measures the difference in fit between two versions of the same model.

An Example of a Test Involving More than One Parameter

One of the central tasks in economics is explaining savings behavior. National savings rates vary considerably across countries, and the United States has been at the low end in recent decades. Most studies of savings behavior by economists look at strictly economic determinants of savings. Differences in national savings rates, however, seem to reflect more than just differences in the economic environment. In a study of individual savings behavior, Carroll et al. (1999) examined the hypothesis that cultural factors play a role. Specifically, they asked the question, Does national origin help to explain differences in savings rate across a group of immigrants to the United States? Using 1980 and 1990 U.S. Census data with data on immigrants from 16 countries and on native-born Americans, Carroll et al. estimated a model similar to the following :( 1 )

For reasons that will become obvious, we call this the unrestricted model. The dependent variable is the household savings rate. Age and education measure, respectively, the age and education of the household head (both in years). The error term reflects omitted variables that affect savings rates as well as the influence of luck. The subscript h indexes households. A series of 16 dummy variables indicate the national origin of the immigrants; for example, Chinah = 1 if both husband and wife in household h were Chinese immigrants .( 2 ) Suppose that the value for the coefficient multiplying China is 0.12. This would indicate that, with other factors controlled, immigrants of Chinese origin have a savings rate 12 percentage points higher than the base case (which in this regression consists of people who were born in the United States).

If there are no cultural effects on savings, then all the coefficients multiplying the dummy variables for national origin ought to be equal to each other. In other words, if culture does not matter, national origin ought not to affect savings rates ceteris paribus. This is a null hypothesis involving 16 parameters and 16 equal signs:

The alternative hypothesis simply negates the null hypothesis, meaning that immigrants from at least one country have different savings rates than immigrants from other countries:

Now, if the null hypothesis is true, then an alternative, simpler model describes the data generation process:

Relative to the original model, the one above is a restricted model. We can test the null hypothesis with a new test statistic, the F-statistic, which essentially measures the difference between the fit of the original and restricted models above. The test is known as an F-test. The F-statistic will not have a normal distribution. Under the often-made assumption that the error terms are normally distributed, when the null is true, the test statistic follows an F distribution, which accounts for the name of the statistic. We will need to learn about the F- and the related chi-square distributions in order to calculate the P-value for the F-test.

F-Test Basics

The F-distribution is named after Ronald A. Fisher, a leading statistician of the first half of the twentieth century. This chapter demonstrates that the F distribution is a ratio of two chi-square random variables and that, as the number of observations increases, the F-distribution comes to resemble the chi-square distribution. Karl Pearson popularized the chi-square distribution beginning in 1900.

The Whole Model F-Test (discussed in Section 17.2) is commonly used as a test of the overall significance of the included independent variables in a regression model. In fact, it is so often used that Excel’s LINEST function and most other statistical software report this statistic. We will show that there are many other F-tests that facilitate tests of a variety of competing models. The idea that there are competing models opens the door to a difficult question: How do we decide which model is the right one? One way to answer this question is with an F-test. At first glance, one might consider measures of fit such as R2 or the sum of squared residuals (SSR) as a guide. But these statistics have a serious weakness – as you include additional independent variables, the R2 and SSR are guaranteed (practically speaking) to improve. Thus, naive reliance on these measures of fit leads to kitchen sink regression – that is, we throw in as many variables as we can find (the proverbial kitchen sink) in an effort to optimize the fit.

The problem with kitchen sink regression is that, for a particular sample, it will yield a higher R2 or lower SSR than a regression with fewer X variables, but the true model may be the one with the smaller number of X variables. This will be shown via a concrete example in Section 17.5. The F-test provides a way to discriminate between alternative models. It recognizes that there will be differences in measures of fit when one model is compared with another, but it requires that the loss of fit be substantial enough to reject the reduced model.

Organization

In general, the F-test can be used to test any restriction on the parameters in the equation. The idea of a restricted regression is fundamental to the logic of the F-test, and thus it is discussed in detail in the next section. Because the F-distribution is actually the ratio of two chi-square (?2) distributed random variables (divided by their respective degrees of freedom), Section 17.3 explains the chi-square distribution and points out that, when the errors are normally distributed, the sum of squared residuals is a random variable with a chi-square distribution. Section 17.4 demonstrates that the ratio of two chi-square distributed random variables is an F-distributed random variable. The remaining sections of this chapter put the F-statistic into practice. Section 17.5 does so in the context of Galileo’s model of acceleration, whereas Section 17.6 considers an example involving food stamps. We use the food stamp example to show that, when the restriction involves a single equals sign, one can rewrite the original model to make it possible to employ a t-test instead of an F-test. The t- and F-tests yield equivalent results in such cases. We apply the F-test to a real-world example in Section 17.7. Finally, Section 17.8 discusses multicollinearity and the distinction between confi- dence intervals for a single parameter and confidence regions for multiple parameters.

1 Their actual model is, not surprisingly, substantially more complicated. Return to text. 2 There were 17 countries of origin in the study, including 900 households selected at random from the United States. Only married couples from the same country of origin were included in the sample. Other restrictions were that the household head must have been older than 35 and younger than 50 in 1980. Return to text.

Excel Workbooks

ChiSquareDist.xls CorrelatedEstimates.xls FDist.xls FDistEarningsFn.xls FDistFoodStamps.xls FDistGalileo.xls MyMonteCarlo.xls NoInterceptBug.xls

PrepNuggets

Joint hypothesis test

PrepNuggets January 5, 2023

A joint hypothesis test is an F-test to evaluate nested models, which consist of a full or unrestricted model, and a restricted model. The F-statistic is calculated using the formula shown. The null hypothesis would be that all coefficients of the excluded variables are equal to zero, and the null that at least one of the excluded coefficients is not equal to zero. Likewise, we reject the null hypothesis if the F-statistic is greater than the critical value . In essence, this is a combined test for coefficients of all excluded variables.

The F-test is actually a special case of the joint hypothesis test where all independent variables are excluded. Mathematically, the F-statistic can be simplified to MSR over MSE for this special case.

  • Evaluating Regression Model Fit – Hypothesis Testing of Regression Coefficients
  • Term: Joint hypothesis test
  • Term: Independent variable
  • Term: Null hypothesis
  • Term: Critical value
  • Term: F-statistic
  • Term: F-test
  • Term: Mean regression sum of squares
  • Term: Mean squared error

This refund processing fee is to compensate us for the payment processing fee that we paid when you made the purchase.

Hey visitor,

Are you a CFA Level I candidate, or someone who is exploring taking the CFA exam? Four years ago, I was in your shoes. I am a Computer Engineering graduate and have been working as an engineer all my life. Having developed a keen interest in finance, I decided on a career switch to the finance field and enrolled into the CFA program at the same time.

Was it tough? You bet!

Adjusting to the drastic career change was tough. I naturally neglected the preparation for my Level I exam in June 2014. It was not until the middle of March 2014 that I realized I only had a little more than 2 months to the exam. To compound my problems, I basically did not have a preparation strategy. Having no background in finance at all, I tried very hard to read the curriculum from cover to cover, but eventually that fell flat. I can still recall the number of times I dozed off while studying, or just going back and forth trying to understand even the simplest concept. My mind simply could not keep up after a hard day at work.

Does all these sound familiar to you? Well, take heart. No matter how bleak it seems, at least sit for the exam and treat it as a learning experience. That was basically my attitude as I burrowed through my exam prep with toil and stress. By God’s grace, I did pass my Level I exam in June 2014. It was an experience I would not want to revisit though.

Tweaking the approach

For the Level II exam, I endeavoured not to repeat the mistakes I made. Based on the Pareto 80/20 principle, I learnt to extract the most essential bits from the curriculum enough to give me that 80% result to pass. Being a visual learner, I took notes and summaries in pictorial form. Instead of reserving huge segments of time to study, I carved out pockets of time to learn and practise – accommodating to my full-time job. I managed to pass my Level II and Level III exams consecutively with considerably less effort and stress than when I did my level I.

Why PrepNuggets?

I love the CFA Program and truly value the skills and ethics that are imparted to make me a better finance professional. My desire is to help candidates who are keen to pursue this path to do so in the most effective and painless process as possible – based on the lessons that I learnt as a candidate. I have set up PrepNuggets with the vision to revolutionise learning by using technology, catering to the short attention span that we can afford. If this makes sense to you, join the PrepNuggets community by signing up for your free student account. I am confident that the materials that we have laboriously crafted will bring you closer to that dream pass with just that 20% effort. Let us do the hard work for you.

Keith Tan, CFA Founder and Chief Instructor PrepNuggets

About the Author

' src=

PrepNuggets

Keith is the founder and chief instructor of PrepNuggets. He has a wide range of interests in all things related to tech, from web development to e-learning, gadgets to apps. Keith loves exploring different cultures and the untouched gems around the world. He currently lives in Singapore but frequently travels to share his knowledge and expertise with others.

what is joint hypothesis testing

Have you ever gotten stuck in your study because you can’t remember a formula, or what a specific term means? Now, say goodbye to scanning through all the videos and ploughing through pages and pages just to find what you are looking for. All the important formulas, definitions and diagrams you need for the exam are now at your fingertips at prepnuggets.com/glossary .

what is joint hypothesis testing

What’s more, these quick references are deeply integrated in our lessons, so you get a good idea of what the lesson covers even before watching the video. The references also point you to specific video lessons where it is covered, so you can quickly access the corresponding video to learn more about the term.

Available now for all Level I topics ,  this service is exclusive for our Premium and Pro members only. We will progressively add the rest of the topic areas over the next few months.

We think this is a game-changer for your CFA success!

what is joint hypothesis testing

Thank you for purchasing one of our courses on Udemy! Now that you have experienced the PrepNuggets way of learning, are you ready to take your exam prep to the next level with us?

We have an irresistible offer for you to upgrade to our Level I Premium Membership, where you will gain full access to ALL 10 topical courses under the CFA Level I curriculum.

Simply log in to any of our courses on Udemy, and head to the last lecture titled ‘ BONUS : Continue Your Exam Prep With Us! ‘. You will find the exclusive link to sign up for this offer!

what is joint hypothesis testing

[fvplayer id=”3″]

what is joint hypothesis testing

On the 1st of March 2018, we took a bold step of faith to put our Financial Reporting and Analysis (FRA) course on Udemy.

For those of you who are new to Udemy, it is the world’s largest marketplace for online courses. Think of it like the EBay of online courses.

So imagine our trepidation in pitting our course in this highly competitive platform, against the many CFA prep providers already entrenched on the platform.


Overwhelming.

Yes, that’s the word that aptly describes the response to our course from the Udemy community.

“Best Seller” Tag

what is joint hypothesis testing

The “Best Seller” tag from Udemy is attached to only one best selling course in its category. In just 1 month, our FRA course became the best selling CFA course on the platform. If you do a search for ‘CFA Level 1’, our course comes out on top in the search rankings.

Global Reach

what is joint hypothesis testing

Since the launch on 1 March, we have had more than 250 paid enrolments. While we are heartened by this figure, nothing beats knowing that our course has reached 50 countries around the world! It was simply heartwarming to receive messages from students from countries we barely know about, telling us how much they love the course and their wish that we would produce more of such courses. This certainly spurs us on to produce more materials to ease the burden of CFA candidates worldwide.

Awesome Ratings

what is joint hypothesis testing

Moving Forward

We are working hard to bring more of our courses to Udemy! We realise some candidates prefer to purchase courses as they need individually, so we endeavour to give more options to our potential students. Check out our Udemy Courses Page to find out which of our courses are available on Udemy for your purchase.

Special Offer for Udemy Students

If you have purchased our course on Udemy and would like to continue with the PrepNuggets study approach for other topics, we have an awesome upgrade offer to Premium membership for you !

[theme-my-login show_reg_link=”0″]

[theme-my-login default_action=”register” show_title=”false”]

Username or Email Address

Remember Me

what is joint hypothesis testing

Efficient Markets Hypothesis: Joint Hypothesis

Important paper: Fama (1970)

An efficient market will always “fully reflect” available information, but in order to determine how the market should “fully reflect” this information, we need to determine investors’ risk preferences. Therefore, any test of the EMH is a test of both market efficiency and investors’ risk preferences. For this reason, the EMH, by itself, is not a well-defined and empirically refutable hypothesis. Sewell (2006)

"First, any test of efficiency must assume an equilibrium model that defines normal security returns. If efficiency is rejected, this could be because the market is truly inefficient or because an incorrect equilibrium model has been assumed. This joint hypothesis problem means that market efficiency as such can never be rejected." Campbell, Lo and MacKinlay (1997), page 24

"...any test of the EMH is a joint test of an equilibrium returns model and rational expectations (RE)." Cuthbertson (1996)

"The notion of market efficiency is not a well-posed and empirically refutable hypothesis. To make it operational, one must specify additional structure, e.g., investors’ preferences, information structure, etc. But then a test of market efficiency becomes a test of several auxiliary hypotheses as well, and a rejection of such a joint hypothesis tells us little about which aspect of the joint hypothesis is inconsistent with the data." Lo (2000) in Cootner (1964), page x

One of the reasons for this state of affairs is the fact that the EMH, by itself, is not a well-defined and empirically refutable hypothesis. To make it operational, one must specify additional structure, e.g. investors' preferences, information structure. But then a test of the EMH becomes a test of several auxiliary hypotheses as well, and a rejection of such a joint hypothesis tells us little about which aspect of the joint hypothesis is inconsistent with the data. Are stock prices too volatile because markets are inefficient, or is it due to risk aversion, or dividend smoothing? All three inferences are consistent with the data. Moreover, new statistical tests designed to distinguish among them will no doubt require auxiliary hypotheses of their own which, in turn, may be questioned." Lo in Lo (1997), page xvii

"For the CAPM or the multifactor APT to be true, markets must be efficient." "Asset-pricing models need the EMT. However, the notion of an efficient market is not affected by whether any particular asset-pricing theory is true. If investors preferred stocks with a high unsystematic risk, that would be fine: as long as all information was immediately reflected in prices, the EMT theory would be true." Lofthouse (2001), page 91

"One of the reasons for this state of affairs is the fact that the Efficient Markets Hypothesis, by itself, is not a well-defined and empirically refutable hypothesis. To make it operational, one must specify additional structure, e.g., investor’ preferences, information structure, business conditions, etc. But then a test of the Efficient Markets Hypothesis becomes a test of several auxiliary hypotheses as well, and a rejection of such a joint hypothesis tells us little about which aspect of the joint hypothesis is inconsistent with the data. Are stock prices too volatile because markets are inefficient, or is it due to risk aversion, or dividend smoothing? All three inferences are consistent with the data. Moreover, new statistical tests designed to distinguish among them will no doubt require auxiliary hypotheses of their own which, in turn, may be questioned." Lo and MacKinlay (1999), pages 6-7

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Stat Appl Genet Mol Biol

The Joint Null Criterion for Multiple Hypothesis Tests

Jeffrey t. leek.

Johns Hopkins Bloomberg School of Public Health

John D. Storey

Princeton University

Associated Data

Simultaneously performing many hypothesis tests is a problem commonly encountered in high-dimensional biology. In this setting, a large set of p-values is calculated from many related features measured simultaneously. Classical statistics provides a criterion for defining what a “correct” p-value is when performing a single hypothesis test. We show here that even when each p-value is marginally correct under this single hypothesis criterion, it may be the case that the joint behavior of the entire set of p-values is problematic. On the other hand, there are cases where each p-value is marginally incorrect, yet the joint distribution of the set of p-values is satisfactory. Here, we propose a criterion defining a well behaved set of simultaneously calculated p-values that provides precise control of common error rates and we introduce diagnostic procedures for assessing whether the criterion is satisfied with simulations. Multiple testing p-values that satisfy our new criterion avoid potentially large study specific errors, but also satisfy the usual assumptions for strong control of false discovery rates and family-wise error rates. We utilize the new criterion and proposed diagnostics to investigate two common issues in high-dimensional multiple testing for genomics: dependent multiple hypothesis tests and pooled versus test-specific null distributions.

1. Introduction

Simultaneously performing thousands or more hypothesis tests is one of the main data analytic procedures applied in high-dimensional biology ( Storey and Tibshirani, 2003 ). In hypothesis testing, a test statistic is formed based on the observed data and then it is compared to a null distribution to form a p-value. A fundamental property of a statistical hypothesis test is that correctly formed p-values follow the Uniform(0,1) distribution for continuous data when the null hypothesis is true and simple. (We hereafter abbreviate this distribution by U(0,1).) This property allows for precise, unbiased evaluation of error rates and statistical evidence in favor of the alternative. Until now there has been no analogous criterion when performing thousands to millions of tests simultaneously.

Just as with a single hypothesis test, the behavior under true null hypotheses is the primary consideration in defining well behaved p-values. However, when performing multiple tests, the situation is more complicated for several reasons: (1) among the entire set of hypothesis tests, a subset are true null hypotheses and the remaining subset are true alternative hypotheses, and the behavior of the p-values may depend on this configuration; (2) the data from each true null hypothesis may follow a different null distribution; (3) the data across hypothesis tests may be dependent; and (4) the entire set of p-values is typically utilized to make a decision about significance, some of which will come from true alternative hypotheses. Because of this, it is not possible to simply extrapolate the definition of a correct p-value in a single hypothesis test to that of multiple hypothesis tests. We provide two key examples to illustrate this point in the following section, both of which are commonly encountered in high-dimensional biology applications.

The first major point of this paper is that the joint distribution of the true null p-values is a highly informative property to consider, whereas verifying that each null p-value has a marginal U(0,1) distribution is not as directly informative. We propose a new criterion for null p-values from multiple hypothesis tests that guarantees a well behaved joint distribution, called the Joint Null Criterion (JNC). The criterion is that the ordered null p-values are equivalent in distribution to the corresponding order statistics of a sample of the same size from independent U(0,1) distributions. We show that multiple testing p-values that satisfy our new criterion can be used to more precisely estimate error rates and rank tests for significance. We illustrate with simple examples how this criterion avoids potentially unacceptable levels of inter-study variation that is possible even for multiple testing procedures that guarantee strong control.

The second major point of this paper is that new diagnostics are needed to objectively compare various approaches to multiple testing, specifically those that evaluate properties beyond control of expected error rate estimates over repeated studies. These new diagnostics should also be concerned with potentially large study specific effects that manifest over repeated studies in terms of the variance of realized error rates (e.g., the false discovery proportion) and the variance of error rate estimates. This has been recognized as a particularly problematic in the case of dependent hypothesis tests where unacceptable levels of variability in error rate estimates may be obtained even though the false discovery rate may be controlled ( Owen, 2005 ). The need for this type of diagnostic is illustrated in an example presented in the next section, where the analysis of gene expression utilizing three different approaches yields drastically different answers. We propose Bayesian and frequentist diagnostic procedures that provide an unbiased standard for null p-values from multiple testing procedures for complex data. When applied to these methods, the reasons for their differing answers are made clearer.

We apply our diagnostics to the null p-values from multiple simulated studies, to capture the potential for study specific errors. We use the diagnostics to evaluate methods in two major areas of current research in multiple testing: testing multiple dependent hypotheses and pooled versus test-specific null distributions. Surprisingly, some popular multiple testing procedures do not produce p-values with a well behaved joint null distribution, leading directly to imprecise estimates of common error rates such as the false discovery rate.

2. Motivating Examples

Here we motivate the need for the JNC and diagnostic tests by providing two general examples and a real data example from a gene expression study. The first general example describes a situation where every p-value has a U(0,1) distribution marginally over repeated studies. However, the joint distribution of study-specific sets of null p-values deviate strongly from that of independent U(0,1) components. The second general example illustrates the opposite scenario: here none of the p-values has a U(0,1) distribution marginally, but the set of study-specific null p-values appear to have a joint distribution equivalent to independent U(0,1) components of the same size. Together, these examples suggest the need for a gold standard for evaluating multiple testing procedures in practice. Finally, we show that different methods for addressing multiple testing dependence give dramatically different results for the same microarray analysis. This indicates that an objective criterion is needed for evaluating such methods in realistic simulations where the correct answer is known.

2.1. Problematic Joint Distribution from Correct Marginal Distributions

In this example, the goal is to test each feature for a mean difference between two groups of equal size. The first 300 features are simulated to have a true mean difference. There is also a second, randomized unmodeled binary variable that affects the data. Features 200–700 are simulated to have a mean difference between the groups defined by the unmodeled variable. The exact model and parameters for this simulation are detailed in Section 5. The result of performing these tests is a 1,000 × 100 matrix of p-values, where the p-values for a single study appear in columns and the p-values for a single test across repeated studies appear in rows.

Using these p-values we examine both their marginal distributions as well as the joint distribution of the null p-values. First, we look at a single p-value affected by the unmodeled variable over the 100 repeated studies. The top two histograms in Figure 1 show the behavior of two specific p-values over the 100 simulated studies. Marginally, each is U(0,1) distributed as would be expected. The randomization of the unmodeled variable results in correct marginal distributions for each null p-value.

An external file that holds a picture, illustration, etc.
Object name is sagmb1673f1.jpg

Uniform marginal p-value distribution, but JNC violating joint distribution. Each histogram in the top panel shows the p-value distribution of a single hypothesis test across 100 simulated studies; for each, the marginal distribution is approximately U(0,1) even though each test was subject to a randomized unmodeled variable. Each histogram in the bottom panel shows a sample from the joint distribution of the set of null p-values from a specific realized study. Here, the p-values deviate from the distribution of an i.i.d. U(0,1) sample, depending on the correlation between the randomized unmodeled variable and the group.

Next we consider the null p-values from tests 301–700 for a single study, which is a sample from their joint distribution. The bottom two histograms in Figure 1 show two such examples. In one case the null p-values appear smaller than expected from an i.i.d. U(0,1) sample, in the other case they appear to be larger than expected. This is because in the first study, the unmodeled variable is correlated with the group difference and the signal from the unmodeled variable is detected by the test between groups. In the second study, the unmodeled variable is uncorrelated with the group difference and a consistent source of noise is added to the data, resulting in null p-values that are too large. The result is that each null p-value is U(0,1) marginally, but the joint distribution deviates strongly from a sample of i.i.d. U(0,1) random variables.

Of particular interest is the lower left histogram of Figure 1 , which shows only the null p-values from a single simulated study with dependence. The p-values appear to follow the usual pattern of differential expression, with some p-values near zero (ostensibly corresponding to differential expressed genes) and some p-values that appear to be drawn from a U(0,1) distribution (ostensibly the null genes). However, in this example all of the genes are true nulls, so ideally their joint distribution would reflect the U(0,1). Inspection of this histogram would lead to the mistaken conclusion that the method had performed accurately and true differential expression had been detected. This strongly motivates the need for new diagnostic tools that consider the joint behavior of null p-values.

2.2. Well Behaved Joint Distribution from Incorrect Marginal Distributions

The second general example also consists of 1,000 tests for mean differences between two groups. The first 300 features are again simulated to have a mean difference between groups. We simulated each feature to have a different variance. The test statistic is a modified t-statistic with a shrinkage constant added to the denominator: t   =   x ¯ i 1   −   x ¯ i 0 s i + a 0 where x ̄ ij is the mean for feature i and group j , s i is the standard error of x ̄ i 1 – x ̄ i 0 , and a 0 is a single fixed shrinkage constant for all tests. This type of shrinkage statistic is common in the field of multiple testing, where a 0 is frequently estimated from the distribution of the si ( Tusher, Tibshirani, and Chu, 2001 , Cui, Hwang, Qiu, Blades, and Churchill, 2005 , Efron, Tibshirani, Storey, and Tusher, 2001 ). The null statistics are calculated via the bootstrap and pooled across features ( Storey and Tibshirani, 2003 ). The top row of Figure 2 shows the distribution of two specific p-values across the 100 studies. In this case, since the standard errors vary across tests, the impact a 0 has on the test’s null distribution depends on the relative size of a 0 to the s i .

An external file that holds a picture, illustration, etc.
Object name is sagmb1673f2.jpg

Non-uniform marginal p-value distribution, but JNC satisfying joint distribution. Each histogram in the top panel shows the p-value distribution of a single hypothesis test using a shrunken t-statistic and pooled null statistics across 100 simulated studies. It can be seen in each that the marginal distribution deviates from U(0,1). Each histogram in the bottom panel shows a sample from the joint distribution of the set of null p-values from two specific realizations of the study. Here, the p-values satisfy the JNC, since pooling the null statistics accounts for the distribution of variances across tests.

When the null statistics are pooled, there are individual tests whose p-value follows an incorrect marginal distribution across repeated studies. The reason is that the bootstrap null statistics are pooled across 1,000 different null distributions. The bottom row of Figure 2 shows a sample from the joint distribution of the null p-values for specific studies. The joint distribution behaves like an i.i.d. U(0,1) sample because pooling the bootstrap null statistics captures the overall impact of different variances on the joint distribution of the test statistics coming from true null hypotheses.

2.3. Microarray Significance Analysis

Idaghdour, Storey, Jadallah, and Gibson (2008) performed a study of 46 desert nomadic, mountain agrarian, and coastal urban Moroccan Amazigh individuals to identify differentially expressed genes across geographic populations. Due to the heterogeneity of these groups and the observational nature of the study, there is likely to be latent structure present in the data, leading to multiple testing dependence. This can be easily verified by examining the residual data after regressing out the variables of interest ( Idaghdour et al., 2008 ). As an example we present two differential expression analyses in Figure 3 : agrarian versus desert nomadic, and desert nomadic versus village.

An external file that holds a picture, illustration, etc.
Object name is sagmb1673f3.jpg

P-value histograms from the differential expression analysis comparing agrarian versus desert nomadic (top row) and desert nomadic versus village (bottom row). For each comparison, three different analysis strategies are used: a standard F-statistic significance (first column), a surrogate variable adjusted approach (second column), and an empirical null adjusted approach (third column). The last two are methods for adjusting for multiple testing dependence. Both comparisons show wildly different results depending on the analysis technique used.

We perform each analysis in three ways, (1) a simple F-test for comparing group means, (2) a surrogate variable adjusted analysis ( Leek and Storey, 2007 ), and (3) an empirical null ( Efron, 2004 , 2007 ) adjusted analysis. These last two approaches are different methods for adjusting for multiple testing dependence and latent structure in microarray data. Figure 3 shows that each analysis strategy results in a very different distribution for the resulting p-values. Idaghdour et al. (2008) found coherent and reproducible biology among the various comparisons only when applying the surrogate variable analysis technique. However, how do we know in a more general sense which, if any, of these analysis strategies is more well behaved since they give such different results? This question motivates a need for a criterion and diagnostic test for evaluating the operating characteristics multiple testing procedures, where the criterion is applied to realistically simulated data where the correct answers are known.

3. A Criterion for the Joint Null Distribution

The examples from the previous section illustrate that it is possible for p-values from multiple tests to have proper marginal distributions, but together form a problematic joint distribution. It is also possible to have a well behaved joint distribution, but composed of p-values with incorrect marginal distributions. In practice only a single study is performed and the statistical significance is assessed from the entire set of p-values from that study. The real data example shows that different methods may yield notably different p-value distributions in a given study.

Thus, utilizing a procedure that produces a well behaved joint distribution of null p-values is critical to reduce deviation from uniformity of p-values within a study, and large variances of statistical significance across studies. A single hypothesis test p-value is correctly specified if its distribution is U(0,1) under the null ( Lehmann, 1997 ). In other words, p is correctly specified if for α ∈ (0, 1), Pr ( p < α ) = Pr ( U < α ) = α , where U ∼ U (0, 1). We would like to ensure that the null p-values from a given experiment have a joint distribution that is stochastically equivalent to an independent sample from the U(0,1) distribution of the same size. Based on this intuition, we propose the following criterion for the joint null p-value distribution.

Definition (Joint Null Criterion, JNC). Suppose that m hypothesis tests are performed where tests 1, 2, . . . , m 0 are true nulls and m 0 +1, . . . , m are true alternatives. Let p i be the p-value for test i and let p ( n i ) be the order statistic corresponding to p i among all p-values, so that n i = #{ p j ≤ p i } . The set of null p-values satisfy the Joint Null Criterion if and only if the joint distribution of p ( n i ) , i =1, . . . , m 0 is equal to the joint distribution of p ( n i * ) * , i = 1, . . . , m 0 , where p 1 * ,   …   , p m 0 * are an i.i.d. sample from the U (0, 1) distribution and p i * a . s . = p i , for i = m 0 +1, . . . , m.

Remark 1. If all the p-values correspond to true nulls, the JNC is equivalent to saying that the ordered p-values have the same distribution as the order statistics from an i.i.d. sample of size m from the U(0,1) distribution.

Intuitively, when the JNC is satisfied and a large number of hypothesis tests is performed, the set of null p-values from these tests should appear to be equivalent to an i.i.d. sample from the U(0,1) distribution when plotted together in a histogram or quantile-quantile plot. Figure 4 illustrates the conceptual difference between the JNC and the univariate criterion. The p-values from multiple tests for a single study appear in columns and the p-values from a single test across studies appear in rows. The standard univariate criterion is concerned with the behavior of single p-values across multiple studies, represented as rows in Figure 4 . In contrast, the JNC is concerned with the joint distribution of the set of study-specific p-values, represented by columns in Figure 4 . When only a single test is preformed, each column has only a single p-value so the JNC is simply the standard single test criterion.

An external file that holds a picture, illustration, etc.
Object name is sagmb1673f4.jpg

An illustration of the Joint Null Criterion. The p-values from multiple tests for a single study appear in columns and the p-value from a single test across replicated studies compose each row. The JNC evaluates the joint distribution of the set of null p-values, whereas the single test criterion is concerned with the distribution of a single p-value across replicated studies.

Remark 2. In the case that the null hypotheses are composite, the distributional equality in the above criterion can be replaced with a stochastic ordering of the two distributions.

Remark 3. The JNC is not equivalent to the trivial case where the null p-values are each marginally U(0,1) and they are jointly independent. Let U (1) ≤ U (2) ≤ ⋯ ≤ U ( m 0 ) be the order statistics from an i.i.d. sample of size m 0 from the U(0,1) distribution. Set p i = U ( i ) for i = 1, . . . , m 0 . It then follows that the null p-values are highly dependent (since p i < p j for all i < j ), none are marginally U(0,1), but their joint distribution is valid. Example 2 from Section 2 provides another scenario where the JNC is not equivalent to the trivial case.

Remark 4. The JNC is not a necessary condition for the control of the false discovery rate, as it has been shown that the false discovery rate may be controlled for certain types of dependence that may violate the JNC ( Benjamini and Yekutieli, 2001 , Storey, Taylor, and Siegmund, 2004 ).

The JNC places a condition on the joint behavior of the set of null p-values. This joint behavior is critical, since error estimates and significance calculation are performed on the set of p-values from a single study (e.g., false discovery rates estimates Storey (2002) ). To make this concrete, consider the examples from the previous section. In Example 1, the joint distribution of the null p-values is much more variable than a sample from the U(0,1) distribution, resulting in unreliable error rate estimates and significance calculations ( Owen, 2005 ). The joint p-values in this example fail to meet the JNC. In Example 2, the joint distribution of the p-values satisfies the JNC, resulting in well behaved error rate estimates and significance calculations, even though the marginal behavior of each p-value is not U(0,1).

When the JNC is met, then estimation of experiment-wide error rates and significance cutoffs behaves similarly to the well behaved situation where the true null p-values are i.i.d. U(0,1). Lemma 1 makes these ideas concrete (see Supplementary Information for the proof).

Lemma 1 Suppose that p 1 , p 2 , . . . , p m are m p-values resulting from m hypothesis tests; without loss of generality, suppose that p 1 , . . . , p m 0 correspond to true null hypotheses and p m 0 +1, . . . , p m to true alternative hypotheses. If (1) the JNC is satisfied for p 1 , . . . , p m 0 and (2) the conditional distribution { p ( n i ) } i = m 0 + 1 m | { p ( n i ) } i = 1 m 0 is equal to the conditional distribution { p ( n i ) * } i = m 0 + 1 m | { p ( n i ) * } i = 1 m 0 , then any multiple hypothesis testing procedure based on the order statistics p (1) , . . . , p ( m ) has properties equivalent to those in the case where the true null hypotheses’ p-values are i.i.d. Uniform(0,1).

Corollary. When conditions (1) and (2) of Lemma 1 are satisfied, the multiple testing procedures of Shaffer (1995) , Benjamini and Hochberg (1995) , Storey et al. (2004) provide strong control of the false discovery rate. Furthermore, the controlling and estimation properties of any multiple testing procedure requiring the null p-values to be i.i.d. Uniform(0,1) continue to hold true when the JNC is satisfied.

The Joint Null Criterion is related to two well-known concepts in multiple testing, the marginal determine joint (MDJ) condition ( Xu and Hsu, 2007 , Calian, Li, and Hsu, 2008 ) and the joint null domination (jtNDT) condition ( Dudoit and van der Laan, 2008 ). The MDJ is a condition on the observations, which is sufficient to guarantee a permutation distribution is the same as the true distribution ( Calian et al., 2008 ). Meanwhile, the jtNDT condition is concerned with Type I errors being stochastically greater under the test statistics null distribution than under their true distribution. From this, Dudoit and van der Laan (2008) show that two main types of null distributions for test statistics can be constructed that satisfy this null domination property. The difference between these criteria and the JNC is that the JNC focuses not just one Type I error control, but also controlling the study-to-study variability in Type I errors.

4. Statistical Methods for Evaluating the Joint Null Criterion

Several new multiple testing statistics for the analysis of gene expression data have recently been proposed and evaluated in the literature ( Tusher et al., 2001 , Newton, Noueiry, Sarkar, and Ahlquist, 2004 , Storey, 2007 ). A standard evaluation of the accuracy of a new procedure is to apply it to simulated data and determine whether a particular error rate, such as the false discovery rate, is conservatively biased at specific thresholds, typically 5% and 10%. The JNC suggests a need for methods to evaluate the joint distribution of null p-values from multiple testing procedures. We propose a three step approach for evaluating whether the joint distribution of null p-values satisfies the JNC.

  • Simulate multiple high-dimensional data sets from a common data generating mechanism that captures the expected cross study variation in signal and noise, and includes any dependence or latent structure that may be present.
  • Apply the method(s) in question to each study individually to produce a set of p-values for each study.
  • Compare the set of null p-values from each specific study to the U(0,1) distribution, and quantify differences between the two distributions across all studies.

The first two steps of our approach involve simulating data and applying the method in question to generate p-values, which we carry out in the next section in the context of multiple testing dependence and pooling null distributions across tests. When the joint null distribution can be characterized directly ( Huang, Xu, Calian, and Hsu, 2006 ), analytic evaluation of the JNC may be possible. A key component of evaluating the JNC is the ability to simulate from a realistic joint distribution for the observed data. Application of these diagnostic criteria requires careful examination of the potential properties, artifacts, and sources of dependence that exist in high-dimensional data. In the remainder of the current section, we propose methods for the third step: summarizing and evaluating null p-values relative to the U(0,1) distribution.

We propose one non-parametric approach based on the Kolmogorov-Smirnov (KS) test and a second approach based on a Bayesian posterior probability for the joint distribution. When applying these diagnostics to evaluate multiple testing procedures that produce a small number of observed p-values ( m < 100) the asymptotic properties of the KS test may not hold. For these scenarios, the Bayesian diagnostic may be more appropriate. In the more general case, when a large number of tests are performed, the diagnostics are both appropriate.

4.1. Double Kolmogorov-Smirnov Test

In this step we start with m p-values from B simulated studies, p 1 j , . . . , p mj , j = 1, . . . B . Assume that the first m 0 p-values correspond to the null tests and the last m − m 0 correspond to the alternative tests. To directly compare the behavior of the p-values from any study to the U(0,1) distribution, we consider the study-specific empirical distribution function, defined for study j as F m 0 j   ( x )   =   1 m 0 Σ i = 1 m 0   1 ( p i j   <   x ) . The empirical distribution is an estimate of the unknown true distribution of the null p-values F j ( x ). If the null p-values are U(0,1) distributed then F m 0 j ( x ) will be close to the U(0,1) distribution function, F ( x ) = x . In practice, none of the empirical distribution functions will exactly match the U(0,1) distribution due to random variation.

One approach to determine if the p-values are “close enough” to the U(0,1) distribution is to perform a KS test ( Shorack and Wellner, 1986 ) using the statistic, D m 0 j   =   sup x | F m 0 j ( x )   −   x | (see also Supplementary Figure S1 ). Based on this statistic we can calculate a KS test p-value for each simulated study. Under the null hypothesis the KS tests’ p-values will also be U(0,1) distributed. We can then calculate a second KS test statistic based on the empirical distribution of the first stage KS test p-values. If the original test-specific null p-values are U(0,1) distributed, then this double KS test p-value will be large and if not then it will be small. Repeating the KS test across a range of simulated data sets permits us to quantify variation around the U(0,1) distribution. Replication also reduces the potential for getting lucky and picking a single simulated study where the method in question excels.

Note that it is possible to consider metrics less stringent than the supremum norm on which the KS test is based. There are variety of ways in which a metric based on | F m 0 j ( x )   −   x | over the range 0 ≤ x ≤ 1 can be calculated.

4.2. Bayesian Posterior Probability

A second approach we propose for evaluating the joint distribution of the null p-values is to estimate the posterior probability that the JNC holds given the sets of m p-values across the B simulated studies. To calculate this posterior probability, we assume that the observed null p-values are drawn from a flexible class of distributions. For example, we assume the null p-values are a sample from a Beta( α, β ) distribution, where ( α, β ) ∈ [0, A ] × [0, B ]. Supplementary Figure S2 shows examples of the density functions for a range of values of ( α, β ). The Beta family is used because Beta distributions closely mimic the behavior of non-null p-values observed in practice ( Pounds and Morris, 2003 ). For example, if α = 1 and β > 1 then the corresponding Beta density function is strictly decreasing between 0 and 1, which is typical of the distribution of p-values from differentially expressed genes in a microarray experiment.

Our approach assigns prior probability 1/2 that the p-values are jointly U(0,1) (i.e., the JNC holds), equivalent to a Beta distribution with α = β = 1, and prior probability 1/2 that the p-values follow a Beta distribution where either α ≠ 1 or β ≠ 1. We write { p ij } as shorthand for the entire set of simulated null p-values, { p ij ; i = 1, . . . , m 0 , j = 1, . . . , B }. From Bayes Theorem we can calculate the posterior probability the JNC holds as follows:

The first component is calculated as:

The second component can be calculated by integrating over the other values of ( α , β ):

where π 0 ( α , β ) is the prior distribution for specific values of ( α , β ). In the examples that follow, we utilize independent U(0,1) priors on both α and β , but more informative prior choices could be used to emphasize specific potential alternatives. For example, weighting the prior toward values with α < 1 and β > 1 would emphasize distributions that are stochastically smaller than the U(0,1) distribution and typically occur under the alternative.

5. Applications of the Joint Null Criterion

We apply the proposed JNC and diagnostic tests to assess the behavior of methods or two important challenges in multiple hypothesis testing: (1) addressing multiple testing dependence and (2) determining the validity pooled null distributions. Methods have been developed for both of these issues in multiple testing, but there has not been a standard approach for evaluating whether the resulting significance measures have desirable variability properties.

5.1. Multiple Testing Dependence

Multiple testing dependence is a common problem in the analysis of high-dimensional data such as those obtained from genomics ( Leek and Storey, 2007 ) or imaging experiments ( Schwartzman, Dougherty, and Taylor, 2008 ). Multiple testing dependence has frequently been defined as a type of stochastic dependence among p-values or one-dimensional test-statistics when performing multiple tests ( Yekutieli and Benjamini, 1999 , Benjamini and Yekutieli, 2001 , Efron, 2004 , 2007 ). More recently, the root source of this type of dependence has been identified and addressed as dependence among the data for the tests ( Leek and Storey, 2008 ). It has also been shown that regardless of the dependence structure, dependence in the feature level data can always be parameterized by a low dimensional set of variables (or factors) called a dependence kernel ( Leek and Storey, 2008 ).

Three different approaches for addressing multiple testing dependence are: surrogate variable analysis ( Leek and Storey, 2007 , 2008 ), residual factor analysis for multiple testing dependence ( Friguet, Kloareg, and Causer, 2009 ), and the empirical null ( Efron, 2004 ) as applied to multiple testing dependence ( Efron, 2007 ). Surrogate variable analysis is an approach that performs a supervised factor analysis of the data during the modeling process, before one dimensional summaries such as p-values have been calculated. Residual factor analysis for multiple testing dependence is a reformulation of this approach where the estimated factors are required to be orthogonal to the class variable. The empirical null distribution is calculated based on the observed values of the test statistics. The basic idea is to estimate a null distribution based on the “null part” of the observed distribution where the null statistics are assumed to lie. We note that the empirical null method as a general approach ( Efron, 2004 , 2007 ) has not been subjected to simulations where the correct answer is known, so its accuracy and general operating characteristics are heretofore unexplored.

It is often the case that the data for multiple tests from high-throughput experiments are dependent. One example of this type of dependence which is common in both microarray and imaging experiments is dependence due to latent or unmodeled factors ( Leek and Storey, 2007 , 2008 ). To mimic this type of dependence in our simulated data, we generate the observations for test i from the model x i = b 0 i + b 1 i y + b 2 i z + ɛ i , where z is a second latent variable that affects the data for multiple tests, and z j is Bernoulli with probability 0.5. Under this model we let b 1 i ≠ 0 for i = 1, . . .500 and b 1 i = 0 for i = 501, . . . , 1000 as before, but b 2 i ≠ 0 for i = 300, . . . 800 and b 1 i = 0 for i = 1, . . . ,200;801, . . . , 1000. We first test the null hypothesis that b 1 i = 0 including the variable z , even though in general it will not be known to the researcher. In Figure 5a the quantile-quantile plots for the null p-values indicate that the p-values approximately follow the U(0,1) distribution. Correspondingly, the double KS p-value is 0.446 and the median posterior probability of the JNC holding (25th–75th percentile) is 0.967 (0.928,0.978).

An external file that holds a picture, illustration, etc.
Object name is sagmb1673f5.jpg

Quantile-quantile plots of the joint distribution of null p-values from 100 simulated studies when the hypothesis tests are dependent. Results when utilizing: a. the true latent variable adjustment, b. surrogate variable analysis, c. empirical null adjustment, and d. residual factor analysis.

Next we apply each of the methods for addressing dependence based on the default R code provided by the authors. The surrogate variable analysis ( Leek and Storey, 2007 , 2008 ) and residual factor analysis for multiple testing dependence ( Friguet et al., 2009 ) methods result in additional covariates that are included in the model when testing b 1 i = 0. The empirical null approach adjusts the p-values directly based on the observed test statistics. Figure 5 shows the quantile-quantile plots for the adjusted null p-values using each of these methods and Table 1 gives the resulting double KS p-values and posterior probabilities of the JNC holding.

The posterior probability distribution and the double KS test p-value assessing whether the JNC holds for each method adjusting for multiple testing dependence. Correctly Adjusted = adjusted for the true underlying latent variable, SV = surrogate variable analysis, EN = empirical null, and RF = residual factor analysis.

The surrogate variable adjusted p-values ( Figure 5b ) behave for the most part like the correctly adjusted p-values in Figure 5a , with the exception of a small number of cases, where the unmodeled variable is nearly perfectly correlated with the group difference. The resulting posterior probability estimates are consistently near 1; however, the double KS p-value is sensitive to the small number of outlying observations.

The empirical null adjustment shows a strong conservative bias, which results in a loss of power ( Figure 5c ). The reason appears to be that the estimated empirical null is often too wide due to the extreme statistics from the dependence structure. Since the one-dimensional summary statistics conflate signal and noise, it is generally impossible to estimate the null distribution well in the case of dependent data. It has been recommended that the empirical null be employed only when the proportion of truly null hypotheses is greater than 0.90, potentially because of this behavior. Under this assumption, the null p-values are somewhat closer to U(0,1), but still show strong deviations in many cases ( Table 1 ). This indicates the empirical null may be appropriate in limited scenarios when only a small number of tests are truly alternative, such as in genome-wide association studies as originally suggested by Devlin and Roeder (1999) – but not for typical microarray, sequencing, or brain imaging studies.

The residual factor analysis adjusted null p-values, where the factors are required to be orthogonal to the group difference, show strong anti-conservative bias ( Figure 5d ). The reason is that the orthogonally estimated factors do not account for potential confounding between the tested variable and the unmodeled variable. However, when the unmodeled variable is nearly orthogonal to the group variable by chance, this approach behaves reasonably well and so the 75th percentile of the posterior probability estimates is 0.810.

Supplementary Figures S3 and S4 show the estimates of the FDR and the proportion of true nulls calculated for the same simulated studies. Again, the estimates using the correct model and surrogate variable analysis perform similarly, while the empirical null estimates are conservatively biased and the residual factor analysis p-values are anti-conservatively biased. For comparison purposes, Supplementary Figure S5 shows the behavior of the unadjusted p-values and their corresponding false discovery rate estimates. It can be seen that since surrogate variables analysis satisfies the JNC, it produces false discovery rate estimates with a variance and a conservative bias close to the correct adjustment. However, the empirical null adjustment and residual factor analysis produce substantially biased estimates. The unadjusted analysis produces estimates with a similar expected value to the correct adjustment, although the variances are very large.

Another way to view this analysis is to consider the sensitivity and specificity of each approach. The ROC curves for each of the four proposed methods are shown in Supplementary Figure S6 . The approaches that pass the JNC criteria - the correctly adjusted analysis and the surrogate variable adjusted analysis - have similarly high AUC values, while the approaches that do not pass the JNC (residual factor analysis and empirical null) have much lower AUC values. This suggests that another property of the JNC is increased sensitivity and specificity of multiple testing procedures.

5.2. Pooled Null Distributions

A second challenge encountered in large-scale multiple testing in genomics is in determining whether it is valid to form an averaged (called “pooled”) null distribution across multiple tests. Bootstrap and permutation null distributions are common for high-dimensional data, where parametric assumptions may be difficult to verify. It is often computationally expensive to generate enough null statistics to make test-specific empirical p-values at a fine enough resolution. (This requires at least as many resampling iterations as there are tests.) One proposed solution is to pool the resampling based null statistics across tests when forming p-values or estimating other error rates ( Tusher et al., 2001 , Storey and Tibshirani, 2003 ). By pooling the null statistics, fewer bootstrap or permutation samples are required to achieve a fixed level of precision in estimating the null distribution. The underlying assumption here is that averaging across all tests’ null distributions yields a valid overall null distribution. This approach has been criticized based on the fact that each p-value’s marginal null distribution may not be U(0,1) ( Dudoit, Shaffer, and Boldrick, 2003 ). However, the JNC allows for this criticism to be reconsidered by considering the joint distribution of pooled p-values.

Consider the simulated data from the previous subsection, where x i = b 0 i + b 1 i y + ɛ i . Suppose that b 1 i ≠ 0 for i = 1, . . . 500, b 1 i = 0 for i = 501, . . . ,1000, and Var( ɛ ij ) ∼ InverseGamma(10,9). Suppose yj = 1 for j = 1, 2, . . . , n /2 and yj = 0 for j = n /2+1, . . . , n . We can apply the t-statistic to quantify the difference between the two groups for each test. We compute p-values in one of two ways. First we permute the labels of the samples and recalculate null statistics based on the permuted labels. The p-value is the proportion of permutation statistics that is larger in absolute value than the observed statistic.

A second approach to calculating the null statistics is with the bootstrap. To calculate bootstrap null statistics, we fit the model x i = b 0 i + b 1 i y + ɛ i by least squares and calculate residuals r i = x i − b ̂ 0 i − b ̂ 1 i y . We calculate a null model fit using the model x i   =   b 0 i 0   +   ɛ i , sample with replacement from the residuals r to obtain bootstrapped residuals r i * , rescale the bootstrapped residuals to have the same variance as the original residuals, and add the bootstrapped residuals to the null model fit to obtain null data x i *   =   b ^ 0 i 0   +   r i * . The p-value is the proportion of bootstrap statistics that is larger in absolute value than the observed statistic. This is the bootstrap approach employed in Storey, Dai, and Leek (2007) .

We considered two approaches to forming resampling based p-values: (1) a pooled null distribution, where the resmapling based null statistics from all tests are used in calculating the p-value for test i and (2) a test-specific null distribution, where only the resampling based null statistics from test i are used in calculating the p-value for test i . Table 2 shows the results of these analyses for all four scenarios with the number of resampling iterations set to B = 200. The pooled null outperforms the marginal null because the marginal null is granular, due to the relatively small number of resampling iterations. The pooling strategy is effective because the t-statistic is a pivotal quantity, so its distribution does not depend on unknown parameters. In this case, the test-specific null distribution can reasonably be approximated by the joint null distribution that comes from pooling all of the null statistics.

The posterior probability distribution and the double KS test p-value assessing whether the JNC holds for each type of permutation or bootstrap analysis.

Many statistics developed for high-dimensional testing that borrow information across tests are not pivotal. Examples of non-pivotal statistics include those from SAM ( Tusher et al., 2001 ), the optimal discovery procedure ( Storey et al., 2007 ), variance shrinkage ( Cui et al., 2005 ), empirical Bayes methods ( Efron et al., 2001 ), limma ( Smyth, 2004 ), and Bayes methods ( Gottardo, Pannuci, Kuske, and Brettin, 2003 ). As an example, to illustrate the behavior of non-pivotal statistics under the four types of null distributions we focus on the optimal discovery procedure (ODP) statistics. The ODP is an extension of the Neyman-Pearson paradigm to tests of multiple hypotheses ( Storey, 2007 ). If m tests are being performed, of which m 0 are null, the ODP statistic for the data x i for test i is given by: S o d p   ( x i )   =   Σ m 0 + 1 m   f 1 i ( x i ) Σ i = 1 m 0   f 0 i ( x i ) , where f 1 i is the density under the alternative and f 0 i is the density under the null for test i . When testing for differences between group A and group B, an estimate for the ODP test statistic can be formed using the Normal probability density function, ϕ ( · ; μ , σ 2 ):

The ODP statistic is based on the estimates of the mean and variance for each test under the null hypothesis model restrictions ( μ ̂ 0 i , σ ^ 0 i 2 ) and unrestricted ( μ ̂ Ai , σ ̂ 2 , μ ̂ Bi , σ ^ B i 2 ). The data for each test x j is substituted into the density estimated from each of the other tests. Like variance shrinkage, empirical Bayes, and Bayesian statistics, the ODP statistic is not pivotal since the distribution of the statistic depends on the parameters for all of the tests being performed.

We used the ODP statistics instead of the t-statistics under the four types of null distributions; the results appear in Table 2 . With a non-pivotal statistic, pooling the permutation statistics results in non-unfiorm null p-values. The variance of the permuted data for the truly alternative tests is much larger than the variance for the null tests, resulting in bias. The test-specific null works reasonably well under permutation, since the null statistics for the alternative tests are not compared to the observed statistics for the null tests. The bootstrap corrects the bias, since the residuals are resampled under the alternative and adjusted to have the same residual variance as the original data. The bootstrap test-specific null distribution yields granular p-values causing the Bayesian diagnostic to be unfavorable, but yielding a favorable result from the double KS test. The pooled bootstrap null distribution meets the JNC in terms of both diagnostic criteria. These results suggest that non-pivotal high-dimensional statistics that employ permutations for calculating null statistics may result in non-uniform p-values when the null statistics are pooled, but those that employ variance adjusted bootstrap pooled distributions meet the JNC. It should be noted that Storey et al. (2007) prescribe using the pooled bootstrap null distribution as implemented here and the permutation null distribution is not advocated.

Our results suggest that the double KS test may be somewhat sensitive to outliers, suggesting that it may be most useful when strict adherence to the JNC is required from a multiple testing procedure. Meanwhile, the Bayesian approach is sensitive to granular p-value distributions commonly encountered with permutation tests using a small sample, suggesting it may be more appropriate for evaluating parametric tests or high-dimensional procedures that pool null statistics.

6. Discussion

Biological data sets are rapidly growing in size and the field of multiple testing is experiencing a coordinated burst of activity. Existing criteria for evaluating these procedures were developed in the context of single hypothesis testing. Here we have proposed a new criterion based on evaluating the joint distribution of the null statistics or p-values. Our criterion is more stringent than requiring strong control of specific error rates, but flexible enough to deal with the type of multiple testing procedures encountered in practice. When the Joint Null Criterion is met, we have shown that standard error rates can be precisely and accurately controlled. We have proposed frequentist and Bayesian diagnostics for evaluating whether the Joint Null Criterion has been satisfied in simulated examples. Although these diagnostics can not be applied in real examples, they can be a useful tool to diagnose multiple testing procedures when they are proposed and evaluated in simulated data. Here we focused on two common problems in multiple testing that arise in genomics, however our criterion and diagnostic tests can be used to evaluate any multiple testing procedure to ensure p-values satisfy the JNC and result in recise error rate estimates.

Supplementary Materials

Supplementary material.

Supplemental Material For the Manuscript

Contributor Information

Jeffrey T. Leek, Johns Hopkins Bloomberg School of Public Health.

John D. Storey, Princeton University.

  • Benjamini Y, Hochberg Y. “Controlling the false discovery rate-a practical and powerful approach to multiple testing,” J Roy Stat Soc B. 1995; 57 :289–300. [ Google Scholar ]
  • Benjamini Y, Yekutieli D. “The control of the false discovery rate in multiple testing under dependency,” Ann Stat. 2001; 29 :1165–88. [ Google Scholar ]
  • Calian V, Li D, Hsu J. “Partitioning to uncover conditions for permutation test to control multiple testing error rate,” Biometrical Journal. 2008; 50 :756–766. doi: 10.1002/bimj.200710471. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cui X, Hwang JTG, Qiu J, Blades NJ, Churchill GA. “Improved statistical tests for differential gene expression by shrinking variance components estimates,” Biostatistics. 2005; 6 :59–75. doi: 10.1093/biostatistics/kxh018. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Devlin B, Roeder K. “Genomic control for association studies,” Biometrics. 1999; 55 :997–1004. doi: 10.1111/j.0006-341X.1999.00997.x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Dudoit S, Shaffer JP, Boldrick JC. “Multiple hypothesis testing in microarray experiments” Statistical Science. 2003; 18 :71–103. doi: 10.1214/ss/1056397487. [ CrossRef ] [ Google Scholar ]
  • Dudoit S, van der Laan MJ. Multiple Testing Procedures with Applications to Genomics. Springer; 2008. [ CrossRef ] [ Google Scholar ]
  • Efron B. “Large-scale simultaneous hypothesis testing: The choice of a null hypothesis,” J Am Stat Assoc. 2004; 99 :96–104. doi: 10.1198/016214504000000089. [ CrossRef ] [ Google Scholar ]
  • Efron B. “Correlation and large-scale simultaneous signicance testing,” J Am Stat Assoc. 2007; 102 :93–103. doi: 10.1198/016214506000001211. [ CrossRef ] [ Google Scholar ]
  • Efron B, Tibshirani R, Storey JD, Tusher V. “Empirical bayes analysis of a microarray experiment,” Journal of Computational Biology. 2001; 96 :1151–60. [ Google Scholar ]
  • Friguet C, Kloareg M, Causer D. “A factor model approach to multiple testing under dependence” Journal of the American Statistical Association. 2009. to appear. [ CrossRef ]
  • Gottardo R, Pannuci JA, Kuske CR, Brettin T. “Statistical analysis of microarray data: a bayesian approach,” Biostatistics. 2003; 4 :597–620. doi: 10.1093/biostatistics/4.4.597. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Huang Y, Xu H, Calian V, Hsu J. “To permute or not to permute,” Bioinformatics. 2006; 22 :2244–2248. doi: 10.1093/bioinformatics/btl383. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Idaghdour Y, Storey JD, Jadallah S, Gibson G. “A genome-wide gene expression signature of lifestlye in peripheral blood of moroccan amazighs,” PLoS Genetics. 2008; 4 :e1000052. doi: 10.1371/journal.pgen.1000052. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Leek JT, Storey JD. “Capturing heterogeneity in gene expression studies by surrogate variable analysis,” PLoS Genetics. 2007; 3 :e161. doi: 10.1371/journal.pgen.0030161. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Leek JT, Storey JD. “A general framework for multiple testing dependence.” Proc. Nat. Acad. Sci. U.S.A. 2008; 105 :18718–18723. doi: 10.1073/pnas.0808709105. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lehmann EL. Testing Statistical Hypotheses. Springer; 1997. [ Google Scholar ]
  • Newton MA, Noueiry A, Sarkar D, Ahlquist P. “Detecting differential gene expression with a semiparametric hierarchical mixture method” Biostatistics. 2004; 5 :155–76. doi: 10.1093/biostatistics/5.2.155. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Owen A. “Variance of the number of false discoveries,” J Roy Stat Soc B. 2005; 67 :411–26. doi: 10.1111/j.1467-9868.2005.00509.x. [ CrossRef ] [ Google Scholar ]
  • Pounds S, Morris SW. “Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values,” Bioinformatics. 2003; 19 :1236–1242. doi: 10.1093/bioinformatics/btg148. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schwartzman A, Dougherty RF, Taylor J. “False discovery rate analysis of brain diffusion direction maps” Ann Appl Stat. 2008; 2 :153–175. doi: 10.1214/07-AOAS133. [ CrossRef ] [ Google Scholar ]
  • Shaffer JP. “Multiple hypothesis testing,” Annu. Rev. Psychol. 1995; 46 :561–84. doi: 10.1146/annurev.ps.46.020195.003021. [ CrossRef ] [ Google Scholar ]
  • Shorack GR, Wellner JA. Empirical Processes with Applications to Statistics. Wiley; 1986. [ Google Scholar ]
  • Smyth GK. “Linear models and empirical bayes methods for assessing differential expression in microarray experiments” Statistical Applications in Genetics and Molecular Biology. 2004; 1 :3. [ PubMed ] [ Google Scholar ]
  • Storey JD. “A direct approach to false discovery rates,” J Roy Stat Soc B. 2002; 6 :4, 479–98. [ Google Scholar ]
  • Storey JD. “The optimal discovery procedure: A new approach to simultaneous significance testing” J Roy Stat Soc B. 2007; 69 :347–68. doi: 10.1111/j.1467-9868.2007.005592.x. [ CrossRef ] [ Google Scholar ]
  • Storey JD, Dai JY, Leek JT. “The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments,” Biostatistics. 2007; 8 :414–32. doi: 10.1093/biostatistics/kxl019. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Storey JD, Taylor JE, Siegmund D. “Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach,” J Roy Stat Soc B. 2004; 66 :187–205. doi: 10.1111/j.1467-9868.2004.00439.x. [ CrossRef ] [ Google Scholar ]
  • Storey JD, Tibshirani R. “Statistical significance for genome-wide studies,” Proc Natl Acad Sci USA. 2003; 100 :9440–9445. doi: 10.1073/pnas.1530509100. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Tusher VG, Tibshirani R, Chu G. “Significance analysis of microarrays applied to the ionizing radiation response,” Proc Natl Acad Sci, U.S.A. 2001; 98 :5116–21. doi: 10.1073/pnas.091062498. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Xu H, Hsu J. “Using the partitioning principle to control the generalized family error rate,” Biometrical Journal. 2007; 49 :52–67. doi: 10.1002/bimj.200610307. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Yekutieli D, Benjamini Y. “Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics,” J Statist Plan Inf. 1999; 82 :171–96. doi: 10.1016/S0378-3758(99)00041-5. [ CrossRef ] [ Google Scholar ]

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

what is joint hypothesis testing

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved February 22, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

  • Submit a Tip
  • Subscribe News-Letter Weekly Leisure Weekly
  • About Contact Staff Mission Statement Policies Professional Advisory Board

what is joint hypothesis testing

Hypothesis testing in lab and on the slopes

By JUSTINA MIAO | February 20, 2024

2205135839_d978e65100_c

CHAVAL BRASIL  / CC BY-NC-ND 2.0

Miao shares her discovery of the parallels between two of her interests, research and skiing. 

what is joint hypothesis testing

As I stood at the top of a ski slope in a terrain park, I looked down upon the 20-foot jump that my friends and I wanted to hit. One critical question arose in my head: How fast should we hit the jump? 

For those of you who didn’t grow up surrounded by snow or haven’t tried out park skiing, hitting a jump in the park has two important caveats — you should never overshoot or undershoot. Hitting it too fast will lead to overshooting, which can send you flying past the landing and result in knee pain or even an anterior cruciate ligament (ACL) injury. Hitting it too slowly will lead to undershooting, which means landing too early and slamming onto the “knuckle.” This can hurt a lot, trust me, but, determined to find the sweet spot and make it back home without any knee pain, I asked the skier next to me.

“Just send it at a medium-ish pace,” he replied.

The rising scientist in me knew that the answer wasn’t so simple. From my experiences in research, I knew that I shouldn’t leave the fate of my knees up to chance. Instead, I should understand the science that can justify the most optimal speed to take off in order to land safely. 

How did I find an answer to this question? Let me share my research experience at Hopkins with you.

In the summer after freshman year, I joined Xinzhong Dong’s lab , where we have been investigating Mas-related G-protein coupled receptors . With no prior wet lab experience, I was really intimidated by the thought of entering into a space dedicated to scientific research. 

However, after spending around half a year in the lab, I realized that research is not that daunting. The research process can actually be found in everyday life — including skiing. 

Research is precise. To generate reliable and accurate data, each step must be nicely controlled to isolate the effects you aim to measure. Furthermore, this process should be reproducible to yield the same results. The need for precision in scientific research means there is little room for human errors, demanding meticulousness and patience. 

How did I apply this to the slopes? Through observing other skiers and reflecting on my past experiences, I hypothesized that speed was important in a successful landing. I quantified it as the number of turns I made before hitting the jump, which is easy to measure for multiple trials. After observing other skiers, I refined my hypothesis that the optimal speed to hit the jump would require two small turns before full sending to the takeoff. 

Research can be high risk sometimes. The samples you are working with can be the accumulating result of months of work. One slip or addition of the wrong reagent means that all the culminating work might need to be redone. Therefore, it is always important to be careful when running experiments because scientific resources are limited and valuable.

Nevertheless, taking risks can still be crucial in research. Like my favorite catchphrase in skiing — “Full send or nothing” — once you think you have sorted everything out, go for it. It can be daunting to pipette your first reagents into cell samples, and running an experiment independently is definitely not a small feat. But if you never take the first step, you will never be able to perform an experiment or make any new findings!

Similar to research, skiing also requires taking risks. After refining my hypothesis, it was time for me to test whether two small turns preceding the jump would result in a successful landing. Although I was risking the possibility of hurting my knees, I would never know the feasibility of my hypothesis until I experimented with it myself. 

Research happens in a collaborative environment that encourages collective efforts. In fact, the crucial point of research is to build on and contribute to an existing pool of literature. Researchers communicate with each other all the time through conferences, journal publishing or just a casual lunchroom conversation. Researchers must elaborate on scientific findings and articulate their implications in a way that is accessible to a broad audience. After all, research leads to new information and spreads these findings to expand human knowledge, and none of that can be achieved by a single person.

The ski park is also a collaborative space like the research community. After successfully landing my jump after making two small turns, I knew that my hypothesis-testing process could help my friends hit that jump. Therefore, I shared my experience to help them find the right speed at which they should hit the jumps.

Through my research experience at Hopkins, I have not only debunked my initial hesitations about scientific research but also gained a greater appreciation for scientists’ tremendous contribution to expanding our current understanding of the world. I also learned that the processes and skills I developed in research are omnipresent throughout our everyday lives. So perhaps my thoughts and reflection can inspire you to also get involved in research or to employ what you have learned in your research experience in your everyday life! 

Research on the Record spotlights undergraduate students involved in STEM research at Hopkins. The goal of the column is to share reflections on the highs and lows that Hopkins students experience in their contributions to undergraduate research. If you are an undergraduate researcher interested in being profiled, reach out to [email protected] .

Related Articles

MICHAEL HIMBEAULT / CC BY 2.0
Jessica Sorrell highlighted the different reasons why machine learning algorithms might be difficult to replicate.

Addressing the replication crisis in computer science

ESO/M. KORNMESSER / CC BY 4.0
Cosmologist Joseph Silk explores the role of supermassive black holes in the early universe in a recent paper.

New paper demonstrates link between black holes and early star formation

ALEXAS_FOTOS / CC0 1.0
A new study reveals smoking’s long-lasting effects on immune system function and genetic activity even after individuals have quit smoking.&nbsp;

Science news in review: Feb. 18

Please note All comments are eligible for publication in The News-Letter .

Editor's Picks

Charm city needs to circulate more, homewood museum: past and present, faculty host seminar on interpretations of gender and sex in islamic society, book recommendations: love is (still) in the air (kind of), u.s. department of education opens title vi investigation into alleged anti-semitism on hopkins campus, hopkins graduate student union organizes picket protest, weekly rundown, events this weekend (feb. 23–25), to watch and watch for: week of feb. 18, science news in review: feb. 11, events this weekend (feb. 16–18), to watch and watch for: week of feb. 11.

what is joint hypothesis testing

Frosty Fun at the Ice Rink

Leisure interactive food map.

The News-Letter Print Locations

News-Letter Special Editions

what is joint hypothesis testing

IMAGES

  1. 05 Easy Steps for Hypothesis Testing with Examples

    what is joint hypothesis testing

  2. F-test for testing the joint significance of a subset of independent variables

    what is joint hypothesis testing

  3. Hypothesis Testing Solved Examples(Questions and Solutions)

    what is joint hypothesis testing

  4. 7 Steps Of Hypothesis Testing Examples

    what is joint hypothesis testing

  5. PPT

    what is joint hypothesis testing

  6. PPT

    what is joint hypothesis testing

VIDEO

  1. Part 2 Video Lecture on the 4 Steps of Hypothesis Testing

  2. Testing of Hypothesis Part 1 And 2

  3. Testing of hypothesis

  4. Five important statistical techniques for hypothesis testing

  5. Hypothesis Testing

  6. hypothesis testing Introdution

COMMENTS

  1. 7.3 Joint Hypothesis Testing using the F-Statistic

    A joint hypothesis imposes restrictions on multiple regression coefficients. This is different from conducting individual t t -tests where a restriction is imposed on a single coefficient. Chapter 7.2 of the book explains why testing hypotheses about the model coefficients one at a time is different from testing them jointly.

  2. Joint Hypotheses Testing

    Joint Hypotheses Testing 17 Dec 2022 In multiple regression, the intercept in simple regression represents the expected value of the dependent variable when the independent variable is zero, while in multiple regression, it's the expected value when all independent variables are zero.

  3. 8.5 Joint Hypothesis Tests

    Joint hypothesis tests consider a stated null involving multiple PRF coefficients simultaneously. Consider the following general PRF: Y i =β0 +β1X1i +β2X2i +β3X3i +εi Y i = β 0 + β 1 X 1 i + β 2 X 2 i + β 3 X 3 i + ε i A simple hypothesis test such as H 0: β1 = 0 versus H 1: β1 ≠ 0 H 0: β 1 = 0 versus H 1: β 1 ≠ 0

  4. PDF Multiple Hypothesis Testing: The F-test

    Hypotheses in-volving multiple regression coefficients require a different test statistic and a different null distribution. We call the test statistics F0 and its null distribution the F-distribution, after R.A. Fisher (we call the whole test an F-test, similar to the t-test).

  5. PDF Joint hypotheses

    Joint hypotheses The null and alternative hypotheses can usually be interpreted as a restricted model ( ) and an unrestricted model ( ). In our example: Note that if the unrestricted model "fits" significantly better than the restricted model, we should reject the null.

  6. Introductory Econometrics Chapter 17: F Tests

    Chapter 17: Joint Hypothesis Testing Chapter 16 shows how to test a hypothesis about a single slope parameter in a regression equation. This chapter explains how to test hypotheses about more than one of the parameters in a multiple regression model.

  7. Joint hypothesis problem

    The joint hypothesis problem is the problem that testing for market efficiency is difficult, or even impossible. Any attempts to test for market (in)efficiency must involve asset pricing models so that there are expected returns to compare to real returns.

  8. Joint hypothesis tests

    Joint hypothesis tests in regression: setting up and interpreting the F test.This video screencast was created with Doceri on an iPad. Doceri is free in the ...

  9. Joint hypothesis test

    A joint hypothesis test is an F-test to evaluate nested models, which consist of a full or unrestricted model, and a restricted model. The F-statistic is calculated using the formula shown.

  10. Joint Hypothesis Testing (Chapter 17)

    Introduction Chapter 16 shows how to test a hypothesis about a single slope parameter in a regression equation. This chapter explains how to test hypotheses about more than one of the parameters in a multiple regression model.

  11. PDF 7 Joint Hypothesis Tests

    A joint hypothesis specifies a value (imposes a restriction) for two or more coefficients Use q to denote the number of restrictions (q = 2 for 1st example, q = 3 for second example) F-tests can be used for model selection. Which variables should we leave out of the model? If variables are insignificant, we might want to drop them from the model

  12. Joint Hypothesis Testing and Gatekeeping Procedures for Stud ...

    Joint hypothesis testing and gatekeeping procedures are shown to substantially improve the efficiency and interpretation of randomized and nonrandomized studies having multiple outcomes of interest. Comparative efficacy or effectiveness studies frequently have more than one primary outcome, and often several secondary.

  13. Testing EMH: The Joint Hypothesis Problem

    Testing EMH: The Joint Hypothesis Problem Hypotheses cannot be proven. They can only be disproved. As Taleb reminds us, even with hundreds of thousands of white swan sightings and no black swan sightings, it was never possible to prove the statement "all swans are white."

  14. PDF ACE 564 Spring 2006

    Joint Hypothesis Tests: F-Test Approach The correct approach to testing a joint hypothesis is based on a general version of the F-test • Approach can accommodate any linear hypothesis or set of linear hypotheses • Some of the joint tests also can be conducted using "simple" t-tests

  15. When to Combine Hypotheses and Adjust for Multiple Tests

    A joint hypothesis test is indicated. The following guideline presents another heuristic to distinguish the need for joint versus separate tests. Guideline 2: If a conclusion would follow from a single hypothesis fully developed, tested, and reported in isolation from other hypotheses, then a single hypothesis test is warranted.

  16. PDF Econ 422

    Hypothesis Tests and Confidence Intervals for a Single Coefficient in Multiple Regression β ˆ ˆ ) β ( E − 1 var( β ˆ ) is approximately distributed N(0,1) (CLT). Thus hypotheses on β 1 can be tested using the usual t-statistic, and confidence intervals are constructed as { β ˆ ± 1.96×SE( β ˆ )}. So too for β 2,..., βk. β ˆ and

  17. PDF Econometrics

    Joint Hypothesis Testing For joint hypothesis testing, we use F-test. Under the null hypothesis, in large samples, the F-statistic has a sampling distribution of F q,∞. That is, F-statistic ~ F q,∞ where q is the number of coefficients that you are testing. If F-statistics is bigger than the critical value or p-value is

  18. Efficient Markets Hypothesis: Joint Hypothesis

    Efficient Markets Hypothesis: Joint Hypothesis Important paper: Fama (1970) An efficient market will always "fully reflect" available information, but in order to determine how the market should "fully reflect" this information, we need to determine investors' risk preferences.

  19. The Joint Null Criterion for Multiple Hypothesis Tests

    Go to: 5. Applications of the Joint Null Criterion. We apply the proposed JNC and diagnostic tests to assess the behavior of methods or two important challenges in multiple hypothesis testing: (1) addressing multiple testing dependence and (2) determining the validity pooled null distributions.

  20. Hypothesis Testing

    Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories. There are 5 main steps in hypothesis testing:

  21. inference

    Joint hypothesis testing: How to set up restricted model for equality of more than 2 coefficients? Asked 2 years, 7 months ago Modified 2 years, 7 months ago Viewed 509 times 0 Say I am running the following regression: Y = β0 +β1X +β2Z +β3W + othercontrols + error Y = β 0 + β 1 X + β 2 Z + β 3 W + o t h e r c o n t r o l s + e r r o r

  22. PDF test

    Description test performs Wald tests of simple and composite linear hypotheses about the parameters of the most recently fit model. test supports svy estimators (see [SVY] svy estimation), carrying out an adjusted Wald test by default in such cases. test can be used with svy estimation results, see [SVY] svy postestimation.

  23. That's not a two-sided test! It's two one-sided tests!

    A non-directional claim often implies two tests of a non-directional joint null hypothesis, and it therefore requires an alpha adjustment to compensate for multiple (dual) testing. In contrast, a directional claim often implies a single test of a directional null hypothesis, in which case it does not require an alpha adjustment.

  24. Hypothesis testing in lab and on the slopes

    Hypothesis testing in lab and on the slopes. Miao shares her discovery of the parallels between two of her interests, research and skiing. As I stood at the top of a ski slope in a terrain park, I looked down upon the 20-foot jump that my friends and I wanted to hit. One critical question arose in my head: How fast should we hit the jump?