Statology

Statistics Made Easy

Understanding the Null Hypothesis for Logistic Regression

Logistic regression is a type of regression model we can use to understand the relationship between one or more predictor variables and a response variable when the response variable is binary.

If we only have one predictor variable and one response variable, we can use simple logistic regression , which uses the following formula to estimate the relationship between the variables:

log[p(X) / (1-p(X))]  =  β 0 + β 1 X

The formula on the right side of the equation predicts the log odds of the response variable taking on a value of 1.

Simple logistic regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = 0
  • H A : β 1 ≠ 0

The null hypothesis states that the coefficient β 1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

The alternative hypothesis states that β 1 is not equal to zero. In other words, there is a statistically significant relationship between x and y.

If we have multiple predictor variables and one response variable, we can use multiple logistic regression , which uses the following formula to estimate the relationship between the variables:

log[p(X) / (1-p(X))] = β 0 + β 1 x 1 + β 2 x 2 + … + β k x k

Multiple logistic regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = β 2 = … = β k = 0
  • H A : β 1 = β 2 = … = β k ≠ 0

The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically significant relationship with the response variable, y.

The alternative hypothesis states that not every coefficient is simultaneously equal to zero.

The following examples show how to decide to reject or fail to reject the null hypothesis in both simple logistic regression and multiple logistic regression models.

Example 1: Simple Logistic Regression

Suppose a professor would like to use the number of hours studied to predict the exam score that students will receive in his class. He collects data for 20 students and fits a simple logistic regression model.

We can use the following code in R to fit a simple logistic regression model:

To determine if there is a statistically significant relationship between hours studied and exam score, we need to analyze the overall Chi-Square value of the model and the corresponding p-value.

We can use the following formula to calculate the overall Chi-Square value of the model:

X 2 = (Null deviance – Residual deviance) / (Null df – Residual df)

The p-value turns out to be 0.2717286 .

Since this p-value is not less than .05, we fail to reject the null hypothesis. In other words, there is not a statistically significant relationship between hours studied and exam score received.

Example 2: Multiple Logistic Regression

Suppose a professor would like to use the number of hours studied and the number of prep exams taken to predict the exam score that students will receive in his class. He collects data for 20 students and fits a multiple logistic regression model.

We can use the following code in R to fit a multiple logistic regression model:

The p-value for the overall Chi-Square statistic of the model turns out to be 0.01971255 .

Since this p-value is less than .05, we reject the null hypothesis. In other words, there is a statistically significant relationship between the combination of hours studied and prep exams taken and final exam score received.

Additional Resources

The following tutorials offer additional information about logistic regression:

Introduction to Logistic Regression How to Report Logistic Regression Results Logistic Regression vs. Linear Regression: The Key Differences

logistic regression hypothesis testing

Hey there. My name is Zach Bobbitt. I have a Master of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • 12.1 - Logistic Regression

Logistic regression models a relationship between predictor variables and a categorical response variable. For example, we could use logistic regression to model the relationship between various measurements of a manufactured specimen (such as dimensions and chemical composition) to predict if a crack greater than 10 mils will occur (a binary variable: either yes or no). Logistic regression helps us estimate a probability of falling into a certain level of the categorical response given a set of predictors. We can choose from three types of logistic regression, depending on the nature of the categorical response variable:

Binary Logistic Regression :

Used when the response is binary (i.e., it has two possible outcomes). The cracking example given above would utilize binary logistic regression. Other examples of binary responses could include passing or failing a test, responding yes or no on a survey, and having high or low blood pressure.

Nominal Logistic Regression :

Used when there are three or more categories with no natural ordering to the levels. Examples of nominal responses could include departments at a business (e.g., marketing, sales, HR), type of search engine used (e.g., Google, Yahoo!, MSN), and color (black, red, blue, orange).

Ordinal Logistic Regression :

Used when there are three or more categories with a natural ordering to the levels, but the ranking of the levels do not necessarily mean the intervals between them are equal. Examples of ordinal responses could be how students rate the effectiveness of a college course (e.g., good, medium, poor), levels of flavors for hot wings, and medical condition (e.g., good, stable, serious, critical).

Particular issues with modelling a categorical response variable include nonnormal error terms, nonconstant error variance, and constraints on the response function (i.e., the response is bounded between 0 and 1). We will investigate ways of dealing with these in the binary logistic regression setting here. Nominal and ordinal logistic regression are not considered in this course.

The multiple binary logistic regression model is the following:

\[\begin{align}\label{logmod} \pi(\textbf{X})&=\frac{\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k})}{1+\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k})}\notag \\ & =\frac{\exp(\textbf{X}\beta)}{1+\exp(\textbf{X}\beta)}\\ & =\frac{1}{1+\exp(-\textbf{X}\beta)}, \end{align}\]

where here \(\pi\) denotes a probability and not the irrational number 3.14....

  • \(\pi\) is the probability that an observation is in a specified category of the binary Y variable, generally called the "success probability."
  • Notice that the model describes the probability of an event happening as a function of X variables. For instance, it might provide estimates of the probability that an older person has heart disease.
  • The numerator \(\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k})\) must be positive, because it is a power of a positive value ( e ).
  • The denominator of the model is (1 + numerator), so the answer will always be less than 1.
  • With one X variable, the theoretical model for \(\pi\) has an elongated "S" shape (or sigmoidal shape) with asymptotes at 0 and 1, although in sample estimates we may not see this "S" shape if the range of the X variable is limited.

For a sample of size n , the likelihood for a binary logistic regression is given by:

\[\begin{align*} L(\beta;\textbf{y},\textbf{X})&=\prod_{i=1}^{n}\pi_{i}^{y_{i}}(1-\pi_{i})^{1-y_{i}}\\ & =\prod_{i=1}^{n}\biggl(\frac{\exp(\textbf{X}_{i}\beta)}{1+\exp(\textbf{X}_{i}\beta)}\biggr)^{y_{i}}\biggl(\frac{1}{1+\exp(\textbf{X}_{i}\beta)}\biggr)^{1-y_{i}}. \end{align*}\]

This yields the log likelihood:

\[\begin{align*} \ell(\beta)&=\sum_{i=1}^{n}[y_{i}\log(\pi_{i})+(1-y_{i})\log(1-\pi_{i})]\\ & =\sum_{i=1}^{n}[y_{i}\textbf{X}_{i}\beta-\log(1+\exp(\textbf{X}_{i}\beta))]. \end{align*}\]

Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like iteratively reweighted least squares is used to find an estimate of the regression coefficients, $\hat{\beta}$.

To illustrate, consider data published on n = 27 leukemia patients. The data ( leukemia_remission.txt ) has a response variable of whether leukemia remission occurred (REMISS), which is given by a 1.

The predictor variables are cellularity of the marrow clot section (CELL), smear differential percentage of blasts (SMEAR), percentage of absolute marrow leukemia cell infiltrate (INFIL), percentage labeling index of the bone marrow leukemia cells (LI), absolute number of blasts in the peripheral blood (BLAST), and the highest temperature prior to start of treatment (TEMP).

The following output shows the estimated logistic regression equation and associated significance tests

  • Select Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model.
  • Select "REMISS" for the Response (the response event for remission is 1 for this data).
  • Select all the predictors as Continuous predictors.
  • Click Options and choose Deviance or Pearson residuals for diagnostic plots.
  • Click Graphs and select "Residuals versus order."
  • Click Results and change "Display of results" to "Expanded tables."
  • Click Storage and select "Coefficients."

Coefficients Term        Coef  SE Coef       95% CI      Z-Value  P-Value     VIF Constant    64.3     75.0  ( -82.7, 211.2)     0.86    0.391 CELL        30.8     52.1  ( -71.4, 133.0)     0.59    0.554   62.46 SMEAR       24.7     61.5  ( -95.9, 145.3)     0.40    0.688  434.42 INFIL      -25.0     65.3  (-152.9, 103.0)    -0.38    0.702  471.10 LI          4.36     2.66  ( -0.85,  9.57)     1.64    0.101    4.43 BLAST      -0.01     2.27  ( -4.45,  4.43)    -0.01    0.996    4.18 TEMP      -100.2     77.8  (-252.6,  52.2)    -1.29    0.198    3.01

The Wald test is the test of significance for individual regression coefficients in logistic regression (recall that we use t -tests in linear regression). For maximum likelihood estimates, the ratio

\[\begin{equation*} Z=\frac{\hat{\beta}_{i}}{\textrm{s.e.}(\hat{\beta}_{i})} \end{equation*}\]

can be used to test $H_{0}: \beta_{i}=0$. The standard normal curve is used to determine the $p$-value of the test. Furthermore, confidence intervals can be constructed as

\[\begin{equation*} \hat{\beta}_{i}\pm z_{1-\alpha/2}\textrm{s.e.}(\hat{\beta}_{i}). \end{equation*}\]

Estimates of the regression coefficients, $\hat{\beta}$, are given in the Coefficients table in the column labeled "Coef." This table also gives coefficient p -values based on Wald tests. The index of the bone marrow leukemia cells (LI) has the smallest p -value and so appears to be closest to a significant predictor of remission occurring. After looking at various subsets of the data, we find that a good model is one which only includes the labeling index as a predictor:

Coefficients Term       Coef  SE Coef      95% CI      Z-Value  P-Value   VIF Constant  -3.78     1.38  (-6.48, -1.08)    -2.74    0.006 LI         2.90     1.19  ( 0.57,  5.22)     2.44    0.015  1.00

Regression Equation P(1)  =  exp(Y')/(1 + exp(Y')) Y' = -3.78 + 2.90 LI

Since we only have a single predictor in this model we can create a Binary Fitted Line Plot to visualize the sigmoidal shape of the fitted logistic regression curve:

Binary fitted line plot

Odds, Log Odds, and Odds Ratio

There are algebraically equivalent ways to write the logistic regression model:

The first is

\[\begin{equation}\label{logmod1} \frac{\pi}{1-\pi}=\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k}), \end{equation}\]

which is an equation that describes the odds of being in the current category of interest. By definition, the odds for an event is π  / (1 - π ) such that P is the probability of the event. For example, if you are at the racetrack and there is a 80% chance that a certain horse will win the race, then his odds are 0.80 / (1 - 0.80) = 4, or 4:1.

The second is

\[\begin{equation}\label{logmod2} \log\biggl(\frac{\pi}{1-\pi}\biggr)=\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k}, \end{equation}\]

which states that the (natural) logarithm of the odds is a linear function of the X variables (and is often called the log odds ). This is also referred to as the logit transformation of the probability of success,  \(\pi\).

The odds ratio (which we will write as $\theta$) between the odds for two sets of predictors (say $\textbf{X}_{(1)}$ and $\textbf{X}_{(2)}$) is given by

\[\begin{equation*} \theta=\frac{(\pi/(1-\pi))|_{\textbf{X}=\textbf{X}_{(1)}}}{(\pi/(1-\pi))|_{\textbf{X}=\textbf{X}_{(2)}}}. \end{equation*}\]

For binary logistic regression, the odds of success are:

\[\begin{equation*} \frac{\pi}{1-\pi}=\exp(\textbf{X}\beta). \end{equation*}\]

By plugging this into the formula for $\theta$ above and setting $\textbf{X}_{(1)}$ equal to $\textbf{X}_{(2)}$ except in one position (i.e., only one predictor differs by one unit), we can determine the relationship between that predictor and the response. The odds ratio can be any nonnegative number. An odds ratio of 1 serves as the baseline for comparison and indicates there is no association between the response and predictor. If the odds ratio is greater than 1, then the odds of success are higher for higher levels of a continuous predictor (or for the indicated level of a factor). In particular, the odds increase multiplicatively by $\exp(\beta_{j})$ for every one-unit increase in $\textbf{X}_{j}$. If the odds ratio is less than 1, then the odds of success are less for higher levels of a continuous predictor (or for the indicated level of a factor). Values farther from 1 represent stronger degrees of association.

For example, when there is just a single predictor, \(X\), the odds of success are:

\[\begin{equation*} \frac{\pi}{1-\pi}=\exp(\beta_0+\beta_1X). \end{equation*}\]

If we increase \(X\) by one unit, the odds ratio is

\[\begin{equation*} \theta=\frac{\exp(\beta_0+\beta_1(X+1))}{\exp(\beta_0+\beta_1X)}=\exp(\beta_1). \end{equation*}\]

To illustrate, the relevant output from the leukemia example is:

Odds Ratios for Continuous Predictors     Odds Ratio        95% CI LI     18.1245  (1.7703, 185.5617)

The regression parameter estimate for LI is $2.89726$, so the odds ratio for LI is calculated as $\exp(2.89726)=18.1245$. The 95% confidence interval is calculated as $\exp(2.89726\pm z_{0.975}*1.19)$, where $z_{0.975}=1.960$ is the $97.5^{\textrm{th}}$ percentile from the standard normal distribution. The interpretation of the odds ratio is that for every increase of 1 unit in LI, the estimated odds of leukemia remission are multiplied by 18.1245. However, since the LI appears to fall between 0 and 2, it may make more sense to say that for every 0.1 unit increase in L1, the estimated odds of remission are multiplied by $\exp(2.89726\times 0.1)=1.336$. Then

  • At LI=0.9, the estimated odds of leukemia remission is $\exp\{-3.77714+2.89726*0.9\}=0.310$.
  • At LI=0.8, the estimated odds of leukemia remission is $\exp\{-3.77714+2.89726*0.8\}=0.232$.
  • The resulting odds ratio is $\frac{0.310}{0.232}=1.336$, which is the ratio of the odds of remission when LI=0.9 compared to the odds when L1=0.8.

Notice that $1.336\times 0.232=0.310$, which demonstrates the multiplicative effect by $\exp(0.1\hat{\beta_{1}})$ on the odds.

Likelihood Ratio (or Deviance) Test

The  likelihood ratio test  is used to test the null hypothesis that any subset of the $\beta$'s is equal to 0. The number of $\beta$'s in the full model is k +1 , while the number of $\beta$'s in the reduced model is r +1 . (Remember the reduced model is the model that results when the $\beta$'s in the null hypothesis are set to 0.) Thus, the number of $\beta$'s being tested in the null hypothesis is \((k+1)-(r+1)=k-r\). Then the likelihood ratio test statistic is given by:

\[\begin{equation*} \Lambda^{*}=-2(\ell(\hat{\beta}^{(0)})-\ell(\hat{\beta})), \end{equation*}\]

where $\ell(\hat{\beta})$ is the log likelihood of the fitted (full) model and $\ell(\hat{\beta}^{(0)})$ is the log likelihood of the (reduced) model specified by the null hypothesis evaluated at the maximum likelihood estimate of that reduced model. This test statistic has a $\chi^{2}$ distribution with \(k-r\) degrees of freedom. Statistical software often presents results for this test in terms of "deviance," which is defined as \(-2\) times log-likelihood. The notation used for the test statistic is typically $G^2$ = deviance (reduced) – deviance (full).

This test procedure is analagous to the general linear F test procedure for multiple linear regression. However, note that when testing a single coefficient, the Wald test and likelihood ratio test will not in general give identical results.

To illustrate, the relevant software output from the leukemia example is:

Deviance Table Source     DF  Adj Dev  Adj Mean  Chi-Square  P-Value Regression  1    8.299     8.299        8.30    0.004 LI          1    8.299     8.299        8.30    0.004 Error      25   26.073     1.043 Total      26   34.372

Since there is only a single predictor for this example, this table simply provides information on the likelihood ratio test for LI ( p -value of 0.004), which is similar but not identical to the earlier Wald test result ( p -value of 0.015). The Deviance Table includes the following:

  • The null (reduced) model in this case has no predictors, so the fitted probabilities are simply the sample proportion of successes, \(9/27=0.333333\). The log-likelihood for the null model is \(\ell(\hat{\beta}^{(0)})=-17.1859\), so the deviance for the null model is \(-2\times-17.1859=34.372\), which is shown in the "Total" row in the Deviance Table.
  • The log-likelihood for the fitted (full) model is \(\ell(\hat{\beta})=-13.0365\), so the deviance for the fitted model is \(-2\times-13.0365=26.073\), which is shown in the "Error" row in the Deviance Table.
  • The likelihood ratio test statistic is therefore \(\Lambda^{*}=-2(-17.1859-(-13.0365))=8.299\), which is the same as \(G^2=34.372-26.073=8.299\).
  • The p -value comes from a $\chi^{2}$ distribution with \(2-1=1\) degrees of freedom.

When using the likelihood ratio (or deviance) test for more than one regression coefficient, we can first fit the "full" model to find deviance (full), which is shown in the "Error" row in the resulting full model Deviance Table. Then fit the "reduced" model (corresponding to the model that results if the null hypothesis is true) to find deviance (reduced), which is shown in the "Error" row in the resulting reduced model Deviance Table. For example, the relevant Deviance Tables for the Disease Outbreak example on pages 581-582 of Applied Linear Regression Models (4th ed) by Kutner et al are:

Full model:

Source      DF  Adj Dev  Adj Mean  Chi-Square  P-Value Regression   9   28.322   3.14686       28.32    0.001 Error       88   93.996   1.06813 Total       97  122.318

Reduced model:

Source      DF  Adj Dev  Adj Mean  Chi-Square  P-Value Regression   4   21.263    5.3159       21.26    0.000 Error       93  101.054    1.0866 Total       97  122.318

Here the full model includes four single-factor predictor terms and five two-factor interaction terms, while the reduced model excludes the interaction terms. The test statistic for testing the interaction terms is \(G^2 = 101.054-93.996 = 7.058\), which is compared to a chi-square distribution with \(10-5=5\) degrees of freedom to find the p -value = 0.216 > 0.05 (meaning the interaction terms are not significant at a 5% significance level).

Alternatively, select the corresponding predictor terms last in the full model and request the software to output Sequential (Type I) Deviances. Then add the corresponding Sequential Deviances in the resulting Deviance Table to calculate \(G^2\). For example, the relevant Deviance Table for the Disease Outbreak example is:

Source           DF  Seq Dev  Seq Mean  Chi-Square  P-Value Regression        9   28.322    3.1469       28.32    0.001   Age             1    7.405    7.4050        7.40    0.007   Middle          1    1.804    1.8040        1.80    0.179   Lower           1    1.606    1.6064        1.61    0.205   Sector          1   10.448   10.4481       10.45    0.001   Age*Middle      1    4.570    4.5697        4.57    0.033   Age*Lower       1    1.015    1.0152        1.02    0.314   Age*Sector      1    1.120    1.1202        1.12    0.290   Middle*Sector   1    0.000    0.0001        0.00    0.993   Lower*Sector    1    0.353    0.3531        0.35    0.552 Error            88   93.996    1.0681 Total            97  122.318

The test statistic for testing the interaction terms is \(G^2 = 4.570+1.015+1.120+0.000+0.353 = 7.058\), the same as in the first calculation.

Goodness-of-Fit Tests

Overall performance of the fitted model can be measured by several different goodness-of-fit tests. Two tests that require replicated data (multiple observations with the same values for all the predictors) are the Pearson chi-square goodness-of-fit test and the deviance goodness-of-fit test  (analagous to the multiple linear regression lack-of-fit F-test). Both of these tests have statistics that are approximately chi-square distributed with c - k  - 1 degrees of freedom, where c is the number of distinct combinations of the predictor variables. When a test is rejected, there is a statistically significant lack of fit. Otherwise, there is no evidence of lack of fit.

By contrast, the Hosmer-Lemeshow goodness-of-fit test is useful for unreplicated datasets or for datasets that contain just a few replicated observations. For this test the observations are grouped based on their estimated probabilities. The resulting test statistic is  approximately chi-square distributed with  c  -  2  degrees of freedom, where  c  is the number of groups (generally chosen to be between 5 and 10, depending on the sample size) .

Goodness-of-Fit Tests Test             DF  Chi-Square  P-Value Deviance         25       26.07    0.404 Pearson          25       23.93    0.523 Hosmer-Lemeshow   7        6.87    0.442

Since there is no replicated data for this example, the deviance and Pearson goodness-of-fit tests are invalid, so the first two rows of this table should be ignored. However, the Hosmer-Lemeshow test does not require replicated data so we can interpret its high p -value as indicating no evidence of lack-of-fit.

The calculation of R 2 used in linear regression does not extend directly to logistic regression. One version of R 2 used in logistic regression is defined as

\[\begin{equation*} R^{2}=\frac{\ell(\hat{\beta_{0}})-\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})-\ell_{S}(\beta)}, \end{equation*}\]

where $\ell(\hat{\beta_{0}})$ is the log likelihood of the model when only the intercept is included and $\ell_{S}(\beta)$ is the log likelihood of the saturated model (i.e., where a model is fit perfectly to the data). This R 2 does go from 0 to 1 with 1 being a perfect fit. With unreplicated data, $\ell_{S}(\beta)=0$, so the formula simplifies to:

\[\begin{equation*} R^{2}=\frac{\ell(\hat{\beta_{0}})-\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})}=1-\frac{\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})}. \end{equation*}\]

Model Summary Deviance   Deviance     R-Sq  R-Sq(adj)    AIC   24.14%     21.23%  30.07

Recall from above that \(\ell(\hat{\beta})=-13.0365\) and \(\ell(\hat{\beta}^{(0)})=-17.1859\), so:

\[\begin{equation*} R^{2}=1-\frac{-13.0365}{-17.1859}=0.2414. \end{equation*}\]

Note that we can obtain the same result by simply using deviances instead of log-likelihoods since the $-2$ factor cancels out:

\[\begin{equation*} R^{2}=1-\frac{26.073}{34.372}=0.2414. \end{equation*}\]

Raw Residual

The raw residual is the difference between the actual response and the estimated probability from the model. The formula for the raw residual is

\[\begin{equation*} r_{i}=y_{i}-\hat{\pi}_{i}. \end{equation*}\]

Pearson Residual

The Pearson residual corrects for the unequal variance in the raw residuals by dividing by the standard deviation. The formula for the Pearson residuals is

\[\begin{equation*} p_{i}=\frac{r_{i}}{\sqrt{\hat{\pi}_{i}(1-\hat{\pi}_{i})}}. \end{equation*}\]

Deviance Residuals

Deviance residuals are also popular because the sum of squares of these residuals is the deviance statistic. The formula for the deviance residual is

\[\begin{equation*} d_{i}=\pm\sqrt{2\biggl[y_{i}\log\biggl(\frac{y_{i}}{\hat{\pi}_{i}}\biggr)+(1-y_{i})\log\biggl(\frac{1-y_{i}}{1-\hat{\pi}_{i}}\biggr)\biggr]}. \end{equation*}\]

Here are the plots of the Pearson residuals and deviance residuals for the leukemia example. There are no alarming patterns in these plots to suggest a major problem with the model.

residual plots for leukemia data

The hat matrix serves a similar purpose as in the case of linear regression – to measure the influence of each observation on the overall fit of the model – but the interpretation is not as clear due to its more complicated form. The hat values (leverages) are given by

\[\begin{equation*} h_{i,i}=\hat{\pi}_{i}(1-\hat{\pi}_{i})\textbf{x}_{i}^{\textrm{T}}(\textbf{X}^{\textrm{T}}\textbf{W}\textbf{X})\textbf{x}_{i}, \end{equation*}\]

where W is an $n\times n$ diagonal matrix with the values of $\hat{\pi}_{i}(1-\hat{\pi}_{i})$ for $i=1 ,\ldots,n$ on the diagonal. As before, we should investigate any observations with $h_{i,i}>3p/n$ or, failing this, any observations with $h_{i,i}>2p/n$ and very isolated .

Studentized Residuals

We can also report Studentized versions of some of the earlier residuals. The Studentized Pearson residuals are given by

\[\begin{equation*} sp_{i}=\frac{p_{i}}{\sqrt{1-h_{i,i}}} \end{equation*}\]

and the Studentized deviance residuals are given by

\[\begin{equation*} sd_{i}=\frac{d_{i}}{\sqrt{1-h_{i, i}}}. \end{equation*}\]

Cook's Distances

An extension of Cook's distance for logistic regression measures the overall change in fitted logits due to deleting the $i^{\textrm{th}}$ observation. It is defined by:

\[\begin{equation*} \textrm{C}_{i}=\frac{p_{i}^{2}h _{i,i}}{(k+1)(1-h_{i,i})^{2}}. \end{equation*}\]

Fits and Diagnostics for Unusual Observations         Observed Obs  Probability    Fit  SE Fit      95% CI       Resid  Std Resid  Del Resid        HI   8        0.000  0.849   0.139  (0.403, 0.979)  -1.945      -2.11      -2.19  0.149840 Obs  Cook’s D     DFITS   8      0.58  -1.08011  R R  Large residual

The residuals in this output are deviance residuals, so observation 8 has a deviance residual of \(-1.945\), a studentized deviance residual of \(-2.19\), a leverage (h) of \(0.149840\), and a Cook's distance (C) of 0.58.

Start Here!

  • Welcome to STAT 462!
  • Search Course Materials
  • Lesson 1: Statistical Inference Foundations
  • Lesson 2: Simple Linear Regression (SLR) Model
  • Lesson 3: SLR Evaluation
  • Lesson 4: SLR Assumptions, Estimation & Prediction
  • Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation
  • Lesson 6: MLR Assumptions, Estimation & Prediction
  • Lesson 7: Transformations & Interactions
  • Lesson 8: Categorical Predictors
  • Lesson 9: Influential Points
  • Lesson 10: Regression Pitfalls
  • Lesson 11: Model Building
  • 12.2 - Further Logistic Regression Examples
  • 12.3 - Poisson Regression
  • 12.4 - Generalized Linear Models
  • 12.5 - Nonlinear Regression
  • 12.6 - Exponential Regression Example
  • 12.7 - Population Growth Example
  • Website for Applied Regression Modeling, 2nd edition
  • Notation Used in this Course
  • R Software Help
  • Minitab Software Help

Penn State Science

Copyright © 2018 The Pennsylvania State University Privacy and Legal Statements Contact the Department of Statistics Online Programs

Introduction to Regression Analysis in R

Chapter 19 inference in logistic regression.

\(\newcommand{\E}{\mathrm{E}}\) \(\newcommand{\Var}{\mathrm{Var}}\) \(\newcommand{\bmx}{\mathbf{x}}\) \(\newcommand{\bmH}{\mathbf{H}}\) \(\newcommand{\bmI}{\mathbf{I}}\) \(\newcommand{\bmX}{\mathbf{X}}\) \(\newcommand{\bmy}{\mathbf{y}}\) \(\newcommand{\bmY}{\mathbf{Y}}\) \(\newcommand{\bmbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bmepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bmmu}{\boldsymbol{\mu}}\) \(\newcommand{\bmSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\XtX}{\bmX^\mT\bmX}\) \(\newcommand{\mT}{\mathsf{T}}\) \(\newcommand{\XtXinv}{(\bmX^\mT\bmX)^{-1}}\)

19.1 Maximum Likelihood

For estimating \(\beta\) ’s in the logistic regression model \[logit(p_i) = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_kx_{ik},\] we can’t minimize the residual sum of squares like was done in linear regression. Instead, we use a statistical technique called maximum likelihood.

To demonstrate the idea of maximum likelihood, we first consider examples of flipping a coin.

Example 19.1 Suppose we that we have a (possibly biased) coin that has probability \(p\) of landing heads. We flip it twice. What is the probability that we get two heads?

Consider the random variable \(Y_i\) such that \(Y_i= 1\) if the \(i\) th coin flip is heads, and \(Y_i = 0\) if the \(i\) th coin flip is tails. Clearly, \(P(Y_i=1) = p\) and \(P(Y_i=0) = 1- p\) . More generally, we can write \(P(Y_i = y) = p^y(1-p)^{(1-y)}\) . To determine the probability of getting two heads in two flips, we need to compute \(P(Y_1=1 \text{ and }Y_2 = 1)\) . Since the flips are independent, we have \(P(Y_1=1 \text{ and }Y_2 = 1) = P(Y_1 = 1)P(Y_2 =1)= p*p = p^2\) .

Using the same logic, we find the probabiltiy of obtaining two tails to be \((1-p)^2\) .

Lastly, we could calculate the probability of obtaining 1 heads and 1 tails (from two coin flips) as \(p(1-p) + (1-p)p = 2p(1-p)\) . Notice that we sum two values here, corresponding to different orderings: heads then tails or tails then heads. Both occurences lead to 1 heads and 1 tails in total.

Example 19.2 Suppose again we have a biased coin, that has probability \(p\) of landing heads. We flip it 100 times. What is the probability of getting 64 heads?

\[\begin{equation} P(\text{64 heads in 100 flips}) = \text{constant} \times p^{64}(1-p)^{100 - 64} \tag{19.1} \end{equation}\] In this equation, the constant term 16 accounts for the possible orderings.

19.1.1 Likelihood function

Now consider reversing the question of Example 19.2 . Suppose we flipped a coin 100 times and observed 64 heads and 36 tails. What would be our best guess of the probability of landing heads for this coin? We can calculate by considering the likelihood :

\[L(p) = \text{constant} \times p^{64}(1-p)^{100 - 64}\]

This likelihood function is very similar to (19.1) , but is written as a function of the parameter \(p\) rather than the random variable \(Y\) . The likelihood function indicates how likely the data are if the probability of heads is \(p\) . This value depends on the data (in this case, 64 heads) and the probability of heads ( \(p\) ).

In maximum likelihood, we seek to find the value of \(p\) that gives the largest possible value of \(L(p)\) . We call this value the maximum likelihood estimate of \(p\) and denote it as \(\hat p\) .

Likelihood function $L(p)$ when 64 heads are observed out of 100 coin flips.

Figure 19.1: Likelihood function \(L(p)\) when 64 heads are observed out of 100 coin flips.

19.1.2 Maximum Likelihood in Logistic Regression

We use this approach to calculate \(\hat\beta_j\) ’s in logistic regression. The likelihood function for \(n\) independent binary random variables can be written: \[L(p_1, \dots, p_n) \propto \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\]

Important differences from the coin flip example are that now \(p_i\) is different for each observation and \(p_i\) depends on \(\beta\) ’s. Taking this into account, we can write the likelihood function for logistic regression as: \[L(\beta_0, \beta_1, \dots, \beta_k) = L(\boldsymbol{\beta}) = \prod_{i=1}^n p_i(\boldsymbol{\beta})^y_i(1-p_i(\boldsymbol{\beta}))^{1-y_i}\] The goal of maximum likelihood is to find the values of \(\boldsymbol\beta\) that maximize \(L(\boldsymbol\beta)\) . Our data have the highest probability of occurring when \(\boldsymbol\beta\) takes these values (compared to other values of \(\boldsymbol\beta\) ).

Unfortunately, there is not simple closed-form solution to finding \(\hat{\boldsymbol\beta}\) . Instead, we use an iterative procedure called Iteratively Reweighted Least Squares (IRLS). This is done automatically by the glm() function in R, so we will skip over the details of the procedure.

If you want to know the value of the likelihood function for a logistic regression model, use the logLik() function on the fitted model object. This will return the logarithm of the likelihood. Alternatively, the summary output for glm() provides the deviance , which is \(-2\) times the logarithm of the likelihood.

19.2 Hypothesis Testing for \(\beta\) ’s

Like with linear regression, a common inferential question in logistic regression is whether a \(\beta_j\) is different from zero. This corresponds to there being a difference in the log odds of the outcome among observations that differen in the value of the predictor variable \(x_j\) .

There are three possible tests of \(H_0: \beta_j = 0\) vs.  \(H_A: \beta_j \ne 0\) in logistic regression:

  • Likelihood Ratio Test

In linear regression, all three are equivalent. In logistic regression (and other GLM’s), they are not equivalent.

19.2.1 Likelihood Ratio Test (LRT)

The LRT asks the question: Are the data significantly more likely when \(\beta_j = \hat\beta_j\) than when \(\beta_j = 0\) ? To do this, it compares the values of the log-likelihood for models with and without \(\beta_j\) The test statistic is: \[\begin{align*} \text{LR Statistic } \Lambda &= -2 \log \frac{L(\widehat{reduced})}{L(\widehat{full})}\\ &= -2 \log L(\widehat{reduced}) + 2 \log L(\widehat{full}) \end{align*}\]

\(\Lambda\) follows a \(\chi^2_{r}\) distribution when \(H_0\) is true and \(n\) is large ( \(r\) is the number of variables set to zero, in this case \(=1\) ). We reject the null hypothesis when \(\Lambda > \chi^2_{r}(\alpha)\) . This means that larger values of \(\Lambda\) lead to rejecting \(H_0\) . Conceptually, if \(\beta_j\) greatly improves the model fit, then \(L(\widehat{full})\) is much bigger than \(L(\widehat{reduced})\) . This makes \(\frac{L(\widehat{reduced})}{L(\widehat{full})} \approx 0\) and thus \(\Lambda\) large.

A key advantage of the LRT is that the test doesn’t depend upon the model parameterization. We obtain same answer testing (1) \(H_0: \beta_j = 0\) vs.  \(H_A: \beta_j \ne 0\) as we would testing (2) \(H_0: \exp(\beta_j) = 1\) vs.  \(H_A: \exp(\beta_j) \ne 1\) . A second advantage is that the LRT easily extends to testing multiple parmaeters at once.

Although the LRT requires fitting the model twice (once with all variables and once with the variables being tested held out), this is trivially fast for most modern computers.

19.2.2 LRT in R

To fit the LRT in R, use the anova(reduced, full, test="LRT") command. Here, reduced is the lm -object for the reduced model and full is the lm -object for the full model.

Example 19.3 Is there a relationship between smoking status and CHD among US men with the same age, EKG status, and systolic blood pressure (SBP)?

To answer this, we fit two models: one with age, EKG status, SBP, and smoking status as predictors, and another with only the first three.

We have strong evidence to reject that null hypothesis that smoking is not related to CHD in men, when adjusting for age, EKG status, and systolic blood pressure ( \(p = 0.0036\) ).

19.2.3 Hypothesis Testing for \(\beta\) ’s

19.2.4 wald test.

A Wald test is, on the surface, the same type of test used in linear regression. The idea behind a Wald test is to calculate how many standard deviations \(\hat\beta_j\) is from zero, and compare that value to a \(Z\) -statistic.

\[\begin{align*} \text{Wald Statistic } W &= \frac{\hat\beta_j - 0}{se(\hat\beta_j)} \end{align*}\]

\(W\) follows a \(N(0, 1)\) distribution when \(H_0\) is true and \(n\) is large. We reject the \(H: \beta_j = 0\) when \(|W| > z_{1-\alpha/2}\) or when \(W^2 > \chi^2_1(\alpha)\) . That is, larger values of \(W\) lead to rejecting \(H_0\) .

Generally, an LRT is preferred to a Wald test, since the Wald test has several drawbacks. A Wald test does depend upon the model parameterization: \[\frac{\hat\beta - 0}{se(\hat\beta)} \ne \frac{\exp(\hat\beta) - 1}{se(\exp(\hat\beta))}\] Wald tests can also have low power when truth is far from \(H_0\) and are based on a normal approximation that is less reliable in small samples. The primary advantage to a Wald test is that it is easy to compute–and often provided by default in most statistical programs.

R can calculate the Wald test for you:

19.2.5 Score Test

The score test relies on the fact that the slope of the log-likelihood function is 0 when \(\beta = \hat\beta\) 17

The idea is to evaluate the slope of the log-likelihood for the “reduced” model (does not include \(\beta_1\) ) and see if it is “significantly” steep. The score test is also called Rao test. The test statistic, \(S\) , follows a \(\chi^2_{r}\) distribution when \(H_0\) is true and \(n\) is large ( \(r\) is the numnber of of variables set to zero, in this case \(=1\) ). The null hypothesis is rejected when \(S > \chi^2_{r}(\alpha)\) .

An advantage of the score test is that is only requires fitting the reduced model. This provides computational advantages in some complex situations (generally not an issue for logistic regression). Like the LRT, the score test doesn’t depend upon the model parameterization

Calculate the score test using with test="Rao"

19.3 Interval Estimation

There are two ways for computing confidence intervals in logistic regression. Both are based on inverting testing approaches.

19.3.1 Wald Confidence Intervals

Consider the Wald hypothesis test:

\[W = \frac{{\hat\beta_j} - {\beta_j^0}}{{se(\hat\beta_j)}}\] If \(|W| \ge z_{1 - \alpha/2}\) , then we would reject the null hypothesis \(H_0 : \beta_j = \beta^0_j\) at the \(\alpha\) level.

Reverse the formula for \(W\) to get:

\[{\hat\beta_k} - z_{1 - \alpha/2}{se(\hat\beta_j)} \le {\beta_j^0} \le {\hat\beta_k} + z_{1 - \alpha/2}{se(\hat\beta_j)}\]

Thus, a \(100\times (1-\alpha) \%\) Wald confidence interval for \(\beta_j\) is:

\[\left(\hat\beta_k - z_{1 - \alpha/2}se(\hat\beta_j), \hat\beta_k + z_{1 - \alpha/2}se(\hat\beta_j)\right)\]

19.3.2 Profile Confidence Intervals

Wald confidence intervals have a simple formula, but don’t always work well–especially in small sample sizes (which is also when Wald Tests are not as good). Profile Confidence Intervals “reverse” a LRT similar to how a Wald CI “reverses” a Wald Hypothesis test.

  • Profile confidence intervals are usually better to use than Wald CI’s.
  • Interpretation is the same for both.
  • This is also what tidy() will use when conf.int=TRUE

To get a confidence interval for an OR, exponentiate the confidence interval for \(\hat\beta_j\)

19.4 Generalized Linear Models (GLMs)

Logistic regression is one example of a generalized linear model (GLM). GLMs have three pieces:

Another common GLM is Poisson regression (“log-linear” models)

The value of the constant is a binomial coefficient , but it’s exact value is not important for our needs here. ↩︎

This follows from the first derivative of a function always equally zero at a local extremum. ↩︎

An Introduction to Data Analysis

15.2 logistic regression.

Suppose \(y \in \{0,1\}^n\) is an \(n\) -placed vector of binary outcomes, and \(X\) a predictor matrix for a linear regression model. A Bayesian logistic regression model has the following form:

\[ \begin{align*} \beta, \sigma & \sim \text{some prior} \\ \xi & = X \beta && \text{[linear predictor]} \\ \eta_i & = \text{logistic}(\xi_i) && \text{[predictor of central tendency]} \\ y_i & \sim \text{Bernoulli}(\eta_i) && \text{[likelihood]} \\ \end{align*} \] The logistic function used as a link function is a function in \(\mathbb{R} \rightarrow [0;1]\) , i.e., from the reals to the unit interval. It is defined as:

\[\text{logistic}(\xi_i) = (1 + \exp(-\xi_i))^{-1}\] It’s shape (a sigmoid, or S-shaped curve) is this:

logistic regression hypothesis testing

We use the Simon task data as an example application. So far we only tested the first of two hypotheses about the Simon task data, namely the hypothesis relating to reaction times. The second hypothesis which arose in the context of the Simon task refers to the accuracy of answers, i.e., the proportion of “correct” choices:

\[ \text{Accuracy}_{\text{correct},\ \text{congruent}} > \text{Accuracy}_{\text{correct},\ \text{incongruent}} \] Notice that correctness is a binary categorical variable. Therefore, we use logistic regression to test this hypothesis.

Here is how to set up a logistic regression model with brms . The only thing that is new here is that we specify explicitly the likelihood function and the (inverse!) link function. 70 This is done using the syntax family = bernoulli(link = "logit") . For later hypothesis testing we also use proper priors and take samples from the prior as well.

The Bayesian summary statistics of the posterior samples of values for regression coefficients are:

What do these specific numerical estimates for coefficients mean? The mean estimate for the linear predictor \(\xi_\text{cong}\) for the “congruent” condition is roughly 3.204. The mean estimate for the linear predictor \(\xi_\text{inc}\) for the “incongruent” condition is roughly 3.204 + -0.726, so roughly 2.478. The central predictors corresponding to these linear predictors are:

\[ \begin{align*} \eta_\text{cong} & = \text{logistic}(3.204) \approx 0.961 \\ \eta_\text{incon} & = \text{logistic}(2.478) \approx 0.923 \end{align*} \]

These central estimates for the latent proportion of “correct” answers in each condition tightly match the empirically observed proportion of “correct” answers in the data:

Testing hypothesis for a logistic regression model is the exact same as for a standard regression model. And so, we find very strong support for hypothesis 2, suggesting that (given model and data), there is reason to believe that the accuracy in incongruent trials is lower than in congruent trials.

Notice that the logit function is the inverse of the logistic function. ↩︎

Companion to BER 642: Advanced Regression Methods

Chapter 10 binary logistic regression, 10.1 introduction.

Logistic regression is a technique used when the dependent variable is categorical (or nominal). Examples: 1) Consumers make a decision to buy or not to buy, 2) a product may pass or fail quality control, 3) there are good or poor credit risks, and 4) employee may be promoted or not.

Binary logistic regression - determines the impact of multiple independent variables presented simultaneously to predict membership of one or other of the two dependent variable categories.

Since the dependent variable is dichotomous we cannot predict a numerical value for it using logistic regression so the usual regression least squares deviations criteria for best fit approach of minimizing error around the line of best fit is inappropriate (It’s impossible to calculate deviations using binary variables!).

Instead, logistic regression employs binomial probability theory in which there are only two values to predict: that probability (p) is 1 rather than 0, i.e. the event/person belongs to one group rather than the other.

Logistic regression forms a best fitting equation or function using the maximum likelihood (ML) method, which maximizes the probability of classifying the observed data into the appropriate category given the regression coefficients.

Like multiple regression, logistic regression provides a coefficient ‘b’, which measures each independent variable’s partial contribution to variations in the dependent variable.

The goal is to correctly predict the category of outcome for individual cases using the most parsimonious model.

To accomplish this goal, a model (i.e. an equation) is created that includes all predictor variables that are useful in predicting the response variable.

10.2 The Purpose of Binary Logistic Regression

  • The logistic regression predicts group membership

Since logistic regression calculates the probability of success over the probability of failure, the results of the analysis are in the form of an odds ratio.

Logistic regression determines the impact of multiple independent variables presented simultaneously to predict membership of one or other of the two dependent variable categories.

  • The logistic regression also provides the relationships and strengths among the variables ## Assumptions of (Binary) Logistic Regression

Logistic regression does not assume a linear relationship between the dependent and independent variables.

  • Logistic regression assumes linearity of independent variables and log odds of dependent variable.

The independent variables need not be interval, nor normally distributed, nor linearly related, nor of equal variance within each group

  • Homoscedasticity is not required. The error terms (residuals) do not need to be normally distributed.

The dependent variable in logistic regression is not measured on an interval or ratio scale.

  • The dependent variable must be a dichotomous ( 2 categories) for the binary logistic regression.

The categories (groups) as a dependent variable must be mutually exclusive and exhaustive; a case can only be in one group and every case must be a member of one of the groups.

Larger samples are needed than for linear regression because maximum coefficients using a ML method are large sample estimates. A minimum of 50 cases per predictor is recommended (Field, 2013)

Hosmer, Lemeshow, and Sturdivant (2013) suggest a minimum sample of 10 observations per independent variable in the model, but caution that 20 observations per variable should be sought if possible.

Leblanc and Fitzgerald (2000) suggest a minimum of 30 observations per independent variable.

10.3 Log Transformation

The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality.

  • Log transformations and sq. root transformations moved skewed distributions closer to normality. So what we are about to do is common.

This log transformation of the p values to a log distribution enables us to create a link with the normal regression equation. The log distribution (or logistic transformation of p) is also called the logit of p or logit(p).

In logistic regression, a logistic transformation of the odds (referred to as logit) serves as the depending variable:

\[\log (o d d s)=\operatorname{logit}(P)=\ln \left(\frac{P}{1-P}\right)\] If we take the above dependent variable and add a regression equation for the independent variables, we get a logistic regression:

\[\ logit(p)=a+b_{1} x_{1}+b_{2} x_{2}+b_{3} x_{3}+\ldots\] As in least-squares regression, the relationship between the logit(P) and X is assumed to be linear.

10.4 Equation

\[P=\frac{\exp \left(a+b_{1} x_{1}+b_{2} x_{2}+b_{3} x_{3}+\ldots\right)}{1+\exp \left(a+b_{1} x_{1}+b_{2} x_{2}+b_{3} x_{3}+\ldots\right)}\] In the equation above: P can be calculated with the following formula

P = the probability that a case is in a particular category,

exp = the exponential function (approx. 2.72),

a = the constant (or intercept) of the equation and,

b = the coefficient (or slope) of the predictor variables.

10.5 Hypothesis Test

In logistic regression, hypotheses are of interest:

the null hypothesis , which is when all the coefficients in the regression equation take the value zero, and

the alternate hypothesis that the model currently under consideration is accurate and differs significantly from the null of zero, i.e. gives significantly better than the chance or random prediction level of the null hypothesis.

10.6 Likelihood Ratio Test for Nested Models

The likelihood ratio test is based on -2LL ratio. It is a test of the significance of the difference between the likelihood ratio (-2LL) for the researcher’s model with predictors (called model chi square) minus the likelihood ratio for baseline model with only a constant in it.

Significance at the .05 level or lower means the researcher’s model with the predictors is significantly different from the one with the constant only (all ‘b’ coefficients being zero). It measures the improvement in fit that the explanatory variables make compared to the null model.

Chi square is used to assess significance of this ratio.

10.7 R Lab: Running Binary Logistic Regression Model

10.7.1 data explanations ((data set: class.sav)).

A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

This dataset has a binary response (outcome, dependent) variable called admit, which is equal to 1 if the individual was admitted to graduate school, and 0 otherwise.

There are three predictor variables: GRE, GPA, and rank. We will treat the variables GRE and GPA as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

10.7.2 Explore the data

This dataset has a binary response (outcome, dependent) variable called admit. There are three predictor variables: gre, gpa and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest. We can get basic descriptives for the entire data set by using summary. To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.

Before we run a binary logistic regression, we need check the previous two-way contingency table of categorical outcome and predictors. We want to make sure there is no zero in any cells.

logistic regression hypothesis testing

10.7.3 Running a logstic regression model

In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.

Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to assess model fit.

The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

How to do the interpretation?

For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.002.

For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.804.

The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -0.675.

Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC. Later we show an example of how you can use these values to help assess model fit.

Why the coefficient value of rank (B) are different with the SPSS outputs? - In R, the glm automatically made the Rank 1 as the references group. However, in our SPSS example, we set the rank 4 as the reference group.

We can test for an overall effect of rank using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 4, 5, and 6, are the three terms for the levels of rank.

The chi-squared test statistic of 20.9, with three degrees of freedom is associated with a p-value of 0.00011 indicating that the overall effect of rank is statistically significant.

We can also test additional hypotheses about the differences in the coefficients for the different levels of rank. Below we test that the coefficient for rank=2 is equal to the coefficient for rank=3. The first line of code below creates a vector l that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for rank=2 and rank=3 (i.e., the 4th and 5th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).

The chi-squared test statistic of 5.5 with 1 degree of freedom is associated with a p-value of 0.019, indicating that the difference between the coefficient for rank=2 and the coefficient for rank=3 is statistically significant.

You can also exponentiate the coefficients and interpret them as odds-ratios. R will do this computation for you. To get the exponentiated coefficients, you tell R that you want to exponentiate (exp), and that the object you want to exponentiate is called coefficients and it is part of mylogit (coef(mylogit)). We can use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise.

Now we can say that for a one unit increase in gpa, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of 2.23.

For more information on interpreting odds ratios see our FAQ page: How do I interpret odds ratios in logistic regression? Link:

Note that while R produces it, the odds ratio for the intercept is not generally interpreted.

You can also use predicted probabilities to help you understand the model. Predicted probabilities can be computed for both categorical and continuous predictor variables. In order to create predicted probabilities we first need to create a new data frame with the values we want the independent variables to take on to create our predictions

We will start by calculating the predicted probability of admission at each value of rank, holding gre and gpa at their means.

These objects must have the same names as the variables in your logistic regression above (e.g. in this example the mean for gre must be named gre). Now that we have the data frame we want to use to calculate the predicted probabilities, we can tell R to create the predicted probabilities. The first line of code below is quite compact, we will break it apart to discuss what various components do. The newdata1$rankP tells R that we want to create a new variable in the dataset (data frame) newdata1 called rankP, the rest of the command tells R that the values of rankP should be predictions made using the predict( ) function. The options within the parentheses tell R that the predictions should be based on the analysis mylogit with values of the predictor variables coming from newdata1 and that the type of prediction is a predicted probability (type=“response”). The second line of the code lists the values in the data frame newdata1. Although not particularly pretty, this is a table of predicted probabilities.

In the above output we see that the predicted probability of being accepted into a graduate program is 0.52 for students from the highest prestige undergraduate institutions (rank=1), and 0.18 for students from the lowest ranked institutions (rank=4), holding gre and gpa at their means.

Now, we are going to do something that do not exist in our SPSS section

The code to generate the predicted probabilities (the first line below) is the same as before, except we are also going to ask for standard errors so we can plot a confidence interval. We get the estimates on the link scale and back transform both the predicted values and confidence limits into probabilities.

It can also be helpful to use graphs of predicted probabilities to understand and/or present the model. We will use the ggplot2 package for graphing.

logistic regression hypothesis testing

We may also wish to see measures of how well our model fits. This can be particularly useful when comparing competing models. The output produced by summary(mylogit) included indices of fit (shown below the coefficients), including the null and deviance residuals and the AIC. One measure of model fit is the significance of the overall model. This test asks whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model). The test statistic is the difference between the residual deviance for the model with predictors and the null model. The test statistic is distributed chi-squared with degrees of freedom equal to the differences in degrees of freedom between the current and the null model (i.e., the number of predictor variables in the model). To find the difference in deviance for the two models (i.e., the test statistic) we can use the command:

10.8 Things to consider

Empty cells or small cells: You should check for empty or small cells by doing a crosstab between categorical predictors and the outcome variable. If a cell has very few cases (a small cell), the model may become unstable or it might not run at all.

Separation or quasi-separation (also called perfect prediction), a condition in which the outcome does not vary at some levels of the independent variables. See our page FAQ: What is complete or quasi-complete separation in logistic/probit regression and how do we deal with them? for information on models with perfect prediction. Link

Sample size: Both logit and probit models require more cases than OLS regression because they use maximum likelihood estimation techniques. It is sometimes possible to estimate models for binary outcomes in datasets with only a small number of cases using exact logistic regression. It is also important to keep in mind that when the outcome is rare, even if the overall dataset is large, it can be difficult to estimate a logit model.

Pseudo-R-squared: Many different measures of psuedo-R-squared exist. They all attempt to provide information similar to that provided by R-squared in OLS regression; however, none of them can be interpreted exactly as R-squared in OLS regression is interpreted. For a discussion of various pseudo-R-squareds see Long and Freese (2006) or our FAQ page What are pseudo R-squareds? Link

Diagnostics: The diagnostics for logistic regression are different from those for OLS regression. For a discussion of model diagnostics for logistic regression, see Hosmer and Lemeshow (2000, Chapter 5). Note that diagnostics done for logistic regression are similar to those done for probit regression.

10.9 Supplementary Learning Materials

Agresti, A. (1996). An Introduction to Categorical Data Analysis. Wiley & Sons, NY.

Burns, R. P. & Burns R. (2008). Business research methods & statistics using SPSS. SAGE Publications.

Field, A (2013). Discovering statistics using IBM SPSS statistics (4th ed.). Los Angeles, CA: Sage Publications

Data files from Link1 , Link2 , & Link3 .

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

5.7: Multiple Logistic Regression

  • Last updated
  • Save as PDF
  • Page ID 1769

  • John H. McDonald
  • University of Delaware

Learning Objectives

  • To use multiple logistic regression when you have one nominal variable and two or more measurement variables, and you want to know how the measurement variables affect the nominal variable. You can use it to predict probabilities of the dependent nominal variable, or if you're careful, you can use it for suggestions about which independent variables have a major effect on the dependent variable.

When to use it

Use multiple logistic regression when you have one nominal and two or more measurement variables. The nominal variable is the dependent (\(Y\)) variable; you are studying the effect that the independent (\(X\)) variables have on the probability of obtaining a particular value of the dependent variable. For example, you might want to know the effect that blood pressure, age, and weight have on the probability that a person will have a heart attack in the next year.

Heart attack vs. no heart attack is a binomial nominal variable; it only has two values. You can perform multinomial multiple logistic regression, where the nominal variable has more than two values, but I'm going to limit myself to binary multiple logistic regression, which is far more common.

The measurement variables are the independent (\(X\)) variables; you think they may have an effect on the dependent variable. While the examples I'll use here only have measurement variables as the independent variables, it is possible to use nominal variables as independent variables in a multiple logistic regression; see the explanation on the multiple linear regression page.

Epidemiologists use multiple logistic regression a lot, because they are concerned with dependent variables such as alive vs. dead or diseased vs. healthy, and they are studying people and can't do well-controlled experiments, so they have a lot of independent variables. If you are an epidemiologist, you're going to have to learn a lot more about multiple logistic regression than I can teach you here. If you're not an epidemiologist, you might occasionally need to understand the results of someone else's multiple logistic regression, and hopefully this handbook can help you with that. If you need to do multiple logistic regression for your own research, you should learn more than is on this page.

The goal of a multiple logistic regression is to find an equation that best predicts the probability of a value of the \(Y\) variable as a function of the \(X\) variables. You can then measure the independent variables on a new individual and estimate the probability of it having a particular value of the dependent variable. You can also use multiple logistic regression to understand the functional relationship between the independent variables and the dependent variable, to try to understand what might cause the probability of the dependent variable to change. However, you need to be very careful. Please read the multiple regression page for an introduction to the issues involved and the potential problems with trying to infer causes; almost all of the caveats there apply to multiple logistic regression, as well.

As an example of multiple logistic regression, in the 1800s, many people tried to bring their favorite bird species to New Zealand, release them, and hope that they become established in nature. (We now realize that this is very bad for the native species, so if you were thinking about trying this, please don't.) Veltman et al. (1996) wanted to know what determined the success or failure of these introduced species. They determined the presence or absence of \(79\) species of birds in New Zealand that had been artificially introduced (the dependent variable) and \(14\) independent variables, including number of releases, number of individuals released, migration (scored as \(1\) for sedentary, \(2\) for mixed, \(3\) for migratory), body length, etc. Multiple logistic regression suggested that number of releases, number of individuals released, and migration had the biggest influence on the probability of a species being successfully introduced to New Zealand, and the logistic regression equation could be used to predict the probability of success of a new introduction. While hopefully no one will deliberately introduce more exotic bird species to new territories, this logistic regression could help understand what will determine the success of accidental introductions or the introduction of endangered species to areas of their native range where they had been eliminated.

Null hypothesis

The main null hypothesis of a multiple logistic regression is that there is no relationship between the \(X\) variables and the \(Y\) variable; in other words, the \(Y\) values you predict from your multiple logistic regression equation are no closer to the actual \(Y\) values than you would expect by chance. As you are doing a multiple logistic regression, you'll also test a null hypothesis for each \(X\) variable, that adding that \(X\) variable to the multiple logistic regression does not improve the fit of the equation any more than expected by chance. While you will get \(P\) values for these null hypotheses, you should use them as a guide to building a multiple logistic regression equation; you should not use the \(P\) values as a test of biological null hypotheses about whether a particular \(X\) variable causes variation in \(Y\).

How it works

Multiple logistic regression finds the equation that best predicts the value of the \(Y\) variable for the values of the \(X\) variables. The \(Y\) variable is the probability of obtaining a particular value of the nominal variable. For the bird example, the values of the nominal variable are "species present" and "species absent." The \(Y\) variable used in logistic regression would then be the probability of an introduced species being present in New Zealand. This probability could take values from \(0\) to \(1\). The limited range of this probability would present problems if used directly in a regression, so the odds, \(Y/(1-Y)\), is used instead. (If the probability of a successful introduction is \(0.25\), the odds of having that species are \(0.25/(1-0.25)=1/3\). In gambling terms, this would be expressed as "\(3\) to \(1\) odds against having that species in New Zealand.") Taking the natural log of the odds makes the variable more suitable for a regression, so the result of a multiple logistic regression is an equation that looks like this:

\[\ln \left [ \frac{Y}{1-Y} \right ]=a+b_1X_1+b_2X_2+b_3X_3+...\]

You find the slopes (\(b_1,\; b_2\), etc.) and intercept (\(a\)) of the best-fitting equation in a multiple logistic regression using the maximum-likelihood method, rather than the least-squares method used for multiple linear regression. Maximum likelihood is a computer-intensive technique; the basic idea is that it finds the values of the parameters under which you would be most likely to get the observed results.

You might want to have a measure of how well the equation fits the data, similar to the \(R^2\) of multiple linear regression. However, statisticians do not agree on the best measure of fit for multiple logistic regression. Some use deviance, \(D\), for which smaller numbers represent better fit, and some use one of several pseudo-\(R^2\) values, for which larger numbers represent better fit.

Using nominal variables in a multiple logistic regression

You can use nominal variables as independent variables in multiple logistic regression; for example, Veltman et al. (1996) included upland use (frequent vs. infrequent) as one of their independent variables in their study of birds introduced to New Zealand. See the discussion on the multiple linear regression page about how to do this.

Selecting variables in multiple logistic regression

Whether the purpose of a multiple logistic regression is prediction or understanding functional relationships, you'll usually want to decide which variables are important and which are unimportant. In the bird example, if your purpose was prediction it would be useful to know that your prediction would be almost as good if you measured only three variables and didn't have to measure more difficult variables such as range and weight. If your purpose was understanding possible causes, knowing that certain variables did not explain much of the variation in introduction success could suggest that they are probably not important causes of the variation in success.

The procedures for choosing variables are basically the same as for multiple linear regression: you can use an objective method (forward selection, backward elimination, or stepwise), or you can use a careful examination of the data and understanding of the biology to subjectively choose the best variables. The main difference is that instead of using the change of \(R^2\) to measure the difference in fit between an equation with or without a particular variable, you use the change in likelihood. Otherwise, everything about choosing variables for multiple linear regression applies to multiple logistic regression as well, including the warnings about how easy it is to get misleading results.

Assumptions

Multiple logistic regression assumes that the observations are independent. For example, if you were studying the presence or absence of an infectious disease and had subjects who were in close contact, the observations might not be independent; if one person had the disease, people near them (who might be similar in occupation, socioeconomic status, age, etc.) would be likely to have the disease. Careful sampling design can take care of this.

Multiple logistic regression also assumes that the natural log of the odds ratio and the measurement variables have a linear relationship. It can be hard to see whether this assumption is violated, but if you have biological or statistical reasons to expect a non-linear relationship between one of the measurement variables and the log of the odds ratio, you may want to try data transformations.

Multiple logistic regression does not assume that the measurement variables are normally distributed.

Some obese people get gastric bypass surgery to lose weight, and some of them die as a result of the surgery. Benotti et al. (2014) wanted to know whether they could predict who was at a higher risk of dying from one particular kind of surgery, Roux-en-Y gastric bypass surgery. They obtained records on \(81,751\) patients who had had Roux-en-Y surgery, of which \(123\) died within \(30\) days. They did multiple logistic regression, with alive vs. dead after \(30\) days as the dependent variable, and \(6\) demographic variables (gender, age, race, body mass index, insurance type, and employment status) and \(30\) health variables (blood pressure, diabetes, tobacco use, etc.) as the independent variables. Manually choosing the variables to add to their logistic model, they identified six that contribute to risk of dying from Roux-en-Y surgery: body mass index, age, gender, pulmonary hypertension, congestive heart failure, and liver disease.

Benotti et al. (2014) did not provide their multiple logistic equation, perhaps because they thought it would be too confusing for surgeons to understand. Instead, they developed a simplified version (one point for every decade over \(40\), \(1\) point for every \(10\) BMI units over \(40\), \(1\) point for male, \(1\) point for congestive heart failure, \(1\) point for liver disease, and \(2\) points for pulmonary hypertension). Using this RYGB Risk Score they could predict that a \(43\)-year-old woman with a BMI of \(46\) and no heart, lung or liver problems would have an \(0.03\%\) chance of dying within \(30\) days, while a \(62\)-year-old man with a BMI of \(52\) and pulmonary hypertension would have a \(1.4\%\) chance.

Graphing the results

Graphs aren't very useful for showing the results of multiple logistic regression; instead, people usually just show a table of the independent variables, with their \(P\) values and perhaps the regression coefficients.

Similar tests

If the dependent variable is a measurement variable, you should do multiple linear regression.

There are numerous other techniques you can use when you have one nominal and three or more measurement variables, but I don't know enough about them to list them, much less explain them.

How to do multiple logistic regression

Spreadsheet.

I haven't written a spreadsheet to do multiple logistic regression.

There's a very nice web page for multiple logistic regression. It will not do automatic selection of variables; if you want to construct a logistic model with fewer independent variables, you'll have to pick the variables yourself.

Salvatore Mangiafico's \(R\) Companion has a sample R program for multiple logistic regression.

You use PROC LOGISTIC to do multiple logistic regression in SAS. Here is an example using the data on bird introductions to New Zealand.

DATA birds; INPUT species $ status $ length mass range migr insect diet clutch broods wood upland water release indiv; DATALINES; Cyg_olor 1 1520 9600 1.21 1 12 2 6 1 0 0 1 6 29 Cyg_atra 1 1250 5000 0.56 1 0 1 6 1 0 0 1 10 85 Cer_nova 1 870 3360 0.07 1 0 1 4 1 0 0 1 3 8 Ans_caer 0 720 2517 1.1 3 12 2 3.8 1 0 0 1 1 10 Ans_anse 0 820 3170 3.45 3 0 1 5.9 1 0 0 1 2 7 Bra_cana 1 770 4390 2.96 2 0 1 5.9 1 0 0 1 10 60 Bra_sand 0 50 1930 0.01 1 0 1 4 2 0 0 0 1 2 Alo_aegy 0 680 2040 2.71 1 . 2 8.5 1 0 0 1 1 8 Ana_plat 1 570 1020 9.01 2 6 2 12.6 1 0 0 1 17 1539 Ana_acut 0 580 910 7.9 3 6 2 8.3 1 0 0 1 3 102 Ana_pene 0 480 590 4.33 3 0 1 8.7 1 0 0 1 5 32 Aix_spon 0 470 539 1.04 3 12 2 13.5 2 1 0 1 5 10 Ayt_feri 0 450 940 2.17 3 12 2 9.5 1 0 0 1 3 9 Ayt_fuli 0 435 684 4.81 3 12 2 10.1 1 0 0 1 2 5 Ore_pict 0 275 230 0.31 1 3 1 9.5 1 1 1 0 9 398 Lop_cali 1 256 162 0.24 1 3 1 14.2 2 0 0 0 15 1420 Col_virg 1 230 170 0.77 1 3 1 13.7 1 0 0 0 17 1156 Ale_grae 1 330 501 2.23 1 3 1 15.5 1 0 1 0 15 362 Ale_rufa 0 330 439 0.22 1 3 2 11.2 2 0 0 0 2 20 Per_perd 0 300 386 2.4 1 3 1 14.6 1 0 1 0 24 676 Cot_pect 0 182 95 0.33 3 . 2 7.5 1 0 0 0 3 . Cot_aust 1 180 95 0.69 2 12 2 11 1 0 0 1 11 601 Lop_nyct 0 800 1150 0.28 1 12 2 5 1 1 1 0 4 6 Pha_colc 1 710 850 1.25 1 12 2 11.8 1 1 0 0 27 244 Syr_reev 0 750 949 0.2 1 12 2 9.5 1 1 1 0 2 9 Tet_tetr 0 470 900 4.17 1 3 1 7.9 1 1 1 0 2 13 Lag_lago 0 390 517 7.29 1 0 1 7.5 1 1 1 0 2 4 Ped_phas 0 440 815 1.83 1 3 1 12.3 1 1 0 0 1 22 Tym_cupi 0 435 770 0.26 1 4 1 12 1 0 0 0 3 57 Van_vane 0 300 226 3.93 2 12 3 3.8 1 0 0 0 8 124 Plu_squa 0 285 318 1.67 3 12 3 4 1 0 0 1 2 3 Pte_alch 0 350 225 1.21 2 0 1 2.5 2 0 0 0 1 8 Pha_chal 0 320 350 0.6 1 12 2 2 2 1 0 0 8 42 Ocy_loph 0 330 205 0.76 1 0 1 2 7 1 0 1 4 23 Leu_mela 0 372 . 0.07 1 12 2 2 1 1 0 0 6 34 Ath_noct 1 220 176 4.84 1 12 3 3.6 1 1 0 0 7 221 Tyt_alba 0 340 298 8.9 2 0 3 5.7 2 1 0 0 1 7 Dac_nova 1 460 382 0.34 1 12 3 2 1 1 0 0 7 21 Lul_arbo 0 150 32.1 1.78 2 4 2 3.9 2 1 0 0 1 5 Ala_arve 1 185 38.9 5.19 2 12 2 3.7 3 0 0 0 11 391 Pru_modu 1 145 20.5 1.95 2 12 2 3.4 2 1 0 0 14 245 Eri_rebe 0 140 15.8 2.31 2 12 2 5 2 1 0 0 11 123 Lus_mega 0 161 19.4 1.88 3 12 2 4.7 2 1 0 0 4 7 Tur_meru 1 255 82.6 3.3 2 12 2 3.8 3 1 0 0 16 596 Tur_phil 1 230 67.3 4.84 2 12 2 4.7 2 1 0 0 12 343 Syl_comm 0 140 12.8 3.39 3 12 2 4.6 2 1 0 0 1 2 Syl_atri 0 142 17.5 2.43 2 5 2 4.6 1 1 0 0 1 5 Man_mela 0 180 . 0.04 1 12 3 1.9 5 1 0 0 1 2 Man_mela 0 265 59 0.25 1 12 2 2.6 . 1 0 0 1 80 Gra_cyan 0 275 128 0.83 1 12 3 3 2 1 0 1 1 . Gym_tibi 1 400 380 0.82 1 12 3 4 1 1 0 0 15 448 Cor_mone 0 335 203 3.4 2 12 2 4.5 1 1 0 0 2 3 Cor_frug 1 400 425 3.73 1 12 2 3.6 1 1 0 0 10 182 Stu_vulg 1 222 79.8 3.33 2 6 2 4.8 2 1 0 0 14 653 Acr_tris 1 230 111.3 0.56 1 12 2 3.7 1 1 0 0 5 88 Pas_dome 1 149 28.8 6.5 1 6 2 3.9 3 1 0 0 12 416 Pas_mont 0 133 22 6.8 1 6 2 4.7 3 1 0 0 3 14 Aeg_temp 0 120 . 0.17 1 6 2 4.7 3 1 0 0 3 14 Emb_gutt 0 120 19 0.15 1 4 1 5 3 0 0 0 4 112 Poe_gutt 0 100 12.4 0.75 1 4 1 4.7 3 0 0 0 1 12 Lon_punc 0 110 13.5 1.06 1 0 1 5 3 0 0 0 1 8 Lon_cast 0 100 . 0.13 1 4 1 5 . 0 0 1 4 45 Pad_oryz 0 160 . 0.09 1 0 1 5 . 0 0 0 2 6 Fri_coel 1 160 23.5 2.61 2 12 2 4.9 2 1 0 0 17 449 Fri_mont 0 146 21.4 3.09 3 10 2 6 . 1 0 0 7 121 Car_chlo 1 147 29 2.09 2 7 2 4.8 2 1 0 0 6 65 Car_spin 0 117 12 2.09 3 3 1 4 2 1 0 0 3 54 Car_card 1 120 15.5 2.85 2 4 1 4.4 3 1 0 0 14 626 Aca_flam 1 115 11.5 5.54 2 6 1 5 2 1 0 0 10 607 Aca_flavi 0 133 17 1.67 2 0 1 5 3 0 1 0 3 61 Aca_cann 0 136 18.5 2.52 2 6 1 4.7 2 1 0 0 12 209 Pyr_pyrr 0 142 23.5 3.57 1 4 1 4 3 1 0 0 2 . Emb_citr 1 160 28.2 4.11 2 8 2 3.3 3 1 0 0 14 656 Emb_hort 0 163 21.6 2.75 3 12 2 5 1 0 0 0 1 6 Emb_cirl 1 160 23.6 0.62 1 12 2 3.5 2 1 0 0 3 29 Emb_scho 0 150 20.7 5.42 1 12 2 5.1 2 0 0 1 2 9 Pir_rubr 0 170 31 0.55 3 12 2 4 . 1 0 0 1 2 Age_phoe 0 210 36.9 2 2 8 2 3.7 1 0 0 1 1 2 Stu_negl 0 225 106.5 1.2 2 12 2 4.8 2 0 0 0 1 2 ; PROC LOGISTIC DATA=birds DESCENDING; MODEL status=length mass range migr insect diet clutch broods wood upland water release indiv / SELECTION=STEPWISE SLENTRY=0.15 SLSTAY=0.15; RUN;

In the MODEL statement, the dependent variable is to the left of the equals sign, and all the independent variables are to the right. SELECTION determines which variable selection method is used; choices include FORWARD, BACKWARD, STEPWISE, and several others. You can omit the SELECTION parameter if you want to see the logistic regression model that includes all the independent variables. SLENTRY is the significance level for entering a variable into the model, if you're using FORWARD or STEPWISE selection; in this example, a variable must have a \(P\) value less than \(0.15\) to be entered into the regression model. SLSTAY is the significance level for removing a variable in BACKWARD or STEPWISE selection; in this example, a variable with a \(P\) value greater than \(0.15\) will be removed from the model.

Summary of Stepwise Selection Effect Number Score Wald Step Entered Removed DF In Chi-Square Chi-Square Pr > ChiSq 1 release 1 1 28.4339 <.0001 2 upland 1 2 5.6871 0.0171 3 migr 1 3 5.3284 0.0210

The summary shows that "release" was added to the model first, yielding a \(P\) value less than \(0.0001\). Next, "upland" was added, with a \(P\) value of \(0.0171\). Next, "migr" was added, with a \(P\) value of \(0.0210\). SLSTAY was set to \(0.15\), not \(0.05\), because you might want to include a variable in a predictive model even if it's not quite significant. However, none of the other variables have a \(P\) value less than \(0.15\), and removing any of the variables caused a decrease in fit big enough that \(P\) was less than \(0.15\), so the stepwise process is done.

Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -0.4653 1.1226 0.1718 0.6785 migr 1 -1.6057 0.7982 4.0464 0.0443 upland 1 -6.2721 2.5739 5.9380 0.0148 release 1 0.4247 0.1040 16.6807 <.0001

The "parameter estimates" are the partial regression coefficients; they show that the model is:

\[\ln \left [ \frac{Y}{1-Y} \right ]=-0.4653-1.6057(migration)-6.2721(upland)+0.4247(release)\]

Power analysis

You need to have several times as many observations as you have independent variables, otherwise you can get "overfitting"—it could look like every independent variable is important, even if they're not. A frequently seen rule of thumb is that you should have at least \(10\) to \(20\) times as many observations as you have independent variables. I don't know how to do a more detailed power analysis for multiple logistic regression.

Benotti, P., G.C. Wood, D.A. Winegar, A.T. Petrick, C.D. Still, G. Argyropoulos, and G.S. Gerhard. 2014. Risk factors associated with mortality after Roux-en-Y gastric bypass surgery. Annals of Surgery 259: 123-130.

Veltman, C.J., S. Nee, and M.J. Crawley. 1996. Correlates of introduction success in exotic New Zealand birds. American Naturalist 147: 542-557.

Advanced Statistics using R

Applied Data Science Meeting, July 4-6, 2023, Shanghai, China . Register for the workshops: (1) Deep Learning Using R, (2) Introduction to Social Network Analysis, (3) From Latent Class Model to Latent Transition Model Using Mplus, (4) Longitudinal Data Analysis, and (5) Practical Mediation Analysis. Click here for more information .

  • Example Datasets
  • Basics of R
  • Graphs in R
  • Hypothesis testing
  • Confidence interval
  • Simple Regression
  • Multiple Regression
  • Logistic regression
  • Moderation analysis
  • Mediation analysis
  • Path analysis
  • Factor analysis
  • Multilevel regression
  • Longitudinal data analysis
  • Power analysis

Logistic Regression

Logistic regression is widely used in social and behavioral research in analyzing the binary (dichotomous) outcome data. In logistic regression, the outcome can only take two values 0 and 1. Some examples that can utilize the logistic regression are given in the following.

  • The election of Democratic or Republican president can depend on the factors such as the economic status, the amount of money spent on the campaign, as well as gender and income of the voters.
  • Whether an assistant professor can be tenured may be predicted from the number of publications and teaching performance in the first three years.
  • Whether or not someone has a heart attack may be related to age, gender and living habits.
  • Whether a student is admitted may be predicted by her/his high school GPA, SAT score, and quality of recommendation letters.

We use an example to illustrate how to conduct logistic regression in R.

In this example, the aim is to predict whether a woman is in compliance with mammography screening recommendations from four predictors, one reflecting medical input and three reflecting a woman's psychological status with regarding to screening.

  • Outcome y: whether a woman is in compliance with mammography screening recommendations (1: in compliance; 0: not in compliance)
  • x1: whether she has received a recommendation for screening from a physician;
  • x2: her knowledge about breast cancer and mammography screening;
  • x3: her perception of benefit of such a screening;
  • x4: her perception of the barriers to being screened.

Basic ideas

With a binary outcome, the linear regression does not work any more. Simply speaking, the predictors can take any value but the outcome cannot. Therefore, using a linear regression cannot predict the outcome well. In order to deal with the problem, we model the probability to observe an outcome 1 instead, that is $p = \Pr(y=1)$. Using the mammography example, that'll be the probability for a woman to be in compliance with the screening recommendation.

Even directly modeling the probability would work better than predicting the 1/0 outcome, intuitively. A potential problem is that the probability is bound between 0 and 1 but the predicted values are generally not. To further deal with the problem, we conduct a transformation using

\[ \eta = \log\frac{p}{1-p}.\]

After transformation, $\eta$ can take any value from $-\infty$ when $p=0$ to $\infty$ when $p=1$. Such a transformation is called logit transformation, denoted by $\text{logit}(p)$. Note that $p_{i}/(1-p_{i})$ is called odds, which is simply the ratio of the probability for the two possible outcomes. For example, if for one woman, the probability that she is in compliance is 0.8, then the odds is 0.8/(1-0.2)=4. Clearly, for equal probability of the outcome, the odds=1. If odds>1, there is a probability higher than 0.5 to observe the outcome 1. With the transformation, the $\eta$ can be directly modeled. 

Therefore, the logistic regression is

\[ \mbox{logit}(p_{i})=\log(\frac{p_{i}}{1-p_{i}})=\eta_i=\beta_{0}+\beta_{1}x_{1i}+\ldots+\beta_{k}x_{ki} \]

where $p_i = \Pr(y_i = 1)$. Different from the regular linear regression, no residual is used in the model.

Why is this?

For a variable $y$ with two and only two outcome values, it is often assumed it follows a Bernoulli or binomial  distribution with the probability $p$ for the outcome 1 and probability $1-p$ for 0. The density function is

\[ p^y (1-p)^{1-y}. \] 

Note that when $y=1$, $p^y (1-p)^{1-y} = p$ exactly.

Furthermore, we assume there is a continuous variable $y^*$ underlying the observed binary variable. If the continuous variable takes a value larger than certain threshold, we would observe 1, otherwise 0. For logistic regression, we assume the continuous variable has a logistic distribution with the density function:

\[ \frac{e^{-y^*}}{1+e^{-y^*}} .\]

The probability for observing 1 is therefore can be directly calculated using the logistic distribution as:

\[ p = \frac{1}{1 + e^{-y^*}},\]

which transforms to 

\[ \log\frac{p}{1-p} = y^*.\]

For $y^*$, since it is a continuous variable, it can be predicted as in a regular regression model.

Fitting a logistic regression model in R

In R, the model can be estimated using the glm() function. Logistic regression is one example of the generalized linear model (glm). Below gives the analysis of the mammography data.

  • glm uses the model formula same as the linear regression model.
  • family = tells the distribution of the outcome variable. For binary data, the binomial distribution is used.
  • link = tell the transformation method. Here, the logit transformation is used.
  • The output includes the regression coefficients and their z-statistics and p-values.
  • The dispersion parameter is related to the variance of the response variable.

Interpret the results

We first focus on how to interpret the parameter estimates from the analysis. For the intercept, when all the predictors take the value 0, we have

\[ \beta_0 = \log(\frac{p}{1-p}), \]

which is the log odds that the observed outcome is 1.

We now look at the coefficient for each predictor. For the mammography example, let's assume $x_2$, $x_3$, and $x_4$ are the same and look at $x_1$ only. If a woman has received a recommendation ($x_1=1$), then the odds is

\[ \log(\frac{p}{1-p})|(x_1=1)=\beta_{0}+\beta_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}.\]

If a woman has not received a recommendation ($x_1=0$), then the odds is

\[\log(\frac{p}{1-p})|(x_1=0)=\beta_{0}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}.\]

The difference is

\[\log(\frac{p}{1-p})|(x_1=1)-\log(\frac{p}{1-p})|(x_1=0)=\beta_{1}.\]

Therefore, the logistic regression coefficient for a predictor is the difference in the log odds when the predictor changes 1 unit given other predictors unchanged.

This above equation is equivalent to

\[\log\left(\frac{\frac{p(x_1=1)}{1-p(x_1=1)}}{\frac{p(x_1=0)}{1-p(x_1=0)}}\right)=\beta_{1}.\]

More descriptively, we have

\[\log\left(\frac{\mbox{ODDS(received recommendation)}}{\mbox{ODDS(not received recommendation)}}\right)=\beta_{1}.\]

Therefore, the regression coefficients is the log odds ratio. By a simple transformation, we have 

\[\frac{\mbox{ODDS(received recommendation)}}{\mbox{ODDS(not received recommendation)}}=\exp(\beta_{1})\]

\[\mbox{ODDS(received recommendation)} = \exp(\beta_{1})*\mbox{ODDS(not received recommendation)}.\]

Therefore, the exponential of a regression coefficient is the odds ratio. For the example, $exp(\beta_{1})$=exp(1.7731)=5.9. Thus, the odds in compliance to screening for those who received recommendation is about 5.9 times of those who did not receive recommendation.

For continuous predictors, the regression coefficients can also be interpreted the same way. For example, we may say that if high school GPA increase one unit, the odds a student to be admitted can be increased to 6 times given other variables the same. 

Although the output does not directly show odds ratio, they can be calculated easily in R as shown below.

By using odds ratios, we can intercept the parameters in the following.

  • For x1, if a woman receives a screening recommendation, the odds for her to be in compliance with screening is about 5.9 times of the odds of a woman who does not receive a recommendation given x2, x3, x4 the same. Alternatively (may be more intuitive), if a woman receives a screening recommendation, the odds for her to be in compliance with screening will increase 4.9 times (5.889 – 1 = 4.889 =4.9), given other variables the same.
  • For x2, if a woman has one unit more knowledge on breast cancer and mammography screening, the odds for her to be in compliance with screening decreases 58.1% (.419-1=-58.1%, negative number means decrease), keeping other variables constant.
  • For x3, if a woman's perception about the benefit increases one unit, the odds for her to be in compliance with screening increases 81% (1.81-1=81%, positive number means increase), keeping other variables constant.
  • For x4, if a woman's perception about the barriers increases one unit, the odds for her to be in compliance with screening decreases 14.2% (.858-1=-14.2%, negative number means decrease), keeping other variables constant.

Statistical inference for logistic regression

Statistical inference for logistic regression is very similar to statistical inference for simple linear regression. We can (1) conduct significance testing for each parameter, (2) test the overall model, and (3) test the overall model.

Test a single coefficient (z-test and confidence interval)

For each regression coefficient of the predictors, we can use a z-test (note not the t-test). In the output, we have z-values and corresponding p-values. For x1 and x3, their coefficients are significant at the alpha level 0.05. But for x2 and x4, they are not. Note that some software outputs Wald statistic for testing significance. Wald statistic is the square of the z-statistic and thus Wald test gives the same conclusion as the z-test. 

We can also conduct the hypothesis testing by constructing confidence intervals. With the model, the function confint() can be used to obtain the confidence interval. Since one is often interested in odds ratio, its confidence interval can also be obtained. 

Note that if the CI for odds ratio includes 1, it means nonsignificance. If it does not include 1, the coefficient is significant. This is because for the original coefficient, we compare the CI with 0. For odds ratio, exp(0)=1.

If we were reporting the results in terms of the odds and its CI, we could say, “The odds of in compliance to screening increases by a factor of 5.9 if receiving screening recommendation (z=3.66, P = 0.0002; 95% CI = 2.38 to 16.23) given everything else the same.”

Test the overall model

For the linear regression, we evaluate the overall model fit by looking at the variance explained by all the predictors. For the logistic regression, we cannot calculate a variance. However, we can define and evaluate the deviance instead. For a model without any predictor, we can calculate a null deviance, which is similar to variance for the normal outcome variable. After including the predictors, we have the residual deviance. The difference between the null deviance and the residual deviance tells how much the predictors help predict the outcome. If the difference is significant, then overall, the predictors are significant statistically.

The difference or the decease in deviance after including the predictors follows a chi-square ($\chi^{2}$) distribution. The chi-square ($\chi^{2}$) distribution is a widely used distribution in statistical inference. It has a close relationship to F distribution. For example, the ratio of two independent chi-square distributions is a F distribution. In addition, a chi-square distribution is the limiting distribution of an F distribution as the denominator degrees of freedom goes to infinity.

There are two ways to conduct the test. From the output, we can find the Null and Residual deviances and the corresponding degrees of freedom. Then we calculate the difference. For the mammography example, we first get the difference between the Null deviance and the Residual deviance, 203.32-155.48= 47.84. Then, we find the difference in the degrees of freedom 163-159=4. Then, the p-value can be calculated based on a chi-square distribution with the degree of freedom 4. Because the p-value is smaller than 0.05, the overall model is significant.

The test can be conducted simply in another way. We first fit a model without any predictor and another model with all the predictors. Then, we can use anova() to get the difference in deviance and the chi-square test result. 

Test a subset of predictors

We can also test the significance of a subset of predictors. For example, whether x3 and x4 are significant above and beyond x1 and x2. This can also be done using the chi-square test based on the difference. In this case, we can compare a model with all predictors and a model without x3 and x4 to see  if the change in the deviance is significant. In this example, the p-value is 0.002, indicating the change is signficant. Therefore, x3 and x4 are statistically significant above and beyond x1 and x2

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R . Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2. To take the full advantage of the book such as running analysis within your web browser, please subscribe .

logistic regression hypothesis testing

Logistic Regression and Survival Analysis

  •   1  
  • |   2  
  • |   3  
  • |   4  
  • Contributing Authors:
  • Learning Objectives
  • Logistic Regression
  • Why use logistic regression?
  • Overview of Logistic Regression

Logistic Regression in R

  • Survival Analysis
  • Why use survival analysis?
  • Overview of Survival Analysis
  • Things we did not cover (or only touched on)

To perform logistic regression in R, you need to use the glm() function.  Here, glm stands for "general linear model." Suppose we want to run the above logistic regression model in R, we use the following command:

> summary( glm( vomiting ~ age, family = binomial(link = logit) ) )

glm(formula = vomiting ~ age, family = binomial(link = logit))

Deviance Residuals:

    Min       1Q   Median       3Q      Max 

-1.0671  -1.0174  -0.9365   1.3395   1.9196 

Coefficients:

             Estimate Std. Error z value Pr(>|z|)   

(Intercept) -0.141729   0.106206  -1.334    0.182   

age         -0.015437   0.003965  -3.893 9.89e-05 ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1452.3   on 1093  degrees of freedom

Residual deviance: 1433.9 on 1092  degrees of freedom

AIC: 1437.9

Number of Fisher Scoring iterations: 4

To get the significance for the overall model we use the following command:

> 1-pchisq(1452.3-1433.9, 1093-1092)

 [1] 1.79058e-05

The input to this test is:

  • deviance of "null" model minus deviance of current model (can be thought of as "likelihood")
  • degrees of freedom of the null model minus df of current model

This is analogous to the global F test for the overall significance of the model that comes automatically when we run the lm() command. This is testing the null hypothesis that the model is no better (in terms of likelihood) than a model fit with only the intercept term, i.e. that all beta terms are 0.

Thus the logistic model for these data is:

E[ odds(vomiting) ] = -0.14 – 0.02*age

This means that for a one-unit increase in age there is a 0.02 decrease in the log odds of vomiting. This can be translated to e -0.02 =  0.98. Groups of people in an age group one unit higher than a reference group have, on average, 0.98 times the odds of vomiting.

How do we test the association between vomiting and age?

  • H 0 : There is no association between vomiting and age (the odds ratio is equal to 1).
  • H a : There is an association between vomiting and age (the odds ratio is not equal to 1).

When testing the null hypothesis that there is no association between vomiting and age we reject the null hypothesis at the 0.05 alpha level ( z = -3.89, p-value = 9.89e-05).

On average, the odds of vomiting is 0.98 times that of identical subjects in an age group one unit smaller.

Finally, when we are looking at whether we should include a particular variable in our model (maybe it's a confounder), we can include it based on the "10% rule," where if the change in our estimate of interest changes more than 10% when we include the new covariate in the model, then we that new covariate in our model. When we do this in logistic regression, we compare the exponential of the betas, not the untransformed betas themselves!

return to top | previous page | next page

Creative Commons license Attribution Non-commercial

Please enable JavaScript to view this site.

  • Statistics Guide
  • Curve Fitting Guide
  • Prism Guide
  • Zoom Window Out
  • Larger Text  |  Smaller Text
  • Hide Page Header
  • Show Expanding Text
  • Printable Version
  • Save Permalink URL

A reminder of how hypothesis tests work

Two hypothesis tests are offered by Prism for assessing how well a model fits the entered data. Like other hypothesis-based tests that you’ve likely encountered, these two tests start by defining a null hypothesis (H0). Each test then calculates a statistic and corresponding P value. The calculated P value represents the probability of obtaining a test statistic as large as the one calculated (or larger) if the null hypothesis were true . This test also requires that a threshold be set for how small a P value must be to make a decision about whether to reject the null hypothesis. This – admittedly arbitrary – threshold is typically set to 0.05, and is also referred to as the alpha (α) level. If the obtained P value is smaller than α, then we reject the null hypothesis.

Before reading about each of the tests below, it's also important to understand what is meant by the terms "specified model" and "intercept-only model". The specified model is simply the model that was fit to the data (the one that you defined on the Model tab of the analysis). It includes the effects of the predictor variables on the probability of observing a success. The intercept-only model is a model that assumes that the contributions from the predictor variables is zero. In other words, the intercept-only model assumes that none of the independent variables help predict the outcome.

Hosmer-Lemeshow (HL) Test

The Hosmer-Lemeshow test is a classic hypothesis test for logistic regression. The null hypothesis is that the specified model is correct (that it fits well). The way the test works is by first sorting the observations by their predicted probability, and splitting them into 10 groups of equal numbers of observations (N). For each group, the average predicted probability is calculated, and that average is multiplied by N to get an expected number of 1s for that group (and in turn the expected number of 0s for that group). It then calculates a Pearson goodness of fit statistic using the observed and expected numbers of 0s and 1s, and summing across all of the groups. It uses a chi-squared distribution to then calculate a P value.

As mentioned, the null hypothesis is that the specified model fits well, so contrary to many tests, a small P value indicates a poor fit of the model to the data. Another way to think about this is that a small P value indicates that there is more deviation from the expected number of 0s and 1s (given 10 bins) than you'd expect by chance. Thus, there may be some additional factor, interaction, or transform that is missing from the model.

This test has received criticism for the arbitrary number of bins (10), since it has been shown that changing this number can influence the result of the test. The test is included as an option in Prism so you can compare results obtained in Prism with results calculated elsewhere, even though this test is not recommended.

Log likelihood ratio test (LRT)

The log likelihood ratio test is also a classic test that compares how well the model selected fits compared to the intercept only model. In this case, the null hypothesis is that the intercept-only model fits best, so a small P value here indicates that you would reject this null hypothesis (or that the specified model outperforms the intercept only model). As the name implies, this test uses the log likelihood of the specified model and the intercept-only model to calculate the associated statistic and P value. Although this test specifically looks at the defined model and an intercept-only model, it is the same test that is provided on the Compare tab of the multiple logistic regression parameters dialog to compare any two nested models.

The likelihood ratio tells you how much more likely the data were to be generated by the fit model than by the intercept-only model. If the independent variables do a good job of predicting the outome, the likelihood ratio will be high and the corresponding P value will be small.

Note the different meaning of the P values

A small P value has opposite meanings with the two tests:

• A small P value from the Hosmer-Lemeshow test means that the specified model doesn't do a good  job of predicting the data. Consider whether you need additional independent variables or interactions included in the model.

• A small P value from the likelihood ratio test means that the intercept-only model doesn't do a good job of predicting the data. The specified independent variables and interactions improve the fit of the model to the data.

© 1995- 2019 GraphPad Software, LLC. All rights reserved.

resize nav pane

Intro To Data Science Part 5: Linear And Logistic Regression

logistic regression hypothesis testing

So you’ve found a trove of data. And you’ve found a couple of correlations. What do you do with it all when you’re trying to build a sports betting model? It’s time to apply a linear or logistic regression.

First, let’s take a quick look back at how we got here:

  • In Part 1 of this series, we introduced the course and laid out what you’ll need to follow along.
  • In Part 2, we talked about where to find data .
  • Part 3 was when we looked at some basic ways to find correlations in your data .
  • And most recently in Part 4 we started to test out our hypothesis . We came into this exercise thinking that there might be a strong relationship between a team’s hard-hit rate and its runs scored. 

Now it’s time to take the information we’ve gathered and use it to build a model around our hypothesis.

What Is a Sports Betting Model?

At the most fundamental level, a model yields a prediction based on the data and variables we used to create it.

To build your model, you need a dependent variable – or “response variable.” In this case, it’s runs scored. The independent variables are the one we think explains the dependent variable. We’re going to start with a team’s hard-hit rate on the season, and then we’ll add in hard hit over the last 10 games and see if streaks matters.

When you’re building a model, you don’t normally want to throw all possible independent variables into a blender and see what you get. It can lead to some messy, inconclusive answers. There are already enough blind alleys in sports betting without making more for yourself.

If you’re just starting out in modeling, a better approach is to simplify. Use one or a small handful of variables, then analyze the results. You can always iterate after the fact.

Checking the Results

We’ll talk about analyzing your results in a later article, but there are a number of ways to do it. The most commonly used is R-squared, which measures how much of the dependent variable’s variance is explained by the model. 

R-squared exists on a scale of 0 to 1, where an R-squared of 0 means there’s no positive correlation between your dependent and independent variables. If your model has an R-squared of 0.5, that means your model explains 50 percent of the variance of the thing you’re trying to predict.

Linear vs. Logistic

Once the dependent and independent variables are in place, the next step is to apply a regression.

There are two bread-and-butter regressions: linear and logistic regression. Linear and logistic regressions are usually the first tools most statisticians use when analyzing data.

(In fact, most machine learning algorithms are just fancy regressions. Don’t tell ChatGPT we said that. We don’t want to be on the hook after the singularity happens.)

These two methods are easy to use, and your results with linear and logistic regressions can help point you to which more advanced techniques or refinements are appropriate for your model.

Linear regressions are usually used for continuous outcome variables, and logistic regressions are better for binary outcomes. A continuous variable is something like “How many runs will be scored in this game?” A binary variable is a yes/no proposition, like “Will this game go over the total?”

If you think of your data on a plot where one axis is your independent variable (in our case, runs scored) and the other is your dependent variable (hard hit rate), a linear regression will find the most efficient line through all of those data points. 

Here’s a very simplified representation of a linear regression. You have a line that shows the best fit between your variables. It shows us the relationship between the x-axis (the independent variable) and the y-axis (the dependent variable). In this case, as the value of the x-axis increases, we can expect the value of the y-axis to increase as well.

logistic regression hypothesis testing

Planning Out the Variables

We’ll do both kinds of regressions by the end. We want to use the same variables in each regression. The data we’ve already gathered has the information we need on runs allowed and hard-hit rate. Those are our dependent and independent variables, respectively.  First, we’ll run a regression for the home team, then the away.

Once that’s all in place, we’ll introduce a second variable: hard hit over a team’s last 10 games. We’ll use this to see if our hypothesis about hot/cold streaks has any merit.

Starting Your Analysis

If you’re working in Excel, you can use built-in linear regression functions to analyze your data. This is where we have to step in and cajole you into taking up R, though. It’s a much faster way to process large data sets. And we promise, even if you’ve never done any coding before, it’s not as painful as it looks. Pinky swear. Re-read the previous installments in this series where we explain all the code line by line.

If you want to stick with Excel, though, you can do all the work in a spreadsheet. To perform a linear regression, use the Data Analysis function under the Data tab and select “Regression.” Use our dependent variable (home or away runs allowed) as the Y axis and choose your dependent variable (home or away hard hit) as your independent variable. 

You can create a summary and a plot that will allow you to analyze how well the variables explain what’s happening. 

logistic regression hypothesis testing

W e can also create a line fit plot to see the relationship between our variables. 

It’s also possible to perform logistic regressions in Excel .  But again, if you’re going to be pulling seasons worth of game data, it could end up being a bit of a beast.  

OK. Cajoling over. If you choose to stay in Excel, at least keep reading to learn about how some of the pieces fit together. It will be helpful in 

Extra Credit

If you’re doing the work in R, here’s the code: 

  • Part 5 code

This requires loading in a .csv file that you already have if you’ve completed the previous modules. But if you can’t find it buried in a nest of folders, or you just want to cut to the chase, here’s the file for download. Just make sure you plant it in the work directory you set for the project using the command setwd(“C:\ExampleDirectory”).

  • Data After Article 4

We’ve also made these files available on Github , if you’re nerding out over there .

Let’s take a look inside all that code. We’re using the numbering from the standalone code, so if you’re using the full ride, start instead from line 716.

logistic regression hypothesis testing

Here we’re loading in all our data from previous installments into a data table, called “dat.” It’s a shorthand way of calling all the data in our .csv file every time we use “dat” in a different command.

logistic regression hypothesis testing

By “fitting” a model, we’re looking for the best way for the model to describe how the independent variable relates to the dependent variable. The “lm” function here is the “linear model” command. We’re telling the program to build a linear regression where home_score is our dependent variable. The tilde (“~” symbol) separates the dependent variable from the two independent variables we’re choosing from the data table: the home team’s hard-hit in previous games, and the away team’s runs allowed.

logistic regression hypothesis testing

You can use the “summary” command in the console to call up the model’s coefficients. We’ll get into detail about what they all mean in the next article, but broadly, this allows us to analyze how well our model explains the variance in our dependent variable. 

Among other analyses, you’ll find the model’s R-squared numbers in this summary.

logistic regression hypothesis testing

Now we’re using our model in fit1 to predict what we think the home team will score. “dat$home_score_pred1” creates a column in our data frame where we’ll put our predicted score. 

The “predict” function tells the model to interact with the independent variables from our data. Which, again, in “fit1” are home team hard-hit rate, and away team runs allowed.

Lines 33-41

logistic regression hypothesis testing

We’re creating two plots here to help us analyze our data.

This looks at the model’s errors (or “residuals”). Errors are just the difference between the predicted results and the observed results. If we predicted four runs and the actual result was five, that’s an error. It’s where messy real life deviates from the model. 

logistic regression hypothesis testing

We want to know if our errors are normally distributed. The first method, in line 33, is a Q-Q plot. We want to know that most of our distribution falls along the line. Looks good here.

logistic regression hypothesis testing

We’ll also plot out the density of the errors. Here, we can see that it’s possibly a normal distribution, but it’s definitely left-skewed. 

In the next article, we’ll look at how to address this.

Lines 47-50

logistic regression hypothesis testing

Now we can add in more variables. In this case, we want to build “fit2” off of “fit1” and simply add in the home team hard-hit over the last 10 games. This is how we test the hypothesis that a team full of guys who’ve been swinging the hot bat will put up more runs overall. 

Lines 53-70

logistic regression hypothesis testing

This should start to look familiar. The section here repeats some of our previous steps.

First, on Line 56 we’re adding predictions back into the data set after adding an third independent variable.

Then, Lines 60-70 repeat all of the steps we did to analyze the home score, but do it instead for the away score.

Lines 75-81

logistic regression hypothesis testing

It’s all coming together. We’re using six independent variables – our three each for home and away – and building a linear regression for the full game total. 

Once we have it all together, we run “predict” again to analyze our residuals. This is also the command you would use in season to get the daily outputs once you have a fully fleshed out and functioning model. 

Lines 90-97

logistic regression hypothesis testing

Now it’s time to apply a logistic regression to attempt to predict whether a game will go over or under the total. 

The command “glm” creates a generalized linear model with our dataset, which is the preferred command for logistic regressions. By specifying “family=binomial” we’re asking for an either/or output as it relates to our dependent variable.

“Tot_over” is our dependent variable – the game total. For our independent variables, we’re testing home and away runs against, and home hard-hit over the last 10 games.

In the final line, we’re again using the “predict” function, this time to predict the probability a game will go over or under the total based on our independent variables.

Lines 101-102

logistic regression hypothesis testing

Finally, in these last two lines we’re taking the probabilities generated by our predictions in fit4 and converting them to odds so we can compare our model’s price against the current prices at sportsbooks.

These functions called here to convert probabilities to odds (and vice-versa) were created in Article 4 . If you’re running the code for all articles, that function is stored in R Studio. If you don’t have those functions loaded, paste in Lines 176-211 from the Article 4 code before these final lines and run it again. 

Expand it Out

There was a lot going on here this time, but this is the fun stuff. Now we’ve got game data. We’ve got betting odds. We’ve got a hypothesis and the means to test it. And we’ve got game scores we’re actually predicting. What we’ve got here is a working model. 

And if you have your own hypotheses, you hopefully are starting to think about ways you can take a concept like linear regression and apply it to different variables. 

Maybe you want to add in slugging percentage or weighted runs created plus. Or possibly you want to go the other direction and analyze pitchers instead of lineups.

If you really want to get in the weeds, you could pull pitch-type data from a team and compare it to an opponent’s strengths against different pitches. The Red Sox staff isn’t throwing many fastballs this year. Do you think the Orioles are going to have a field day against sinker/slider types who eschew the heater? Now you have some techniques to build a model to test out your hypothesis.

Of course, is the model we’ve built here any good? That’s a different question altogether.

What’s next

Once we have a model in place, we need to figure out if we can trust it. We’ll talk a little bit about how to put a model through its paces and evaluate its performance. 

One thing we’re going to want to look at is that our model as it currently exists shows a significant p-value for home team hard hit rate, but a much less significant one for away team hard hit rate. We’ll need to think through what that means. Is it logical based on our data, or is something else going on?

What came first

  • Part 1: The Road to Sports Betting Models
  • Part 2: Sports Betting Data Basics
  • Part 3: Data and Basic Correlation
  • Part 4: Testing Your Hypothesis

Have questions or want to talk to other bettors on your modeling journey? Come on over to the free Discord and hop into the discussion.

logistic regression hypothesis testing

Latest Articles

College football offseason news: arkansas qb green on track to be starter, why martingale betting works – until it doesn’t, betting the us open: inside the process, how to bet golf: analyzing the course, mistakes to avoid when betting on the masters, gambling taxes faq, why you are losing betting baseball, latest videos, learn about the props simulator, learn about the partial game derivative calculator, learn about the clv calculator, learn about the derivatives calculator, learn about the hold calculator, betting tools.

  • NFL Season Simulator
  • NFL Props Simulator
  • NBA Props Simulator
  • NFL Teaser Calculator
  • NFL Pick'em Edge Tool
  • MLB DFS Pick'em Tool
  • NFL DFS Pick'em Tool
  • College Football Injury Updates
  • College Football Depth Charts

Betting Odds

  • College Football Odds
  • College Basketball Odds
  • NFL Player Props
  • NBA Player Props
  • WNBA Player Props
  • College Football Player Props

Betting Calculators

  • Closing Line Value
  • Compare Lines
  • Hold Calculator
  • Odds Converter
  • No-Vig Fair Odds Calculator
  • Hedge Betting Calculator
  • Derivatives - Alternate Lines
  • Derivatives - Partial Game

Betting Education

  • How can I find positive-EV wagers?
  • Ten ways to win at sports betting
  • How do I become a professional sports bettor?
  • Is arbitrage betting worth it?
  • Should I hedge my bet?
  • How can I avoid getting limited at sportsbooks?
  • What is a risk-free bet and how do I get the most from it?
  • Who sets sports betting lines?
  • Building a sports betting model
  • Bankroll management for sports betting
  • March Madness Prop Betting
  • Best sports betting books
  • March Madness Betting Strategies
  • Terms of Use
  • Sports Betting API
  • Follow Us on Twitter
  • Subscribe to Us on YouTube
  • Follow Us on Instagram
  • Follow Us on TikTok
  • Listen to the Unabated Podcast
  • Email [email protected]
  • Join the Unabated Discord
  • Subscribe to Our Newsletter

Copyright 2024 © Unabated Sports, Inc. All Rights Reserved.

IMAGES

  1. 06 2 Logistic Regression Hypothesis Representation

    logistic regression hypothesis testing

  2. In Logistic Regression You Will Mostly Use Which Statistic

    logistic regression hypothesis testing

  3. Logistic Regression

    logistic regression hypothesis testing

  4. Writing Hypothesis For Logistic Regression

    logistic regression hypothesis testing

  5. To create a logistic regression with Python from scratch we should

    logistic regression hypothesis testing

  6. How to Perform Logistic Regression in SAS

    logistic regression hypothesis testing

VIDEO

  1. ML Class 7: Logistic Regression (Hypothesis Representation, Decision Boundary and Cost Function)

  2. 5 Statistics Chapter-5(Correlation vs Regression| Hypothesis)

  3. 06 2 Logistic Regression Hypothesis Representation

  4. اختبارات الفروض : تحليل الانحدار المتعدد Hypothesis tests: multiple regression analysis

  5. Multiple regression, hypothesis testing, model deployment

  6. Categorical data analysis: Logistic regression (hypothesis tests for regression parameters)

COMMENTS

  1. PDF Lecture 13 Estimation and hypothesis testing for logistic regression

    Testing a single logistic regression coefficient in R To test a single logistic regression coefficient, we will use the Wald test, βˆ j −β j0 seˆ(βˆ) ∼ N(0,1), where seˆ(βˆ) is calculated by taking the inverse of the estimated information matrix. This value is given to you in the R output for β j0 = 0. As in linear regression ...

  2. Understanding the Null Hypothesis for Logistic Regression

    The formula on the right side of the equation predicts the log odds of the response variable taking on a value of 1. Simple logistic regression uses the following null and alternative hypotheses: H0: β1 = 0. HA: β1 ≠ 0. The null hypothesis states that the coefficient β1 is equal to zero. In other words, there is no statistically ...

  3. 12.1

    The multiple binary logistic regression model is the following: \[\begin{align}\label{logmod} ... (recall that we use t-tests in linear regression). For maximum likelihood estimates, the ratio \[\begin{equation*} ... The likelihood ratio test is used to test the null hypothesis that any subset of the $\beta$'s is equal to 0.

  4. Understanding Logistic Regression step by step

    The logistic regression classifier will predict "Male" if: This is because the logistic regression " threshold " is set at g (z)=0.5, see the plot of the logistic regression function above for verification. For our data set the values of θ are: To get access to the θ parameters computed by scikit-learn one can do: # For theta_0: print ...

  5. PDF Logistic regression, Part III

    Global tests of parameters. In OLS regression, if we wanted to test the hypothesis that all β's = 0 versus the alternative that at least one did not, we used a global F test. In logistic regression, we use a likelihood ratio chi-square test instead. Stata calls this LR chi2. The value in this case is 15.40.

  6. PDF Global and Simultaneous Hypothesis Testing for

    Besides such single logistic regression problems, hypothesis testing involving two logistic regression models with regression coe cients (1) and (2) in Rp is also important. Speci cally, one is interested in testing the global null hypothesis H 0: (1) = (2), or identifying the di erentially associated covariates through simultaneously testing ...

  7. Chapter 19 Inference in Logistic Regression

    19.2 Hypothesis Testing for \(\beta\) 's. Like with linear regression, a common inferential question in logistic regression is whether a \(\beta_j\) is different from zero. This corresponds to there being a difference in the log odds of the outcome among observations that differen in the value of the predictor variable \(x_j\).

  8. PDF Lecture 20

    Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors. We assume a binomial distribution produced the outcome variable and we therefore want to model p the probability of success for a given set of predictors. Logistic Regression.

  9. 15.2 Logistic regression

    Testing hypothesis for a logistic regression model is the exact same as for a standard regression model. And so, we find very strong support for hypothesis 2, suggesting that (given model and data), there is reason to believe that the accuracy in incongruent trials is lower than in congruent trials.

  10. Logistic Regression Analysis

    Like standard multiple regression, logistic regression carries hypothesis tests for the significance of each variable, along with other tests, estimates, and goodness-of-fit assessments. In the classification setting, the variable significance tests can be used for feature selection: modern computational implementations incorporate several ...

  11. Chapter 10 Binary Logistic Regression

    10.5 Hypothesis Test. In logistic regression, hypotheses are of interest: the null hypothesis, which is when all the coefficients in the regression equation take the value zero, and. the alternate hypothesis that the model currently under consideration is accurate and differs significantly from the null of zero, i.e. gives significantly better than the chance or random prediction level of the ...

  12. Chapter 18 Logistic Regression

    18.6 Example: Measuring Team Defense Using Logistic Regression. logit(pi) = α+β1SD+β2Team+β3(Team)(SD) l o g i t ( p i) = α + β 1 S D + β 2 Team + β 3 ( Team) ( S D) Since the team defending is a categorical variable R will store it as a dummy variable when forming the regression. Thus the first level of this variable will not appear in ...

  13. Global and Simultaneous Hypothesis Testing for High-Dimensional

    1 Logistic regression models have been applied widely in genetics, finance, and business analytics. ... A test statistic for testing the global null hypothesis is constructed using a generalized low-dimensional projection for bias correction and its asymptotic null distribution is derived. A lower bound for the global testing is established ...

  14. 5.7: Multiple Logistic Regression

    How it works. Multiple logistic regression finds the equation that best predicts the value of the Y Y variable for the values of the X X variables. The Y Y variable is the probability of obtaining a particular value of the nominal variable. For the bird example, the values of the nominal variable are "species present" and "species absent."

  15. PDF Logistic regression, Part III

    comes to Logistic regression. I'm trying to more or less follow Menard, but you'll have to learn to adapt to whatever the author or statistical program happens to use. Overview. In this handout, we'll examine hypothesis testing in logistic regression and make comparisons between logistic regression and OLS.

  16. Introduction to Logistic Regression

    Linear Regression VS Logistic Regression Graph| Image: Data Camp. We can call a Logistic Regression a Linear Regression model but the Logistic Regression uses a more complex cost function, this cost function can be defined as the 'Sigmoid function' or also known as the 'logistic function' instead of a linear function. The hypothesis of logistic regression tends it to limit the cost ...

  17. PDF Lecture 17: Logistic Regression: Testing Homogeneity of the OR

    Testing for homogeneity of the OR across strata • Recall, in the previous lecture we were interested in estimating the "common" (or adjusted) OR using a logistic model • In doing so, we assumed that the OR remained the same for each level of our confounding variable j • Suppose we again think of the data as arising from J, (2× 2) tables: Stratum j (of W)

  18. Hypothesis Test in Logistic Regression

    #All codes are available at https://github.com/MyDataCafe/#All Class Videos are at https://www.youtube.com/mydatacafe#We are on Facebook https://www.facebook...

  19. Logistic regression -- Advanced Statistics using R

    Logistic regression is widely used in social and behavioral research in analyzing the binary (dichotomous) outcome data. In logistic regression, the outcome can only take two values 0 and 1. ... Wald statistic is the square of the z-statistic and thus Wald test gives the same conclusion as the z-test. We can also conduct the hypothesis testing ...

  20. Logistic Regression in R

    Test the hypothesis that being nauseated was not associated with sex and age (hint: use a multiple logistic regression model). Test the overall hypothesis that there is no association between nausea and sex and age. Then test the individual main effects hypothesis (i.e. no association between sex and nausea after adjusting for age, and vice versa).

  21. Hypothesis Testing in Logistic Regression

    5. The likelihood-ratio test on a model fit by maximum likelihood, (for example, a logistic regression or another generalized linear model), is a counterpart to the F F test on a linear regression model. Both allow for testing the overall model against the null model (in R, outcome ~ 1 ), as in your question, and generally for testing nested ...

  22. Multiple logistic regression

    How it works. Multiple logistic regression finds the equation that best predicts the value of the Y variable for the values of the X variables. The Y variable is the probability of obtaining a particular value of the nominal variable. For the bird example, the values of the nominal variable are "species present" and "species absent."

  23. GraphPad Prism 10 Curve Fitting Guide

    The Hosmer-Lemeshow test is a classic hypothesis test for logistic regression. The null hypothesis is that the specified model is correct (that it fits well). The way the test works is by first sorting the observations by their predicted probability, and splitting them into 10 groups of equal numbers of observations (N).

  24. Intro To Data Science Part 5: Linear And Logistic Regression

    Once the dependent and independent variables are in place, the next step is to apply a regression. There are two bread-and-butter regressions: linear and logistic regression. Linear and logistic regressions are usually the first tools most statisticians use when analyzing data. (In fact, most machine learning algorithms are just fancy regressions.