Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P-values of variables #224

Open
nauanchik opened this issue Dec 12, 2022 · 7 comments
Open

P-values of variables #224

nauanchik opened this issue Dec 12, 2022 · 7 comments
Assignees
Labels
enhancement New feature or request
Projects
Milestone

Comments

@nauanchik
Copy link

Dear Guillermo,

Thank you for the library, it really does help at my work.

However, I wonder how can I get the p-values of all explanatory variables once the logistic regression was fitted and and the binning process complete? I'd like to have an output similiar to statsmodels library.

@guillermo-navas-palencia guillermo-navas-palencia added the enhancement New feature or request label Dec 12, 2022
@guillermo-navas-palencia
Copy link
Owner

Hi @nauanchik,

This feature would be an interesting addition to the scorecard class. Would you be interested in implementing it and PR?

@nauanchik
Copy link
Author

Unfortunately, I am too unskilled in coding to help you on adding the feature to the library. As for the PR, I always recommend this library at my workplace to the colleagues because it saves an enormous amount of time

@guillermo-navas-palencia
Copy link
Owner

Ok, no worries. I will find the time to implement it.

@jnsofini
Copy link
Contributor

@guillermo-navas-palencia If you are still looking for some support, I can work on this feature of getting p-values. Let me know and I can start looking into it.

@guillermo-navas-palencia
Copy link
Owner

Thanks @jnsofini. That would be great!

@detrin
Copy link

detrin commented Jul 9, 2023

This would be a great enhancement.

@jnsofini
Copy link
Contributor

@ guillermo-navas-palencia I thought about implementing this directly in the code. Also, I read about why it is not included in the Scikit-learn library and implementing it is not a wise decision. This is for the following reasons.

  1. We will have many other estimators without p-values, like the decision tree base classifier. So to have this only for logistic regression would not be helpful.
  2. The second reason is that it needs to be clarified how to calculate these when we use regularization like l1 and l2 in scikit-learn. It might mislead users when they use regularization as the answers might differ from what they think they I due to the use of fisher information.

As a result, I am considering putting my results as a tutorial on the Optbinning page. Please let me know what your thoughts are..

Here is the code. I can use it build a scorecard with Optbinning and then provide summary stats of the p-values of the coefficients.

from sklearn.linear_model import LogisticRegression
import scipy.stats as stat
import numpy as np
import pandas as pd

class LogisticRegressionPValues(LogisticRegression):
    """Logistic regression model with p-value computation for coefficients and z-score statistics.
    
    This class extends the scikit-learn's LogisticRegression to include the computation of p-values 
    and z-scores for the coefficients after fitting the logistic regression model.
    """
    
    def fit(self, X, y, **kwargs):
        """
        Fit the logistic regression model and compute p-values and z-scores for the coefficients.
        
        Parameters:
            X (array-like or sparse matrix): Training data.
            y (array-like): Target values.
            **kwargs: Additional keyword arguments to pass to the base LogisticRegression.fit().
        """
        super().fit(X, y, **kwargs)
        self.p_values, self.z_scores = self.get_pvalues(X)

    def get_pvalues(self, X):
        """
        Compute the p-values and z-scores for the fitted model.
        
        Parameters:
            X (array-like or sparse matrix): Training data.
            
        Returns:
            p_values (list): Two-tailed p-values for each model coefficient.
            z_scores (array-like): Z-scores for each model coefficient.
        """
        return self.get_stats(self.decision_function(X), X, self.coef_[0])

    @staticmethod
    def get_stats(decision_boundary, X, coef):
        """
        Compute the p-values and z-scores for the fitted model.
        
        Parameters:
            decision_boundary (array-like): Decision function values for the training data.
            X (array-like or sparse matrix): Training data.
            coef (array-like): Model coefficients.
            
        Returns:
            p_values (list): Two-tailed p-values for each model coefficient.
            z_scores (array-like): Z-scores for each model coefficient.
        """
        cramer_rao = LogisticRegressionPValues.fisher_matrix(decision_boundary, X)
        sigma_estimates = np.sqrt(np.diagonal(cramer_rao))
        z_scores = coef / sigma_estimates  # Z-score for each model coefficient
        p_values = [stat.norm.sf(abs(z)) * 2 for z in z_scores]  # Two-tailed test for p-values

        return p_values, z_scores

    @staticmethod
    def fisher_matrix(decision_boundary, X):
        """
        Compute the Fisher Information Matrix for the logistic regression model.
        
        Parameters:
            decision_boundary (array-like): Decision function values for the training data.
            X (array-like or sparse matrix): Training data.
            
        Returns:
            cramer_rao (array-like): Inverse Information Matrix (Cramer-Rao).
        """
        denom = (2.0 * (1.0 + np.cosh(decision_boundary)))
        denom = np.tile(denom, (X.shape[1], 1)).T
        fisher_matrix = np.dot((X / denom).T, X)  # Fisher Information Matrix
        cramer_rao = np.linalg.inv(fisher_matrix)  # Inverse Information Matrix

        return cramer_rao

    def z_statistics(self):
        """
        Return a DataFrame containing z-statistics, p-values, and coefficients for each feature.
        
        Returns:
            pd.DataFrame: DataFrame containing z-statistics, p-values, and coefficients for each feature.
                Columns: ["Feature", "Coef", "z-score", "p-values"]
        """
        return pd.DataFrame(
            zip(self.feature_names_in_, self.coef_[0], self.p_values, self.z_scores),
            columns=["Feature", "Coef", "z-score", "p-values"]
        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
ToDo
  
Awaiting triage
Development

No branches or pull requests

4 participants