P-values of variables #224

nauanchik · 2022-12-12T13:33:04Z

Dear Guillermo,

Thank you for the library, it really does help at my work.

However, I wonder how can I get the p-values of all explanatory variables once the logistic regression was fitted and and the binning process complete? I'd like to have an output similiar to statsmodels library.

guillermo-navas-palencia · 2022-12-14T16:53:33Z

Hi @nauanchik,

This feature would be an interesting addition to the scorecard class. Would you be interested in implementing it and PR?

nauanchik · 2022-12-15T04:13:27Z

Unfortunately, I am too unskilled in coding to help you on adding the feature to the library. As for the PR, I always recommend this library at my workplace to the colleagues because it saves an enormous amount of time

guillermo-navas-palencia · 2022-12-15T07:57:52Z

Ok, no worries. I will find the time to implement it.

jnsofini · 2023-01-12T21:43:25Z

@guillermo-navas-palencia If you are still looking for some support, I can work on this feature of getting p-values. Let me know and I can start looking into it.

guillermo-navas-palencia · 2023-01-13T08:57:37Z

Thanks @jnsofini. That would be great!

detrin · 2023-07-09T20:23:25Z

This would be a great enhancement.

jnsofini · 2023-07-28T17:31:17Z

@ guillermo-navas-palencia I thought about implementing this directly in the code. Also, I read about why it is not included in the Scikit-learn library and implementing it is not a wise decision. This is for the following reasons.

We will have many other estimators without p-values, like the decision tree base classifier. So to have this only for logistic regression would not be helpful.
The second reason is that it needs to be clarified how to calculate these when we use regularization like l1 and l2 in scikit-learn. It might mislead users when they use regularization as the answers might differ from what they think they I due to the use of fisher information.

As a result, I am considering putting my results as a tutorial on the Optbinning page. Please let me know what your thoughts are..

Here is the code. I can use it build a scorecard with Optbinning and then provide summary stats of the p-values of the coefficients.

from sklearn.linear_model import LogisticRegression
import scipy.stats as stat
import numpy as np
import pandas as pd

class LogisticRegressionPValues(LogisticRegression):
    """Logistic regression model with p-value computation for coefficients and z-score statistics.
    
    This class extends the scikit-learn's LogisticRegression to include the computation of p-values 
    and z-scores for the coefficients after fitting the logistic regression model.
    """
    
    def fit(self, X, y, **kwargs):
        """
        Fit the logistic regression model and compute p-values and z-scores for the coefficients.
        
        Parameters:
            X (array-like or sparse matrix): Training data.
            y (array-like): Target values.
            **kwargs: Additional keyword arguments to pass to the base LogisticRegression.fit().
        """
        super().fit(X, y, **kwargs)
        self.p_values, self.z_scores = self.get_pvalues(X)

    def get_pvalues(self, X):
        """
        Compute the p-values and z-scores for the fitted model.
        
        Parameters:
            X (array-like or sparse matrix): Training data.
            
        Returns:
            p_values (list): Two-tailed p-values for each model coefficient.
            z_scores (array-like): Z-scores for each model coefficient.
        """
        return self.get_stats(self.decision_function(X), X, self.coef_[0])

    @staticmethod
    def get_stats(decision_boundary, X, coef):
        """
        Compute the p-values and z-scores for the fitted model.
        
        Parameters:
            decision_boundary (array-like): Decision function values for the training data.
            X (array-like or sparse matrix): Training data.
            coef (array-like): Model coefficients.
            
        Returns:
            p_values (list): Two-tailed p-values for each model coefficient.
            z_scores (array-like): Z-scores for each model coefficient.
        """
        cramer_rao = LogisticRegressionPValues.fisher_matrix(decision_boundary, X)
        sigma_estimates = np.sqrt(np.diagonal(cramer_rao))
        z_scores = coef / sigma_estimates  # Z-score for each model coefficient
        p_values = [stat.norm.sf(abs(z)) * 2 for z in z_scores]  # Two-tailed test for p-values

        return p_values, z_scores

    @staticmethod
    def fisher_matrix(decision_boundary, X):
        """
        Compute the Fisher Information Matrix for the logistic regression model.
        
        Parameters:
            decision_boundary (array-like): Decision function values for the training data.
            X (array-like or sparse matrix): Training data.
            
        Returns:
            cramer_rao (array-like): Inverse Information Matrix (Cramer-Rao).
        """
        denom = (2.0 * (1.0 + np.cosh(decision_boundary)))
        denom = np.tile(denom, (X.shape[1], 1)).T
        fisher_matrix = np.dot((X / denom).T, X)  # Fisher Information Matrix
        cramer_rao = np.linalg.inv(fisher_matrix)  # Inverse Information Matrix

        return cramer_rao

    def z_statistics(self):
        """
        Return a DataFrame containing z-statistics, p-values, and coefficients for each feature.
        
        Returns:
            pd.DataFrame: DataFrame containing z-statistics, p-values, and coefficients for each feature.
                Columns: ["Feature", "Coef", "z-score", "p-values"]
        """
        return pd.DataFrame(
            zip(self.feature_names_in_, self.coef_[0], self.p_values, self.z_scores),
            columns=["Feature", "Coef", "z-score", "p-values"]
        )

guillermo-navas-palencia added the enhancement New feature or request label Dec 12, 2022

guillermo-navas-palencia modified the milestones: v0.18.0, Backlog Dec 12, 2022

guillermo-navas-palencia self-assigned this Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P-values of variables #224

P-values of variables #224

nauanchik commented Dec 12, 2022

guillermo-navas-palencia commented Dec 14, 2022

nauanchik commented Dec 15, 2022

guillermo-navas-palencia commented Dec 15, 2022

jnsofini commented Jan 12, 2023

guillermo-navas-palencia commented Jan 13, 2023

detrin commented Jul 9, 2023

jnsofini commented Jul 28, 2023

P-values of variables #224

P-values of variables #224

Comments

nauanchik commented Dec 12, 2022

guillermo-navas-palencia commented Dec 14, 2022

nauanchik commented Dec 15, 2022

guillermo-navas-palencia commented Dec 15, 2022

jnsofini commented Jan 12, 2023

guillermo-navas-palencia commented Jan 13, 2023

detrin commented Jul 9, 2023

jnsofini commented Jul 28, 2023