Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support mann_whitney_u_test aggregate function #45513

Open
jixxiong opened this issue May 13, 2024 · 2 comments
Open

support mann_whitney_u_test aggregate function #45513

jixxiong opened this issue May 13, 2024 · 2 comments

Comments

@jixxiong
Copy link

jixxiong commented May 13, 2024

Feature request

Is your feature request related to a problem? Please describe.

Please refer to Whitney_U_test-wikipedia.

Describe the solution you'd like

The Mann-Whitney U test is a non-parametric test that can be used to determine if two populations were selected from the same distribution. The Whitney U (Wilcoxon-Mann-Whitney) test is a non-parametric test used to compare two independent samples. The calculation of the test statistic involves the following steps:

Combine all observations from both groups into a single dataset.
Rank the combined dataset from smallest to largest, assigning the lowest rank to the smallest observation and so on.
Sum the ranks of the observations in one group (group A). This sum is denoted as ( $R_A$ ).
Calculate the sum of the ranks for the other group (group B), denoted as ( $R_B$ ). The sum of all ranks is ( $N(N+1)/2$ ), where ( $N$ ) is the total number of observations. Thus, ( $R_B = N(N+1)/2 - R_A$ ).
Calculate the test statistic ( $U$ ) for each group using the following formulas:
For group A: [ $U_A = R_A - \frac{n_A(n_A + 1)}{2}$ ]

For group B: [ $U_B = R_B - \frac{n_B(n_B + 1)}{2}$ ]

Where ( $n_A$ ) and ( $n_B$ ) are the sample sizes of group A and group B, respectively.

The smaller value of ( $U_A$ ) and ( $U_B$ ) is the test statistic ( U ). If this value is less than or equal to the critical value from the U distribution table, the null hypothesis is rejected, indicating a significant difference between the two groups. Otherwise, the null hypothesis is not rejected.
By calculating the test statistic and comparing it to the critical values, the Whitney U test determines whether there is a statistically significant difference between the two independent samples, without assuming normality.

In Starrocks, the MANN_WHITNEY_U_TEST agg function is expected to perform the Mann-Whitney rank test on samples derived from two populations and return the Mann-Whitney U statistic as well as p-value associated with the test.

MANN_WHITNEY_U_TEST (sample_data, sample_treatment[, alternative[, continuity_correction]])

The MANN_WHITNEY_U_TEST function takes a column sample_data and column sample_treatment as input, where sample_data should be of numeric type and sample_treatment should be of boolean type, with false representing the first group and true representing the second group. Two optional parameters are: alternative to specify the alternative hypothesis (can be 'two-sided', 'less', or 'greater'), and continuity_correction to specify whether to apply a continuity correction (default is True).

The MANN_WHITNEY_U_TEST function should return the Mann-Whitney U statistic as well as p-value associated with the test in a json array of two elements.

This function should ignores NULLs.

Here are some examples.

create table testing_data (
    id int, 
    score int, 
    treatment boolean
)
properties(
    "replication_num" = "1"
);

insert into testing_data values 
    (1, 80, false), 
    (2, 100, false), 
    (3, NULL, false), 
    (4, 60, true), 
    (5, 70, true), 
    (6, 85, true);

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment) FROM testing_data;

Result:

+---------------------------------------+
| mann_whitney_u_test(score, treatment) |
+---------------------------------------+
| [5, 0.38647623077123283]              |
+---------------------------------------+

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment, 'less') FROM testing_data;

Result:

+-----------------------------------------------+
| mann_whitney_u_test(score, treatment, 'less') |
+-----------------------------------------------+
| [5, 0.9255426634106172]                       |
+-----------------------------------------------+

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment, 'two-sided', 0) FROM testing_data;

Result:

+-------------------------------------------------------+
| mann_whitney_u_test(score, treatment, 'two-sided', 0) |
+-------------------------------------------------------+
| [5, 0.2482130789899235]                               |
+-------------------------------------------------------+

Describe alternatives you've considered

Additional context

@wangsimo0
Copy link
Contributor

Thanks @jixxiong !, could you pls add more informtaion so we can understand this function better?
Scenario: please add your scenario as detailed as possible, for example: get the closest timestamped data or the earliest timestamped data in grouped statistics, or to create a candlestick chart of a stock.
How to use it? the input parameter, out put parameter, their data type and how you want to deal with specific scenario like nulls...
and also, If you have used it in other systems, we are more than glad to see you share your experience :)

@jixxiong
Copy link
Author

jixxiong commented May 20, 2024

Thanks @jixxiong !, could you pls add more informtaion so we can understand this function better? Scenario: please add your scenario as detailed as possible, for example: get the closest timestamped data or the earliest timestamped data in grouped statistics, or to create a candlestick chart of a stock. How to use it? the input parameter, out put parameter, their data type and how you want to deal with specific scenario like nulls... and also, If you have used it in other systems, we are more than glad to see you share your experience :)

Thanks for your reply.

In Starrocks, the MANN_WHITNEY_U_TEST agg function is expected to perform the Mann-Whitney rank test on samples derived from two populations and return the Mann-Whitney U statistic as well as p-value associated with the test.

MANN_WHITNEY_U_TEST (sample_data, sample_treatment[, alternative[, continuity_correction]])

The MANN_WHITNEY_U_TEST function takes a column sample_data and column sample_treatment as input, where sample_data should be of numeric type and sample_treatment should be of boolean type, with false representing the first group and true representing the second group. Two optional parameters are: alternative to specify the alternative hypothesis (can be 'two-sided', 'less', or 'greater'), and continuity_correction to specify whether to apply a continuity correction (default is True).

The MANN_WHITNEY_U_TEST function should return the Mann-Whitney U statistic as well as p-value associated with the test in a json array of two elements.

This function should ignores NULLs.

Here are some examples.

create table testing_data (
    id int, 
    score int, 
    treatment boolean
)
properties(
    "replication_num" = "1"
);

insert into testing_data values 
    (1, 80, false), 
    (2, 100, false), 
    (3, NULL, false), 
    (4, 60, true), 
    (5, 70, true), 
    (6, 85, true);

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment) FROM testing_data;

Result:

+---------------------------------------+
| mann_whitney_u_test(score, treatment) |
+---------------------------------------+
| [5, 0.38647623077123283]              |
+---------------------------------------+

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment, 'less') FROM testing_data;

Result:

+-----------------------------------------------+
| mann_whitney_u_test(score, treatment, 'less') |
+-----------------------------------------------+
| [5, 0.9255426634106172]                       |
+-----------------------------------------------+

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment, 'two-sided', 0) FROM testing_data;

Result:

+-------------------------------------------------------+
| mann_whitney_u_test(score, treatment, 'two-sided', 0) |
+-------------------------------------------------------+
| [5, 0.2482130789899235]                               |
+-------------------------------------------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants