Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix heights update in weighted_extended_p_square #59

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

adimajo
Copy link

@adimajo adimajo commented Apr 15, 2024

In weighted_extended_p_square.hpp, a weighted version (that is, incoming samples are given a weight) of the extended (which allows the estimation of several quantiles) p-square algorithm (an online - in the sense that it doesn't require storing all samples - quantile estimator) is implemented.

This algorithm works by updating estimates of these quantiles and additional "markers" (min, max values and all mid-points, i.e. all quantiles lying between two requested quantiles).

Unfortunately, the heights (i.e. quantile estimates) update rule does not properly take into account weights and does not differ from the unweighted case.

  • The update rule is currently done only if the discrepancy between desired and actual positions is above 1 when positions are actually in the "weights" scale (which can be arbitrarily small/large)
  • The update rule itself currently only takes into account the sign of the discrepancy when it has to be weighted.

This implementation is correct in the unweighted case, but make the approach work poorly on situations where the weights lie far away from 1 on average (obviously when all weights are set to 1 - and one can extrapolate to an order of magnitude farther from 1 - it matches the unweighted case).

This is counter-intuitive at best, and even unsatisfactory, because it is reasonable to assume that the "weighted" equivalent of an unweighted algorithm should yield similar results when presented with similar data and the same weight for each sample.

Provided programs MWE1.{cpp,py} implement this idea:

  • Instanciate an accumulator_set of type weighted_extended_p_square_quantile and give it quantiles to track {0.001, 0.2, 0.5, 0.8, 0.999}
  • For weight in {0.0001, 0.001, 0.01, 0.1, 1., 10., 100., 1000., 10000.}
    • Do 10000 times:
      • Draw a sample from uniform distribution U(0, 1)
      • Estimate quantiles {0.1, 0.35, 0.65, 0.9}
  • Plot the estimates against the truth (estimates should converge to true values reasonably fast since linear interpolation is correct with U(0,1) if quantile estimates are correct).

They produce the following plot with the current implementation:
MWE1_current

As can be seen, the result highly depends on the chosen weight (small to large from left to right) and are unsatisfactory for very {small,large} weights, breaking the desirable "weight-invariance" property.

Applying the proposed modifications to the heights update rule and rerunning the proposed consistency test results in a satisfactory plot:
MWE1_fixed

Notes:

  • Program MWE1 can be compiled e.g. via: g++ -I$BOOST_INCLUDE_PATH MWE1.cpp -o MWE1
  • Data is generated via MWE1 > data1.csv
  • Plots are generated via python3 MWE1.py
  • MWE1.py requires matplotlib and pandas

The heights update rule was not updated to take into account weights.
* The update rule is done only if the discrepancy between desired and actual positions is above 1 when positions are in the "weights" scale (which can be arbitrarily small)
* The update rule itself only takes into account the sign of the discrepancy when it has to be weighted
A simulation shows (see PR) that when pushing observations from the same distribution with varying weights, the estimate is not consistent (well off its true value for w << 1 and w >> 1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant