Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add partial dataset of NAB, Bayes Online for Anomaly Detection, and testing example notebook #6335

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

duydl
Copy link
Contributor

@duydl duydl commented Apr 26, 2024

Reference Issues/PRs

#6167, #3214

What does this implement/fix? Explain your changes.

Introduce dataset from Numenta Anomaly Benchmark into sktime. Implement online anomaly detection algos on the dataset.

Does your contribution introduce a new dependency? If yes, which one?

No

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the sktime root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • Optionally, for added estimators: I've added myself and possibly to the maintainers tag - do this if you want to become the owner or maintainer of an estimator you added.
    See here for further details on the algorithm maintainer role.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
    dependency isolation, see the estimator dependencies guide.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@duydl
Copy link
Contributor Author

duydl commented Apr 26, 2024

Bayes CPD for Anomaly:

Initialization

Initialize self.max_length_probs[0, 0] to 1:
$$P(r_0 = 0) = 1$$
This probability matrix will store the probabilities for each potential run length ( r ).

Sequential Update for Each Data Point ( x_t )

The algorithm processes and updates probabilities for all potential run lengths for each new data point.

2.1. Predictive Probability for Each Run Length:
$$P(x_t \mid r_{t-1}, x_{1:t-1})$$
Calculated with observation_likelihood. The likelihood of observing ( x_t ) given the data model parameters for a specific run length ( r_{t-1} ).

2.2. Update Run Length Probabilities:
$$P(r_t = r_{t-1} + 1 \mid x_{1:t}) = (1 - H(r_{t-1})) \times P(x_t \mid r_{t-1}, x_{1:t-1}) \times P(r_{t-1} \mid x_{1:t-1})$$
( H(r) ) is the hazard function i.e., the probability of a change point at each run length. This is the probability of not having a change point and the run length updated accordingly.

2.3. Probability of a Change Point:
$$P(r_t = 0 \mid x_{1:t}) = \sum_{r=0}^{max_run_length} H(r) \times P(x_t \mid r, x_{1:t-1}) \times P(r \mid x_{1:t-1})$$
This step sums the probabilities across all previous run lengths, weighted by the hazard function, to compute the likelihood that ( x_t ) is a change point.

2.4. Normalization and Transfer:

$$P(r_t \mid x_{1:t}) = \frac{P(r_t \mid x_{1:t})}{\sum_{j=0}^{max_run_length + 1} P(r_j \mid x_{1:t})}$$
After updating, the probabilities are normalized and transferred from ([:, 1]) back to ([:, 0]).

Iterating Over All Data Points

The algorithm repeats these steps for each new data point, continuously updating the probability distribution over potential run lengths and adapting to new evidence as it comes in.

from sktime.annotation.base import BaseSeriesAnnotator


class StudentTDistribution:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the distribution exists already in sktime.proba.t, no?

On first glance, I would have modelled this as a distribution fitter, inheriting from the "parameter estimator" template in param_est. However, it does not fit entirely the interface, so we can leave it as is for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants