Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Refactor base series annotator #6265

Open
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

Alex-JG3
Copy link
Contributor

@Alex-JG3 Alex-JG3 commented Apr 6, 2024

Reference Issues/PRs

See #3214 .

What does this implement/fix? Explain your changes.

Refactors the BaseSeriesAnnotator class:

  • Removes the fmt and label attributes.
  • Adds the learning_type and task attributes.
  • Adds default predict and transform method.
  • Adds default predict_points and predict_segments` methods.
  • Adds methods for converting between dense and sparse output formats.
  • Adds default methods for predict and transform.

Does your contribution introduce a new dependency? If yes, which one?

No.

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the sktime root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • Optionally, for added estimators: I've added myself and possibly to the maintainers tag - do this if you want to become the owner or maintainer of an estimator you added.
    See here for further details on the algorithm maintainer role.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
    dependency isolation, see the estimator dependencies guide.

@Alex-JG3 Alex-JG3 marked this pull request as draft April 6, 2024 09:21
"""

_tags = {"python_dependencies": "pyod"}

def __init__(self, estimator, fmt="dense", labels="indicator"):
def __init__(self, estimator, learning_type="unsupervised"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't pyod always unsupervised?

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 6, 2024

As discussed in the last developer meeting, the approach using tags is showcased here: #6271

This allows to keep all "property"-like attributes in one, inspectable place.

@fkiraly fkiraly added module:annotation enhancement Adding new functionality API design API design & software architecture labels Apr 6, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented Apr 11, 2024

PS: if you migrate to tags, you need to merge the parts of #6271 as well that add tags to the registry.

@fkiraly
Copy link
Collaborator

fkiraly commented May 18, 2024

I am currently reviewing and will write a longer review.

Regarding the last question, would a solution not listed clear things up: transforming the last interval to [4, 6) ?

@Alex-JG3
Copy link
Contributor Author

I am currently reviewing and will write a longer review.

Regarding the last question, would a solution not listed clear things up: transforming the last interval to [4, 6) ?

For that example yes but I think it becomes less clear if we have an index that is not equally spaces. For example, taking the example from before and making it into a proper pandas series we get,

0 1
1 1
2 2
3 2
4 3
5 3

the interval of the final segment could be [4, 6) and that would be sensible. But suppose the index is not equally spaces.

0.1 1
1.1 1
2.2 2
2.3 2
4.1 3
4.3 3

For example, I don’t think an interval of [4.1, 5.1) would be sensible for the third segment. For completeness, the sparse format of the series above would in the current implementation would be,

[0.1, 2.2) 1
[2.2, 4.1) 2
[4.1, 4.3) 3

fkiraly
fkiraly previously approved these changes May 20, 2024
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

I think this will be the foundation of finally sorting this module out.
I also like how this removes boilerplate from the concrete classes.

The one worry I have is about removal of arguments and that the changes here are not deprecation safe.
The module is clearly marked as "experimental", so we could do that, I wonder though whether we can do sth, like leaving the old args, and raising warnings that they will disappear, instead.

@fkiraly
Copy link
Collaborator

fkiraly commented May 20, 2024

For that example yes but I think it becomes less clear if we have an index that is not equally spaces.

I see. How about making the last interval run over by the last difference, then?
In the unequally spaced example, [4.1, 4.5) then. Bit of a hack, but it is a heuristic that seems to agree with a good solution in the regularly spaced case.

I would also like to summon @VascoSch92 who might have a useful opinion here, given both sequences and sets are involved at the same time - hope you enjoy this 😁

The greater question is coming up with mutually inverse functions from a value representation to a segment representation, as above. This is not possible if the target index is not known (only partial inverses), but the function assume we know X.index - see above, discussion starts here in a self-contained form: #6265 (comment)

@VascoSch92
Copy link

I would also like to summon @VascoSch92 who might have a useful opinion here, given both sequences and sets are involved at the same time - hope you enjoy this 😁
:-)

Just some thoughts from my side:

I think the solution of @Alex-JG3 is good. What are you doing here is to represent in an intelligent way the values of a column/serie, i.e., you want to group interval of contiguous indexes where the column/serie is constant.

After that if you want to consider closed/open/right-closed/... interval, it is just taste, but it should be written clearly in the documentation or/and in the docstring.

The choice of taking left-closed and right-open interval ( for example [4.1, 4.5) ) seems legit, as it is also what Python do with lists and slicing, i.e., a list start from index 0 and ends at index length-1.

For that example yes but I think it becomes less clear if we have an index that is not equally spaces.
I see your point but

  • the index of a df is always monotonically increasing (is correct? I'm not sure now)
  • you said that you consider interval of contiguous indices with the same value.
    Therefore, it should be clear.

@fkiraly
Copy link
Collaborator

fkiraly commented May 20, 2024

the index of a df is always monotonically increasing (is correct? I'm not sure now)

Not for pandas.DataFrame in general, but we can enforce this in our output assumptions (should be tested, let's make a note of this!)

@Alex-JG3
Copy link
Contributor Author

the index of a df is always monotonically increasing (is correct? I'm not sure now)

No the index is not currently always monotonically increasing but it should be. In fact I think some of the new methods will break if index is not monotonically increasing. We need to enforce this.

I have added deprecation warnings in clasp and the pyod annotator. I think those are the only two classes that are affected by deprecation. Currently, the deprecation message feels very brief:

The fmt argument is going to be removed.

Is this sufficient? Do we want to point a specific version when these arguments will be depracted?

@fkiraly
Copy link
Collaborator

fkiraly commented May 20, 2024

Is this sufficient? Do we want to point a specific version when these arguments will be depracted?

I would say sth like

f"Warning from {type(self).__name__}: fmt argument will be removed in 0.31.0. For behaviour equivalent to current fmt=a usage, do X; for fmt=b usage, do Y",

using warn from sktime.utils.warning.

@Alex-JG3
Copy link
Contributor Author

I have rewritten the deprecation warning. I have also removed the deprecation warning for labels for the pyod adapter since the changes to BaseSeriesAnnotator do not deal with scores properly yet. This will come in a future PR.

If everyone else is happy are we ready to merge?

@Alex-JG3
Copy link
Contributor Author

Could someone please rerun the CI/CD pipeline? Looks like the docs jobs has failed due to running out of time to build. I don't think this as a result of changes introduced in this PR.

@fkiraly
Copy link
Collaborator

fkiraly commented May 22, 2024

Sure - rerunning

@fkiraly
Copy link
Collaborator

fkiraly commented May 22, 2024

Restarted - I also updated the deprecation messages and added release manager notes.

Question, it seems we want to remove the labels param in PyODAnnotator too - should there be a deprecation warning as well?

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are ready to merge for release with 0.30.0.

@Alex-JG3, it would be great if you could check the updated warnings and release manager comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture enhancement Adding new functionality module:annotation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants