Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] long test collection time and test timeouts #6344

Open
fkiraly opened this issue Apr 26, 2024 · 26 comments
Open

[BUG] long test collection time and test timeouts #6344

fkiraly opened this issue Apr 26, 2024 · 26 comments
Labels
bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 26, 2024

Update: as @yarnabrina noted, there is an issue with long test times and long test collection times. This can be repreoduced locally.

python -m pytest -n0 --collect-only used to take <1min, now it is at 10min.

Using py-spy record -o profile.svg -- python -m pytest -n0 --collect-only for profiling now.

Related: #4900


VARMAX seems to be sporadically timing out with >10min durations.

Unsure what is causing this - previous failures were reported for some other tests, see #2997, #3176.

For testing, the skips in #6345 need to be reverted.

@fkiraly fkiraly added bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting labels Apr 26, 2024
fkiraly added a commit that referenced this issue Apr 26, 2024
Skips tests for the `VARMAX` forecaster until #6344 is resolved.
@fkiraly fkiraly changed the title [BUG] VARMAX test timeouts [BUG] long test collection time and test timeouts Apr 28, 2024
@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

@yarnabrina, I think we should also add a test for a 2min upper bound on the test collection or similar.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

Result of test collection for classification, using

py-spy record -o profile_classification.svg -- python -m pytest sktime/classification -n0 --collect-only - 2 minutes (!)

profile_classification

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

according to this, coverage report generation takes 80 seconds? That is odd, why would a coverage report be generated, if we only collect?

@yarnabrina
Copy link
Collaborator

Can we get the profiling for all components, with and without sktime/tests?

Question: where/how are you generating these plots, and how do I read it? I'm not familiar with py-spy?

@yarnabrina
Copy link
Collaborator

yarnabrina commented Apr 28, 2024

according to this, coverage report generation takes 80 seconds? That is odd, why would a coverage report be generated, if we only collect?

I think/guess pytest-cov is enabled by default?

https://github.com/sktime/sktime/blob/main/setup.cfg#L17-L19

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

Question: where/how are you generating these plots

I pasted the exact command - you need to run it from sktime root, in an env that has py-spy installed.

and how do I read it? I'm not familiar with py-spy?

The horizontal bars are time spent in function calls, lines in files are also given. So you are looking for public functions or similar, to see how much time is spent in there. The further down you go, the further the function calls are nested.

In the svg produced, in a browser you can also magnify by subsetting on one of the bars - click on one.

Can we get the profiling for all components, with and without sktime/tests?

I think that wlil not be as useful as investigating a component where times are short, such as classification. This is because the times there are already unexpectedly high, they should be in the range of 10s of seconds. My hunch is, identifying the issue there may also shed light on the other modules (there the times are just longer, so one diagnostic iteration takes more)

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

I think/guess pytest-cov is enabled by default?

yes, in setup.cfg.

I removed the lines to check - and collecting classification tests still takes 2 minutes. That is very odd.
profile_classification_no_cov

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

Part of this seem to be import chains and import coulpling:
#6354

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

Pinging @tarpas, @art049 - in case you have any ideas what to do here?

The timeouts started occurring around a month or so ago.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

Runtimes of historical versions for the classification line:

  • 0.22.0 - 12 sec for 301 tests
  • 0.24.0 - 11 sec for 294 tests
  • 0.25.0 - 38 sec for 355 tests
  • 0.26.0 - 38 sec for 355 tests
  • 0.27.0 - 56 sec for 363 tests

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

I think I understand the issue - it is run_test_for_class. The logic on which tests to run has been expanded over the versions.

If we replace run_test_for_class with only _required_deps_present, the time for the classifcation module shrinks to 5 secs, for 352 tests at current version.

For the entirety of sktime we get to 110 secs with _required_deps_present, and 22 seconds without, that is probably the current the floor from optimizing run_test_for_class.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

I think caching indeed greatly cuts the time: #6357

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 28, 2024

Bumps in test time are explained by added logic to run_test_for_class:

  • 0.22.0 - 12 sec for 301 tests
  • 0.24.0 - 11 sec for 294 tests
    • in 0.24.2, parent classes were included in diff check
  • 0.25.0 - 38 sec for 355 tests
  • 0.26.0 - 38 sec for 355 tests
    • in 0.26.1, the condition on dep change in pyproject was added
  • 0.27.0 - 56 sec for 363 tests

Caching gets this down to 5 sec

@yarnabrina
Copy link
Collaborator

Can you please tell me which PR/changes should I make locally to test your final changes to get timings?

I am also interested to know what are the timings for you after these dependencies and caching changes when you include sktime/tests along with sktime/classification, as that's what added the big time increase for me.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 29, 2024

(not a reply to the above, just a note following CI runs. Reply is in the next pots)

Timeouts occur for the following classes:

  • AutoARIMA
  • SARIMAX
  • VARMAX
  • StatsModelsARIMA
  • StatsForecastAutoARIMA
  • BATS
  • TBATS
  • ShapeletLearningClassifierTslearn

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 29, 2024

Can you please tell me which PR/changes should I make locally to test your final changes to get timings?

This PR: #6357

I am also interested to know what are the timings for you after these dependencies and caching changes when you include sktime/tests along with sktime/classification, as that's what added the big time increase for me.

All tests in sktime, including the above, take 22 sec to collect with this PR. I think it can be further reduced using the other speedup PR, but I would not add them to 0.29.0.

@yarnabrina
Copy link
Collaborator

All tests in sktime, including the above, take 22 sec to collect with this PR.

Sorry, while I do see some improvements, I can not reproduce anything that is close to this.

modified `setup.cfg`
[aliases]
test = pytest

[tool.isort]
profile = "black"
multi_line_output = 3

[tool:pytest]
# ignore certain folders and pytest warnings
addopts =
    --ignore build_tools
    --ignore examples
    --ignore docs
filterwarnings =
    ignore::UserWarning
    ignore:numpy.dtype size changed
    ignore:numpy.ufunc size changed

[flake8]
# Default flake8 3.5 ignored flags
ignore = E121, E123, E126, E226, E24, E704, W503, W504
# inline with Black code formatter
max-line-length = 88
exclude =
    sktime/_contrib/*
extend-ignore =
    # See https://github.com/PyCQA/pycodestyle/issues/373
    E203

[metadata]
description_file = README.md
long_description_content_type = text/markdown

[check-manifest]
ignore =
    .binder/**
    .all-contributorsrc
    .coveragerc
    examples/**
    build_tools/**
    __check_build/**
    docs/**
    Makefile
    CODEOWNERS
    CONTRIBUTING.md
    *.yaml
    *.yml

[isort]
profile = black

[pydocstyle]
convention = numpy
match = (?!test_).*\.py

[bdist_wheel]
universal = false

[sdist]
formats = gztar
`pip freeze` before running below commands
adagio==0.2.4
aiohttp==3.9.5
aiosignal==1.3.1
alembic==1.13.1
antlr4-python3-runtime==4.11.1
appdirs==1.4.4
async-timeout==4.0.3
attrs==23.2.0
autoflake==2.3.1
certifi==2024.2.2
cfgv==3.4.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
colorlog==6.8.2
coreforecast==0.0.8
distlib==0.3.8
exceptiongroup==1.2.1
filelock==3.13.4
frozenlist==1.4.1
fs==2.4.16
fsspec==2024.3.1
fugue==0.8.7
fugue-sql-antlr==0.2.0
greenlet==3.0.3
identify==2.5.35
idna==3.7
iniconfig==2.0.0
Jinja2==3.1.3
joblib==1.4.0
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
lightning-utilities==0.11.2
llvmlite==0.42.0
Mako==1.3.3
MarkupSafe==2.1.5
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
networkx==3.3
neuralforecast==1.7.1
nodeenv==1.8.0
numba==0.59.1
numpy==1.26.4
optuna==3.6.1
packaging==24.0
pandas==2.2.2
patsy==0.5.6
platformdirs==4.2.0
pluggy==1.4.0
pre-commit==3.7.0
protobuf==5.26.1
pyarrow==15.0.2
pyflakes==3.2.0
pytest==8.1.1
python-dateutil==2.9.0.post0
pytorch-lightning==2.2.2
pytz==2024.1
pyupgrade==3.15.2
PyYAML==6.0.1
qpd==0.4.4
ray==2.11.0
referencing==0.34.0
requests==2.31.0
rpds-py==0.18.0
scikit-base==0.7.7
scikit-learn==1.4.2
scipy==1.13.0
six==1.16.0
-e git+https://github.com/yarnabrina/sktime-fork.git@2bb0c1bb454f2e7851611b12f6425a4f19841d17#egg=sktime
SQLAlchemy==2.0.29
sqlglot==23.10.0
statsforecast==1.7.4
statsmodels==0.14.2
sympy==1.12
tensorboardX==2.6.2.2
threadpoolctl==3.4.0
tokenize-rt==5.2.0
tomli==2.0.1
torch==2.2.2+cpu
torchmetrics==1.3.2
tqdm==4.66.2
triad==0.9.6
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
utilsforecast==0.1.5
virtualenv==20.25.3
yarl==1.9.4

branch main (commit 2bb0c1bb45)

component forecasting (command: python -m pytest sktime/forecasting --co)

25.88 seconds (43669 tests)

component forecasting with tests framework (command: python -m pytest sktime/forecasting sktime/tests --co)

322.94 seconds (79603 tests)

component classification (command: python -m pytest sktime/classification --co)

7.15 seconds (568 tests)

component classification with tests framework (command: python -m pytest sktime/classification sktime/tests --co)

299.09 seconds (36502 tests)

branch cache-test-switch (commit c9e842c558)

component forecasting (command: python -m pytest sktime/forecasting --co)

12.09 seconds (40400 tests)
failed with error ERROR sktime/forecasting/model_evaluation/tests/test_evaluate.py - TypeError: unhashable type: 'CRPS'

component forecasting with tests framework (command: python -m pytest sktime/forecasting sktime/tests --co)

239.78 seconds (79603 tests)
failed with error ERROR sktime/forecasting/model_evaluation/tests/test_evaluate.py - TypeError: unhashable type: 'CRPS'

component classification (command: python -m pytest sktime/classification --co)

4.53 seconds (568 tests)

component classification with tests framework (command: python -m pytest sktime/classification sktime/tests --co)

224.60 seconds (36502 tests)

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 29, 2024

@yarnabrina, I think we have different dependencies, so the number of tests differs?

The CRPS failure is a bug that got fixed in a newer commit - object are not hashable, so we need to get the class.

My timings on branch cache-test-switch (9e842c558ea4e4163582e1be3e50bcafb68f1b6)

python -m pytest sktime/forecasting --co - 9 secs, 5108 tests
python -m pytest sktime/forecasting sktime/tests --co - 20 secs, 6575 tests
python -m pytest sktime/classification --co - 6 secs, 246 tests
python -m pytest sktime/classification sktime/tests --co - 17 secs, 1713 tests

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 29, 2024

I should also note that my timings significantly increase if I run with pytest-cov present and the "coverage" flags enabled, i.e.,

    --cov sktime
    --cov-report xml
    --cov-report html

Then, timings are:

python -m pytest sktime/forecasting --co - 32 secs, 5108 tests
python -m pytest sktime/forecasting sktime/tests --co 44 secs, 6575 tests
python -m pytest sktime/classification --co - 25 secs, 246 tests
python -m pytest sktime/classification sktime/tests --co- 51 secs, 1713 tests

I've noted the huge bump in time from coverage logic already here, in the profiler logs: #6344 (comment)

What are your timings with pytest-cov off?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 29, 2024

... maybe we should also turn off pytest-cov in the CI tests, and run it only on test-all?

Coverage is off anyway, due to incremental/differential testing, see #5090

@yarnabrina
Copy link
Collaborator

I shared my setup.cfg and pip freeze above, all my reported times are without coverage. It's not even installed in my environment.

I am really shocked seeing the difference in number of tests and timings, especially number of tests. I think I only have statsmodels, pmdarima and statsforecast installed, may be neuralforecast as well. Just these are increasing it from 6575 to 79603? This seems a bit too much.

I definitely don't have any classification dependencies (not intentionally at least) and I have way more tests (36502/568) than you (1713/246). Does it match your expectations?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 29, 2024

I shared my setup.cfg

Where is your setup.cfg? Cannot find it, sorry.

Have you turned off incremental testing? That might explain the diff in tests.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 29, 2024

Timings and number of tests with --only_changed_modules False, cache-test-switch branch:

  • python -m pytest sktime/forecasting sktime/tests --co, coverage turned on: 75190 tests, 455 secs
  • python -m pytest sktime/forecasting sktime/tests --co, coverage turned off: 75190 tests, 240 secs

@yarnabrina
Copy link
Collaborator

modified setup.cfg
pip freeze before running below commands

Please click on the arrows, I made those collapsible as those are big files.

And yes I believe I have those subset per OS and only tun changed etc. turned off. Otherwise it won't be a full benchmarking I think?

That explains number of tests difference I guess, so that's a relief. I have an unrelated question: why do you see even any tests for your branch? You haven't modified forecasting or classification right, shouldn't it be 0 then? Or are there some exceptions to increment only flag?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 29, 2024

why do you see even any tests for your branch? You haven't modified forecasting or classification right, shouldn't it be 0 then?

The flag impacts only tests from the TestAll... classes, and tests decorated with run_tests_for_class pytest skips. All other tests are still collected, and run.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 29, 2024

Please click on the arrows,

Ah, thanks for explaining. Did not see these at first.

The large differences is explained by your turning off of differential/incremental testing.

fkiraly added a commit that referenced this issue Apr 30, 2024
…ilities (#6357)

Towards #6344.

Speeds up test collection by `lru_cache`-ing all test switch utilities
called from `run_test_for_class`.

This gets test collection time for the classification module to 5
seconds, compared to historical test times:

* 0.22.0 - 12 sec for 301 tests
* 0.24.0 - 11 sec for 294 tests
    * in 0.24.2, parent classes were included in diff check
* 0.25.0 - 38 sec for 355 tests
* 0.26.0 - 38 sec for 355 tests
    * in 0.26.1, the condition on dep change in `pyproject` was added
* 0.27.0 - 56 sec for 363 tests
* 0.28.1 - 72 sec for 246 tests
* this PR: 5 sec for 246 tests

(timings from `python -m pytest sktime/classification -n0
--collect-only`)

For all of `sktime`, the test collection time is down to 40sec for me
locally, from >20 minutes.
fkiraly added a commit that referenced this issue Apr 30, 2024
This PR adds temporary skips for estimators involved in timeouts, see
#6344, until the problem is fully diagnosed and addressed.
fkiraly added a commit that referenced this issue May 1, 2024
Also skips tests for `pmdarima` `ARIMA`, in relation to #6344
fkiraly added a commit that referenced this issue May 1, 2024
Removes test step from the release action - temporary to allow quick
release.

Issue #6344 causes mac-13 to run longer than 6 hours and then time out -
there are no known failures, just too long runtimes at the moment.

To be reverted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting
Projects
None yet
Development

No branches or pull requests

2 participants