Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache pom.xml content of non-SNAPSHOT versions to improve Maven cache efficiency #6572

Closed
Lucas-C opened this issue Jun 24, 2020 · 7 comments
Assignees
Labels
datasource:maven priority-3-medium Default priority, "should be done" but isn't prioritised ahead of others type:feature Feature (new functionality)

Comments

@Lucas-C
Copy link
Contributor

Lucas-C commented Jun 24, 2020

What would you like Renovate to be able to do?
We have around 350 to 450 -SNAPSHOT versions of a dozen internal artifacts,
and it takes Renovate 2 hours to analyze a single git repository containing a pom.xml with dependencies to those artifacts.
Hence, we would like Renovate cache mechanism to be improved in order to limit numerous pom.xml downloads & parsing.

Describe the solution you'd like
Based on datasource/maven/index.ts code source, it looks like currently Renovate retrieve maven-metadata.xml files in order to list all existing versions of an artifact. It also uses a cache with a TTL of 10min.

We suggest to improve the caching mechanism in order to avoid unncessary HTTP calls.
This could be done by using the <lastUpdated> field of maven-metadata.xml to only download pom.xml files if they were updated since Renovate last downloaded them and stored them in cache.
The idea would be to cache all non-SNAPSHOT versions pom.xml content.

Describe alternatives you've considered
Our Renovate instance use JFrog Artifactory as its main registry.
We considered playing with its zone / repository system to limit the number of artifacts listed,
or even simply purging old -SNAPSHOT versions.
This may be a valid workaround for us on short term, but I thought it may still be worth suggesting this feature.

@rarkins rarkins added datasource:maven type:feature Feature (new functionality) priority-2-high Bugs impacting wide number of users or very important features labels Jun 24, 2020
@rarkins
Copy link
Collaborator

rarkins commented Jun 24, 2020

I would be very happy to receive any PR that helps improve this experience for yourself and others. If so, I recommend we discuss the choices of approaches here before you start coding, just to make sure the final implementation is acceptable.

First of all I'd like to make sure I understand what is causing so many requests. Is it because you have a lot of dependencies, or because certain dependencies have a lot of available versions available? i.e. if we have an O(n) problem I want to understand which n it is.

I'd also like you to elaborate a bit more about your caching idea, as I'm not sure I understand it.

FYI I also created #6573 as I think that's a separate approach that may help too.

@Lucas-C
Copy link
Contributor Author

Lucas-C commented Jun 24, 2020

First of all I'd like to make sure I understand what is causing so many requests. Is it because you have a lot of dependencies, or because certain dependencies have a lot of available versions available?

It's because certain dependencies have a lot of available versions available.
Also, some of them are SNAPSHOT versions, meaning they can be overriden, contrary to release versions.

On second though the <lastUpdated> field won't help much with caching,
as it describes the maven-metadata.xml, and tells nothing about the artifacts pom.xml "last update time".

Currently caching happens in the getVersionsFromMetadata function, and only maven-metadata.xml content is cached, for 10min (correct me if I'm wrong).

To improve caching, I suggest to introduce some pom.xml content cache for non-SNAPSHOT versions, with a much longer TTL (several days, ideally configurable).

I'm going to rename this issue name accordingly.

@Lucas-C Lucas-C changed the title Use <lastUpdated> field of maven-metadata.xml to improve cache efficiency Cache pom.xml content of non-SNAPSHOT versions to improve Maven cache efficiency Jun 24, 2020
rarkins added a commit that referenced this issue Jun 26, 2020
If defined in env, this will bypass pom checks and instead rely on the metadata alone.

Related: #6591, #6572
@rarkins
Copy link
Collaborator

rarkins commented Jun 26, 2020

I added an experimental workaround in 5b24943

If you define RENOVATE_EXPERIMENTAL_NO_MAVEN_POM_CHECK in your env then it should bypass the pom checks. Let me know if this works well for you.

@Lucas-C
Copy link
Contributor Author

Lucas-C commented Jun 26, 2020

Awesome !
I won't be able to test this in the next days as I'm on holidays, but thank you !

@Lucas-C
Copy link
Contributor Author

Lucas-C commented Jul 9, 2020

This really improved our processing time !

renovate-duration-per-repo

This graph displays our processing time per repo (we extract this data after execution by parsing the resulting --log-file).
Vertical unit is: 1 line = 500s

@rarkins
Copy link
Collaborator

rarkins commented Jul 10, 2020

@zharinov let's work out how to address this. First of all, should the metadata file always be "correct", and the only way it's not correct is due to some type of mistake in the registry's data?

Even if it's only due to a mistake, we then need to evaluate if it happens often enough that we should protect against it anyway (like we have tried to do). Or if for example we think it's ok to require users to fix it themselves with allowedVersions restrictions (some of which we could put into our default presets if it's public packages).

Finally, if we still need to protect against it, we need to work out the best approach. One example is "lazy" verification as suggested in #6591. This would mean a significant refactoring of our lookup/evaluate logic so we'd do something like this instead:

  • getReleases() returns non-validated results
  • Our lookup processing would do filtering to reduce the list substantially (e.g. only newer, only stable, etc)
  • We'd then need to call a new datasource method for "validation" of the filtered versions only, which could possibly remove some more

@rarkins rarkins added priority-3-medium Default priority, "should be done" but isn't prioritised ahead of others and removed priority-2-high Bugs impacting wide number of users or very important features labels Aug 4, 2020
@rarkins rarkins added the status:requirements Full requirements are not yet known, so implementation should not be started label Jan 12, 2021
@rarkins rarkins added status:ready and removed status:requirements Full requirements are not yet known, so implementation should not be started labels Oct 1, 2023
@zharinov
Copy link
Collaborator

Seems like this is solved

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
datasource:maven priority-3-medium Default priority, "should be done" but isn't prioritised ahead of others type:feature Feature (new functionality)
Projects
None yet
Development

No branches or pull requests

3 participants