ensure SLO for server availability #713

akosyakov · 2021-10-26T07:33:45Z

Please consider implement request duration and failure rate metrics for OpenVSX server to ensure availability.
In our experience RED metrics are good fit for this. @amvanbaren suggested to use spring-metrics to collect data for prometheus.

At Gitpod we rely on OpenVSX server responsiveness while users starting workspaces. If a request to OpenVSX fails then workspace is mostly unusable since VS Code frontend times out in 1 min. We have been working on SLO of 99% of extensions availability and built a caching proxy which allows us to serve 70%-90% of requests for 3 days while OpenVSX is down.

But it is not enough to achieve the goal though. We need to ensure that the issue gets recognised and addressed in OpenVSX itself before users notice it. In the past it was not a case, i.e. https://www.eclipsestatus.io/ usually did not get updated before some Gitpod user ping us and then we reach out to @eclipsewebmaster. Usually we already have a full blown incident by this moment. Unfortunately it is tricky for us to figure out whether there is a real issue with upstream from the proxy, since we are not only client and a request failure can be caused by the proxy itself. The OpenVSX server looks to be a proper place to address the issue.

eclipsewebmaster · 2021-11-29T16:48:03Z

I believe you filed this before many server updates were peformed. Since then, service availability and responsiveness have greatly improved, but there is still a lot of work to be done, as open-vsx consumes an inordinate amount of bandwidth. I will file an issue for that.

akosyakov · 2021-11-30T12:54:34Z

I believe you filed this before many server updates were peformed. Since then, service availability and responsiveness have greatly improved, but there is still a lot of work to be done, as open-vsx consumes an inordinate amount of bandwidth. I will file an issue for that.

The performance improvements are great. But the issue is about Eclipse team being able to recognise that incident is happening. In the past it was never the case. It is even alright for us that it takes a day or two to resolve the incident, but it should be noticed before users do it.

amvanbaren · 2023-01-31T10:41:02Z

Related PR: eclipse/openvsx#667

akosyakov mentioned this issue Nov 29, 2021

OpenVSX is slow #757

Closed

amvanbaren mentioned this issue Aug 27, 2022

Add observability on endpoints eclipse/openvsx#514

Closed

kineticsquid added priority:medium priority:high and removed priority:medium labels Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure SLO for server availability #713

ensure SLO for server availability #713

akosyakov commented Oct 26, 2021 •

edited

eclipsewebmaster commented Nov 29, 2021

akosyakov commented Nov 30, 2021

amvanbaren commented Jan 31, 2023

ensure SLO for server availability #713

ensure SLO for server availability #713

Comments

akosyakov commented Oct 26, 2021 • edited

eclipsewebmaster commented Nov 29, 2021

akosyakov commented Nov 30, 2021

amvanbaren commented Jan 31, 2023

akosyakov commented Oct 26, 2021 •

edited