Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure SLO for server availability #713

Open
akosyakov opened this issue Oct 26, 2021 · 3 comments
Open

ensure SLO for server availability #713

akosyakov opened this issue Oct 26, 2021 · 3 comments

Comments

@akosyakov
Copy link
Contributor

akosyakov commented Oct 26, 2021

Please consider implement request duration and failure rate metrics for OpenVSX server to ensure availability.
In our experience RED metrics are good fit for this. @amvanbaren suggested to use spring-metrics to collect data for prometheus.

At Gitpod we rely on OpenVSX server responsiveness while users starting workspaces. If a request to OpenVSX fails then workspace is mostly unusable since VS Code frontend times out in 1 min. We have been working on SLO of 99% of extensions availability and built a caching proxy which allows us to serve 70%-90% of requests for 3 days while OpenVSX is down.

But it is not enough to achieve the goal though. We need to ensure that the issue gets recognised and addressed in OpenVSX itself before users notice it. In the past it was not a case, i.e. https://www.eclipsestatus.io/ usually did not get updated before some Gitpod user ping us and then we reach out to @eclipsewebmaster. Usually we already have a full blown incident by this moment. Unfortunately it is tricky for us to figure out whether there is a real issue with upstream from the proxy, since we are not only client and a request failure can be caused by the proxy itself. The OpenVSX server looks to be a proper place to address the issue.

@eclipsewebmaster
Copy link
Contributor

I believe you filed this before many server updates were peformed. Since then, service availability and responsiveness have greatly improved, but there is still a lot of work to be done, as open-vsx consumes an inordinate amount of bandwidth. I will file an issue for that.

@akosyakov
Copy link
Contributor Author

I believe you filed this before many server updates were peformed. Since then, service availability and responsiveness have greatly improved, but there is still a lot of work to be done, as open-vsx consumes an inordinate amount of bandwidth. I will file an issue for that.

The performance improvements are great. But the issue is about Eclipse team being able to recognise that incident is happening. In the past it was never the case. It is even alright for us that it takes a day or two to resolve the incident, but it should be noticed before users do it.

@amvanbaren
Copy link
Contributor

Related PR: eclipse/openvsx#667

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

4 participants