Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(mirrors.jenkins.io/http://mirrors.jenkins-ci.org/) Sunset the legacy "mirrorbrain" service in favor of get.jenkins.io #2888

Closed
9 tasks done
dduportal opened this issue Apr 15, 2022 · 25 comments
Assignees

Comments

@dduportal
Copy link
Contributor

dduportal commented Apr 15, 2022

Service(s)

Update center, Other

Summary

What Happened

Since 4 weeks, the infra team receives the following pager duty alert: Weird Response time https://updates.jenkins-ci.org multiple times a day.

Click to see details

The alerts are triggered by a threshold in the datadog metrics collection for this service: https://github.com/jenkins-infra/docker-datadog/blob/main/conf.d/http_check.d/jenkins.yaml#L137-L148.

As shown in the screenshots, it means that the average HTTP response time is increased past 10s most of the time (when the alert is triggered).

Capture d’écran 2022-04-15 à 11 19 59 Capture d’écran 2022-04-15 à 11 19 35

Most of the time, the alert acknowledge itself as the response time decreased. Sometimes, the person on duty (@MarkEWaite or I) have to SSH to the machine pkg.origin.jenkins.io and restart the Apache server (rebooting the machine would be the last option).

Root cause

The (legacy) service referenced as mirrorbrain (hosting the services mirrors.jenkins.io and mirrors.jenkins-ci.org), also hosted on this VM is causing a peak of CPU usage which slows done the other service updates.jenkins.io.

Click to expand for details on the configuration as code

Puppet configuration audit trail:

Proposal

Let's sunset the legacy service mirrorbrain in favor of the current get.jenkins.io modern mirror service based on mirrorbits!

Rationale:

  • mirrorbits defaults to HTTPS, while mirrorbrain only supports plain old HTTP
  • Why maintaining 2 different mirror system? End users are not benefiting from this
  • mirrorbits can scale horizontally and efficiently (redis database, hosted in Kubernetes) and is updated regularly and automatically
Click to expand for details about the mirrorbits service

In order to NOT break end-users installations, the domains mirrors.jenkins.io and mirrors.jenkins-ci.org should be CNAMEs to the mirrorbits new system.

Known usages of the legacy mirror system

To Do List

  • Add ingresses for the domains mirror.jenkins.io and mirrors.jenkins-ci.org in the mirrorbits configuration (to ensure that it will always work, whatever DNS configuration we use)
  • Communicate to end users:
    • Write a blog post on jenkins.io to communicate about the change
    • message on mailing lists jenkins-infra and jenkins-dev
    • message on the jenkinsci twitter account
    • message on IRC jenkins-infra and Gitter jenkins/jenkins
    • message on community.jenkins.io
  • Once the deadline is reached: update the DNS (existing!) records in Azure (either manually or in jenkins-infra/azure if DNS records have been imported) to CNAME to the public DNS associated with the ingresses
  • Update the Puppet repository to remove the mirrorbrain profiles
  • Cleanup the VM from the Apache former vhosts + postgresql (+ any resource from the mirrorbrain profile)
@dduportal dduportal added the triage Incoming issues that need review label Apr 15, 2022
@dduportal dduportal added this to the infra-team-sync-next milestone Apr 15, 2022
@dduportal dduportal self-assigned this Apr 15, 2022
@dduportal
Copy link
Contributor Author

Ping @daniel-beck @MarkEWaite @olblak @lemeurherve @timja @halkeye @jnord @jglick for info, review and advise (If I forgot anything)

@halkeye
Copy link
Member

halkeye commented Apr 15, 2022

Code in the GitHub organization jenkinsci (pipeline, scripts, docs) - https://github.com/search?q=org%3Ajenkinsci+mirrors.jenkins.io&type=code:

evergreen plugin should be archived, the rest of the usages are pretty much documentation anyways

Jenkins users that are not able to use HTTPS

are they still able to? or will be we killing that access path?

@olblak
Copy link
Member

olblak commented Apr 19, 2022

"Add ingresses for the domains mirror.jenkins.io and mirrors.jenkins-ci.org in the mirrorbits configuration (to ensure that it will always work, whatever DNS configuration we use)"

What do you think to just deprecated this DNS record. Officially it's not used anymore, or used it as a the k8s cluster fallback. you would cleanly deploy mirrorbits on that machine pkg.jenkins.io so if something goes wrong with the k8s cluster, you still have it working.

Btw you may have notice that but we have a mirrorbits binary in the /opt directory that we used multiple time in the past to mitigate cluster downtime

@dduportal
Copy link
Contributor Author

evergreen plugin should be archived, the rest of the usages are pretty much documentation anyways

Thanks for the tip! It confirm that what we did in #2040 was correct. For information, https://github.com/jenkins-infra/evergreen is marked as "archived" repository

Jenkins users that are not able to use HTTPS
are they still able to? or will be we killing that access path?

They are still able to, and we'll kill this access path as it implies force a redirect to https.

If mirrors.jenkins.io or mirrors.jenkins-ci.org is used to download any file (war, plugin, or package), then it is only HTTP (there is not vhost for these domain at all, no certificates and defaults to https://pkg.origin.jenkins.io/ - with an expected TLS security alert for domain mismatch).

What do you think to just deprecated this DNS record.

Thanks for the tip! You know that I like deleting things ;) But it might be a bit too harsh to kill this domain. Using a CNAME to get.jenkins.io would allow a smooth transition. Once we tracked as much usages (such as code in jenkinscu GH org) as we can and switched them to get.jenkins.io, then we can track access for a 2-3 months to see what usage is done and decide of killing it maybe at that time.

Btw you may have notice that but we have a mirrorbits binary in the /opt directory that we used multiple time in the past to mitigate cluster downtime

Good reminder! That we'll be the next subject. The current get.jenkins.io, which is kubernetes cluster wide, is still more available than the mirrorbrain on its alone VM. I don't know for response time though. So once mirrorbrain is killed, then we'll check the fallback solution for DRS of the kubernetes cluster.

@dduportal
Copy link
Contributor Author

Opened the PR jenkins-infra/pipeline-library#374 in the shared library + notified with an email on the dev mailing list https://groups.google.com/g/jenkinsci-dev/c/anTCx9Q6mLI

@dduportal
Copy link
Contributor Author

Thanks @MarkEWaite and @timja for jenkinsci/jep#386 on this area!

dduportal added a commit to dduportal/plugin-compat-tester that referenced this issue May 5, 2022
Ref. jenkins-infra/helpdesk#2888

This change also uses long flags for `curl` and shows if an error happens during the download (easier to diagnose)
@dduportal
Copy link
Contributor Author

Another PR on the PCT: jenkinsci/plugin-compat-tester#363

@dduportal
Copy link
Contributor Author

Other references found on the github.com/jenkinsci organization are not worth the changes (README or deprecated projects such as evergreen)

dduportal added a commit to dduportal/kubernetes-management that referenced this issue May 5, 2022
…elpdesk#2888

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
dduportal added a commit to dduportal/kubernetes-management that referenced this issue May 5, 2022
…elpdesk#2888

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue May 5, 2022
…elpdesk#2888 (#2329)

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

As per @MarkEWaite messages in the #jenkins-infra IRC channel:

Been receiving alerts that updates.jenkins.io is slow to respond. The pkg.jenkins.io top output shows postgres heavily loaded. Stopping and restarting Apache in hopes that reduces load
Disc use on the /dev/xvda1 disc is at 87%. Vaccuumed the logs from using 4 GB to using 1 GB and didn't change the disc use percentage at all. We may need to expand the disc on that machine or remove more services

Opening maintenance window on status.jenkins.io: jenkins-infra/status#157

@dduportal
Copy link
Contributor Author

Resized the root volume from 1000 to 1200 Gb:

  • Took a snapshot of the disk as an AMI with today's date (in case something goes wrong)
  • Stopped the instance
  • Increase the EBS root volume size to 1200
  • Restarted the instance

The file system was automatically resized:

$ df -hT / # Right after reboot
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/xvda1     ext4  1.2T  811G  323G  72% /

@dduportal
Copy link
Contributor Author

Failed to change the instance size:

Today, we are using an m4.2xlarge VM (ref. https://aws.amazon.com/ec2/instance-types/). This instance type features a 8vCPUS 2.3 GHz Intel Xeon® E5-2686 v4 (Broadwell) processors or 2.4 GHz Intel Xeon® E5-2676 v3 (Haswell) processors. Its rate is 0.40$ per hour (~ 295 $ per month).

$ cat /proc/cpuinfo  | grep Xeon | sort | uniq
model name      : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
$ grep -c processor /proc/cpuinfo
8

The idea was to try to migrate to a new instance size that would benefit from:

  • Better CPU: new generation of Xeon or AMD EPYC (increase peak and clock performances, new instruction set, better core management)
  • Increase network bandwidth
  • Decrease costs

Check the following table to compare instance types, with the following rules:

  • Same amount of vCPU
  • Accepts 16 Gb or more (currently 32 Gb but only 6 to 10 are used)
  • Only "General Purpose" or "Compute Optimized" families, as this VM is bound to network and CPU (I/O and memory are negligible)
Instance Type CPU Family vCPUs Memory Network Bandwidth EBS Bandwidth Hourl Rate (on-demand)
m4.2xlarge (Current) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz 8 32 Up to 10 Gbps 1,000 Mbps $0.40
m5.2xlarge 3.1 GHz Intel Xeon® Platinum 8175M 8 32 Up to 10 Gbps Up to 4,750 Mbps $0.384
c6i.2xlarge 3.5 GHz 3rd generation Intel Xeon 8 16 Up to 12,5 Gbps Up to 10,000 Mbps $0.34
m5a.2xlarge AMD EPYC 7000 series 2.5 GHz 8 32 Up to 10 Gbps Up to 2,880 Mpbs $0.344
c6i.2xlarge 3.5 GHz 3rd generation Intel Xeon 8 16 Up to 12.5 Gbps Up to 6,600 Mpbs $0.34

Alas, each try to change the instance type ended up in an error message "configuration not documented" when starting the instance.

Tried to enabled the "Enhanced Networking Adapter" did not change anything (but it is enabled now):

$ aws ec2 describe-instances --instance-id i-e0968e19 --query "Reservations[].Instances[].EnaSupport" --region us-east-1 | jq -r '.'
[]
$ aws ec2 modify-instance-attribute --instance-id i-e0968e19 --ena-support --region us-east-1
$ aws ec2 describe-instances --instance-id i-e0968e19 --query "Reservations[].Instances[].EnaSupport" --region us-east-1 | jq -r '.'
[
  true
]

Let's keep this instance size for now: the AMI snapshot could be used to try creating a new instance but better putting our effort in #2649

@dduportal
Copy link
Contributor Author

While trying to "short-term" workaround with the high CPU usage on this machine, stumbled across the following error message in Apache error logs:

AH00632: failed to prepare SQL statements: ERROR:  relation "pfx2asn" does not exist\nLINE 1: ...EPARE asn_dbd_1 (varchar) AS SELECT pfx, asn FROM pfx2asn WH...\n

This error is related to the mirrorbrain installation:

  • Missing table in the PgSQL database
  • This table is related to the mod_asn
  • Should be created during the mirrorbrain installation, along with the Ubuntu APT package postgresql-*-ip4r

But this machine is a mess: there was 3 different postgresql server installations, each one on a different port:

  • postgresql-9.3, port 5432, used by mirrorbrain. The "production"
  • postgresql-9.5, port 5433, with a copy of the database from 2021. Smells like a tentative update, or an incomplete puppet run when @MarkEWaite and I ensured that this Ubunt! was fully 18.04.
  • postgresql-10, port 5433, but stopped (conflict with postgresql-9.3), which is the default version for Ubuntu 18.04, installed with the apt-get dist-upgrade operations.

Since this VM is not managed by puppet since some time, the following operation where done manually:

  • Fully migrate the PostgreSQL instance to postgresql 10, to allow installation of the only ip4r postgresl package postgresql-10-ip4r
# Ensure postgresql 10 is installed properly
$ apt-get -y install postgresql-10
$ dpkg --get-selections | grep postgresql # Sanity check

# Migrate the actual 9.3 cluster named `main` to version 10 with the same name
$ pg_lsclusters
$ pg_renamecluster 10 main main_ver10
$ pg_lsclusters # Sanity check
$ systemctl stop postgresql@9.3-main.service 
$ pg_upgradecluster 9.3 main # Restarts the instance once done
$ pg_lsclusters # Sanity check
  • Cleaned up old postgresql versions
## Cleanup
$ pg_dropcluster --stop 9.3 main
$ pg_dropcluster --stop 10 main_ver10
$ pg_dropcluster --stop 9.5 main
$ apt-get remove --purge postgresql-9.3 postgresql-client-9.3 postgresql-9.5 postgresql-client-9.5
$ dpkg --get-selections | grep postgresql # Sanity check
  • Installed and configured ip4r in the database (as it was missing)
# Ensure ip4r is installed properly
$ apt-get -y install postgresql-contrib postgresql-10-ip4r

# Create extension in the pgsql instance, as Pg superuser
$ su - postgres
$ psql # Top-level
# \dx
# CREATE EXTENSION ip4r ;
# \dx
# \q
$ psql --dbname=jenkins_mirrorbrain_db # On the mirrorbrain database
# \dx
# CREATE EXTENSION ip4r ;
# \dx
# \q

# Load the ASN script, now that the primitive type `iprange` is provided by the ip4r extension
$ psql --host=localhost --username=jenkins_mirrorbrain --password --dbname=jenkins_mirrorbrain_db --file=/usr/share/doc/libapache2-mod-asn/asn.sql
password: <redacted>

# Ensure everything is loaded and available
$ apt update && apt-get dist-upgrade && apt-get autoremove --purge && update-grub && reboot
  • Ensure that error message does not appears anymore on apache logs:
$ tail -f /var/log/apache2/*log

@dduportal
Copy link
Contributor Author

Another error on the apache log, but no solution for now:

[Sat May 07 10:47:48.548369 2022] [mpm_event:error] [pid 1651:tid 140147096673216] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.

Sounds related to https://www.claudiokuenzler.com/blog/948/apache-2.4-mpm-event-bug-freezing-up-scoreboard-full-after-reload (yes we are using MPM event, and the /server-status shows a lot of Apache threads in a G state for loooong time.

In order to help on this area, installed sysstat to provide a finer metric grain

$ apt-get update -q && apt-get install -y sysstat
$ vi /etc/default/sysstat # changed `ENABLED` to `true`
$ vi /etc/cron.d/sysstat # changed to collection every 2 min
$ systemctl enable sysstat
$ systemctl start sysstat

It appears that there are peaks of CPU on %system when the slowness appears:

10:00:01 AM     all      8.87      0.00      3.10      0.06      0.12     87.86
10:02:01 AM     all     28.04      0.00      4.75      0.27      0.14     66.79
10:04:01 AM     all     26.21      0.00      4.87      0.15      0.21     68.55
10:06:01 AM     all     33.32      0.00     12.11      0.13      1.68     52.77
10:08:01 AM     all     30.51      0.00     11.64      0.08      1.68     56.08
10:10:01 AM     all     27.46      0.00     13.96      0.05      1.72     56.81
10:12:01 AM     all     30.69      0.00     13.89      0.11      1.66     53.66
10:14:01 AM     all     30.90      0.00     11.48      0.11      1.69     55.82
10:16:01 AM     all     27.94      0.00     13.86      0.08      1.71     56.41
10:18:01 AM     all     29.40      0.00     14.48      0.07      1.66     54.39
10:20:01 AM     all     27.84      0.00     13.03      0.06      1.72     57.35
10:22:01 AM     all     23.33      0.00      4.35      0.14      0.23     71.96
10:24:01 AM     all     21.31      0.00      3.50      0.06      0.11     75.01

We might check the configuration history:

  • When @olblak and I decreased the instance size from 16 to 8 vCPUs, we might have failed to update the MPM worker threads configuration
    • Check and fine tune actual to 8 vCPUs?
    • Go back to 16 vCPUs (but on a new instance generation)
  • Maybe MPM event is not the best solution with Apache 2.4: considering switching to MPM prefork

@dduportal
Copy link
Contributor Author

  • DNS record mirrors.jenkins.io changed from IN A 52.202.51.185 to CNAME get.jenkins.io. (TTL 1 min) today at ~08:10am UTC

@dduportal
Copy link
Contributor Author

  • Outage on updates.jenkins.io, consecutively to this change: the DNS record updates.jenkins.io was a CNAME to mirrors.jenkins.io (reported in IRC around ~08:47am in the gitter channel jenkins/jenkins by a user)
    • DNS record updates.jenkins.io change to IN A 52.202.51.185 around 09:00am UTC` and TTL was changed from 1 hour to 1 minute
    • It was an uplanned side effect. We should have checked this DNS before. Expect 1 hour for DNS caches to update. Until then, updates.jenkins.io is considered full outage (because redirected to the Kubernetes cluster until DNS)

@dduportal
Copy link
Contributor Author

Starting maintenance on the VM:

  • Checking logs of the service mirrors.jenkins.io to be sure
  • Snapshoting the VM for backup
  • Stop postgresql and mirrorbrain service, wait 1 hour and clean it up if no error

In parallel, jenkins-infra/status#166 was opened to prepare puppet so we can put this machine under automatic puppet management again.

@dduportal
Copy link
Contributor Author

Ran the following command on the VM (after snapshoting + backuping postgres data):

apt-get remove --purge postgresql-10 postgresql-10-ip4r postgresql-client-10 postgresql-client-common postgresql-common postgresql-contrib mirmon mirrorbrain mirrorbrain-scanner mirrorbrain-tools
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  formencode-i18n libalgorithm-c3-perl libauthen-sasl-perl libb-hooks-endofscope-perl libclass-c3-perl libclass-c3-xs-perl libclass-data-inheritable-perl
  libclass-inspector-perl libclass-method-modifiers-perl libclass-singleton-perl libconfig-inifiles-perl libdata-dump-perl libdata-optlist-perl
  libdatetime-locale-perl libdatetime-perl libdatetime-timezone-perl libdbd-pg-perl libdbi-perl libdevel-caller-perl libdevel-lexalias-perl
  libdevel-stacktrace-perl libdigest-md4-perl libencode-locale-perl libeval-closure-perl libexception-class-perl libfile-listing-perl libfile-sharedir-perl
  libfont-afm-perl libhtml-form-perl libhtml-format-perl libhtml-parser-perl libhtml-tagset-perl libhtml-tree-perl libhttp-cookies-perl libhttp-daemon-perl
  libhttp-date-perl libhttp-message-perl libhttp-negotiate-perl libio-html-perl libio-socket-inet6-perl libio-socket-ssl-perl liblwp-mediatypes-perl
  liblwp-protocol-https-perl libmailtools-perl libmodule-implementation-perl libmro-compat-perl libnamespace-autoclean-perl libnamespace-clean-perl
  libnet-http-perl libnet-smtp-ssl-perl libnet-ssleay-perl libpackage-stash-perl libpackage-stash-xs-perl libpadwalker-perl libparams-util-perl
  libparams-validationcompiler-perl libreadonly-perl libref-util-perl libref-util-xs-perl librole-tiny-perl libsocket6-perl libspecio-perl
  libsub-exporter-perl libsub-exporter-progressive-perl libsub-identify-perl libsub-install-perl libsub-quote-perl libtry-tiny-perl liburi-perl
  libvariable-magic-perl libwww-perl libwww-robotrules-perl perl-openssl-defaults python-cmdln python-dnspython python-formencode python-mb
  python-pkg-resources python-pydispatch python-sqlobject python3-dnspython python3-formencode python3-pydispatch python3-sqlobject sqlobject-admin
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  mirmon* mirrorbrain* mirrorbrain-scanner* mirrorbrain-tools* postgresql-10* postgresql-10-ip4r* postgresql-client-10* postgresql-client-common*
  postgresql-common* postgresql-contrib*
0 upgraded, 0 newly installed, 10 to remove and 3 not upgraded.
After this operation, 20.9 MB disk space will be freed.

followed by the autoremove.

Also, removed manually all the apache vhost configurations (after backuping it) related to domains mirrors.jenkins or get.jenkins.io.

Still some apache config to clean up

@dduportal
Copy link
Contributor Author

  • Cleaned up any remnant of mirrorbrain / mirrors on the VM
  • Enabled again puppet management with a new agent name pkg (with the whole puppet certificate regeneration)
  • Ran a dry run of the puppet agent, and backuped all files touched
  • Puppet apply ran successfully
  • VM rebooted with puppet enabled

@dduportal
Copy link
Contributor Author

jenkins-infra/status#167

@dduportal
Copy link
Contributor Author

Just in case : backups of apache2 etc and var are in the /root if anything breaks + there is a snapshot of the vm root volume in aws

@dduportal
Copy link
Contributor Author

Summary of the past days:

A lot of people help, and I'm really glad for it!

Next step:

  • Ensuring that http to https redirection is need and if the case is it enforced for the pkg and update services
  • Ensure that nothing else is broken

@dduportal
Copy link
Contributor Author

Yet another incident due to this issue: #2960

@github-actions github-actions bot added the triage Incoming issues that need review label May 31, 2022
@dduportal
Copy link
Contributor Author

Closing as the incidents seems to be gone (all of them).

@dduportal dduportal removed the triage Incoming issues that need review label May 31, 2022
@github-actions github-actions bot added the triage Incoming issues that need review label May 31, 2022
lemeurherve pushed a commit to lemeurherve/pipeline-library that referenced this issue Jun 1, 2022
jenkins-infra#374)

* feat(infra) switch to the new mirror system in HTTPS - jenkins-infra/helpdesk#2888

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

* cleanup(runAth) remove unused mirror variable

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

* chore(README) typos

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
dduportal added a commit to jenkins-infra/mirror-scripts that referenced this issue Jun 1, 2022
@dduportal dduportal removed the triage Incoming issues that need review label Aug 8, 2023
smerle33 pushed a commit to smerle33/pipeline-library that referenced this issue Jan 16, 2024
jenkins-infra#374)

* feat(infra) switch to the new mirror system in HTTPS - jenkins-infra/helpdesk#2888

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

* cleanup(runAth) remove unused mirror variable

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

* chore(README) typos

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants