Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI often fails with "Could not resolve host: github.com" #549

Open
eu9ene opened this issue Apr 30, 2024 · 12 comments
Open

CI often fails with "Could not resolve host: github.com" #549

eu9ene opened this issue Apr 30, 2024 · 12 comments
Labels
blocker bug Something isn't working taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Apr 30, 2024

If it's expected that it can fail we should add retries to all our steps:

[vcs 2024-04-30T01:09:53.281Z] fatal: unable to access 'https://github.com/mozilla/firefox-translations-training/': Could not resolve host: github.com

Here is an example task: https://firefox-ci-tc.services.mozilla.com/tasks/JdDA-zYDQnG4166Zbsqq6w

@eu9ene eu9ene added bug Something isn't working taskcluster Issues related to the Taskcluster implementation of the training pipeline labels Apr 30, 2024
@bhearsum
Copy link
Collaborator

bhearsum commented May 1, 2024

I'm asking around other projects to see if they're seeing this as well.

@bhearsum
Copy link
Collaborator

bhearsum commented May 2, 2024

Haven't seen reports of this elsewhere.

@eu9ene - have you seen this on GPU workers only? Or also on the CPU workers?

@eu9ene
Copy link
Collaborator Author

eu9ene commented May 2, 2024

Haven't seen reports of this elsewhere.

@eu9ene - have you seen this on GPU workers only? Or also on the CPU workers?

I'm not sure but I feel like I've been seeing this in random places.

@gregtatum
Copy link
Member

@bhearsum
Copy link
Collaborator

Thanks; so it seems very unlikely to be related to specific worker images.

@aerickson - I don't suppose you have any idea what's going on here?

@eu9ene eu9ene mentioned this issue May 15, 2024
@aerickson
Copy link
Member

@bhearsum I'm not sure what's going on. Translations GPU workers on GCP should use GCP's DNS servers (provided by DHCP) and the Snakepit worker use our internal infoblox servers (configured in dnsmasq). It seems like a network blip or perhaps the DNS server could be overloaded for a second?

I haven't heard about any Github outages around DNS. I've never really heard of DNS outages (it's a pretty resilient service/protocol).

If we find a concentrated event or location let me know and I'll dig in some more.

@bhearsum
Copy link
Collaborator

We do see these fairly often - I would say maybe on 5-10% of the tasks run. I'll try to collect some data to help us analyze this better.

@bhearsum
Copy link
Collaborator

Here's failures by worker group:

defaultdict(<class 'int'>,
            {'us-central1': 7,
             'us-central1-a': 9,
             'us-central1-b': 7,
             'us-central1-c': 7,
             'us-central1-f': 9,
             'us-west1': 5,
             'us-west1-a': 5,
             'us-west1-b': 11})

And here's timestamps when we hit the failures:

['2024-01-19T21:20:34.616Z',
 '2024-02-02T20:02:49.997Z',
 '2024-02-22T16:07:57.737Z',
 '2024-02-26T23:13:59.353Z',
 '2024-02-28T00:36:35.015Z',
 '2024-02-28T17:48:06.906Z',
 '2024-02-29T15:03:54.666Z',
 '2024-03-06T14:32:56.066Z',
 '2024-03-06T18:10:27.186Z',
 '2024-03-21T15:03:54.430Z',
 '2024-03-21T20:29:42.584Z',
 '2024-03-26T13:29:20.434Z',
 '2024-03-30T22:11:50.010Z',
 '2024-04-01T14:10:30.644Z',
 '2024-04-01T15:18:22.619Z',
 '2024-04-09T10:47:13.166Z',
 '2024-04-18T12:41:01.952Z',
 '2024-04-22T19:58:27.752Z',
 '2024-04-23T19:46:39.366Z',
 '2024-04-24T17:57:31.559Z',
 '2024-04-25T07:33:55.024Z',
 '2024-04-29T17:55:15.937Z',
 '2024-04-30T00:38:45.847Z',
 '2024-04-30T01:08:46.769Z',
 '2024-04-30T14:06:18.417Z',
 '2024-04-30T18:56:31.177Z',
 '2024-05-01T23:21:11.540Z',
 '2024-05-02T00:22:43.542Z',
 '2024-05-06T14:40:00.719Z',
 '2024-05-06T16:41:19.051Z',
 '2024-05-07T00:01:30.432Z',
 '2024-05-07T21:23:47.145Z',
 '2024-05-07T21:36:54.706Z',
 '2024-05-09T19:01:17.356Z',
 '2024-05-09T22:10:58.358Z',
 '2024-05-09T22:11:47.850Z',
 '2024-05-10T18:42:00.163Z',
 '2024-05-10T22:14:23.709Z',
 '2024-05-13T23:01:23.656Z',
 '2024-05-13T23:24:46.266Z',
 '2024-05-14T14:49:27.875Z',
 '2024-05-14T15:08:43.843Z',
 '2024-05-14T23:18:39.852Z',
 '2024-05-14T23:43:02.154Z',
 '2024-05-15T01:01:44.678Z',
 '2024-05-15T14:30:49.732Z',
 '2024-05-15T23:20:08.322Z',
 '2024-05-16T21:56:58.285Z',
 '2024-05-17T11:38:30.465Z',
 '2024-05-17T14:02:32.250Z',
 '2024-05-17T16:18:16.690Z',
 '2024-05-17T17:51:07.523Z',
 '2024-05-17T20:37:44.360Z',
 '2024-05-17T20:49:01.813Z',
 '2024-05-17T20:49:18.471Z',
 '2024-05-17T21:36:37.947Z',
 '2024-05-17T22:05:41.152Z',
 '2024-05-17T22:06:42.275Z',
 '2024-05-17T23:11:32.583Z',
 '2024-05-20T15:29:29.312Z']

And by worker image:

defaultdict(<class 'int'>, {'gpu': 58, 'cpu': 2})

Clearly the most notable part here here is that we're seeing more issues on GPU images. And within that, we seem to have gotten more beginning in late April/early May. We added a test pool with a new image on April 22nd (that I was running a lot to test things) in https://phabricator.services.mozilla.com/D208202. That image went to production on May 8th in https://phabricator.services.mozilla.com/D209840.

mozilla-platform-ops/monopacker#140 was the PR related to this image, but I don't know how deterministic the other parts are? Eg: could we have picked up a change to a system package that is now causing problems?

@aerickson - do you have any thoughts? If we still have the old image, maybe we could poke around and compare the new to old one? (I'd be happy to do this if you want.)

@eu9ene
Copy link
Collaborator Author

eu9ene commented May 22, 2024

Ok, it happens every single pipeline run for me now and does not restart. Marking as blocker.

@eu9ene eu9ene added the blocker label May 22, 2024
bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue May 22, 2024
…DNS failures

This should help make mozilla#549 less painful. I suggest we back it out once we get to the bottom of that.
bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue May 22, 2024
…DNS failures

This should help make mozilla#549 less painful. I suggest we back it out once we get to the bottom of that.
@bhearsum
Copy link
Collaborator

Poking at this a bit on an interactive instance, too. I sortof repro'ed (it seems to have retried with success though):

ubuntu@translations-1-b-linux-v100-gpu-ecnlewoqrtmrltrxrl8o3w:~/tasks/task_171642353933996$ host github.com
;; communications error to 127.0.0.53#53: timed out
github.com has address 140.82.116.3
github.com mail is handled by 5 alt2.aspmx.l.google.com.
github.com mail is handled by 1 aspmx.l.google.com.
github.com mail is handled by 5 alt1.aspmx.l.google.com.
github.com mail is handled by 10 alt3.aspmx.l.google.com.
github.com mail is handled by 10 alt4.aspmx.l.google.com.

Looking at the machine configuration, I see that it uses the standard resolved that we expect on Ubuntu:

ubuntu@translations-1-b-linux-v100-gpu-ecnlewoqrtmrltrxrl8o3w:~/tasks/task_171642353933996$ cat /etc/resolv.conf
# This is /run/systemd/resolve/stub-resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad
search c.fxci-production-level1-workers.internal google.internal
ubuntu@translations-1-b-linux-v100-gpu-ecnlewoqrtmrltrxrl8o3w:~/tasks/task_171642353933996$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (ens5)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 169.254.169.254
       DNS Servers: 169.254.169.254
        DNS Domain: c.fxci-production-level1-workers.internal google.internal

It claims to be timing out talking to 127.0.0.53 - but I don't know if that means it really couldn't talk to the local resolver, or if that's just the local resolver passing along a failure from the upstream. It seems more likely that it's the latter, but I can't say that with any certainty.

The upstream server is a reserved address, and I'm guessing it's something internal to GCP? I'm really not sure, to be honest - that's quite out of my depth.

I looked through syslogs and found nothing of note, just messages like this every time I perfomed a lookup:

May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Received dns UDP packet of size 28, ifindex=0, ttl=64, fragsize=0
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Got DNS stub UDP query packet for id 63009
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Looking up RR for github.com IN A.
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Cache miss for github.com IN A
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Firing regular transaction 18081 for <github.com IN A> scope dns on ens5/* (validate=yes).
q4cpec8nldnk-q systemd-resolved[445]: Received dns UDP packet of size 28, ifindex=0, ttl=64, fragsize=0
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Got DNS stub UDP query packet for id 7454
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Looking up RR for github.com IN MX.
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Cache miss for github.com IN MX
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Firing regular transaction 16268 for <github.com IN MX> scope dns on ens5/* (validate=yes).
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Using feature level UDP+EDNS0 for transaction 16268.
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Using DNS server 169.254.169.254 for transaction 16268.
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Announcing packet size 1432 in egress EDNS(0) packet.
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Emitting UDP, link MTU is 1460, socket MTU is 0, minimal MTU is 40
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Sending query packet with id 16268 of size 39.
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Processing query...
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Received dns UDP packet of size 154, ifindex=2, ttl=0, fragsize=0
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Processing incoming packet of size 154 on transaction 16268 (rcode=SUCCESS).
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Added positive unauthenticated non-confidential cache entry for github.com IN MX 3598s on ens5/INET/169.254.169.254
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: message repeated 4 times: [ Added positive unauthenticated non-confidential cache entry for github.com IN MX 3598s on ens5/INET/169.254.169.254]
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Regular transaction 16268 for <github.com IN MX> on scope dns on ens5/* now complete with <success> from network (unsigned; non-confidential).
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Sending response packet with id 7454 on interface 1/AF_INET of size 143.
May 23 00:42:26 translations-1-b-linux-v100-gpu-efqfgjymq4cpec8nldnk-q systemd-resolved[445]: Freeing transaction 16268.

bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue May 23, 2024
…DNS failures

This should help make mozilla#549 less painful. I suggest we back it out once we get to the bottom of that.
bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue May 23, 2024
…DNS failures

This should help make mozilla#549 less painful. I suggest we back it out once we get to the bottom of that.
@bhearsum
Copy link
Collaborator

I was looking through worker logs of a worker that had a dns issue in production and found other things of interest.

In the task we had:

[vcs 2024-05-23T19:33:53.574Z] executing ['git', 'clone', 'https://github.com/mozilla/firefox-translations-training', '/home/ubuntu/tasks/task_171649283082993/checkouts/vcs']
[vcs 2024-05-23T19:33:53.582Z] Cloning into '/home/ubuntu/tasks/task_171649283082993/checkouts/vcs'...
[vcs 2024-05-23T19:34:13.971Z] fatal: unable to access 'https://github.com/mozilla/firefox-translations-training/': Could not resolve host: github.com

And in the syslogs I found:

May 23 19:34:32 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A.
2024-05-23 20:34:41.006
May 23 19:34:32 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Cache miss for github.com IN A
2024-05-23 20:34:41.006
May 23 19:34:32 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Firing regular transaction 65143 for <github.com IN A> scope dns on ens5/* (validate=yes).
2024-05-23 20:34:41.006
May 23 19:34:32 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN AAAA.
2024-05-23 20:34:41.006
May 23 19:34:32 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Cache miss for github.com IN AAAA
2024-05-23 20:34:41.006
May 23 19:34:32 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Firing regular transaction 23655 for <github.com IN AAAA> scope dns on ens5/* (validate=yes).
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Added positive unauthenticated non-confidential cache entry for github.com IN A 60s on ens5/INET/169.254.169.254
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Regular transaction 65143 for <github.com IN A> on scope dns on ens5/* now complete with <success> from network (unsigned; non-confidential).
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Not caching negative entry for: github.com IN AAAA, cache mode set to no-negative
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Regular transaction 23655 for <github.com IN AAAA> on scope dns on ens5/* now complete with <success> from network (unsigned; non-confidential).
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A.
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Positive cache hit for github.com IN A
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Regular transaction 7141 for <github.com IN A> on scope dns on ens5/* now complete with <success> from cache (unsigned; non-confidential).
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN AAAA.
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Cache miss for github.com IN AAAA
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Firing regular transaction 60167 for <github.com IN AAAA> scope dns on ens5/* (validate=yes).
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Not caching negative entry for: github.com IN AAAA, cache mode set to no-negative
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Regular transaction 60167 for <github.com IN AAAA> on scope dns on ens5/* now complete with <success> from network (unsigned; non-confidential).
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com.c.fxci-production-level1-workers.internal IN A.
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Cache miss for github.com.c.fxci-production-level1-workers.internal IN A
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Firing regular transaction 48751 for <github.com.c.fxci-production-level1-workers.internal IN A> scope dns on ens5/* (validate=yes).
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com.c.fxci-production-level1-workers.internal IN AAAA.
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Cache miss for github.com.c.fxci-production-level1-workers.internal IN AAAA
2024-05-23 20:34:45.041
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Firing regular transaction 26112 for <github.com.c.fxci-production-level1-workers.internal IN AAAA> scope dns on ens5/* (validate=yes).
2024-05-23 20:34:48.739
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Cache miss for github.com.c.fxci-production-level1-workers.internal IN A
2024-05-23 20:34:48.739
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Firing regular transaction 48751 for <github.com.c.fxci-production-level1-workers.internal IN A> scope dns on ens5/* (validate=yes).
2024-05-23 20:34:48.739
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Cache miss for github.com.c.fxci-production-level1-workers.internal IN AAAA
2024-05-23 20:34:48.739
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Firing regular transaction 26112 for <github.com.c.fxci-production-level1-workers.internal IN AAAA> scope dns on ens5/* (validate=yes).
2024-05-23 20:34:48.739
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Not caching negative entry for: github.com.c.fxci-production-level1-workers.internal IN A, cache mode set to no-negative
2024-05-23 20:34:48.739
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Regular transaction 48751 for <github.com.c.fxci-production-level1-workers.internal IN A> on scope dns on ens5/* now complete with <rcode-failure> from network (unsigned; non-confidential).
2024-05-23 20:34:48.739
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Not caching negative entry for: github.com.c.fxci-production-level1-workers.internal IN AAAA, cache mode set to no-negative
2024-05-23 20:34:48.739
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Regular transaction 26112 for <github.com.c.fxci-production-level1-workers.internal IN AAAA> on scope dns on ens5/* now complete with <rcode-failure> from network (unsigned; non-confidential).
2024-05-23 20:34:52.098
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com.c.fxci-production-level1-workers.internal IN A.
2024-05-23 20:34:52.098
May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Cache miss for github.com.c.fxci-production-level1-workers.internal IN A
2024-05-23 20:34:52.098

(For whatever reason, there seems to be a timestamp discrepancy between the task log and the system logs, for example we have messages like May 23 19:34:23 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw start-worker[572]: 2024/05/23 19:33:51 Environment: []string that show a 32 second difference.)

Within a very short time we see:

  • A lookup for github.com succeed
  • A lookup for github.com.c.fxci-production-level1-workers.internal fail

Maybe that second lookup is expected, maybe it's not - I'm really not sure. The github.com one succeeds, the .internal one does not (of course). I searched through all the logs and there are no failures to resolve the real github.com - the only failures are for .internal domains.

I'm still not sure what to make of this, just dropping more info at the moment.

bhearsum added a commit that referenced this issue May 23, 2024
…DNS failures (#624)

This should help make #549 less painful. I suggest we back it out once we get to the bottom of that.
@bhearsum
Copy link
Collaborator

bhearsum commented May 23, 2024

Another interesting thing is that we 12 attempts to look up the A record for github.com in the span of 12 seconds, while there were 3 attempts to clone over the span of ~40 seconds (the last one succeeding):

"May 23 19:34:32 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:33 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:37 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:38 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:39 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:41 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:41 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:41 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:42 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
"May 23 19:34:44 translations-1-b-linux-v100-gpu-sqdnmhpqsrk--fyf5bflrw systemd-resolved[448]: Looking up RR for github.com IN A."
[vcs 2024-05-23T19:33:53.574Z] executing ['git', 'clone', 'https://github.com/mozilla/firefox-translations-training', '/home/ubuntu/tasks/task_171649283082993/checkouts/vcs']
[vcs 2024-05-23T19:33:53.582Z] Cloning into '/home/ubuntu/tasks/task_171649283082993/checkouts/vcs'...
[vcs 2024-05-23T19:34:13.971Z] fatal: unable to access 'https://github.com/mozilla/firefox-translations-training/': Could not resolve host: github.com
[vcs 2024-05-23T19:34:15.976Z] executing ['git', 'clone', 'https://github.com/mozilla/firefox-translations-training', '/home/ubuntu/tasks/task_171649283082993/checkouts/vcs']
[vcs 2024-05-23T19:34:15.978Z] Cloning into '/home/ubuntu/tasks/task_171649283082993/checkouts/vcs'...
[vcs 2024-05-23T19:34:33.081Z] fatal: unable to access 'https://github.com/mozilla/firefox-translations-training/': Could not resolve host: github.com
[vcs 2024-05-23T19:34:37.087Z] executing ['git', 'clone', 'https://github.com/mozilla/firefox-translations-training', '/home/ubuntu/tasks/task_171649283082993/checkouts/vcs']
[vcs 2024-05-23T19:34:37.089Z] Cloning into '/home/ubuntu/tasks/task_171649283082993/checkouts/vcs'...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker bug Something isn't working taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

4 participants