Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A race between two serial logging pumps #1160

Open
kkm000 opened this issue Mar 22, 2020 · 4 comments
Open

A race between two serial logging pumps #1160

kkm000 opened this issue Mar 22, 2020 · 4 comments
Assignees

Comments

@kkm000
Copy link
Contributor

kkm000 commented Mar 22, 2020

Sorry if this has been logged, but I could not find a similar issue, and it has bothered be for a while.

This is a loop pumping serial port log to the log bucket. This is always on, and all fine (except it loses the tail of the log if the machine shuts town too quickly, but that's a GCE design defect, a whole another story--ideally, GCE should have kept the log open for another 15-30 second after the instance stops). I do not remember off the top of my head where exactly, but I seen in some of the workflows in this repo that a machine is even restarted with the only goal to extract its logs. I am talking about this loop:

func periodicFlush(f func()) {
go func() {
for {
time.Sleep(5 * time.Second)
f()
}
}()
}

The real race inside Daisy happens between this pump, and the one in WaitForInstancesSignal (WFIS). The latter may react so quickly and issue the .stop to instance that a solid trailing chunk of logs disappears, because the first pump nods it off. And you'd probably be much less than surprised that I'm going after these logs only when build fails, and the tail, the part that was probably the most crucial for diagnostics, would be missing. I'm talking about this loop:

for {
select {
case <-s.w.Cancel:
return nil
case <-tick:
resp, err := w.ComputeClient.GetSerialPortOutput(project, zone, name, so.Port, start)
if err != nil {
status, sErr := w.ComputeClient.InstanceStatus(project, zone, name)
if sErr != nil {
err = fmt.Errorf("%v, error getting InstanceStatus: %v", err, sErr)
} else {
err = fmt.Errorf("%v, InstanceStatus: %q", err, status)
}
// Wait until machine restarts to evaluate SerialOutput.
if status == "TERMINATED" || status == "STOPPED" || status == "STOPPING" {
continue
}
// Retry up to 3 times in a row on any error if we successfully got InstanceStatus.
if errs < 3 {
errs++
continue
}
return Errf("WaitForInstancesSignal: instance %q: error getting serial port: %v", name, err)
}
start = resp.Next
for _, ln := range strings.Split(resp.Contents, "\n") {
if so.StatusMatch != "" {
if i := strings.Index(ln, so.StatusMatch); i != -1 {
w.LogStepInfo(s.name, "WaitForInstancesSignal", "Instance %q: StatusMatch found: %q", name, strings.TrimSpace(ln[i:]))
extractOutputValue(w, ln)
}
}
if len(so.FailureMatch) > 0 {
for _, failureMatch := range so.FailureMatch {
if i := strings.Index(ln, failureMatch); i != -1 {
errMsg := strings.TrimSpace(ln[i:])
format := "WaitForInstancesSignal FailureMatch found for %q: %q"
return newErr(errMsg, fmt.Errorf(format, name, errMsg))
}
}
}
if so.SuccessMatch != "" {
if i := strings.Index(ln, so.SuccessMatch); i != -1 {
w.LogStepInfo(s.name, "WaitForInstancesSignal", "Instance %q: SuccessMatch found %q", name, strings.TrimSpace(ln[i:]))
return nil
}
}
}
errs = 0
}
}

I understand is not an easy problem to solve, and I have a solid chunk of boilerplate in every build script to work around these issues with a various degree of success: Two sleeps, essentially: one before logging the magic incantations for the WFIS step to match, and another after it but before a poweroff to give Daisy a break to suck out the remainder of the log--there is a surge of logging noise when the instance is stopping. Getting the timeout right to avoid false negatives is closer to magic than engineering, tho :)

If it was only somehow possible to keep a single pump, so that what would have gone to the the bucket and/or stackdriver were exactly the same chunk of log that WFIS had seen and acted upon, that would have solved a half of the issue already. In other words, If WFIS were to somehow tap into the existing log stream pumping service inside Daisy instead of rolling the second one on its own. Do you think something like this is doable?

@EricEdens
Copy link
Contributor

Thanks for the thorough analysis! We're pricked by this thorn as well, and while we've talked about solutions, we haven't implemented any yet. I'll start those conversations again, though, and keep this issue updated.

@EricEdens EricEdens self-assigned this Mar 22, 2020
@kkm000
Copy link
Contributor Author

kkm000 commented Mar 23, 2020

@EricEdens, it just suddenly occurred to me. Last year GCE introduced Guest attributes, and I now use them to report initial status of a deployment (I am running compute clusters, and my single image specialize machines on the first boot as one of 4 types of nodes: login, controller, NFS server; and a preemptible compute node). The deployment script waits for a signal from the first three, permanent type of nodes, indicating they successfully figured out and assumed the required sub-identity each. Maybe this can be a solution independent of log polling?

No question about the existing, log matching a magic word mechanism--existing workflow should be supported--but as the "legacy" vs "recommended" new type of signaling they may be useful, I believe.

@kkm000
Copy link
Contributor Author

kkm000 commented Mar 23, 2020

BTW, I am not really a Go programmer--I wrote two only production pieces in Go of code in my life, one a self-hosted Web server back in my Google days for the team to gather all the interesting ad stats in a single page from multiple backends, and another that pulled and decrypted the machine SSH keys from KMS for my cloud clusters so that all their machines have the same identity key (dropped it since in favor of the new Secrets API, much easier to use and audit). But you can use me to document the new feature, if you decide to use the Guest attrs (it would involve a curl incantation to signal success/failure). Please @-mention me if you think I can be of help. These go to my inbox and also even bleep on the phone.

@EricEdens
Copy link
Contributor

Hi @kkm000. The guest attributes idea is interesting, but would still require coordination with the logging mechanism to ensure that all serial logs have been archived prior to terminating the worker instance. So, I'd recommend filing a feature request for guest attributes, since I think they're orthogonal to the problem in this issue. (Unless I'm missing something!)

To distil this current issue, these steps need to be serialized:

  1. Fetch the latest serial console logs
  2. Archive them
  3. Process rules in WaitForInstancesSignal

I'm still exploring different options to achieve this. It sounds like you have a mitigation on your end, is that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants