A race between two serial logging pumps #1160

kkm000 · 2020-03-22T08:00:45Z

Sorry if this has been logged, but I could not find a similar issue, and it has bothered be for a while.

This is a loop pumping serial port log to the log bucket. This is always on, and all fine (except it loses the tail of the log if the machine shuts town too quickly, but that's a GCE design defect, a whole another story--ideally, GCE should have kept the log open for another 15-30 second after the instance stops). I do not remember off the top of my head where exactly, but I seen in some of the workflows in this repo that a machine is even restarted with the only goal to extract its logs. I am talking about this loop:

compute-image-tools/daisy/logger.go

Lines 261 to 268 in c776c02

    
           func periodicFlush(f func()) { 
        
           	go func() { 
        
           		for { 
        
           			time.Sleep(5 * time.Second) 
        
           			f() 
        
           		} 
        
           	}() 
        
           }

The real race inside Daisy happens between this pump, and the one in WaitForInstancesSignal (WFIS). The latter may react so quickly and issue the .stop to instance that a solid trailing chunk of logs disappears, because the first pump nods it off. And you'd probably be much less than surprised that I'm going after these logs only when build fails, and the tail, the part that was probably the most crucial for diagnostics, would be missing. I'm talking about this loop:

compute-image-tools/daisy/step_wait_for_instances_signal.go

Lines 122 to 175 in c776c02

    
           for { 
        
           	select { 
        
           	case <-s.w.Cancel: 
        
           		return nil 
        
           	case <-tick: 
        
           		resp, err := w.ComputeClient.GetSerialPortOutput(project, zone, name, so.Port, start) 
        
           		if err != nil { 
        
           			status, sErr := w.ComputeClient.InstanceStatus(project, zone, name) 
        
           			if sErr != nil { 
        
           				err = fmt.Errorf("%v, error getting InstanceStatus: %v", err, sErr) 
        
           			} else { 
        
           				err = fmt.Errorf("%v, InstanceStatus: %q", err, status) 
        
           			} 
        
           			// Wait until machine restarts to evaluate SerialOutput. 
        
           			if status == "TERMINATED" || status == "STOPPED" || status == "STOPPING" { 
        
           				continue 
        
           			} 
        
           			// Retry up to 3 times in a row on any error if we successfully got InstanceStatus. 
        
           			if errs < 3 { 
        
           				errs++ 
        
           				continue 
        
           			} 
        
           			return Errf("WaitForInstancesSignal: instance %q: error getting serial port: %v", name, err) 
        
           		} 
        
           		start = resp.Next 
        
           		for _, ln := range strings.Split(resp.Contents, "\n") { 
        
           			if so.StatusMatch != "" { 
        
           				if i := strings.Index(ln, so.StatusMatch); i != -1 { 
        
           					w.LogStepInfo(s.name, "WaitForInstancesSignal", "Instance %q: StatusMatch found: %q", name, strings.TrimSpace(ln[i:])) 
        
           					extractOutputValue(w, ln) 
        
           				} 
        
           			} 
        
           			if len(so.FailureMatch) > 0 { 
        
           				for _, failureMatch := range so.FailureMatch { 
        
           					if i := strings.Index(ln, failureMatch); i != -1 { 
        
           						errMsg := strings.TrimSpace(ln[i:]) 
        
           						format := "WaitForInstancesSignal FailureMatch found for %q: %q" 
        
           						return newErr(errMsg, fmt.Errorf(format, name, errMsg)) 
        
           					} 
        
           				} 
        
           			} 
        
           			if so.SuccessMatch != "" { 
        
           				if i := strings.Index(ln, so.SuccessMatch); i != -1 { 
        
           					w.LogStepInfo(s.name, "WaitForInstancesSignal", "Instance %q: SuccessMatch found %q", name, strings.TrimSpace(ln[i:])) 
        
           					return nil 
        
           				} 
        
           			} 
        
           		} 
        
           		errs = 0 
        
           	} 
        
           }

I understand is not an easy problem to solve, and I have a solid chunk of boilerplate in every build script to work around these issues with a various degree of success: Two sleeps, essentially: one before logging the magic incantations for the WFIS step to match, and another after it but before a poweroff to give Daisy a break to suck out the remainder of the log--there is a surge of logging noise when the instance is stopping. Getting the timeout right to avoid false negatives is closer to magic than engineering, tho :)

If it was only somehow possible to keep a single pump, so that what would have gone to the the bucket and/or stackdriver were exactly the same chunk of log that WFIS had seen and acted upon, that would have solved a half of the issue already. In other words, If WFIS were to somehow tap into the existing log stream pumping service inside Daisy instead of rolling the second one on its own. Do you think something like this is doable?

The text was updated successfully, but these errors were encountered:

EricEdens · 2020-03-22T14:25:59Z

Thanks for the thorough analysis! We're pricked by this thorn as well, and while we've talked about solutions, we haven't implemented any yet. I'll start those conversations again, though, and keep this issue updated.

kkm000 · 2020-03-23T22:27:35Z

@EricEdens, it just suddenly occurred to me. Last year GCE introduced Guest attributes, and I now use them to report initial status of a deployment (I am running compute clusters, and my single image specialize machines on the first boot as one of 4 types of nodes: login, controller, NFS server; and a preemptible compute node). The deployment script waits for a signal from the first three, permanent type of nodes, indicating they successfully figured out and assumed the required sub-identity each. Maybe this can be a solution independent of log polling?

No question about the existing, log matching a magic word mechanism--existing workflow should be supported--but as the "legacy" vs "recommended" new type of signaling they may be useful, I believe.

kkm000 · 2020-03-23T23:01:41Z

BTW, I am not really a Go programmer--I wrote two only production pieces in Go of code in my life, one a self-hosted Web server back in my Google days for the team to gather all the interesting ad stats in a single page from multiple backends, and another that pulled and decrypted the machine SSH keys from KMS for my cloud clusters so that all their machines have the same identity key (dropped it since in favor of the new Secrets API, much easier to use and audit). But you can use me to document the new feature, if you decide to use the Guest attrs (it would involve a curl incantation to signal success/failure). Please @-mention me if you think I can be of help. These go to my inbox and also even bleep on the phone.

EricEdens · 2020-04-01T16:17:19Z

Hi @kkm000. The guest attributes idea is interesting, but would still require coordination with the logging mechanism to ensure that all serial logs have been archived prior to terminating the worker instance. So, I'd recommend filing a feature request for guest attributes, since I think they're orthogonal to the problem in this issue. (Unless I'm missing something!)

To distil this current issue, these steps need to be serialized:

Fetch the latest serial console logs
Archive them
Process rules in WaitForInstancesSignal

I'm still exploring different options to achieve this. It sounds like you have a mitigation on your end, is that right?

EricEdens self-assigned this Mar 22, 2020

EricEdens mentioned this issue Jun 9, 2020

Daisy: Consolidate reading of serial port output. #1243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A race between two serial logging pumps #1160

A race between two serial logging pumps #1160

kkm000 commented Mar 22, 2020

EricEdens commented Mar 22, 2020

kkm000 commented Mar 23, 2020

kkm000 commented Mar 23, 2020 •

edited

EricEdens commented Apr 1, 2020

A race between two serial logging pumps #1160

A race between two serial logging pumps #1160

Comments

kkm000 commented Mar 22, 2020

EricEdens commented Mar 22, 2020

kkm000 commented Mar 23, 2020

kkm000 commented Mar 23, 2020 • edited

EricEdens commented Apr 1, 2020

kkm000 commented Mar 23, 2020 •

edited