[pkg/stanza] Fileconsumer TestFlushPeriodEOF test failing on Windows #32715

crobert-1 · 2024-04-26T16:21:06Z

Component(s)

pkg/stanza

Describe the issue you're reporting

Failing CI/CD links:
https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/8849403200/job/24301322178
https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/8850888073/job/24306184699?pr=32529

Failure output:

=== Failed
=== FAIL: fileconsumer/internal/reader TestFlushPeriodEOF (3.01s)
    sink.go:114: 
        	Error Trace:	D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/emittest/sink.go:114
        	            				D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/reader/reader_test.go:209
        	Error:      	timeout: expected: 2, actual: 1
        	Test:       	TestFlushPeriodEOF

=== FAIL: fileconsumer/internal/reader TestFlushPeriodEOF (re-run 1) (3.02s)
    sink.go:114: 
        	Error Trace:	D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/emittest/sink.go:114
        	            				D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/reader/reader_test.go:209
        	Error:      	timeout: expected: 2, actual: 1
        	Test:       	TestFlushPeriodEOF

DONE 2 runs, 3289 tests, 13 skipped, 2 failures in 112.066s

This has been failing pretty consistently (but not quite everytime) for the last few days, I haven't been able to determine why it recently started failing more frequently.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-26T16:21:25Z

Pinging code owners:

pkg/stanza: @djaglowski

%s See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski · 2024-04-26T17:02:27Z

Could this be related to #32100? @OverOrion @ChrsMark

OverOrion · 2024-04-29T05:52:59Z

I think it is highly likely to be related. I'll try to have a look at it this week, not sure if I can before SIG though.

EDIT: maybe it has something to do with using \r\n instead of \n?

ChrsMark · 2024-04-29T07:56:38Z

maybe it has something to do with using \r\n instead of \n?

That's likely :) (windows 🙈).
Maybe using the the fmt library to add the new line (ie fmt.Sprintln) instead of adding it explicitly could help here?

ChrsMark · 2024-04-30T17:48:27Z

So for some reason on Windows the second token is not collected if we don't append a new line (\n) after it.
That's why we get the error expected: 2, actual: 1, because we were reading only 1 token instead of 2.

It seems to be fixed at #32734.
I'm trying to think though if this error indicates that there is an actual issue in the implementation with how we check for the EOFs.
@djaglowski @OverOrion would that be possible?
Do we have other tests already that would ensure the behavior on Windows?

crobert-1 · 2024-05-03T15:27:39Z

+1 freq: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/8940807269/job/24559739110

crobert-1 · 2024-05-03T15:35:25Z

+1 freq: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/8939477614/job/24555590652

crobert-1 · 2024-05-07T22:34:35Z

+1 freq: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/8993044783/job/24704110759?pr=32932

See #32715 This also adds a bit more debugging info for other tests which fail on the same expectation, since it's not very obvious what was expected vs actually found.

…telemetry#32946) See open-telemetry#32715 This also adds a bit more debugging info for other tests which fail on the same expectation, since it's not very obvious what was expected vs actually found.

OverOrion · 2024-05-10T07:34:58Z

It seems to be fixed at #32734. I'm trying to think though if this error indicates that there is an actual issue in the implementation with how we check for the EOFs. @djaglowski @OverOrion would that be possible? Do we have other tests already that would ensure the behavior on Windows?

When I checked the related code there wasn't anything that stood out EOF / SplitFunc wise.

Maybe in the future it would be worth investigating all the skipped tests on Windows, hopefully they could be (re)enabled.

I think this issue could be closed since #32946 got merged @crobert-1.

ChrsMark · 2024-05-10T08:15:03Z

I still think that we might have an issue on windows 🤔 . To my mind the test clearly proves that on windows we don't manage the "hanging" situation properly (see #32734 (comment)), right?
So even if we skipped the flaky test for now maybe it would worth it to further investigate this?

djaglowski · 2024-05-10T18:58:18Z

Yes we should leave this open until the issue is fixed on windows. Skipping is just a hack to not get in everyone else's way.

pjanotti · 2024-05-11T00:48:43Z

I took a look at this and it seems a frequent issue when running tests on Windows: timing and synchronization. From my debugging the test code is relying on the 5ns flush period to ensure that the read of the second token, the b, is flushed. That, in turn, depends on the flush code check if time.Since(s.LastDataChange) > period succeeding on the first time that the test execution hits it. However, on Windows that doesn't succeeds on the first time the code reaches that check and then there are no other chances to flush the second token.

You can see the timing differences with the following code:

package main

import (
	"fmt"
	"runtime"
	"time"
)

func main() {
	// Count how many times the loop runs until 5 nanoseconds have passed.
	start := time.Now()
	count := 0
	for {
		count++
		if time.Since(start) > 5*time.Nanosecond {
			break
		}
	}
	end := time.Now()
	fmt.Println("GOOS:", runtime.GOOS)
	fmt.Println("Count:", count)
	fmt.Println("Duration:", end.Sub(start))
}

GOOS: windows
Count: 37838
Duration: 532µs

GOOS: darwin
Count: 1
Duration: 304ns

GOOS: linux
Count: 1
Duration: 300ns

djaglowski · 2024-05-13T14:25:23Z

Thanks for looking into and explaining this @pjanotti. The same problem is likely behind many of the flaky tests we've previously disabled on windows.

I spent a little time reading up on this and found that the core Golang packages have some strategies for dealing with this in their own tests and benchmarks. Notably, I found this PR which appears to implement a solution, but the latest source has moved away from it for reasons I didn't find.

I think it would be quite a benefit to the repo if we could figure out a reasonable solution like this and use it whenever timing is involved in tests.

ChrsMark · 2024-05-13T15:13:11Z

Maybe something like https://github.com/golang/go/blob/5881ae742fc7eaa9b7d61b4ba37598c42aaa4753/src/testing/testing_windows.go#L47-L70 could do the trick here?

However, is my assumption correct that the time precision issue can hit a real use case as well instead of only the tests?

I wonder if instead of if time.Since(s.LastDataChange) > period we could define+use a highPrecisionTimeSince to solve the timing issue at first place instead of just for the tests?

golang/go#67066 also looks related.

djaglowski · 2024-05-13T17:37:16Z

@ChrsMark, I looked briefly at the most recent testing/testing_windows.go and found that the underlying implementation relies on CGO. We've avoided taking a dependency on CGO but I don't have a clear understanding of whether or not we would have to in this case. (This isn't very clear direction but I want to call it out as something to consider with any implementation.)

…telemetry#32946) See open-telemetry#32715 This also adds a bit more debugging info for other tests which fail on the same expectation, since it's not very obvious what was expected vs actually found.

ChrsMark · 2024-05-16T13:51:12Z

In any case this issue looks to be a valid one. Shall we remove the needs triage label?

crobert-1 added the needs triage New item requiring triage label Apr 26, 2024

github-actions bot added the pkg/stanza label Apr 26, 2024

crobert-1 added the os:windows label Apr 26, 2024

crobert-1 mentioned this issue Apr 26, 2024

[chore][CI/CD] Remove integration test naming restriction #32529

Merged

ChrsMark mentioned this issue Apr 29, 2024

[chore] [pkg/stanza] fix new line issue on windows test #32734

Closed

crobert-1 mentioned this issue May 7, 2024

[receiver/sqlserver] Add more metrics #32932

Merged

djaglowski mentioned this issue May 8, 2024

[chore][fileconsumer] Skip flaky TestFlushPeriodEOF on windows #32946

Merged

djaglowski removed the needs triage New item requiring triage label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pkg/stanza] Fileconsumer TestFlushPeriodEOF test failing on Windows #32715

[pkg/stanza] Fileconsumer TestFlushPeriodEOF test failing on Windows #32715

crobert-1 commented Apr 26, 2024

github-actions bot commented Apr 26, 2024

djaglowski commented Apr 26, 2024

OverOrion commented Apr 29, 2024 •

edited

ChrsMark commented Apr 29, 2024

ChrsMark commented Apr 30, 2024

crobert-1 commented May 3, 2024

crobert-1 commented May 3, 2024

crobert-1 commented May 7, 2024

OverOrion commented May 10, 2024

ChrsMark commented May 10, 2024

djaglowski commented May 10, 2024

pjanotti commented May 11, 2024

djaglowski commented May 13, 2024

ChrsMark commented May 13, 2024

djaglowski commented May 13, 2024

ChrsMark commented May 16, 2024 •

edited

[pkg/stanza] Fileconsumer TestFlushPeriodEOF test failing on Windows #32715

[pkg/stanza] Fileconsumer TestFlushPeriodEOF test failing on Windows #32715

Comments

crobert-1 commented Apr 26, 2024

Component(s)

Describe the issue you're reporting

github-actions bot commented Apr 26, 2024

djaglowski commented Apr 26, 2024

OverOrion commented Apr 29, 2024 • edited

ChrsMark commented Apr 29, 2024

ChrsMark commented Apr 30, 2024

crobert-1 commented May 3, 2024

crobert-1 commented May 3, 2024

crobert-1 commented May 7, 2024

OverOrion commented May 10, 2024

ChrsMark commented May 10, 2024

djaglowski commented May 10, 2024

pjanotti commented May 11, 2024

djaglowski commented May 13, 2024

ChrsMark commented May 13, 2024

djaglowski commented May 13, 2024

ChrsMark commented May 16, 2024 • edited

OverOrion commented Apr 29, 2024 •

edited

ChrsMark commented May 16, 2024 •

edited