Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flaky] TestMetricsExport is super flaky #1672

Open
vagababov opened this issue Aug 31, 2020 · 17 comments · Fixed by #1689 or #1957
Open

[flaky] TestMetricsExport is super flaky #1672

vagababov opened this issue Aug 31, 2020 · 17 comments · Fixed by #1689 or #1957
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@vagababov
Copy link
Contributor

metrics/resource_view_test.go - TestMetricsExport flakes most of the time.

/assign @jjzeng-seattle @evankanderson

here's a good example: https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/1666/pull-knative-pkg-unit-tests/1300475037081407489

@vagababov vagababov transferred this issue from knative/serving Aug 31, 2020
@mattmoor
Copy link
Member

cc @yanweiguo (community oncall)

@yanweiguo
Copy link
Contributor

I ran the tests hundreds times with my own cluster and could only reproduce the timeout error once. I guess I have to send out a PR to run the CICD to debug.

@evankanderson
Copy link
Member

/reopen

I still see issues, though with a different signature. Collecting signatures here.

=== CONT  TestMetricsExport
    resource_view_test.go:334: Created exporter at localhost:12345
    logger.go:130: 2020-09-27T02:15:16.884Z	INFO	metrics/exporter.go:155	Flushing the existing exporter before setting up the new exporter.
    logger.go:130: 2020-09-27T02:15:16.940Z	ERROR	websocket/connection.go:138	Websocket connection could not be established	{"error": "dial tcp: lookup somewhere.not.exist on 10.7.240.10:53: no such host"}
    logger.go:130: 2020-09-27T02:15:16.975Z	INFO	metrics/opencensus_exporter.go:56	Created OpenCensus exporter with config:	{"config": {}}
    logger.go:130: 2020-09-27T02:15:16.975Z	INFO	metrics/exporter.go:168	Successfully updated the metrics exporter; old config: &{knative.dev/serving testComponent prometheus 1000000000 <nil> <nil>  false 19090 false   {   false}}; new config &{knative.dev/serving testComponent opencensus 1000000000 <nil> <nil> localhost:12345 false 0 false   {   false}}

and then:

    resource_view_test.go:370: Timeout reading input
    resource_view_test.go:376: Unexpected OpenCensus exports (-want +got):
          []metrics.metricExtract(Inverse(Sort, []string{
          	"knative.dev/serving/testComponent/global_export_counts<>:2",
          	"knative.dev/serving/testComponent/resource_global_export_count<>:2",
          	`knative.dev/serving/testComponent/testing/value<project="p1",rev`...,
        - 	`knative.dev/serving/testComponent/testing/value<project="p1",revision="r2">:1`,
          }))

@knative-prow-robot
Copy link
Contributor

@evankanderson: Reopened this issue.

In response to this:

/reopen

I still see issues, though with a different signature. Collecting signatures here.

=== CONT  TestMetricsExport
   resource_view_test.go:334: Created exporter at localhost:12345
   logger.go:130: 2020-09-27T02:15:16.884Z	INFO	metrics/exporter.go:155	Flushing the existing exporter before setting up the new exporter.
   logger.go:130: 2020-09-27T02:15:16.940Z	ERROR	websocket/connection.go:138	Websocket connection could not be established	{"error": "dial tcp: lookup somewhere.not.exist on 10.7.240.10:53: no such host"}
   logger.go:130: 2020-09-27T02:15:16.975Z	INFO	metrics/opencensus_exporter.go:56	Created OpenCensus exporter with config:	{"config": {}}
   logger.go:130: 2020-09-27T02:15:16.975Z	INFO	metrics/exporter.go:168	Successfully updated the metrics exporter; old config: &{knative.dev/serving testComponent prometheus 1000000000 <nil> <nil>  false 19090 false   {   false}}; new config &{knative.dev/serving testComponent opencensus 1000000000 <nil> <nil> localhost:12345 false 0 false   {   false}}

and then:

   resource_view_test.go:370: Timeout reading input
   resource_view_test.go:376: Unexpected OpenCensus exports (-want +got):
         []metrics.metricExtract(Inverse(Sort, []string{
         	"knative.dev/serving/testComponent/global_export_counts<>:2",
         	"knative.dev/serving/testComponent/resource_global_export_count<>:2",
         	`knative.dev/serving/testComponent/testing/value<project="p1",rev`...,
       - 	`knative.dev/serving/testComponent/testing/value<project="p1",revision="r2">:1`,
         }))

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@evankanderson
Copy link
Member

I'm slightly suspicious, seeing interleaved logging from different tests, if we're seeing side effects of the global monitoring singleton.

Unfortunately, it seems like it's hard to adjust our current prow test infrastructure to run these separately; let me look into doing it via GitHub Actions.

@evankanderson
Copy link
Member

Update: I've managed to reproduce this about 1/50 times when running all the tests under the e2e script.

It looks like somehow the default exporter is sometimes trying to export to the default localhost:55678 address rather than the address in the config. I'm still trying to figure out why this happens.

I've also found a small bug in the Prometheus exporter where it won't necessarily re-create the exporter if the port changes. Since that seems to happen rarely in the current scenarios, I'm going to roll that with the other change.

@evankanderson
Copy link
Member

I've fixed a few bugs, but I'm still seeing one case where the default meter seems to have lost track of all of the metrics associated with it.

@vagababov
Copy link
Contributor Author

@evankanderson
any updates on this?

@skonto
Copy link
Contributor

skonto commented Feb 9, 2021

@evankanderson I still see this issue here: #2005 (comment)

@dprotaso
Copy link
Member

@knative-prow-robot
Copy link
Contributor

@dprotaso: Reopened this issue.

In response to this:

/reopen

TestMetricsExport/OpenCensus

https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2189/pull-knative-pkg-unit-tests/1415384970213462016

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions
Copy link
Contributor

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 13, 2021
@benmoss
Copy link
Member

benmoss commented Oct 26, 2021

@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2021
@github-actions
Copy link
Contributor

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2022
@pierDipi
Copy link
Member

/remove-lifecycle stale

@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022
@dprotaso
Copy link
Member

/lifecycle frozen

@knative-prow knative-prow bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 26, 2022
@dprotaso dprotaso changed the title TestMetricsExport is super flaky [flaky] TestMetricsExport is super flaky Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
10 participants