Enable view=test mode in wpt.fyi #190

jcscottiii · 2024-04-05T19:35:01Z

Enable view=test mode in wpt.fyi

mathiasbynens · 2024-04-08T08:50:01Z

I’m really excited about this proposal, as it solves the use case described in web-platform-tests/wpt.fyi#3740 without affecting the default wpt.fyi view.

jgraham · 2024-04-08T08:55:55Z

This feels extremely sketchy to me in terms of the incentive structure it creates.

In particular having entire tests show up as FAIL if one subtest is failing feels misleading, and is likely to incentivise people to either remove/not submit tests that have a single failing subtest, or arbitarily split out the parts of the test that they don't pass into a seperate test.

To some extent we already do this for Interop, of course, but it's a massive pain and really something that I'd like to change if possible. And Interop is already better because it gives partial credit for cases where some subtests are passing and others are failing.

foolip · 2024-04-08T12:10:55Z

I like this simplified view, I'm often trying to understand the big picture in terms of tests that need attention. Currently, I have to use a status:!ok status:!pass query and look at the "Showing 172 tests" text at the top while navigating around to get an idea of where there are the most tests with failures.

It would resolve web-platform-tests/wpt.fyi#421 but also for testharness.js failures. This view shows it nicely I think:
https://staging.wpt.fyi/results/infrastructure/expected-fail?label=master&label=experimental&aligned&view=test

This will necessarily hide some information when there are multiple types of failures, but drilling down would still reveal it.

@jgraham I agree this simplification comes with downsides, but since the view can be useful I'd rather have it than not. I'd also like to have the interop-style calculation be possible for any set of tests. There's discussion in #114, and IIRC the sticking point was percentages.

How about if we also added a view that's like view=interop but without the percentages? Then the results could be understood given your choice of tallying, while leaving the default unchanged.

jcscottiii · 2024-04-09T17:02:12Z

Met about this today @jgraham will comment on this RFC.

jgraham · 2024-04-09T18:38:32Z

A concern I have here is that the original motivation is specifically "we want to track a target at the test level", and I worry that's going to cause problems. For example what happens if we reach the end of the quarter and another contributor adds extra subtests to the existing tests which makes it look like the goal has been failed? In some cases that could be adding one subtest to many top-level tests (historical example of this kind of thing would be idlharness tests that check where on the prototype chain attributes end up, or WebDriver classic tests for the interaction of user prompt handing with all other features). I'd predict in that case people end up being annoyed because they have the choice of either explaining why their dashboard now shows that they are failing all the tests, or unilaterally taking action to back out the change, or reorganising the tests to make their dashboard look right. I don't think any of these is a good option.

For @foolip's infrastructure test example, I think that's markedly worse than the current default view, even though that's nearly the best case where each test has one subtest (it's not quite true because window-onload-test.html has 6, and one is passing in all browsers, which is itself somewhat unexpected in an expected-fail directory; the "test" view totally hides that fact).

If we want a way to find any test that has subtest failures I think we should consider if we can solve it with search options rather than a new view. I also think there could be mileage in adding more sorting options to wpt.fyi e.g. maybe being able to sort by subtest pass rate would enable identifying tests with lots of subtest failures, in case that's ever userful.

mathiasbynens · 2024-04-10T08:30:38Z

I'd predict in that case people end up being annoyed because they have the choice of either explaining why their dashboard now shows that they are failing all the tests, or unilaterally taking action to back out the change, or reorganising the tests to make their dashboard look right. I don't think any of these is a good option.

For what it’s worth, as the original requester of this feature, I don't think these are the only possible outcomes. The scenario of subtests being added has already happened several times with WebDriver BiDi tests in the past, and it was never a problem for us — if anything, it’s a good thing when a gap in test coverage is identified and addressed by adding more subtests. We do want to track our implementation progress against a set of tests over a long period of time, but that doesn’t mean there’s no room for nuance in our tracking.

jgraham · 2024-04-10T08:45:11Z

We do want to track our implementation progress against a set of tests over a long period of time, but that doesn’t mean there’s no room for nuance in our tracking.

My concern with this proposal is exactly that it removes nuance from the dashboard; a view where any subtest failure marks the entire test as FAIL is more unstable than one that shows the subtests individually.

I also agree there are use cases where you think you probably should pass all the tests for a given feature and want to have some way to find out when new tests are added that you don't pass. But I don't think this view is necessary or sufficient for that use case; for example with that goal, I care about a new test being added to a directory just the same as I care about a subtest, so labeling individual tests to include in the tracked set doesn't really work; you either need to be confident that you will never have a reason to ignore a test in some directory, or you need to label exclusions. But labeling exclusions falls down because we can't label subtests. Also, once features are in maintenance mode you really want push-notifications for changes rather than having to poll a dashboard, because people will stop polling the dashboard once they're mostly working on something else.

foolip · 2024-04-10T11:34:32Z

I'd predict in that case people end up being annoyed because they have the choice of either explaining why their dashboard now shows that they are failing all the tests, or unilaterally taking action to back out the change, or reorganising the tests to make their dashboard look right.

I think it's a small risk, but yes, there is a risk that some people will be annoyed. If there is reverting of good changes to meet a metric, I think the wpt core team should step in and say no, don't do that. We can document this risk and the expected response in this RFC, or even preemptively put something in our code of conduct or general READMEs to point to.

For @foolip's infrastructure test example, I think that's markedly worse than the current default view

I find it helpful to see more information about the kind of failure elevated one level. But I'm not claiming it is a perfect view, better that view=subtest in every situation. It's just a useful view, that I think I'll often use.

@jgraham Would documentation of how to understand each view directly on wpt.fyi address your concern? It is quite difficult to understand the default view=subtest even, so I think this would be good.

past · 2024-04-10T18:01:09Z

I understand the concern about the incentive structure and I think it is valid. A key point to remember is that incentives are affected by changing the defaults. Optional views, flags, search queries or configurations are only useful for narrow use cases to get a particular job done. I am of the opinion that we should try to accommodate as many of these use cases as possible so that WPT remains a core workflow for people and they don't have to migrate to other solutions.

jgraham

I would still strongly discourage people from using this view. I think there's a meaningful difference between "we fail one subtest" and "we fail all the subtests" which this obscures.

In particular for a multi-participant project I think it's a bad idea to try to build metrics on the total count of fully passing tests, because other participants are free to add new failing subtests at any time. That could significantly change the scoring of any metric based on this view.

However on the understanding that this view is not visible in the UI, and the people using this view fully understand the risks and are aware of appropriate actions to take in case there are changes in the tests (notably, that this is likely, and that it's not appropriate to ask for correct changes to be reverted or refactored due to their effect on the test view or derived metrics), I think this is OK to add.

rfcs/wptfyi_enable_view_eq_test.md

Co-authored-by: jgraham <james@hoppipolla.co.uk>

past · 2024-05-14T18:37:22Z

it's not appropriate to ask for correct changes to be reverted or refactored due to their effect on the test view or derived metrics

I want to underscore this point that I agree 100% with. We are trying to make a great conformance suite for the web platform here, not play games with the metrics.

Enable view=test mode in wpt.fyi

9f686b1

foolip mentioned this pull request Apr 8, 2024

Feature request: report results in terms of passing tests (not subtests) web-platform-tests/wpt.fyi#3740

Open

foolip mentioned this pull request May 14, 2024

Feedback on view=test web-platform-tests/interop#661

Open

Update wptfyi_enable_view_eq_test.md

2830349

jgraham approved these changes May 14, 2024

View reviewed changes

rfcs/wptfyi_enable_view_eq_test.md Outdated Show resolved Hide resolved

Update rfcs/wptfyi_enable_view_eq_test.md

3d3d2c5

Co-authored-by: jgraham <james@hoppipolla.co.uk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable view=test mode in wpt.fyi #190

Enable view=test mode in wpt.fyi #190

jcscottiii commented Apr 5, 2024

mathiasbynens commented Apr 8, 2024

jgraham commented Apr 8, 2024

foolip commented Apr 8, 2024 •

edited

jcscottiii commented Apr 9, 2024

jgraham commented Apr 9, 2024 •

edited

mathiasbynens commented Apr 10, 2024

jgraham commented Apr 10, 2024

foolip commented Apr 10, 2024

past commented Apr 10, 2024

jgraham left a comment •

edited

past commented May 14, 2024

Enable view=test mode in wpt.fyi #190

Are you sure you want to change the base?

Enable view=test mode in wpt.fyi #190

Conversation

jcscottiii commented Apr 5, 2024

mathiasbynens commented Apr 8, 2024

jgraham commented Apr 8, 2024

foolip commented Apr 8, 2024 • edited

jcscottiii commented Apr 9, 2024

jgraham commented Apr 9, 2024 • edited

mathiasbynens commented Apr 10, 2024

jgraham commented Apr 10, 2024

foolip commented Apr 10, 2024

past commented Apr 10, 2024

jgraham left a comment • edited

Choose a reason for hiding this comment

past commented May 14, 2024

foolip commented Apr 8, 2024 •

edited

jgraham commented Apr 9, 2024 •

edited

jgraham left a comment •

edited