Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable view=test mode in wpt.fyi #190

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jcscottiii
Copy link
Contributor

Enable view=test mode in wpt.fyi


Preview

@mathiasbynens
Copy link

I’m really excited about this proposal, as it solves the use case described in web-platform-tests/wpt.fyi#3740 without affecting the default wpt.fyi view.

@jgraham
Copy link
Contributor

jgraham commented Apr 8, 2024

This feels extremely sketchy to me in terms of the incentive structure it creates.

In particular having entire tests show up as FAIL if one subtest is failing feels misleading, and is likely to incentivise people to either remove/not submit tests that have a single failing subtest, or arbitarily split out the parts of the test that they don't pass into a seperate test.

To some extent we already do this for Interop, of course, but it's a massive pain and really something that I'd like to change if possible. And Interop is already better because it gives partial credit for cases where some subtests are passing and others are failing.

@foolip
Copy link
Member

foolip commented Apr 8, 2024

I like this simplified view, I'm often trying to understand the big picture in terms of tests that need attention. Currently, I have to use a status:!ok status:!pass query and look at the "Showing 172 tests" text at the top while navigating around to get an idea of where there are the most tests with failures.

It would resolve web-platform-tests/wpt.fyi#421 but also for testharness.js failures. This view shows it nicely I think:
https://staging.wpt.fyi/results/infrastructure/expected-fail?label=master&label=experimental&aligned&view=test

This will necessarily hide some information when there are multiple types of failures, but drilling down would still reveal it.

@jgraham I agree this simplification comes with downsides, but since the view can be useful I'd rather have it than not. I'd also like to have the interop-style calculation be possible for any set of tests. There's discussion in #114, and IIRC the sticking point was percentages.

How about if we also added a view that's like view=interop but without the percentages? Then the results could be understood given your choice of tallying, while leaving the default unchanged.

@jcscottiii
Copy link
Contributor Author

Met about this today @jgraham will comment on this RFC.

@jgraham
Copy link
Contributor

jgraham commented Apr 9, 2024

A concern I have here is that the original motivation is specifically "we want to track a target at the test level", and I worry that's going to cause problems. For example what happens if we reach the end of the quarter and another contributor adds extra subtests to the existing tests which makes it look like the goal has been failed? In some cases that could be adding one subtest to many top-level tests (historical example of this kind of thing would be idlharness tests that check where on the prototype chain attributes end up, or WebDriver classic tests for the interaction of user prompt handing with all other features). I'd predict in that case people end up being annoyed because they have the choice of either explaining why their dashboard now shows that they are failing all the tests, or unilaterally taking action to back out the change, or reorganising the tests to make their dashboard look right. I don't think any of these is a good option.

For @foolip's infrastructure test example, I think that's markedly worse than the current default view, even though that's nearly the best case where each test has one subtest (it's not quite true because window-onload-test.html has 6, and one is passing in all browsers, which is itself somewhat unexpected in an expected-fail directory; the "test" view totally hides that fact).

If we want a way to find any test that has subtest failures I think we should consider if we can solve it with search options rather than a new view. I also think there could be mileage in adding more sorting options to wpt.fyi e.g. maybe being able to sort by subtest pass rate would enable identifying tests with lots of subtest failures, in case that's ever userful.

@mathiasbynens
Copy link

I'd predict in that case people end up being annoyed because they have the choice of either explaining why their dashboard now shows that they are failing all the tests, or unilaterally taking action to back out the change, or reorganising the tests to make their dashboard look right. I don't think any of these is a good option.

For what it’s worth, as the original requester of this feature, I don't think these are the only possible outcomes. The scenario of subtests being added has already happened several times with WebDriver BiDi tests in the past, and it was never a problem for us — if anything, it’s a good thing when a gap in test coverage is identified and addressed by adding more subtests. We do want to track our implementation progress against a set of tests over a long period of time, but that doesn’t mean there’s no room for nuance in our tracking.

@jgraham
Copy link
Contributor

jgraham commented Apr 10, 2024

We do want to track our implementation progress against a set of tests over a long period of time, but that doesn’t mean there’s no room for nuance in our tracking.

My concern with this proposal is exactly that it removes nuance from the dashboard; a view where any subtest failure marks the entire test as FAIL is more unstable than one that shows the subtests individually.

I also agree there are use cases where you think you probably should pass all the tests for a given feature and want to have some way to find out when new tests are added that you don't pass. But I don't think this view is necessary or sufficient for that use case; for example with that goal, I care about a new test being added to a directory just the same as I care about a subtest, so labeling individual tests to include in the tracked set doesn't really work; you either need to be confident that you will never have a reason to ignore a test in some directory, or you need to label exclusions. But labeling exclusions falls down because we can't label subtests. Also, once features are in maintenance mode you really want push-notifications for changes rather than having to poll a dashboard, because people will stop polling the dashboard once they're mostly working on something else.

@foolip
Copy link
Member

foolip commented Apr 10, 2024

I'd predict in that case people end up being annoyed because they have the choice of either explaining why their dashboard now shows that they are failing all the tests, or unilaterally taking action to back out the change, or reorganising the tests to make their dashboard look right.

I think it's a small risk, but yes, there is a risk that some people will be annoyed. If there is reverting of good changes to meet a metric, I think the wpt core team should step in and say no, don't do that. We can document this risk and the expected response in this RFC, or even preemptively put something in our code of conduct or general READMEs to point to.

For @foolip's infrastructure test example, I think that's markedly worse than the current default view

I find it helpful to see more information about the kind of failure elevated one level. But I'm not claiming it is a perfect view, better that view=subtest in every situation. It's just a useful view, that I think I'll often use.

@jgraham Would documentation of how to understand each view directly on wpt.fyi address your concern? It is quite difficult to understand the default view=subtest even, so I think this would be good.

@past
Copy link
Member

past commented Apr 10, 2024

I understand the concern about the incentive structure and I think it is valid. A key point to remember is that incentives are affected by changing the defaults. Optional views, flags, search queries or configurations are only useful for narrow use cases to get a particular job done. I am of the opinion that we should try to accommodate as many of these use cases as possible so that WPT remains a core workflow for people and they don't have to migrate to other solutions.

Copy link
Contributor

@jgraham jgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still strongly discourage people from using this view. I think there's a meaningful difference between "we fail one subtest" and "we fail all the subtests" which this obscures.

In particular for a multi-participant project I think it's a bad idea to try to build metrics on the total count of fully passing tests, because other participants are free to add new failing subtests at any time. That could significantly change the scoring of any metric based on this view.

However on the understanding that this view is not visible in the UI, and the people using this view fully understand the risks and are aware of appropriate actions to take in case there are changes in the tests (notably, that this is likely, and that it's not appropriate to ask for correct changes to be reverted or refactored due to their effect on the test view or derived metrics), I think this is OK to add.

rfcs/wptfyi_enable_view_eq_test.md Outdated Show resolved Hide resolved
Co-authored-by: jgraham <james@hoppipolla.co.uk>
@past
Copy link
Member

past commented May 14, 2024

it's not appropriate to ask for correct changes to be reverted or refactored due to their effect on the test view or derived metrics

I want to underscore this point that I agree 100% with. We are trying to make a great conformance suite for the web platform here, not play games with the metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants