mt76: work around dropped TX status reporting #3261
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
These fixes were developed for MT7603 and MT7612. Originally, we've discovered the TX-status reporting mishandles cases where continuous action frames are sent over a mesh-link but no actual data traffic is transmitted over the link.
This leads to a client "sticking" to the AP even if it already left the covered area.
In the process, we've discovered this patchset might(!) also help mitigate issues seen on MT7916 based platforms. Therefore, i think it is a good idea to open this PR as a draft to serve as a starting point for discussions on this matter.
I've rebased this patchset on current master, build testing by the CI runners now. We've tested fixes with the mt76 patches based on v2023.2.1 releases.
MT7612 affected node
First example
Dashboard: https://stats.darmstadt.freifunk.net/d/000000021/router-meshviewer-export?var-node=3894edf5ff9b&orgId=1&from=1708931806122&to=1711420462467
This Node was the one we've originally diagnosed the issue. You can see the devices accumulating on the OWE VAP for the 5 GHz radio, as the driver incorrectly marks all action-frames sent to the device as acked. They are dropped after 24h for rekeying.
You can observe a sticky client on the 2.4 GHz radio (MT7603). This does not matter if it happens on the OWE or unencrypted VAP. The node will in this case pollute the airtime and be unavailable to other clients.
Second example
I have no access to this node, but it shows the same symptoms:
Dashboard: https://stats.darmstadt.freifunk.net/d/000000021/router-meshviewer-export?var-node=3894edf70c46&orgId=1&from=1713058330554&to=1715097473774
MT7986 Nodes
We've seen cases where MT7986 and MT7981 radios might experience degraded performance. Coincidentally, this patchset seems to mitigate similar issues as well (We only have some days of testing on this hypothesis).
Dashboard: https://stats.darmstadt.freifunk.net/d/000000021/router-meshviewer-export?var-node=c87f54231ad8&orgId=1&from=1713058547177&to=1715642058103