Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve bandwidth estimation and adaptive switching #4825

Merged
merged 5 commits into from Jan 18, 2023

Conversation

robwalch
Copy link
Collaborator

@robwalch robwalch commented Aug 4, 2022

This PR will...

Improve bandwidth estimation and adaptive switching with smaller segments and higher TTFB

Why is this Pull Request needed?

Estimating time-to-first-byte exclusive from the time used to estimate bandwidth, allows us to more accurately predict the time it takes to load segments, especially those with shorter durations closer to the average round trip time of a request.

There were also several issues with loading stats, combined vs main video buffer observation, and calculation of inflight BW performed before bytes were loaded that all contributed to bad emergency down-switch and BWE corruption in _abandonRulesCheck. Thanks to @Pri12890 for pointing many of these out.

Are there any points in the code the reviewer needs to double check?

There are two new public methods on the player instance that I have left undocumented:

  1. get mainForwardBufferInfo(): BufferInfo | null Allows the ABR controller to access the stream controller's media buffer. Since it only deals with main variant fragment traffic, this allows it to compare that activity to the buffer it appends to rather than the combined buffer which could be stalled if alt-audio does not keep ahead.
  2. get ttfbEstimate(): number Similar to hls.bandwidthEstimate, hls.ttfbEstimate provides the latest time-to-first-byte estimate.

The TTFB sampling only uses one EWMA instance with the same slow half life as that used for bandwidth. The default estimate is 100(ms) and is not configurable. It is weighted on a curve that favors shorter values so that the occasional request RTT hiccup does not have as much impact on the estimate.

These changes will remain up for at least 1-2 minor releases and will not be released in v1.3. This is to give contributors time to review and test these changes and provide feedback.

Resolves issues:

Fixes #3578 (special thanks to @Oleksandr0xB for submitting #4283)
Fixes #3563 and Closes #3595 (special thanks to @kanongil for early testing and feedback on the Low-Latency HLS implementation)
Fixes #5094
Closes #4291 (_abandonRulesCheck govern whether fragment loading is completed or aborted based on timeouts - not active throughput)

Checklist

  • changes have been done against master branch, and PR does not conflict
  • new unit / functional tests have been added (whenever applicable)
  • API or design changes are documented in API.md

@robwalch robwalch added this to the 1.4.0 milestone Aug 4, 2022
@kanongil
Copy link
Contributor

kanongil commented Aug 4, 2022

It is great that you are trying to tackle the bandwidth estimation issues!

While it makes sense to use request latency as part of the effective bandwidth calculation, this averaging will fail with #3988. With preload hint fetching, the TTFB is not correlated with the latency, but with the request start time relative to the latest index update and part duration, which can both change from part to part. As such, I would prefer an alternative approach that doesn't require this averaging.

FYI, it's possible to get more accurate transfer timing for a single request using the Resource Timing API, though it is a bit cumbersome to extract and requires server opt-in using the Timing-Allow-Origin header.

@robwalch robwalch force-pushed the feature/estimate-ttfb-fix-part-stats-improve-abr branch from f4090fa to bb5b3e7 Compare August 4, 2022 16:58
@robwalch
Copy link
Collaborator Author

robwalch commented Aug 4, 2022

Thanks @kanongil,

I added to a note to #3988: "do not sample TTFB for blocked part-hint responses". The _abandonRulesCheck will need to change as well for part hints.

(duration * levelNextBitrate) / (8 * 0.8 * loadRate);
fragLevelNextLoadedDelay = loadRate
? (duration * levelNextBitrate) / (8 * 0.8 * loadRate)
: duration / bwEstimate + ttfbEstimate;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for else part, are we missing (duration * levelNextBitrate) and by mistake have only used duration.

Copy link
Collaborator Author

@robwalch robwalch Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the find.

Copy link
Collaborator Author

@robwalch robwalch Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in af0843a

const bwEstimate: number = this.bwEstimator.getEstimate();
logger.warn(`Fragment ${frag.sn}${
// if estimated load time of new segment is completely unreasonable, ignore and do not emergency switch down
if (fragLevelNextLoadedDelay > duration * 10) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not able to imagine how would we run into this in general and moreover it is smaller than fragLoadedDelay

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can end up with very very low loadRate calculations. When this happens it's usually because of a very problematic request or a huge drop. If emergency down-switching is not going to prevent a stall then it's not worth aborting here. Wait for more data and/or let the normal ABR cycle recover.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we decide to keep fragLevelNextLoadedDelay in ms, we should convert RHS to ms, i think duration is in seconds right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duration is in seconds

const processingMs =
stats.parsing.end -
stats.loading.start -
Math.min(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this change is good improvement, does it work well for Parts as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to work well with the sample LL-HLS test stream which uses two 1s parts per segment.

@robwalch robwalch force-pushed the feature/estimate-ttfb-fix-part-stats-improve-abr branch from 93cb148 to af0843a Compare August 4, 2022 23:00
);
} else {
// If there has been no loading progress, sample TTFB
this.bwEstimator.sampleTTFB(timeLoading);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sample TTFB when no bytes have been delivered? Won't this undershoot?

Copy link
Collaborator Author

@robwalch robwalch Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point timeLoading is expected to be greater than the TTFB estimate (we should have exited above if it is not). The idea is to increase the estimate so that the penalty is not just a down-switch but also impact future estimates.

// reset forced auto level value so that next level will be selected
this._nextAutoLevel = -1;

this.bwEstimator.sampleTTFB(stats.loading.first - stats.loading.start);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move up before the this.ignoreFragment() gating? TTFB should still be relevant for eg. an init segment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done fe62282. Keeping it restricted to main variant only. We would not want alt-audio or subs to impact the behavior here which will remain main variant only for this change-set.

hls.nextLoadLevel = nextLoadLevel;
if (loadedFirstByte) {
// If there has been loading progress, sample bandwidth using loading time offset by minimum TTFB time
this.bwEstimator.sample(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can sample TTFB here too?

if (loadedFirstByte) {
// If there has been loading progress, sample bandwidth using loading time offset by minimum TTFB time
this.bwEstimator.sample(
timeLoading - Math.min(ttfbEstimate, ttfb),
Copy link

@Pri12890 Pri12890 Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeLoading - ttfb This might be more accurate for bwe. agreed high TTFB is not healthy, but then that doesn't reflect user's true network condition. Also largely high TTFB is typically outliers, but those outliers can really bring fastEstimate down, and catching up again on true BW can take a while(because of small duration, weight will be smaller). Resulting user playing in low resolution unnecessarily. Moreover, how can lower bwe help with high TTFB even if it is for some time duration. Because that high TTFB will hold true for lower resolution segments as well(in theory). so track switch will not help?

stats.loading.start -
Math.min(
stats.loading.first - stats.loading.start,
this.bwEstimator.getEstimateTTFB()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as abandonRules, too conservative bwe.

Comment on lines +298 to +299
const processingMs =
stats.parsing.end -
stats.loading.start -
Math.min(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello

I encountered the Bandwidth Estimation Problem with LL-HLS streams myself, but was never able to find a reliable solution. So thank you, I appreciate your work on this PR and I'm happy if i can help you.

After playing around a little bit, I felt that this solution is working pretty well already. When I wanted to compare the most recent release of hls.js with this PR, I noticed a problem with the bandwidth graph on the demo page.

It appears to me that the demo graph is calculating the bandwidth values separately, and therefore also needs to be updated.

demo/main.js lines 553 - 556
Math.round( (8 * data.stats.total) / (data.stats.buffering.end - data.stats.loading.start) ),

Any thoughts on that?

Thanks Thomas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robwalch would it help if I create a PR to fix the demo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI #4904

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a Contribution

@silltho can you rebase your PR now that this is merged? Thank you!

Copy link
Contributor

@shacharz shacharz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were a few unit conversion bugs a long the way, it might be worth to create a few sanity unit tests to make sure this works as expected.

@@ -157,13 +157,17 @@ class AbrController implements ComponentAPI {
const expectedLen =
stats.total ||
Math.max(stats.loaded, Math.round((duration * level.maxBitrate) / 8));
let timeStreaming = timeLoading - ttfb;
if (timeStreaming < 1 && loadedFirstByte) {
timeStreaming = Math.min(timeLoading, (stats.loaded * 8) / bwEstimate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're at the point that the network is so slow (only the first buffer was emitted by the network stack, we're in the abondonRuleCheck) then probably something happened to the network, why should we ignore this effect and rely on bwEstimate ?

suggestion:
timeStreaming = timeLoading;

Copy link
Collaborator Author

@robwalch robwalch Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: timeStreaming = timeLoading;

We're trying to remove roundtrip latency (timeLoading - ttfb) from the equation so that is not what we want.

In this case, almost all of the time was spent waiting for roundtrip (timeStreaming < 1). It hasn't been long enough to know if throughput is affected yet, so we can continue assuming that only roundtrip latency was impacted. As long as it's been more than one millisecond since we received bytes, then we can calculate a load rate based on actual timeStreaming.

// (longer times have less weight with expected input under 1 second)
const seconds = ttfb / 1000;
const weight = Math.sqrt(2) * Math.exp(-Math.pow(seconds, 2) / 2);
this.ttfb_.sample(weight, Math.max(ttfb, 5));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it on purpose that the weight has a unit of f(second), while the value's unit is milisecond ?

Copy link
Collaborator Author

@robwalch robwalch Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weight is not measured in units of time. The weight curve's input is in seconds because we expect that the majority of round trip time will be < 1s and these samples have greater impact on our weighted average. If we have an outlier where the server could not respond for a couple of seconds, its weight and impact on this estimate will be lower.

// (longer times have less weight with expected input under 1 second)
const seconds = ttfb / 1000;
const weight = Math.sqrt(2) * Math.exp(-Math.pow(seconds, 2) / 2);
this.ttfb_.sample(weight, Math.max(ttfb, 5));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity why min sample value of 5ms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be addressed by not sampling small values in abandon incomplete request cases. I chose this magic number because based on testing with real traffic it seemed unlikely we would get lower values that were realistic or reliable.

@@ -183,7 +187,7 @@ class AbrController implements ComponentAPI {
const levelNextBitrate = levels[nextLoadLevel].maxBitrate;
fragLevelNextLoadedDelay = loadRate
? (duration * levelNextBitrate) / (8 * 0.8 * loadRate)
: duration / bwEstimate + ttfbEstimate;
: (duration * levelNextBitrate) / bwEstimate + ttfbEstimate;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we conservative with loadRate, while not conservative with bwEstimate ?
loadRate is falsey when we didn't load the first bytes, I'd argue bwEstimate is over promising in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we conservative with loadRate, while not conservative with bwEstimate ?

Changed this to always add ttfbEstimate rather than augment the load rate or bwe. There will always be a round-trip cost when switching down to load a new segment, so this should provide a better estimate in either case. 846f618

loadRate is falsey when we didn't load the first bytes, I'd argue bwEstimate is over promising in this case.

bwEstimate is the best estimate we have. It is more likely that the ttfbEstimate has not been met if a fragment request needs to be aborted before any data has been received.

robwalch added a commit that referenced this pull request Sep 20, 2022
robwalch added a commit that referenced this pull request Sep 20, 2022
robwalch added a commit that referenced this pull request Sep 21, 2022
@robwalch robwalch added this to Low-Latency Improvements in Release Planning and Backlog Sep 30, 2022
matteocontrini pushed a commit to matteocontrini/hls.js that referenced this pull request Oct 31, 2022
@robwalch robwalch force-pushed the feature/estimate-ttfb-fix-part-stats-improve-abr branch from fe62282 to b4b3bfd Compare November 28, 2022 22:50
@robwalch robwalch force-pushed the feature/estimate-ttfb-fix-part-stats-improve-abr branch from 846f618 to 6f19423 Compare January 9, 2023 03:37
@robwalch robwalch force-pushed the feature/estimate-ttfb-fix-part-stats-improve-abr branch from 6f19423 to 1672028 Compare January 12, 2023 18:41
@robwalch robwalch merged commit 0f14a22 into master Jan 18, 2023
@robwalch robwalch deleted the feature/estimate-ttfb-fix-part-stats-improve-abr branch January 18, 2023 18:54
@robwalch robwalch restored the feature/estimate-ttfb-fix-part-stats-improve-abr branch January 18, 2023 19:28
robwalch added a commit that referenced this pull request Jan 26, 2023
* Improve bandwidth estimation and adaptive switching with smaller segments and higher TTFB
Fixes #3578 (special thanks to @Oleksandr0xB for submitting #4283)
Fixes #3563 and Closes #3595 (special thanks to @kanongil)

* Load rate and loaded delay calculation fixes

* Convert ttfbEstimate to seconds

* Include main variant init segments in TTFB sampling

* Use ttfb estimate in abandon rules down-switch timing
@robwalch robwalch deleted the feature/estimate-ttfb-fix-part-stats-improve-abr branch March 22, 2023 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Release Planning and Backlog
  
Low-Latency Improvements
5 participants