Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird audio issues with high audio per packet settings #6415

Open
bkacjios opened this issue May 10, 2024 · 11 comments
Open

Weird audio issues with high audio per packet settings #6415

bkacjios opened this issue May 10, 2024 · 11 comments
Labels
audio bug A bug (error) in the software client

Comments

@bkacjios
Copy link

Description

Hello, I am the author of lua-mumble, a module for allowing someone to create a bot via the Lua scripting language.

I've been updating it to support the new protobuf UDP packets that were introduced in 1.5 and have been running into this weird issue. I have a test bot that is configured with a 60 ms audio frame per packet with a 96000 bitrate. It is also configured to output stereo audio. The server has a max bitrate of 558000.

I noticed that sometimes I was having weird audio issues with the bot I couldn't explain. Sometimes I would start the bot and the audio would be stuttering from the start, other times I would start it and it would sound perfect. The only way I was able to reproduce it consistently is by setting the audio packet size to 60 on the bot and muting/unmuting my client.

2024-05-10.09-01-27.mp4

I even managed to get this to happen when having my bot not play any music at all. Instead I had it loop back my audio so I could hear myself through the bot. The stuttering issues were also present. It almost seems like my clients decoder is trying to decode it in some other mode or something, but honestly I'm not too sure what is going on here.

Would anyone perhaps know what is going on here? I'm still unsure if this is an issue with the way I am encoding audio or an issue with how mumble is decoding audio.

Steps to reproduce

Have a client start transmitting audio at 60ms audio frame packet.
Mute and unmute the audio. They will now be stuttering.

Mumble version

1.5.628

Mumble component

Client

OS

Windows

Reproducible?

Yes

Additional information

No response

Relevant log output

No response

Screenshots

No response

@bkacjios bkacjios added bug A bug (error) in the software triage This issue is waiting to be triaged by one of the project members labels May 10, 2024
@davidebeatrici
Copy link
Member

I have encountered a similar issue when working on a project based on libmumble.

Turns out the Mumble client struggles with packets that contain more than 10ms (480 samples at 48000 Hz) of audio.

Try to send 480 samples (per channel): the issue should not be present.

@bkacjios
Copy link
Author

Good to know. I was kinda going crazy thinking I was doing something wrong somewhere. Yeah, setting it to 10 ms seems to work fine. 20 is okay too mostly, but 40 and 60 are basically unusable. I usually like to use at least 20 ms since I find it seems to keep the audio clarity stable. I notice little hiccups or stutters with 10 that are nowhere near as bad as this though.

@davidebeatrici davidebeatrici added client audio and removed triage This issue is waiting to be triaged by one of the project members labels May 11, 2024
@davidebeatrici
Copy link
Member

Thank you for confirming.

This is definitely something that has to be fixed in the client. Ideally it should handle all frame sizes that are supported by libopus.

#define SAMPLE_RATE 48000

// Standard microphone sample rate (samples/s)
static const unsigned int iSampleRate = SAMPLE_RATE;
/// Based the sample rate, 48,000 samples/s = 48 samples/ms.
/// For each 10 ms, this yields 480 samples. This corresponds numerically with the calculation:
/// iFrameSize = 48000 / 100 = 480 samples, allowing a consistent 10ms of audio data per frame.
static const int iFrameSize = SAMPLE_RATE / 100;

The iSampleRate and iFrameSize variables exist because the code itself doesn't consider the sample rate and frame size to be hardcoded.

Unfortunately an option to change them was never added and as a result problems like this one went unnoticed until third-party clients (e.g. bots) started popping up.

@bkacjios
Copy link
Author

So wait, I'm a little confused. I thought this setting in the client adjusts it? Are you saying the client just thinks all incoming audio is 10 ms?

image

I made my bot mimic this functionality. I was defaulting it to 20 ms like mumble does on a fresh install, but I can adjust it via a method call.

https://github.com/bkacjios/lua-mumble/blob/54c2272e2f0ba06bacf87c2fdc21441d14505140/mumble/mumble.h#L81
https://github.com/bkacjios/lua-mumble/blob/54c2272e2f0ba06bacf87c2fdc21441d14505140/mumble/audio.c#L196

@davidebeatrici
Copy link
Member

The setting itself adjusts the number of 10ms chunks to send per packet, not the actual number of audio frames:

/// Encoded audio rate in bit/s
int iAudioQuality;
bool bAllowLowDelay;
/// Number of 10ms audio "frames" per packet (!= frames in packet)
int iAudioFrames;

void AudioInput::setMaxBandwidth(int bitspersec) {
if (bitspersec == Global::get().iMaxBandwidth)
return;
int frames;
int bitrate;
bool allowLowDelay;
adjustBandwidth(bitspersec, bitrate, frames, allowLowDelay);
Global::get().iMaxBandwidth = bitspersec;
if (bitspersec != -1) {
if ((bitrate != Global::get().s.iQuality) || (frames != Global::get().s.iFramesPerPacket))
Global::get().mw->msgBox(
tr("Server maximum network bandwidth is only %1 kbit/s. Audio quality auto-adjusted to %2 "
"kbit/s (%3 ms)")
.arg(bitspersec / 1000)
.arg(bitrate / 1000)
.arg(frames * 10));
}
AudioInputPtr ai = Global::get().ai;
if (ai) {
Global::get().iAudioBandwidth = getNetworkBandwidth(bitrate, frames);
ai->iAudioQuality = bitrate;
ai->iAudioFrames = frames;
ai->bAllowLowDelay = allowLowDelay;
return;
}
ai.reset();
Audio::stopInput();
Audio::startInput();
}

// Encode via Opus
encoded = false;
opusBuffer.insert(opusBuffer.end(), psSource, psSource + iFrameSize);
++iBufferedFrames;
if (!bIsSpeech || iBufferedFrames >= iAudioFrames) {
if (iBufferedFrames < iAudioFrames) {
// Stuff frame to framesize if speech ends and we don't have enough audio
// this way we are guaranteed to have a valid framecount and won't cause
// a codec configuration switch by suddenly using a wildly different
// framecount per packet.
const int missingFrames = iAudioFrames - iBufferedFrames;
opusBuffer.insert(opusBuffer.end(), static_cast< std::size_t >(iFrameSize * missingFrames), 0);
iBufferedFrames += missingFrames;
iFrameCounter += missingFrames;
}
Q_ASSERT(iBufferedFrames == iAudioFrames);
len = encodeOpusFrame(&opusBuffer[0], iBufferedFrames * iFrameSize, buffer);
opusBuffer.clear();
if (len <= 0) {
iBitrate = 0;
qWarning() << "encodeOpusFrame failed" << iBufferedFrames << iFrameSize << len;
iBufferedFrames = 0; // These are lost. Make sure not to mess up our sequence counter next flushCheck.
return;
}
encoded = true;
}
if (encoded) {
flushCheck(QByteArray(reinterpret_cast< char * >(&buffer[0]), len), !bIsSpeech, voiceTargetID);
}
if (!bIsSpeech)
iBitrate = 0;

Another issue in the code above is that it doesn't clamp the number of chunks to match iAudioFrame's value.

Putting that aside, let's see what is going on in the audio output section:

AudioOutputSpeech::AudioOutputSpeech(ClientUser *user, unsigned int freq, Mumble::Protocol::AudioCodec codec,
unsigned int systemMaxBufferSize)
: iMixerFreq(freq), m_codec(codec), p(user) {
int err;
opusState = nullptr;
bHasTerminator = false;
bStereo = false;
iSampleRate = SAMPLE_RATE;
// opus's "frame" means different from normal audio term "frame"
// normally, a frame means a bundle of only one sample from each channel,
// e.Global::get(). for a stereo stream, ...[LR]LRLRLR.... where the bracket indicates a frame
// in opus term, a frame means samples that span a period of time, which can be either stereo or mono
// e.Global::get(). ...[LRLR....LRLR].... or ...[MMMM....MMMM].... for mono stream
// opus supports frames with: 2.5, 5, 10, 20, 40 or 60 ms of audio data.
// sample rate / 100 means 10ms mono audio data per frame.
iFrameSizePerChannel = iFrameSize = iSampleRate / 100; // for mono stream
assert(m_codec == Mumble::Protocol::AudioCodec::Opus);
// Always pretend Stereo mode is true by default. since opus will convert mono stream to stereo stream.
// https://tools.ietf.org/html/rfc6716#section-2.1.2
bStereo = true;
opusState = opus_decoder_create(static_cast< int >(iSampleRate), bStereo ? 2 : 1, nullptr);
opus_decoder_ctl(opusState,
OPUS_SET_PHASE_INVERSION_DISABLED(1)); // Disable phase inversion for better mono downmix.
// iAudioBufferSize: size (in unit of float) of the buffer used to store decoded pcm data.
// For opus, the maximum frame size of a packet is 60ms.
iAudioBufferSize = iSampleRate * 60 / 1000; // = SampleRate * 60ms = 48000Hz * 0.06s = 2880, ~12KB
// iBufferSize: size of the buffer to store the resampled audio data.
// Note that the number of samples in each opus packet can be different from the number of samples the system
// requests from us each time (this is known as the system's audio buffer size).
// For example, the maximum size of an opus packet can be 60ms, but the system's audio buffer size is typically
// ~5ms on my laptop.
// Whenever the system's audio callback is called, we have two choice:
// 1. Decode a new opus packet. Then we need a buffer to store unused samples (which don't fit in the system's
// buffer),
// 2. Use unused samples from the buffer (remaining from the last decoded frame).
// How large should this buffer be? Consider the case in which remaining samples in the buffer can not fill
// the system's audio buffer. In that case, we need to decode a new opus packet. In the worst case, the buffer size
// needed is
// 60ms of new decoded audio data + system's buffer size - 1.
iOutputSize = static_cast< unsigned int >(
ceilf(static_cast< float >(iAudioBufferSize * iMixerFreq) / static_cast< float >(iSampleRate)));
iBufferSize = iOutputSize + systemMaxBufferSize; // -1 has been rounded up
if (bStereo) {
iAudioBufferSize *= 2;
iOutputSize *= 2;
iBufferSize *= 2;
iFrameSize *= 2;
}
pfBuffer = new float[iBufferSize];
srs = nullptr;
fResamplerBuffer = nullptr;
if (iMixerFreq != iSampleRate) {
srs = speex_resampler_init(bStereo ? 2 : 1, iSampleRate, iMixerFreq, 3, &err);
fResamplerBuffer = new float[iAudioBufferSize];
}
iBufferOffset = iBufferFilled = iLastConsume = 0;
bLastAlive = true;
iMissCount = 0;
iMissedFrames = 0;
m_audioContext = Mumble::Protocol::AudioContext::INVALID;
jbJitter = jitter_buffer_init(static_cast< int >(iFrameSize));
int margin = Global::get().s.iJitterBufferSize * static_cast< int >(iFrameSize);
jitter_buffer_ctl(jbJitter, JITTER_BUFFER_SET_MARGIN, &margin);
// We are configuring our Jitter buffer to use a custom deleter function. This prevents the buffer from
// copying the stored data into the buffer itself and also from releasing the memory of it. Instead it
// will now call this "deleter" function instead.
// This allows us to manage our own (global) storage for our audio data. With that, we can reuse the same
// memory regions in order to avoid frequent memory allocations and deallocations.
// Also this is the basis for using our trick of actually only storing indices instead of proper data
// pointers in the buffer.
jitter_buffer_ctl(jbJitter, JITTER_BUFFER_SET_DESTROY_CALLBACK,
reinterpret_cast< void * >(&AudioOutputSpeech::invalidateAudioOutputCache));
fFadeIn = new float[iFrameSizePerChannel];
fFadeOut = new float[iFrameSizePerChannel];
float mul = static_cast< float >(M_PI / (2.0 * static_cast< double >(iFrameSizePerChannel)));
for (unsigned int i = 0; i < iFrameSizePerChannel; ++i)
fFadeIn[i] = fFadeOut[iFrameSizePerChannel - i - 1] = sinf(static_cast< float >(i) * mul);
}

As you can see, the code assumes that incoming frames are always 10ms. This is wrong because (at least in our case) the Opus encoder always produces packets that contain a single encoded frame.

Basically, there is no concept of "chunks" in encoded packets, regardless of the client's audio input settings.

Finally, just to add some more confusion to the mix and possibly clarifying it: xiph/opus#315

@bkacjios
Copy link
Author

Ahhh.. I understand now. My bot has a timer that determines how often it should send the audio data. So if I set the bots audio packet size to 60, it's encoding 2880 frames into one audio packet every 60 ms. I'm guessing I should send 6 individual packets with 480 encoded frames in one go?

@davidebeatrici
Copy link
Member

Yup!

@bkacjios
Copy link
Author

Am I missing something?

When my bot receives audio from me speaking, it isn't getting 10 ms chunks if I adjust my client to have a 60 ms delay. Doesn't this go against what you said? Shouldn't I still be receiving 6 individual 240 bytes every 0.06 seconds?

OnUserStartSpeaking     mumble.user [137]["Bkacjios"]
received 960 bytes in 0.059990 seconds
[MUMBLE - TRACE] RECEIVED MumbleUDP.Audio
received 960 bytes in 0.059892 seconds
[MUMBLE - TRACE] RECEIVED MumbleUDP.Audio
received 960 bytes in 0.060152 seconds
OnUserStopSpeaking      mumble.user [137]["Bkacjios"]

Compared to the results when I have it set to 10 ms.

[MUMBLE - TRACE] RECEIVED MumbleUDP.Audio
OnUserStartSpeaking     mumble.user [137]["Bkacjios"]
received 240 bytes in 0.009980 seconds
[MUMBLE - TRACE] RECEIVED MumbleUDP.Audio
received 240 bytes in 0.009977 seconds
[MUMBLE - TRACE] RECEIVED MumbleUDP.Audio
received 240 bytes in 0.009891 seconds
OnUserStopSpeaking      mumble.user [137]["Bkacjios"]

@davidebeatrici
Copy link
Member

Are you referring to the size of the encoded packet or the raw audio data?

@bkacjios
Copy link
Author

Yeah, this was the encoded packet I receive from a speaking client.

I'm starting to think we had a misunderstanding here.

Turns out the Mumble client struggles with packets that contain more than 10ms (480 samples at 48000 Hz) of audio.

Try to send 480 samples (per channel): the issue should not be present.

My bot resamples all playing audio to 48000hz, since that's what mumble expects. I always encode 480 samples per 10ms. The issue I am having is with sending more than 10 ms per packet. (Replicating the audio per packet setting in an official client)

If I set it to 20, I will encode (480 * 2) bytes of PCM data and send that over. Is that not correct?

@davidebeatrici
Copy link
Member

If I set it to 20, I will encode (480 * 2) bytes of PCM data and send that over. Is that not correct?

With libopus you can encode either 16 bit (2 bytes) signed integer or 32 bit (4 bytes) float samples. You can choose by calling either opus_encode() or opus_encode_float(), respectively.

20ms (0.02s) of mono 16 bit audio data at 48000Hz would be:

(20 / 1000) * (16 / 8) * 48000 = 1920 bytes

Multiply the result by the number of channels (3840 bytes for stereo).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audio bug A bug (error) in the software client
Projects
None yet
Development

No branches or pull requests

2 participants