Implementing low-latency shared/exclusive mode audio output/duplex

Audio output and duplex is actually quite tricky, and even libraries like RtAudio's ALSA backend get it wrong. If you're writing an app that needs low-latency audio without glitches, the proper implementation architecture differs between apps talking to pull-mode (well-designed, low-latency) mixing daemons, and apps talking to hardware. (I hear push-mode mixing daemons are incompatible with low latency; I discuss this at the end.) This is my best understanding of the problem right now.

Prior art

There are some previous resources on implementing ALSA duplex, but I find them to be unclear and/or incomplete:

ALSA terminology

These are some background terms which are helpful to understand before writing an audio backend.

Sample: one amplitude in a discrete-time signal, or the time interval between an ADC generating or DAC playing adjacent samples.

Frame: one sample of time, or one sample across all audio channels.

Period: Every time the hardware record/play point advances by this many frames, the app is woken up to read or generate audio. In most ALSA apps, the hardware period determines the chunks of audio read, generated, or written.

However you can read and write arbitrary chunks of audio anyway, and query the exact point where the hardware is writing or playing audio at any time, even between periods. For example, PulseAudio and PipeWire's ALSA backends ignore/disable periods altogether, and instead fetch and play audio based off a variable-interval OS timer loosely synchronized with the hardware's write and play points.

Batch device: Represented by SNDRV_PCM_INFO_BATCH in the Linux kernel. I'm not exactly sure what it means. https://mailman.alsa-project.org/pipermail/alsa-devel/2014-March/073816.html says it's a device where audio can only be sent to the device in period-sized chunks. https://mailman.alsa-project.org/pipermail/alsa-devel/2015-June/094037.html is too complicated for me to understand.

EDIT 2023-07-10: https://mailman.alsa-project.org/pipermail/alsa-devel/2015-June/093750.html says that SNDRV_PCM_INFO_BLOCK_TRANSFER means that data gets transferred and cached in period-sized chunks (so you must send a full period of audio before it can start playing), while SNDRV_PCM_INFO_BATCH means that the amount of consumed data is reported in period-sized chunks (usually).

Quantum: PipeWire's app-facing equivalent to ALSA/JACK periods.

Buffer size: the total amount of audio which an input ALSA device can buffer for an app to read, or can be buffered by an app for an output ALSA device to play. Always at least 2 periods long.

Available frames: The number of frames (channel-independent samples) of audio readable/buffered (for input streams) or writable (for output streams).

"Buffered" frames: For input devices, this matches available (readable) frames. For output devices, this equals the buffer size minus available (writable) frames.

hw devices, plugins, etc: See https://www.volkerschatz.com/noise/alsa.html.

Minimum achievable input/output/duplex latency

The minimum achievable audio latency at a given period size is achieved by having 2 periods of total capture/playback buffering between hardware and a app (like JACK2, PipeWire, and well-designed ALSA apps).

For duplex streams, the total round-trip (microphone-to-speaker) latency of a duplex stream is N periods (the maximum amount of buffered audio in the output buffer). N is always ≥ 2 and almost always an integer.

For capture and duplex streams, there are 0 to 1 periods of capture (microphone-to-screen) latency (since microphone input can occur at any time, but is always processed at period boundaries).

For playback and duplex streams, there are N-1 to N periods of playback (keyboard-to-speaker) latency (since keyboard input can occur at any point, waits up to 1 period before sound is generated, and the sound takes N-1 periods to reach the speakers).

These values only include delay caused by audio buffers, and exclude extra latency in the input stack, display stack, sound drivers, resamplers, or ADC/DAC.

Note that this article doesn't cover the advantages of extra buffering, like smoothing over hitches, or JACK2 async mode ensuring that an app that stalls won't cause the system audio and all apps to xrun. I have not studied JACK2 async mode though.

Avoid blocking writes (both exclusive and shared, output only)

If your app generates one output period of audio at a time and you want to minimize keypress-to-audio latency, regardless if your app outputs to hardware devices or pull-mode daemons, it should never rely on blocking writes to act as output backpressure. Instead it should wait until 1 period of audio is writable, then generate 1 period of audio and nonblocking-write it. (This does not apply to duplex apps, since waiting for available input data effectively acts as output throttling.)

If your app generates audio before performing blocking writes for throttling, you will generate a new period of audio as soon as the previous period of audio is written (a full period of real time before a new period of audio is writable). This audio gets buffered for an extra period (while snd_pcm_writei() blocks) before reaching the speakers, so external (eg. keyboard) input takes a period longer to be audible.

(Note that avoiding blocking writes isn't necessarily beneficial if you don't generate and play audio in chunks synchronized with output periods.)

Issue: RtAudio relies on blocking snd_pcm_writei in pure-output streams. This adds 1 period of keyboard-to-speaker latency to output streams. (It also relies on blocking snd_pcm_writei for duplex streams, but this is essentially harmless since RtAudio first blocks on snd_pcm_readi, and by the time the function returns, if the input and output streams are synchronized snd_pcm_writei is effectively a nonblocking write call.)

ALSA: blocking reads/writes vs. snd_pcm_wait() vs. poll()

Making a blocking call to snd_pcm_readi() before generating sound is basically fine and does not add latency relative to nonblocking reads (snd_pcm_sw_params_set_avail_min(1 period) during setup, and calling snd_pcm_wait() before every read).

On the other hand, generating sound then making a blocking call to snd_pcm_writei() (in output-only streams) adds a full period of keyboard-to-speaker latency relative to nonblocking writes (snd_pcm_sw_params_set_avail_min(unused_buffer_size + 1 period) during setup, and calling snd_pcm_wait() before generating and writing audio).

poll() has the same latency as snd_pcm_wait() and is more difficult to setup. The advantage is that you can pass in an extra file descriptor, allowing the main thread to interrupt the audio thread if poll/snd_pcm_wait() is stuck waiting on a stalled ALSA device. (I'm not sure if stalled ALSA is common, but I've seen stalled shared-mode WASAPI happen.)

Avoid buffering shared output streams (output and duplex)

Most apps use shared-mode streams, since exclusive-mode streams take up an entire audio device, preventing other apps from playing sound. Shared-mode streams generally communicate with a userspace audio daemon (or in ALSA dmix, the first app opening a device12), which is responsible for mixing audio from various programs and feeding it into hardware sound buffers, and ideally even routing audio from app to app.

If an app needs a output-only or duplex shared-mode stream, and must avoid unnecessary output latency, it should not buffer output audio itself (or generate audio before performing a blocking write, discussed above). Instead it should wait for the daemon to request output audio (and optionally provide input audio), then generate output audio and send it to the daemon. This minimizes output latency, and in the case of duplex streams, enables zero-latency app chaining between apps in an audio graph! To achieve this, the pull-mode mixing daemon (for example JACK2 or PipeWire) requests audio from the first app, and synchronously passes it to later apps within the same period of real-world time. Sending audio through two apps in series has zero added latency compared to sending audio through one app. The downside is that if you chain too many apps, JACK2 can't finish ticking all the apps in a single period, and fails to output audio to the speakers in time, resulting in an audio glitch or xrun.

Issue: Any ALSA app talking to pulseaudio-alsa or pipewire-alsa (and possibly any PulseAudio app talking to pipewire-pulse) will perform extra buffering. Hopefully RtAudio, PortAudio, etc. will all add PipeWire backends someday (SDL2 already has it: https://www.phoronix.com/scan.php?page=news_item&px=SDL2-Lands-PipeWire-Audio).

As a result, for the remainder of the article, I will be focusing on using ALSA to talk to hardware devices.

Buffer 1-2 periods in exclusive output streams (output and duplex)

It is useful for some apps to open hardware devices directly (such that no other app can output or even receive audio), using exclusive-mode APIs like ALSA. These apps include audio daemons like PipeWire and JACK2 (which mix audio output from multiple shared-mode apps), or DAWs (which occupy an entire audio device for low-latency low-overhead audio recording and playback).

Apps which open hardware in exclusive mode must handle output timing in real-world time themselves. They must read input audio as the hardware writes it into buffers, and send output audio to the buffers ahead of the hardware playing it back.

In well-designed duplex apps that talk to hardware, such as jack2 talking to ALSA, the general approach is:

Then loop:

Implementing exclusive-mode duplex like JACK2

JACK2's ALSA backend, and this guide, assume the input and output device in a duplex pair share the same underlying sample clock and never go out of sync. Calling snd_pcm_link() on two streams is supposed to succeed if and only if they share the same sample clock, buffer size and period count, etc. (the exact criteria are undocumented, and I didn't read the kernel source yet). If it succeeds, it not only starts and stops the streams together, but is supposed to synchronize the input's write pointer and the output's read pointer.

PipeWire supports rate-matching resampling (link), but (like timer-based scheduling) it introduces a great deal of complexity (heuristic clock skew estimation, resampling latency compensation), which I have not studied, is out of scope for opening a simple duplex stream, and actively detracts from learning the fundamentals.

Note that unused_buffer_size > 0 is also incidental complexity, and not essential to understanding the concepts. Normally buffer_size = N periods.

On ALSA, you can implement full duplex period-based audio by:

And in the audio loop:

RtAudio gets duplex wrong, can have xruns and glitches

Issue: RtAudio opens and polls an ALSA duplex stream (in this case, duplex.cpp with extra debug prints added, opening my motherboard's hw device) by:

Then loop:

Fixing RtAudio output and duplex

To resolve this for duplex streams, the easiest approach is to change stream starting:

This approach fails for output-only streams. To resolve the issue in both duplex and output streams, you must:

I haven't looked into how RtAudio stops ALSA streams (with or without snd_pcm_link()), then starts them again, and what happens if you call them quickly enough that the buffers haven't fully drained yet.

(optional) Replacing blocking reads/writes with cancellable polling

RtAudio needs to use polling to avoid extra latency in output-only streams. Should it be used for duplex and input-only streams as well? Is it worth adding an extra pollfd for cancelling blocking writes (possibly replacing the condvar)?

I don't know how to refactor RtAudio to allow cancelling a blocked snd_pcm_readi/writei. Maybe pthread cancellation is sufficient, I don't know. If not, one JACK2 and cpal-inspired approach is:

Unfortunately this requires a pile of refactoring for relatively little gain.

Is RtAudio's current approach appropriate for low-latency pipewire-alsa?

Update: No.

pipewire-alsa in its current form (774ade146) is wholly unsuitable for low-latency audio.

I use jack_iodelay to measure signal latency, by using Helvum (a PipeWire graph editor) to route jack_iodelay's output (which generates audio) through other nodes (which should pass-through audio with a delay) and back into its input (which measures audio and determines latency). When jack_iodelay is routed through hardware alone, it reports the usual 2 periods/quantums of latency. When I start RtAudio's ALSA duplex app with period matched to the PipeWire quantum (which should add only 1 period of latency since snd_pcm_link() fails), and route jack_iodelay through hardware and duplex in series, jack_iodelay reports a whopping 7 periods of latency. My guess is that pipewire-alsa adds a full 2 periods of buffering to both its input and output streams. I'm not sure if I have the motivation to understand and fix it.

Earlier:

RtAudio doesn't write silence to the output of a duplex stream before starting the streams, and only writes to the output stream once one period of data arrives at the input stream. This is unambiguously wrong for hw device streams. Is it the best way to achieve zero-latency alsa passthrough, when using the pipewire-alsa ALSA plugin? I don't know if it works or if the output stream xruns, I don't know if this is contractually guaranteed to work, and I'd have to test it and read the pipewire-alsa source (link).

Is it possible to achieve low-latency output-only ALSA, perhaps by waiting until the buffer is entirely empty (snd_pcm_sw_params_set_avail_min())? Again I don't know, and I'd have to test.

Push-mode audio loses the battle before it's even fought

I hear push-mode mixing daemons like PulseAudio (or possibly WASAPI) are fundamentally bad designs, incompatible with low-latency or consistent-latency audio output.

https://superpowered.com/androidaudiopathlatency (discussion) is an horror story. In fact I read elsewhere that pre-AAudio Android duplex loopback latency is different on every run; I can no longer recall the source, but it's entirely consistent with the user application's own ring buffering, or if input and output streams were started separately and not started and run in sync at a driver level like snd_pcm_link.

Note that Android audio may have improved since then, see AAudio and https://android-developers.googleblog.com/2021/03/an-update-on-androids-audio-latency.html.

An update on push-mode audio

Update 2023-07-10: Push-mode audio can provide fairly low latencies, even if you don't generate audio in the application in fixed-size callbacks. This still requires managing the queue length in a closed-loop fashion. You can rely on backpressure from the audio output to throttle audio generation, though getting the same latency as JACK requires a zero-length output queue where pushing audio doesn't return to the program until the new audio block starts playing.

Alternatively you can rely on a non-audio timer to control the rate audio is generated (which is necessary to sync audio playback with the rate of online video streaming or display playback), and adjust the rate that audio is resampled (if audio is generated faster than being played, causing queue length to increase, you need to increase the rate of audio playback). This becomes more difficult if audio is generated and/or consumed in large chunks relative to the desired time that audio is queued for, and you cannot accurately measure the progress within the chunks (like in ALSA batch mode?). Unfortunately dynamic resampling is easy to get wrong, causing the pitch of audio to drift over time (VLC, Dolphin Emulator).


The largest problem with push-mode queueing for low-latency audio is that it does not compose. If you generate audio before checking if there is room to accept it, then perform a blocking push until there is such room, the audio necessarily gets delayed by the length of your sleep before it enters the queue, and possibly some additional time in the queue before it starts playing, and it takes the length of the audio segment before it finishes playing. To achieve minimal latency, only one application can be doing this at a time, whether it be the audio daemon (like Windows's WASAPI audiodg.exe, Linux's PulseAudio/JACK/PipeWire, or presumably macOS's Core Audio) or an application in exclusive mode audio. The audio daemon must then request audio from applications to fill up its own audio queue (and the applications must only generate audio once room is available in the system), before mixing the received audio into the sound queued for the audio device.

There is a comparable distinction in video latency, between an app exclusively controlling a display, versus push/pull-mode video (with and without latency). A major difference is that on VRR displays, an app can actually display a new frame of video as soon as it's rendered, whereas in audio playback, playing a new buffer as soon as it's ready will result in severe audio skipping if you start a new buffer before or after the old one is finished playing. I am less familiar with video compositing and latency, and may write about it in a separate post once I have worked with it more.

Footnotes

1

Source code of dmix, where first process appears to create shared memory segment and open hw device.