User Controllable Audio Channels for Streaming

Jun 26, 2022

I sometimes watch twitch streams and I frequently want to mute either the commentators or the players. This is also how I feel when I watch sports - I don’t think I’ve ever wanted to hear the pundits say “you know, now more than ever they need to focus on scoring the most points”.

For video streamers it’s easy with tools like OBS to choose which audio channel to stream at any given time, but as a listener I don’t have the same level of control. Only occasionally does the commentary feel like a meaningful value add, and when it does I don’t want to have to mentally tune out other voices and sounds.

Creating layers of audio channels to the user and allowing them to turn on and off specific ones would be a great feature. It leans into the core idea of general purpose computing that the user should have control over what runs on their device; just like I can customize my browser to display text at different sizes and to choose what JavaScript does or does not run, I could have control over what parts of an audio stream I listen to.

There are two ways I see this as being implementable. The first would be at the platform and app level. Twitch could offer advanced features to streamers and users that expose the different audio layers, and let a user determine what they hear. I don’t have a good sense of the complexity of such a feature - I think a naive implementation, which doesn’t need client side rendering, would require generating different streams of audio (one with commentary, one without) and sending different streams depending on the user’s settings.

A second way would be using deep learning to isolate and remove specific voices and sounds from an audio channel. This would be a pure end user tool, not requiring any buy-in from the audio provider. DL audio manipulation has gotten very good, and there are already non-streaming versions of this. A tool that takes in a streaming audio channel, delays the stream for several seconds while removing a specific set of voices, and then streams the new audio, seems feasible with current tech, though would likely have some annoying failures (i.e. at times removing the wrong audio).

Ben Goldhaber's Newsletter

Discussion about this post