How We Built Local Audio Streaming in Android

Home

/ Blog /

February 5, 20238 min read

Link copied

Pratim Mallick

Staff Software Engineer

One of our customers is building a social audio rooms app on Android where hosts go live on talk shows. Recently, they came up with a feature request. They wanted to convert a few of their rooms into "radio stations", where the host of the room would stream local music from their device, and also talk simultaneously so that listeners would be able to hear the voice of the host as well as the music together.

This meant that we needed the ability to mix data captured from the device’s microphone and the music captured from the local device. This was a challenging feature to be built using native WebRTC on mobile, since unlike WebRTC on the web, WebRTC on mobile does not expose any API to add custom audio tracks.

To fulfill this scenario, we made significant changes to WebRTC and our Android SDK. We will go over those changes in this post — starting with an overview of how audio is captured by WebRTC, the changes we made to the WebRTC source, and how to mix two audio streams programmatically.

To make things easier to explain, let's break down the problem statement into 3 parts.

How to capture audio from the device mic
How to capture local audio from the device
How to mix two audio streams

How WebRTC Audio Streaming Works

Let’s go step-by-step into solving each of these problems.

Capture audio from the device mic

To find a solution to this we would first need to understand the flow inside WebRTC that captures audio from the mic of a device. So let’s dive in!

Things start off when we create an instance of PeerConnectionFactory of WebRTC. It internally creates an instance of JavaAudioDeviceModule, which is an implementation of WebRTC’s AudioDeviceModule.

`JavaAudioDeviceModule`

The JavaAudioDeviceModule is the main class responsible for setting parameters while capturing audio(Input) or playing out audio(Output). This class is responsible for:

Set preferred input device
Enable/disable noise suppression
Input/output sampling rate
Input/output stereo/mono channel

It also creates instances of WebrtcAudioRecord (for input) and WebrtcAudioTrack (for output) which are the actual classes responsible for capturing audio from the mic and playing out audio.

Let’s look deeper into WebRtcAudioRecord since that is where lies the path to our solution.

`WebrtcAudioRecord`

This class uses Android’s AudioRecord to communicate with the device to open and capture an audio stream from the mic.

When WebRTC starts to record an audio stream from the system’s mic, a Java thread is spawned, with the responsibility of:

Read audio buffer from the input hardware for recording into a direct buffer continuously
Send the number of bytes reads and the buffer containing actual audio bytes to the native layer of WebRTC, so that it can be encoded, packetised and streamed over network

Now that we have looked at the relevant classes of WebRTC, we need a way for us to be able to get these bytes so that we can modify them before they are sent over the network.

To achieve this we need to make changes to WebRTC's Java layer.

Create a callback that would be called as soon as data is read from device's microphone and it would contain the Byte buffer that was read.

public interface AudioBufferCallback {
    void onBuffer(ByteBuffer buffer, int bytesRead);
}

Modify creation logic in JavaAudioDeviceModule, so that we can provide the above callback instance when we create PeerConnectionFactory

// Create webrtcAudioRecord
webrtcAudioRecord = WebRtcAudioRecord(
    ...
)

// Set the callback to webrtcAudioRecord
webrtcAudioRecord.setBufferCallback(
    object : AudioBufferCallback {
        override fun onBuffer(micByteBuffer: ByteBuffer?,  micBytesRead: Int) {
      
        }
    })

// Create instance of JavaAudioDeviceModule & set webrtcAudioRecord to it
val customAudioDeviceModule = JavaAudioDeviceModule(
    ..
    webrtcAudioRecord,
    ..
)

// set the above JavaAudioDeviceModule instance while creating PeerConnectionFactory instance
val peerConnectionFactory = PeerConnectionFactory.builder()
    ..
    .setAudioDeviceModule(customAudioDeviceModule)
    .createPeerConnectionFactory()

With these changes, we now have the captured audio in the form of a byte buffer with us whenever onBuffer gets called.

Now let’s move on to the next problem.

Capture local audio from the device

So now that we have the bytes captured from the device's mic, our next goal is to get the audio from the local device so that we can mix them together

To capture system audio we use Android’s AudioRecord class to record audio played by other apps on the device and that allows their audio output to be captured. Since this involves using MediaProjection APIs in Android, this would only work on Android 10 and above.

To use the Audio Record object to capture other app’s audio, we first create an instance of AudioRecord and set its format and playback config

// Sets the format of the audio data to be captured.
val format = AudioFormat.Builder()
      .setEncoding(AudioFormat.ENCODING_PCM_16BIT)
      .setSampleRate(48000) // Most devices operate at 48kHz sampling rate
      .setChannelMask(AudioFormat.CHANNEL_OUT_MONO)
      .build()

// Sets the audioRecord to record audio played by other apps
val playbackConfig = AudioPlaybackCaptureConfiguration.Builder(mMediaProjection)
      .addMatchingUsage(AudioAttributes.USAGE_MEDIA)
      .addMatchingUsage(AudioAttributes.USAGE_UNKNOWN)
      .addMatchingUsage(AudioAttributes.USAGE_GAME)
      .build()

// Create the audio Record instance 
val mAudioRecord = AudioRecord.Builder()
      .setAudioFormat(format)
      .setAudioPlaybackCaptureConfig(playbackConfig)
      .build()

// Starts recording from the AudioRecord instance
mAudioRecord.startRecording()

Once the audio record is created, the next step is to read the audio from other apps in the system. But the question is when and how much to read at a time.

This is where the callback we created in the first step comes into the picture

The onBuffer callback of AudioBufferCallback interface is called whenever WebRTC's audio capture thread reads the audio bytes from the microphone. This callback gives us two values.

micByteBuffer - which is the ByteBuffer that contains the bytes read by WebRTC from the microphone of the device
micBytesRead - an Integer value that gives us how many bytes were read by WebRTC from the mic of the device

So now we know when to read the audio of other apps using the instance of audioRecord created above - whenever we receive the callback. And we also know how much to read - the value of the parameter micBytesRead received in the callback.

We read from the audio record into a ByteBuffer like so:

// Allocates a new direct byte buffer.
val localAudioByteBuffer: ByteBuffer = ByteBuffer.allocateDirect(micBytesRead).order(ByteOrder.nativeOrder())

// Reads audio data from the audio record for recording into a direct buffer
mAudioRecord.read(localAudioByteBuffer, micBytesRead, READ_BLOCKING)

This would give us the audio bytes from other apps/systems into the newly created localAudioByteBuffer.

Mix two audio streams

Now that we have captured audio from both sources – local and mic – we need is to mix them to create a combined stream that can be streamed to other peers connected via WebRTC.

The way to create a new buffer is to first convert the ByteBuffer into ShortArray.

// a1 and a2 are the ShortArrays that need to be combined
// a1Limit and a2Limit are the sizes of the above arrays  
fun addBuffers(
        a1: ShortArray,
        a1Limit: Int,
        a2: ShortArray,
        a2Limit: Int
    ): ByteArray {
        val size = Math.max(a1Limit, a2Limit)
        if (size < 0) return ByteArray(0)
        val result = ByteArray(size * 2)
        for (i in 0 until size) {
            var sum: Int
            sum = if (i >= a1Limit) {
                a2[i].toInt()
            } else if (i >= a2Limit) {
                a1[i].toInt()
            } else {
                a1[i].toInt() + a2[i].toInt()
            }
            if (sum > Short.MAX_VALUE) sum = Short.MAX_VALUE.toInt()
            if (sum < Short.MIN_VALUE) sum = Short.MIN_VALUE.toInt()
            val byteIndex = i * 2
            result[byteIndex] = (sum and 0xff).toByte()
            result[byteIndex + 1] = (sum shr 8 and 0xff).toByte()
        }
        return result
    }

The above method will return you a ByteArray which needs to be converted back to ByteBuffer before passing on to WebRTC.

Once the mixing of the audio is done, all we need to do is clear the contents of the microphone ByteBuffer captured by WebRTC and replace it with the resultant Byte buffers from the above

// Clear the mic buffer
micAudioByteBuffer.clear()
micAudioByteBuffer.put(combinedByteBuffer)

This will lead to streaming the combined audio from the microphone as well as the one playing on a local device as a single audio stream. With this approach, we can also replace the device mic audio with local device audio – if there is a use case for that.

If you're building with 100ms and want to add this feature to your application, check out our docs to enable local audio sharing in Android. Do share your feedback with us on Discord – we'd love to know what you build with 100ms.

Engineering

Pratim Mallick

Link copied

See all articles