January 4, 2022
Virtual backgrounds are becoming necessary nowadays in the video conferencing world. It allows us to replace our natural background with an image or a video. We can also upload our custom images in the background.
Learn how to build Twitter Spaces with 100ms Android SDK
In this blog, we are going to implement Virtual Background in Android with WebRTC using mlkit selfie segmentation.
This feature works best with uniform lightning condition in background and requires a high-performance mobile android device for a smooth user experience.
By end of this blog, you can expect the virtual background feature to look like this.
Add the dependencies for the ML Kit Android libraries to the module's app-level gradle file, which is usually app/build.gradle
:
dependencies { implementation 'com.google.mlkit:segmentation-selfie:16.0.0-beta3'}
Add the dependencies for the libyuv.
dependencies { implementation 'io.github.zncmn.libyuv:core:0.0.7'}
libyuv is an open-source project that includes YUV scaling and conversion functionality.
MediaStreamTrack
objects, representing various audio or video tracksMost challenging part was getting the VideoFrame out for every frame from WebRTC for processing. After a lot of research we found out that we can use setVideoProcessor API available with VideoSource. It has few callbacks
//It gives us VideoFrame going into WebRTC for every frame
fun onFrameCaptured(inputVideoFrame: VideoFrame?)
//It gives us sink which we will use to send updated videoFrame back to //WebRTC
fun setSink(sink: VideoSink?)
This is how we can setVideoProcessor to VideoSource(source in below code snippet is VideoSource)
source.setVideoProcessor(object : VideoProcessor {
override fun onCapturerStarted(p0: Boolean) {
}
override fun onCapturerStopped() {
}
override fun onFrameCaptured(inputVideoFrame: VideoFrame?) {
//Do processing with inputVideoFrame here
}
override fun setSink(sink: VideoSink?) {
//set sink here to send updated videoFrame back to WebRTC
}
})
If we are setting VideoProcessor to the VideoSource we need to call onFrame callback on every frame from VideoSink otherwise, we will get a black screen on our device.
//Here frame is the updated VideoFrame we are getting after ML processing //on input videoFrame
sink.onFrame(frame)
To perform segmentation on an image, mlkit needs an InputImage object which can be created from either a bitmap, bytebuffer, media.Image, byte array, or a file on the device.
Here, we have converted inputVideoFrame into a bitmap using libyuv library
YuvFrame: It copies the Y, V and U planes from videoFrame buffer into a byte array which we are converting to ARGB_8888 Bitmap
yuvFrame = YuvFrame(
inputVideoFrame,
YuvFrame.PROCESSING_NONE,
inputVideoFrame.timestampNs
)
inputFrameBitmap = yuvFrame.bitmap
Now we have created InputImage using inputFrameBitmap
val mlImage = InputImage.fromBitmap(inputFrameBitmap, 0)
We have created an instance of Segmenter using this.
segmenter.process( mlImage )
.addOnSuccessListener { segmentationMask ->
val mask = segmentationMask.buffer
val maskWidth = segmentationMask.width
val maskHeight = segmentationMask.height
mask.rewind()
val arr: IntArray = maskColorsFromByteBuffer(mask, maskWidth, maskHeight)
val segmentedBitmap = Bitmap.createBitmap(
arr, maskWidth, maskHeight, Bitmap.Config.ARGB_8888
)
//segmentedBitmap is the person segmented from background
}
.addOnFailureListener { exception ->
HMSLogger.e( "App" , "${exception.message}" )
}
.addOnCompleteListener {
}
We have used Porter.Duff mode to draw segmented output with the background image given by user on the Canvas(using canvas APIs)
After this we will get outputBitmap from canvas which we are using to create an updated VideoFrame
surfTextureHelper?.handler?.post() {
GLES20.glTexParameteri(
GLES20.GL_TEXTURE_2D,
GLES20.GL_TEXTURE_MIN_FILTER,
GLES20.GL_NEAREST
)
GLES20.glTexParameteri(
GLES20.GL_TEXTURE_2D,
GLES20.GL_TEXTURE_MAG_FILTER,
GLES20.GL_NEAREST
)
GLUtils.texImage2D(GLES20.GL_TEXTURE_2D, 0, outputBitmap, 0)
val i420Buf = yuvConverter.convert(inputBuffer)
val outputVideoFrame = VideoFrame(i420Buf, 180, frameTs) //180 is the frame rotation degree which we are using
}
This will replace the input video feed with the background supplied on both local and remote
sink.onFrame(outputVideoFrame)
The whole pipeline takes on an average 40-50ms on 360p resolution as measured on OnePlus6.
Most of the processing time is taken by input VideoFrame to YuvFrame conversion. Since the real-time view doesn't change much on every frame, there is no point in doing this conversion on every frame. The previous converted YuvFrame can be easily used for processing. It helps in enhancing the performance and user experience.
Read how you can build your own Omegle clone using 100ms SDK
Like what you’re reading?
Get Audio/video engineering tips straight into your inbox