Mixing Audio Into Video on Android

09 Jul 2019

Mixing Audio Into Video on Android

riad

If you're working on a camera app, you might also be interested in video post processing.

Even though you might be tempted to use external libraries (like ffmpeg), Android already offers powerful APIs that enable video processing.

Using Android's built-in APIs can give you some additional benefits. Oftentimes you'll be able to use device-specific hardware optimizations. GPL-style licenses might also not always fit your project needs. Additionally, using external libraries for encoding/decoding video might require you to pay patent fees.

Recently I've been playing a bit more with Android's video processing API's. In my recent post I showed how to create video from multiple consecutive images. In this post I would like to extend my previous sample app. It should be able to also add audio track to the generated video file.

This post tries to use mostly the newer APIs. You should be able to use the code from this post with minSdkVersion >= 22.

The information in this post should help you to implement some popular video editing features, like voice over or audio format conversion.

You can find the code for this post on github.

Overview

As mentioned at the beginning, in my previous post I created a video from multiple images. The video was packed into MPEG 4 container with H.264 encoding. This video contained no audio track.

In this post I'm assuming that you already have a video in the same format and you also might want to add audio to it. If this assumption doesn't suit your requirements, I still suggest to continue reading because I'll show some techniques that you can use for converting your media file to the preferred format (even though I focus on converting audio). However, you need to make sure that your format is one of the supported formats on Android.

The main class responsible for adding the audio track to our time-lapse video file will be the MediaMuxer. MediaMuxer can add multiple tracks (video/audio/subtitles) into one output video file.

The tracks for MediaMuxer should contain compressed video and audio frames in the right format.

We can use MediaExtractor to get the compressed audio and video frames out of the input files.

In ideal case, our input audio would already be encoded using AAC encoding and packed inside .M4A. In that case we would extract frames from input audio and video files and mux them into final MPEG 4 output file using MediaMuxer without any additional work.

However, MediaMuxer is strict about which audio encoding can go into MPEG 4 file. In case that you for example want to add MP3 audio into our vide file, you first need to convert the audio samples into the right format. In this case you also need to use MediaCodec class for conversion.

This second option would look something like this
1.) Use MediaExtractor to extract compressed frames from input video file
2.) Give compressed video frames to MediaMuxer
3.) Extract compressed frames from input audio file
4.) Feed compressed frames to MediaCodec decoder
5.) Get raw decompressed frames from decoder and feed them to encoder
6.) Get compressed frames in the right audio format from encoder and give them to MediaMuxer

So I will have this extra conversion step for the input audio file. The video file won't need any conversion in my case because it's been already outputted from the time-lapse generator into the right video format (MPEG 4).

Extracting Audio & Video Frames With MediaExtractor

Android provides the MediaExtractor class which allows you to extract compressed chunks of video and audio from various input media file formats. The extracted "chunks" (frame/access units) are already properly formatted, so they can be directly fed to MediaCodec decoder.

Without MediaExtractor, you would need to parse the metadata from media file yourself and you would need to ensure that the chunks you feed to MediaCodec's input buffer have the correct access unit boundaries.

Media files can contain multiple tracks of video, audio, or subtitles. Before you can extract frames from specific audio or video track, you first need to select the right track. Here's how you can find audio track inside of a media file

val extractor = MediaExtractor()
extractor.setDataSource(inFilePath)

for (i in 0 until extractor.trackCount) {
    val format = extractor.getTrackFormat(i)
    val mime = format.getString(MediaFormat.KEY_MIME);

    if (mime.startsWith("audio/")) {

        extractor.selectTrack(i)
        // Read the frames from this track here
        // ...
    }
}

Once you selected the right track, you can extract compressed frames from it in the following way

// Initialize extractor and select target track
// ...

val maxChunkSize = 1024 * 1024
val buffer = ByteBuffer.allocate(maxChunkSize)
val bufferInfo = MediaCodec.BufferInfo()

// Extract all frames from selected track
while (true) {
    val chunkSize = videoExtractor.readSampleData(buffer, 0)

    if (chunkSize > 0) {
        // Process extracted frame here
        // ...

        videoExtractor.advance()

    } else {
    // All frames extracted - we're done
        break
    }
}

MediaExtractor provides 2 important methods that give you information about the extracted frame - getSampleTime() and getSampleFlags().

getSampleTime() gives you the number of microseconds since the beginning of the track until the start of the current sample. getSampleFlags() gives you flags that are used by MediaCodec.

These 2 methods will be important later when we want to use the frames with MediaMuxer and MediaCodec.

Muxing Frames Into MP4 With MediaMuxer

MediaMuxer is kind of the counterpart to MediaExtractor. It can take various tracks with audio and video and mux them into one final media file.

You initialize the muxer by giving it an output file and output format. Before you start muxing, you also need to add the tracks that you want in your output file. In my example I got the MediaForma for the track from MediaExtractor.

Once the muxer is started, you can feed it with compressed frames by calling writeSampleData()

// Init muxer
val muxer = MediaMuxer(outFilePath, MediaMuxer.OutputFormat.MUXER_OUTPUT_MPEG_4)
val videoIndex = muxer.addTrack(videoFormat)
val audioIndex = muxer.addTrack(audioFormat)
muxer.start()
...

while(true) {
    // Extract frames (and possibly decode/process/encode)
    // ...

    // Write encoded frames to muxer
    muxer.writeSampleData(videoIndex, buffer, bufferInfo)
}

writeSampleData takes BufferInfor object as last argument. We can initialize the BufferInfo argument with information we get directly from MediaExtractor or MediaCodec.

In my case, I want to get input from MPEG-4 video and from AAC/M4A audio file, and mux both inputs into one MPEG-4 output video file. To accomplish that, I created the following mux() method

fun mux(audioFile: String, videoFile: String, outFile: String) {

    // Init extractors which will get encoded frames
    val videoExtractor = MediaExtractor()
    videoExtractor.setDataSource(videoFile)
    videoExtractor.selectTrack(0) // Assuming only one track per file. Adjust code if this is not the case.
    val videoFormat = videoExtractor.getTrackFormat(0)

    val audioExtractor = MediaExtractor()
    audioExtractor.setDataSource(audioFile)
    audioExtractor.selectTrack(0) // Assuming only one track per file. Adjust code if this is not the case.
    val audioFormat = audioExtractor.getTrackFormat(0)

    // Init muxer
    val muxer = MediaMuxer(outFile, MediaMuxer.OutputFormat.MUXER_OUTPUT_MPEG_4)
    val videoIndex = muxer.addTrack(videoFormat)
    val audioIndex = muxer.addTrack(audioFormat)
    muxer.start()

    // Prepare buffer for copying
    val maxChunkSize = 1024 * 1024
    val buffer = ByteBuffer.allocate(maxChunkSize)
    val bufferInfo = MediaCodec.BufferInfo()

    // Copy Video
    while (true) {
        val chunkSize = videoExtractor.readSampleData(buffer, 0)

        if (chunkSize > 0) {
            bufferInfo.presentationTimeUs = videoExtractor.sampleTime
            bufferInfo.flags = videoExtractor.sampleFlags
            bufferInfo.size = chunkSize

            muxer.writeSampleData(videoIndex, buffer, bufferInfo)

            videoExtractor.advance()

        } else {
            break
        }
    }

    // Copy audio
    while (true) {
        val chunkSize = audioExtractor.readSampleData(buffer, 0)

        if (chunkSize >= 0) {
            bufferInfo.presentationTimeUs = audioExtractor.sampleTime
            bufferInfo.flags = audioExtractor.sampleFlags
            bufferInfo.size = chunkSize

            muxer.writeSampleData(audioIndex, buffer, bufferInfo)
            audioExtractor.advance()
        } else {
            break
        }
    }

    // Cleanup
    muxer.stop()
    muxer.release()

    videoExtractor.release()
    audioExtractor.release()
}

In the example above I assumed there is only one video and one audio track in each input file. If your input file contains multiple track (e.g. if you want to take audio track from a video file), just loop through each track and select the right one the same way I showed in previous section where I selected the audio track.

The maxChunkSize variable that initializes the buffer for copying was chosen somewhat randomly. You might need a large buffer for high resolution video. A small buffer might be sufficient when processing only audio frames.

Converting to Other Audio Formats

MediaMuxer is fairly strict about what audio format you can mux into my MPEG-4 time-lapse video file. For example, if you would try to mix in an mp3 audio track because you assume that MPEG-4 container format should support this, you would probably see the following error in your logs when calling MediaMuxer.addTrack()

E/MPEG4Writer: Unsupported mime 'audio/mpeg'

Therefore, if your input audio doesn't have a compactible mime type (like "audio/mp4a-latm"), you first need to convert your audio into the right format/encoding.

For this you can use the MediaCodec class.

In my previous post I used [MediaCodec] as an encoder to get the compressed frames that could be fed into the muxer. For converting to other audio format I will use MediaCodec as both - decoder and encoder.

I tried to do the decoding and encoding in a similar way as it was done in Android's CTS tests. I wanted to make the method bellow "copy-pastable", so it's a bit long. Hope you can understand what I'm doing there

fun convert(inFile: String, outFileM4a: String) {

    val extractor = MediaExtractor()
    extractor.setDataSource(inFile)

    // Find Audio Track
    for (i in 0 until extractor.trackCount) {
        val inputFormat = extractor.getTrackFormat(i)
        val mime = inputFormat.getString(MediaFormat.KEY_MIME);

        if (mime.startsWith("audio/")) {

            extractor.selectTrack(i)

            val decoder = MediaCodec.createDecoderByType(inputFormat.getString(MediaFormat.KEY_MIME))
            decoder.configure(inputFormat, null, null, 0)
            decoder.start()

            // Prepare output format for aac/m4a
            val outputFormat = MediaFormat()
            outputFormat.setString(MediaFormat.KEY_MIME, "audio/mp4a-latm")
            outputFormat.setInteger(MediaFormat.KEY_AAC_PROFILE, MediaCodecInfo.CodecProfileLevel.AACObjectLC)
            outputFormat.setInteger(MediaFormat.KEY_SAMPLE_RATE, inputFormat.getInteger(MediaFormat.KEY_SAMPLE_RATE))
            outputFormat.setInteger(MediaFormat.KEY_BIT_RATE, inputFormat.getInteger(MediaFormat.KEY_BIT_RATE))
            outputFormat.setInteger(MediaFormat.KEY_CHANNEL_COUNT, inputFormat.getInteger(MediaFormat.KEY_CHANNEL_COUNT))
            outputFormat.setInteger(MediaFormat.KEY_MAX_INPUT_SIZE,1048576) // Needs to be large enough to avoid BufferOverflowException

            // Init encoder
            val encoder = MediaCodec.createEncoderByType(outputFormat.getString(MediaFormat.KEY_MIME))
            encoder.configure(outputFormat, null, null, MediaCodec.CONFIGURE_FLAG_ENCODE)
            encoder.start()

            // Init muxer
            val muxer = MediaMuxer(outFileM4a, MediaMuxer.OutputFormat.MUXER_OUTPUT_MPEG_4)

            var allInputExtracted = false
            var allInputDecoded = false
            var allOutputEncoded = false

            val timeoutUs = 10000L
            val bufferInfo = MediaCodec.BufferInfo()
            var trackIndex = -1

            while (!allOutputEncoded) {

                // Feed input to decoder
                if (!allInputExtracted) {
                    val inBufferId = decoder.dequeueInputBuffer(timeoutUs)
                    if (inBufferId >= 0) {
                        val buffer = decoder.getInputBuffer(inBufferId)
                        val sampleSize = extractor.readSampleData(buffer, 0)

                        if (sampleSize >= 0) {
                            decoder.queueInputBuffer(
                                inBufferId, 0, sampleSize,
                                extractor.sampleTime, extractor.sampleFlags
                            )

                            extractor.advance()
                        } else {
                            decoder.queueInputBuffer(
                                inBufferId, 0, 0,
                                0, MediaCodec.BUFFER_FLAG_END_OF_STREAM
                            )
                            allInputExtracted = true
                        }
                    }
                }

                var encoderOutputAvailable = true
                var decoderOutputAvailable = !allInputDecoded

                while (encoderOutputAvailable || decoderOutputAvailable) {

                    // Drain Encoder & mux first
                    val outBufferId = encoder!!.dequeueOutputBuffer(bufferInfo, timeoutUs)
                    if (outBufferId >= 0) {

                        val encodedBuffer = encoder!!.getOutputBuffer(outBufferId)

                        muxer!!.writeSampleData(trackIndex, encodedBuffer, bufferInfo)

                        encoder!!.releaseOutputBuffer(outBufferId, false)

                        // Are we finished here?
                        if ((bufferInfo.flags and MediaCodec.BUFFER_FLAG_END_OF_STREAM) != 0) {
                            allOutputEncoded = true
                            break
                        }
                    } else if (outBufferId == MediaCodec.INFO_TRY_AGAIN_LATER) {
                        encoderOutputAvailable = false
                    } else if (outBufferId == MediaCodec.INFO_OUTPUT_FORMAT_CHANGED) {
                        trackIndex = muxer!!.addTrack(encoder!!.outputFormat)
                        muxer!!.start()
                    }

                    if (outBufferId != MediaCodec.INFO_TRY_AGAIN_LATER)
                        continue

                    // Get output from decoder and feed it to encoder
                    if (!allInputDecoded) {
                        val outBufferId = decoder.dequeueOutputBuffer(bufferInfo, timeoutUs)
                        if (outBufferId >= 0) {
                            val outBuffer = decoder.getOutputBuffer(outBufferId)

                            // If needed, process decoded data here
                            // ...

                            // We drained the encoder, so there should be input buffer
                            // available. If this is not the case, we get a NullPointerException
                            // when touching inBuffer
                            val inBufferId = encoder.dequeueInputBuffer(timeoutUs)
                            val inBuffer = encoder.getInputBuffer(inBufferId)

                            // Copy buffers - decoder output goes to encoder input
                            inBuffer.put(outBuffer)

                            // Feed encoder
                            encoder.queueInputBuffer(
                                inBufferId, bufferInfo.offset, bufferInfo.size, bufferInfo.presentationTimeUs,
                                bufferInfo.flags)

                            decoder.releaseOutputBuffer(outBufferId, false)

                            // Did we get all output from decoder?
                            if ((bufferInfo.flags and MediaCodec.BUFFER_FLAG_END_OF_STREAM) != 0)
                                allInputDecoded = true

                        } else if (outBufferId == MediaCodec.INFO_TRY_AGAIN_LATER) {
                            decoderOutputAvailable = false
                        }
                    }
                }
            }

            // Cleanup
            extractor.release()

            decoder.stop()
            decoder.release()

            encoder.stop()
            encoder.release()

            muxer.stop()
            muxer.release()

            return
        }
    }

    throw IllegalArgumentException("Input file doesn't contain audio track")
}

In the code above I first try to find an audio track inside of the provided input audio file.

Once I have the audio track selected, I initialize the encoder, decoder, and muxer.

Note how I initialized the outputFormat variable. I mostly used information from the input format to initialize the mandatory keys of MediaFormat. I just needed to set the right mime type. And I also had to adjust the KEY_MAX_INPUT_SIZE because I was getting BufferOverflowException when feeding data to encoder (the number I choose might be unnecessarily high though).

In the loop I always try to drain output from encoder and decoder before feeding new input.

You can do some additional processing of the raw output you get from decoder before you feed it to encoder. I'll show how to do this in the next section. Here I only copied the output buffer from decoder directly to encoder. According to docs, input buffer should already be handed out cleared by getInputBuffer(), so you only need to do the copying.

Also note that you might need to experiment with different configurations and MediaCodec (and other classes involved) might throw an exception. Therefore you should probably put the code into a try/catch block and make sure that you did proper cleanup and releasing of resources also in case there's an exception thrown.

The method above should also work with other formats than mp3 (e.g. flac). Android supports more formats for doing decoding than encoding, so you might be able to convert formats only one way (e.g. you might be able to convert from ogg/vorbis to m4a, but not the other way around).

Adjusting Duration & Time

I wanted to mix in some nature sounds that I recorded in the nearby forest, into the time-lapse video. The audio file with nature sounds had a longer duration than the video file with the time-lapse. When the audio is longer than the video, the muxer makes the final output file as long as the input audio track. The video in the final output file just stops at the last video frame and the audio continues playing.

Therefore, I want to make sure that the audio track duration is only as long as the video duration.

It is fairly simple to accomplish that. For this you only need to check if the sample time of the sample that you get from MediaExtractor is not longer than your preferred maximum track duration. If the sample time reached maximum duration, you signal the end of stream by manually feeding and empty buffer with BUFFER_FLAG_END_OF_STREAM to decoder or muxer.

After adjusting the code from previous section, the audio will finish once maxDurationUs is reached

...
while (!allOutputEncoded) {
    // Feed input to decoder
    if (!allInputExtracted) {
        val inBufferId = decoder.dequeueInputBuffer(timeoutUs)
        if (inBufferId >= 0) {
            val buffer = decoder.getInputBuffer(inBufferId)
            val sampleSize = extractor.readSampleData(buffer, 0)

            if (sampleSize >= 0 && extractor.sampleTime <= maxDurationUs) {
                decoder.queueInputBuffer(
                    inBufferId, 0, sampleSize,
                    extractor.sampleTime, extractor.sampleFlags
                )

                extractor.advance()
            } else {
                decoder.queueInputBuffer(
                    inBufferId, 0, 0,
                    0, MediaCodec.BUFFER_FLAG_END_OF_STREAM
                )
                allInputExtracted = true
            }
        }
        ...

To get the total duration of your input video (which will become the final audio duration), you can do the following with the MediaFormat you've gotten at the beginning from the MediaExtractor

val totalDurationUs = videoFormat.getLong(MediaFormat.KEY_DURATION)

The other case might be that your audio is too short. Here you might try to add another audio or just start to repeat the previous audio.

I didn't yet have time to test this, but it should be possible to manually adjust BufferInfo.presentationTimeUs before you give it to MediaMuxer.writeSampleData(). You would just add a given offset to it which determine at which time your track will actually start.

The same way you can adjust the presentationTimeUs argument when calling MediaCodec.queueInputBuffer() in case you want to adjust this at MediaCodec level.

Note that the time units here are always microseconds.

Implementing Fade In and Fade Out (Editing Audio)

In the previous sections I only copied the output buffers from decoder into the encoder. This should be sufficient for a simple format conversion.

However, once you have the raw decoded samples, you might want to do some additional processing before feeding them to encoder. For example, when you just abruptly cut off the audio to adjust the duration to the duration of the video, cracking sounds might appear at the end. Therefore I decided to implement a simple fade out and fade in effect.

The buffers that come out of decoder contain PCM audio samples. The audio data is channel-interleaved (one sample for each channel). The samples are by default 16 bit signed integer in native byte order. You can also get float samples from decoder, but you have to additionally specify KEY_PCM_ENCODING key with value of ENCODING_PCM_FLOAT in MediaFormat that you give to the decoder upon configuration. I wasn't sure about the support of float samples by various Android devices, so I decided to work with regular short audio samples.

For the fade in effect I just took encoded samples out of encoder's output buffer, converted them to short, and increased the volume of each sample exponentially within the fadeInDurationMillis period

...
if (bufferInfo.presentationTimeUs < fadeInDurationMillis * 1000) {

    val format = decoder.getOutputFormat(outBufferId)
    val channels = format.getInteger(MediaFormat.KEY_CHANNEL_COUNT)
    val shortSamples = outBuffer.order(ByteOrder.nativeOrder()).asShortBuffer()
    val sampleDurationMillis = 1000.0 / format.getInteger(MediaFormat.KEY_SAMPLE_RATE)
    var elapsedMillis = bufferInfo.presentationTimeUs / 1000.0

    val size = shortSamples.remaining()

    for (i in 0 until size step channels) {
        val progress = elapsedMillis/fadeInDurationMillis
        val factor = progress * progress

        for (c in 0 until channels) {
            var sample = shortSamples.get()

            // Increase volume exponentially
            sample = (sample * factor).toShort()

            // Put into encoder's buffer
            inBuffer.putShort(sample)
        }

        elapsedMillis += sampleDurationMillis
    }
} else {
    // Just copy whole buffer without processing
    inBuffer.put(outBuffer)
}
...

You can see that I loop sample by sample and channel by channel until I gradually increased volume by an exponential factor that depends on the start time of a given sample.

Once I'm past the fade in duration, I can again just directly feed in the whole output buffer from decoder into decoder without further processing.

For the fade out effect I can again check BufferInfo.presentationTimeUs and verify that I'm near the end. Once I reached the fade out duration, I can again gradually decrease the volume of the sample. Here I decided to use a logarithmic function to calculate the fade out factor

...
var tillEndMillis = totalDurationMillis - bufferInfo.presentationTimeUs / 1000.0
if (fadeOutDurationMillis >= tillEndMillis) {

    val format = decoder.getOutputFormat(outBufferId)
    val channels = format.getInteger(MediaFormat.KEY_CHANNEL_COUNT)
    val shortSamples = outBuffer.order(ByteOrder.nativeOrder()).asShortBuffer()    
    val sampleDurationMillis = 1000.0 / format.getInteger(MediaFormat.KEY_SAMPLE_RATE)
    var elapsedMillis = fadeOutDurationMillis - tillEndMillis

    val size = shortSamples.remaining()

    for (i in 0 until size step channels) {
        val progress = elapsedMillis/fadeOutDurationMillis
        val factor = (20.0 * log(progress, 10.0))/-40.0 // logarithmic fade out

        for (c in 0 until channels) {
            var sample = shortSamples.get()

            // Fade out
            sample = (sample * factor).toShort()

            // Put processed sample into encoder's buffer
            inBuffer.putShort(sample)
        }

        elapsedMillis += sampleDurationMillis
    }
} else {
    // Just copy whole buffer without processing
    inBuffer.put(outBuffer)
}
...

Conclusion

Android provides APIs for various video processing tasks. You can use the MediaCodec API for decoding and encoding videos efficiently. MediaMuxer allows you to mux audio and video streams into one final video file.

In this post I showed how to add audio to the time-lapse video from previous post. The result looks (and sounds) like this

Next Post Previous Post

Add a comment

Name

Email *

Comment *

Comments

I don't know any other ways that use the standard Android API and wouldn't go through MediaCodec. Only by using some format-specific external libraries. Your could maybe try to optimize the conversion step. MediaCodec also offers async way of functioning. So you can try use that. It might be also possible to do the conversion and muxing into final file in one step (instead of first converting into a separate audio file). You can also try to measure execution time of the sections inside of convert function. Maybe there's room for improvement.
Written on Thu, 16 Jan 2020 08:53:18 by Roman

You have other ways to convert audio ? I've used ffmpeg to add audio, it fast and supported a variety of formats
Written on Thu, 16 Jan 2020 08:46:46 by kin zo

I'm not sure how to optimize the example even more, but skipping the re-encoding step (if possible) should be the biggest improvement
Written on Thu, 16 Jan 2020 08:40:34 by Roman

thank you for this post, but time for convert too long..
Written on Thu, 16 Jan 2020 08:29:35 by kin zo

sisik