Algorithm to mix sound

0 votes
asked Dec 17, 2008 by adam-davis

I have two raw sound streams that I need to add together. For the purposes of this question, we can assume they are the same bitrate and bit depth (say 16 bit sample, 44.1khz sample rate).

Obviously if I just add them together I will overflow and underflow my 16 bit space. If I add them together and divide by two, then the volume of each is halved, which isn't correct sonically - if two people are speaking in a room, their voices don't become quieter by half, and a microphone can pick them both up without hitting the limiter.

  • So what's the correct method to add these sounds together in my software mixer?
  • Am I wrong and the correct method is to lower the volume of each by half?
  • Do I need to add a compressor/limiter or some other processing stage to get the volume and mixing effect I'm trying for?


17 Answers

0 votes
answered Dec 17, 2008 by krusty-ar

If you need to do this right, I would suggest looking at open source software mixer implementations, at least for the theory.

Some links:



Actually you should probably be using a library.

0 votes
answered Dec 17, 2008 by adam-rosenfield

I'd say just add them together. If you're overflowing your 16 bit PCM space, then the sounds you're using are already incredibly loud to begin with and you should attenuate them. If that would cause them to be too soft by themselves, look for another way of increasing the overall volume output, such as an OS setting or turning the knob on your speakers.

0 votes
answered Dec 17, 2008 by tony-arkles

I think that, so long as the streams are uncorrelated, you shouldn't have too much to worry about, you should be able to get by with clipping. If you're really concerned about distortion at the clip points, a soft limiter would probably work OK.

0 votes
answered Dec 17, 2008 by roddy

You should add them together, but clip the result to the allowable range to prevent over/underflow.

In the event of the clipping occuring, you will introduce distortion into the audio, but that's unavoidable. You can use your clipping code to "detect" this condition and report it to the user/operator (equivalent of red 'clip' light on a mixer...)

You could implement a more "proper" compressor/limiter, but without knowing your exact application, it's hard to say if it would be worth it.

If you're doing lots of audio processing, you might want to represent your audio levels as floating-point values, and only go back to the 16-bit space at the end of the process. High-end digital audio systems often work this way.

0 votes
answered Dec 17, 2008 by jon-smock

You're right about adding them together. You could always scan the sum of the two files for peak points, and scale the entire file down if they hit some kind of threshold (or if the average of it and its surrounding spots hit a threshold)

0 votes
answered Dec 17, 2008 by mark-ransom

"Quieter by half" isn't quite correct. Because of the ear's logarithmic response, dividing the samples in half will make it 6-db quieter - certainly noticeable, but not disastrous.

You might want to compromise by multiplying by 0.75. That will make it 3-db quieter, but will lessen the chance of overflow and also lessen the distortion when it does happen.

0 votes
answered Dec 18, 2008 by mark-heath

Most audio mixing applications will do their mixing with floating point numbers (32 bit is plenty good enough for mixing a small number of streams). Translate the 16 bit samples into floating point numbers with the range -1.0 to 1.0 representing full scale in the 16 bit world. Then sum the samples together - you now have plenty of headroom. Finally, if you end up with any samples whose value goes over full scale, you can either attenuate the whole signal or use hard limiting (clipping values to 1.0).

This will give much better sounding results than adding 16 bit samples together and letting them overflow. Here's a very simple code example showing how you might sum two 16 bit samples together:

short sample1 = ...;
short sample2 = ...;
float samplef1 = sample1 / 32768.0f;
float samplef2 = sample2 / 32768.0f;
float mixed = samplef1 + sample2f;
// reduce the volume a bit:
mixed *= 0.8;
// hard clipping
if (mixed > 1.0f) mixed = 1.0f;
if (mixed < -1.0f) mixed = -1.0f;
short outputSample = (short)(mixed * 32768.0f)
0 votes
answered Dec 5, 2009 by ben-dyer

There is an article about mixing here. I'd be interested to know what others think about this.

0 votes
answered Dec 4, 2011 by glenn-barnett

You can also buy yourself some headroom with an algorithm like y= 1.1x - 0.2x^3 for the curve, and with a cap on the top and bottom. I used this in Hexaphone when the player is playing multiple notes together (up to 6).

float waveshape_distort( float in ) {
  if(in <= -1.25f) {
    return -0.984375;
  } else if(in >= 1.25f) {
    return 0.984375;
  } else {    
    return 1.1f * in - 0.2f * in * in * in;

It's not bullet-proof - but will let you get up to 1.25 level, and smoothes the clip to a nice curve. Produces harmonic distortion, which sounds better than clipping and may be desirable in some circumstances.

0 votes
answered Dec 5, 2012 by podperson

I'd prefer to comment on one of the two highly ranked replies but owing to my meager reputation (I assume) I cannot.

The "ticked" answer: add together and clip is correct, but not if you want to avoid clipping.

The answer with the link starts with a workable voodoo algorithm for two positive signals in [0,1] but then applies some very faulty algebra to derive a completely incorrect algorithm for signed values and 8-bit values. The algorithm also does not scale to three or more inputs (the product of the signals will go down while the sum increases).

So - convert input signals to float, scale them to [0,1] (e.g. A signed 16-bit value would become
float v = ( s + 32767.0 ) / 65536.0 (close enough...))
and then sum them.

To scale the input signals you should probably do some actual work rather than multiply by or subtract a voodoo value. I'd suggest keeping a running average volume and then if it starts to drift high (above 0.25 say) or low (below 0.01 say) start applying a scaling value based on the volume. This essentially becomes an automatic level implementation, and it scales with any number of inputs. Best of all, in most cases it won't mess with your signal at all.

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter