We all know that getting the audio right makes the pictures better, don’t we? Anyone who has been to see a movie that has been created for Object Audio like Dolby Atmos will know that there is something special about it. To figure out what that is, let’s rewind a little and see why we might need it.
Remember the good old days of mono audio on TV? Then you probably have as much grey hair as I do! Mono audio gives a single channel of sound played back through one or more speakers. That’s fine for many genres, but to increase the feeling of being there, in the scene, stereo audio was introduced, so that a left channel and a right channel are transmitted, giving our ears the ability to hear sound moving on either side of us – like we do in real life – while we watch the images.
With the advent of cheaper speakers and cheaper electronics, we moved to a surround sound system where we added more speakers to give the impression that sounds could come from in front and behind us, as well as to either side of us. Typically, today these surround sound systems use 5.1 channels. This means that we have 5 full bandwidth channels:
- Left Front
- Right Front
- Left Rear
- Right Rear
And one low bandwidth channel (this it the “.1”)
- Low Frequency Effects (LFE)
This system essentially takes a sound mix from the program creator that is mastered on 5.1 speakers and then maps it on to the 5.1 speakers in your home or in the theater. If you ignore compression in the signal chain, then there is almost no processing between the source and the destination listening environment.
What’s changed? Audio processing technology is now very cheap – your cellphone can do more audio processing than dedicated hardware from the 1990s. Also, we have much bigger screens today, and with the rollout of UHD, it is likely that we will be sitting even closer to them. To increase the sense of “being there,” adding a vertical element to the sound can make a dramatic difference – especially for effects like weather, gunshots, birds, etc.
It is impractical to move from a situation where we have six fixed speakers to one where we have hundreds of speakers that position the sound exactly, especially when most of the time there will be little or no sound coming from an individual speaker. Imagine instead a “bed” of audio that is the traditional stereo, or 5.1, mix that you add effects or objects to with an audio stream and some control metadata.
In its simplest form, the UK’s audio description service does just that. Start with a stereo “bed” that is the normal program mix and then add a description for the visually impaired and a control track that adjusts the volume and position (left-right pan) so that a smart decoder can mix the sound together.
A full cinematic system is very similar, but with more objects and more metadata. Each cinema may have a different number and position of speakers depending on budgets. The object sound system provides the “bed” of audio that is mapped onto the speakers in the theater. Each object sound is then mapped to one or more physical speakers at the right time and at the right volume to provide very specific spatial effects for the audience.
Interestingly, the human ear is very good at telling the difference between sound coming from a single point source and sound that is mixed between two speakers. To make sure that we hear certain sounds as being really sharp, some of the metadata used forces the sound processer to “snap” the audio to a nearby physical speaker rather than produce a mix that might be more accurate in terms of mathematical position, but sounds more “blurry” to a listener.
Knowing these basics, you can see how this system might map to a consumer setup where there will be fewer speakers, but by calibrating the room with a built-in microphone, a sound processor could do a good job of mixing the sound between spatial speakers and upward firing speakers to give a pretty good approximation to the 3D sound experience in the cinema.
That’s nice Bruce, but why is this relevant? Currently, we send our content to the listening / viewing environment in a fairly linear way. The mix that is created at the content provider is the mix heard / seen by the viewer. Technologies like IMF are enabling content creators to produce and distribute versions more cheaply. Technologies like object sound with consumer audio processing units allows different objects like languages, high dynamic range effects (for quiet environments) and low dynamic range effects (for noisy environments) to be selected and / or mixed at the receiving point. We’re increasingly moving from a linear “here’s my content” world to a component “which bits of the content will give you the best experience” world. It’s a fun time to be in technology.
‘Till next time.