Object Audio Signal Processing Overview

Object Audio Signal Processing Overview

Object-Based Audio: A Signal Processing Overview Sachin Ghanekar Agenda A brief Overview & History of Digital Audio Basic Concepts of Object Audio & How it works Signal Processing in Object Based Audio on Headphones. Signal Processing in Object Based Audio on Immersive Speaker Layouts Trends and Summary Audio Signals A snapshot view Parameter Bandwidth Sound Pressure Level (Signal Energy) Values & Ranges 20 Hz to 20 kHz 0 to 120 dB SPL (1e-12 to 1 W/m2 ) (2e-5 to 20 Pa) Loudness Resolution

OR 0.5 dB Comment Audible for Human Ear Range of sounds acceptable to Human ear. Note: 1 Pa = 1e-5 bar. Humans can hear sounds wave creating pressure changes less than a billionth of atmospheric pressure. Smallest volume Level change that is perceived by human ear. Digital Audio Sampling Rates 32, 44.1, 48 kHz Oversampled Signals up to 192kHz Recording Levels. 120 dB SPL Audio Input to ADC => 0 dB Full Scale digital output.

120 dB SPL for Full-Range Audio 94 dB SPL for Normal-Range Audio Sample Resolution 16 to 24 bits per sample 16 (Lo-Resolution internet audio) & 24 bits (Full-Resolution Audio) A Brief History of Audio [1979 1993] : Digital PCM Audio Sampled PCM audio signals LD (1979), CD (1982) , DAT (1987), Sony-MD (1992) MIDI Audio (1983) [1991 2003] : MPEG Stereo Audio Digital Stereo Audio Compression Standards ISO mpeg1,2 and mp3, (1991 to 1998). mp4-aac (2003) [1996 2007] : Digital Radio/TV Broad-Cast Audio Digital Video & Stereo Audio Compression Standards DTV/ATSC standards for TV broadcast Video: 1998 (mp1), 2009 (H264/AVC)

Audio: mp2 and Dolby-AC3-multi channel Digital Radio Broadcast DAB stereo (1995 (mp2), 2006 (aac) till now) [1997 onwards] : Immersive Multi-Channel Audio (Cinema/Home) Dolby/DTS Multi-Channel Immersive Audio for Home, Theater VCD (1994 mp2 audio), DVD (1997 ac3/dts5.1 audio), BlueRay (2006 dd+/dts-hd audio) Dolby/DTS encoded sound tracks for Movies, Songs & Concerts. A Brief History of Audio [1987 onwards] : Audio for Gaming Devices MIDI and FM synthesis. (Synthetic Audio Sounds) ATARI Gaming Consoles 1987+, Game Boy, Nintendo Gaming Gadgets 1989-1996 CD-Audio Tracks (Natural + Synthetic Audio) [1994 onwards] SEGA Saturn, Sony PS 1994 onwards PCM stereo Audio Microsoft XBOX, Sony PS2 Immersive Audio Dolby/DTS 8 channels compressed audio. Next: XBOX, PS2+ - 3D Audio experience. Immersive, 3D-Audio Tracks (Immersive Audio & Sounds)

XBOX One Sony PS3 / [2014 onwards] [2005 onwards] : Audio for Internet Streaming & Mobile Devices Most popular stereo standards (mp3, aac, wma, real) Low Power DSPs, Low bit-rate stereo audio standards Bitrates requirements are 40 to 256 kbps high-resolution upto 640 kbps. A Brief History of Audio DTV & DAB Broadcast Digital PCM Audio 1979 MPEG Stereo Audio 1991 1995 Theater, Cinema Multi-channel

Audio 1997 PC & Gaming Audio Internet Streaming & Mobile Audio 1987 2005 Object Based Audio 2013 Basics of Object Based Audio What is an Object Based Audio? Audio which is generated by a Stationary or moving object OR

a class of objects that are clubbed together as a collective source of sound Some Examples of Audio Objects are: A Stage artiste Chorus Cheering Crowd at Cricket ground Ocean waves, wind blowing Birds chirping An Airplane or A Helicopter

A moving Train A bullet fired A monologue or a dialogue Examples of Object-Based Audio Audio Scene or Sound Field generated by mixing (not just adding) audio signals from multiple Objects. Following are a few examples. Watching a football match in a stadium with home crowd. (Sports TV Channel) 3 objects : Home Crowd as a ring object, Commentator as a point object, Players-Umpire conversations as another object. Participating as a player in a field game. (Computer Games) 4+ objects : Home Crowd as a ring object around you, Commentator as a point object,

Players own voice responses as a point object, Other PlayersUmpire as multiple moving objects. Being a part of Scuba-diver team searching an underwater treasure. (VR) Listening to a conversation between different actors & backgrounds in a movie scene. (cinema) Attending a music concert or a simple Hindustani classical music mehafil (concert) Basics of Object-Based Audio A YouTube examples of Audio Scene with different Objects: http://youtube search "UDK + SuperCollider for real-time sound effect synthesis - demo 6 Observe the Video carefully & Identify -- Number of Audio Objects present in the scene. Shape of the objects Movement of those objects. Appearance / Disappearance of objects Properties of the audio signals generated by these objects.

Channel Based Audio vs Object Based Audio Channel Based Immersive Audio Content Creation Each signal track is associated with a specific speaker feed & setup at listener end. Content is created for a specific Listener Environment or setup. (mobile, home, or theater) Playback at Listener End At Listener end, the contents (channels) are mapped onto user speaker setup Need to use Predefined channel-mapping to headphones, stereo speaker, 2.1, 5.1, 11.1 etc. Object Based Audio Audio Object based signal tracks are independent of speaker-setup. => Content created is independent Listener Environment or setup. (mobile, home, or theater) At Listener end, the objects are mapped onto user speaker setup Objects based on positions and movements are mapped on the fly to the speaker-setup. Channel Based Audio vs Object Based Audio

Channel Based Immersive Audio Object Based Audio Content Creation With inputs as the recorded contents or tracks, each Channel track is carefully designed and created at the recording studios. OR at the gaming developer studios for creating good immersive effects. Playback at Listener End If the content-target speaker == user speaker setup, then simple-mapping and playback. Else use some good pre-defined maps and delays for rear speakers to create the content. Audio Objects can be simply identified encoded as separate tracks. Associated meta-data should be carefully designed to capture shape, movement, appearance/disappearance of the objects assuming the listener at the center. Objects are decoded to create audio signals. Frame-by-Frame, positions of active objects are mapped on to user speakers in form of gains and delays for these objects. Mix and playback.

Channel Based Audio vs Object Based Audio Channel Based Immersive Audio Content Creation Creation is a complex careful process. Encoding steps and procedure is complex and hence is done by skilled well trained sound designers. Object Based Audio Creation and encoding object-audio is a relatively simpler process and can be done without much pre-thinking of user-setups & environment. Audio object meta-data needs to be carefully associated with it. Playback at Listener End Decoders are Renderers are fairly simple. Decoders are simple (as simple as channel based Audio). However the Renderers are much more complex. Renderers need to map these objects with its

positions to speakers on a frame-by-frame basis. Pros & Cons of Object Based-Audio PROS Richer Immersive experience Better user control & available choices for different audio-settings & preferences Same content can supports a much larger variety of playback speaker setups from simple Headphones to 22.2 setup. Readily maps to Gaming & VR requirements where user context is NOT defined and varies based on users navigation. CONS Decoder+Renderer complexity is about 2x to 3x times higher. Hence the power consumption during playback of such stream is higher. A summary basic of Object-Based Audio Typical Object-Audio Encoded Stream contains 8 to 16 encoded audio object-tracks.

And each audio-object track has two parts Object Audio Track Enc-Audio-Obj Frame 1 Enc-Audio-Obj Frame 2 Enc-Audio-Obj Frame N MetaData-Obj Frame 1 MetaData-Obj Frame 2 MetaData-Obj Frame N Meta-Data:

Stream-level: Max Number of Audio Objects present in the scene. Compressed PCM Audio-Data: Frame-level: Standard DD or AAC encoded audio signals associated with a specific object. Shape of the object, Data related to position, speed of the object, Appearance / Disappearance of object Revisit basic of Object-Based Audio A YouTube examples of Audio Scene with different Objects: http://youtube search "UDK + SuperCollider for real-time sound effect synthesis - demo 6 Observe the Video carefully & Identify - 4-5 object of different shapes appear, move and disappear w.r.t. to the listener. Footsteps, lava-pond, whirling wind, flowing stream, dripping water Object-Based Audio Stream Decoding & Rendering Decoding: The basic audio from object is encoded using standard legacy encoders. Therefore, decoding uses standard mp3, aac, dolby-digital decoding to provide basic audio PCM for the object. Renderers: Challenges are in Rendering the decoded object-based audio PCM contents & use objects shape/motion meta-data to create

An immersive audio experience on headphones. (VR, Gaming, smartphones, and tablets) An immersive audio experience on our multi-speaker layouts at homes or theaters Object-Based Audio Renderer on Headphones HRTF Model (Head Related Transfer Function) Ear impulse & frequency response for orientation shown in picture on left (90 azimuth, 0- elevation) from WK SDO set. [2,3] Diagram of spherical coordinate system (Wightman and Kistler, Univ Winsconsin) 1989 [1] (Note: 5 to 9 msec duration impulse response width @ 16kHz sample rate) Object-Based Audio Rendering on Headphones FIR Filter Pair Gain / delay Module HRTF_Left Decoded audio object PCM data

Decoded audio object Meta data (dist, azi, ele) Interpolate & Compute HRTF_Left & HRTF_Right from filterpool Distance -> Gain delay mapping R f & qq HRTF_L/R Filter pool for diff values of fi & qqi MIXER HRTF_Right Object-Based Audio Rendering on Headphones Gain delay Module

FIR Filter Pair HRTF_Left Decoded audio object PCM data Signal Processing Challenges Decoded audio object Meta data (dist, azi, ele) MIXER HRTF_Right Compute HRTF_Left & HRTF_Right Distance -> Gain, delay mapping

R f & qq Preset FIR Filter pool The parameters R, qf & qq change every frame (20-30 msec) There are multiple objects & some appear and disappear after a few frames. filter coeffs, gain change every frame Some Objects move very rapidly Need for on the fly object

pcm + associated meta-data memory VR / Computer Games related: Head /the Joystick movements changes R, f & qq This R may cause glitches, distortions in outputs. changes w.r.t. time

=> the speed of the object is substantial causing allocation, update and destruction An additional Head-Tracking or Joystick movements module which feeds Need for techniques to

adaptively & of smoothly change those Doppler effect /on audio signal (e.g. a fast-train passing by)coefficients Need for fade-in fade-out / mute output PCM samples userfor orientation parameters Ru, fu & qqu

Need pitch shifting (variable-delay) to be introduced on top of Need for a well-designed multi-port PCM module mixing module gain Additional module to Perform 3-D geometry computations derive final application Module. Oversampling & Interpolation wouldtobe required. object position parameters from the above 2 sets R, f & qq

Object-Based Audio Rendering on Headphones FIR Filter Pair Gain / delay Module HRTF_Left Decoded audio object PCM data A Quick Example on You-Tube Decoded audio object Meta data (dist, azi, ele) MIXER HRTF_Right Compute HRTF_Left & HRTF_Right

Distance -> Gain & delay mapping R f & qq Preset FIR Filter pool Youtube: RealSpace 3D v0.9.9 Audio Demo - YouTube (courtesy: http://realspace3daudio.com/demos/) Object-Based Audio Rendering on Immersive Speaker-Layouts Examples of Immersive Speaker Layouts DTS / DD+ 7.1 Speaker Layout Dolby ATMOS 11.1 Speaker Layouts Object-Based Audio Renderer on Immersive Speaker-Layouts Examples of Immersive Speaker Layouts

DTS-X 7.2.4 Speaker Layout DTS Neo:X 11.1 Speaker Layout Object-Based Audio Renderer on Immersive Speaker-Layouts Examples of Immersive Speaker Layouts Auro-3D 11.1 Speaker Layout Auro-3D 13.1 Speaker Layout Object-Based Audio Renderer on Immersive Speaker-Layouts Observations on most recent Immersive Speaker Layouts 1. The additional speakers and their positions are fixed as recommended by the HomeTheatre AVR vendors & content creators. 2. A couple of front wide speakers are added in front to cover azimuth angle better. 3.

About 2 to 3 speakers at higher elevations are also added to cover audio coming from higher elevation angles. 4. In some cases, there are direct overhead speakers which are fitted in the ceiling (or its effects are virtually created by upward-tilted speakers using ceiling reflections) Object-Based Audio Renderer on Immersive Speaker-Layouts Two Main Techniques. VBAP Vector based amplitude Panning: HOA Higher Order Ambisonics : Mapping object audio to Virtual Speaker Array Creating desired Sound-Field at listeners sitting position VBAP (Vector Based Amplitude Panning):

A large array of Virtual Speaker Positions are assumed to surround the listener. AudioObjects and their motions / positions w.r.t. the listener are mapped on a larger set of Virtual Speaker Positions. Audio signals for each object is mapped on this virtual speaker positions using VBAP method The audio associated with virtual speakers is then mapped to standard user speaker layouts using pre-defined down-mixing matrices & set of delays. VBAP based object rendering on Immersive Speaker-Layouts Vector Base Amplitude Panning [Pulkki 1997] 3D-VBAP describes/derives sound-field of an object kept on unit sphere by means of 3 relevant channel unit vectors. These channel position vectors need not be orthogonal to each other correspond to nearest speaker positions. (real or virtual) When the 3 channel position vectors are orthonormal (e.g. on x, y, z axis), 3D-VBAP mapping gets simplified to 1st order mapping. VBAP based object rendering on Immersive Speaker-Layouts

Vector Base Amplitude Panning [Pulkki 1997] P = [g1 g2, g3] x [L1, L2, L3] = object Position & loudness vector. where [g] is gain 3x1 vector, L = [L1, L2, L3] is 3x3 matrix formed by of x,y,z co-ordinates of Virtual Speaker positions L1, L2, L3. P is Audio Object representation vector with direction & amplitude [g1, g2, g3] = P * L-1 Typically, the space around the listener is divided into 80 to 100 valid triangular meshes or region. The object is mapped in one of the regions The object-audio stream is created by encoding P values and sent as meta-data for the object. At the renderer, the matrices L-1 are pre-computed and stored for the triangular meshes. The gains g1, g2, g3 are calculated as (P * L-1) for the object. The gains are applied to the audio object PCM data to create the audio-signals to be played at the virtual speaker positions HOA based object rendering on Immersive Speaker-Layouts Higher Order Ambisonics [Gerzon 1970] Creates a sound field generated by audio-object(s) when it gets captured by directional microphones located at the listeners position

First Order Ambisonics fields Second Order Ambisonics fields An Ambisonic Microphone HOA channels are encoded and these channels are decoded and then mapped onto any standard user speaker layouts from 5.1, or 7.2.4 or 13.1. These mappings are easy and less complex. HOA technique makes it easy to modify the sound-field for different user (listener) orientations (required mainly in VR & Computer Gaming) Rendering on Headphones using intermediate audio of Immersive Speaker-Layouts In industry, typically the following technique is used (in non-gaming) application to render the object-based content on headphones. Decode and render the objectbased content for an Immersive Speaker layout. Map the immersive speaker layout audio signals to headphones using BinAural Rendering Advantages are: The same content can be decoded for immersive speaker layout, theater, or headphones. Reduced Complexity if the stream carries standard immersive channel configurations as substream then this can be done as multi-channel decoding followed by Bin-Aural Rendering. BinAural Rendering : Immersive Speakers -> Headphones. Left Rear Speaker Signal Lsr HRTF_LeftEar_Lsr

Depending upon the fixed f & qq qangles of Front, Rear & Overhead Speakers w.r.t. Left and Right ear of the listener HRTF_RightEar_Lsr Right Rear Speaker Signal Rsr HRTF_LeftEar_Rsr HRTF_Left and HRTF_right are applied & mixed as below. HRTF_RightEar_Rsr Left Front Speaker Signal Lf HRTF_LeftEar_Lf HRTF_RightEar_Lf Right Front Speaker Signal Rf HRTF_LeftEar_Rf HRTF_RightEar_Rf

Overhead Speaker Signal Oh + HRTF_LeftEar_Oh HRTF_RightEar_Oh + Lets Listen to YouTube: Hear New York City in 3D Audio (BinAural Audio) Duration of Interest: 1.30-6.00 min Industry Trends for Object-Based Audio Reference Material based on Object-Audio: Dolby, DTS and Fraunhofer are introducing their freshly designed audio standard reference IPs which support Object-based audio contents for Cinema, Home theater and Digital broadcasts. Gaming industry and VR gadgets manufacturers are welcoming this trend in their domains. For listeners, this trend will provide Better immersive experience for the latest movie releases and audio contents on their existing home theater setup. Streaming contents & games on mobile phones / tablets would sound much richer. The Sports channels & movies will carry more options for user controls to select audio environments for watching their favorite match.

Upgrading to the latest AVRs and SoundBars with wide front and ceiling with overhead speakers (actual or virtual) will provide the best effects. For DSP solution providers: Higher computational burden on Audio DSPs to decode and especially rich post processing to render the content. Audio DSP gets burdened with 3-D geometry computations, computes related to square-root, sine, cos, matrix operations are required. Compute Power required to decode this content is going to be 2 to 3x higher, thus battery drain may be higher. Summary of Object-Based Audio Each sound source is captured as an object which is associated with compressed audio as payload and motion/shape information as meta-data. For listener using Headphones, objects along with its motions are mapped to left and right ear from a pool of pre-generated HRTF (Head related transfer functions) functions. For listener sitting in living room or cinema theaters, objects along with its motions are mapped onto an array of immersive speaker layouts, using VBAP, HOA methods. The content creators also use combination of the above two methods called Bin-aural rendering to render the contents for creating immersive audio perception on the Headphones. Thank you Q & A Session? References [1] http://youtube search "UDK + SuperCollider for real-time sound effect synthesis - demo 6

[2] http://alumnus.caltech.edu/~franko/thesis/Chapter4.html (HRTF related discussions & details) [3] Headphone simulation of free-field listening. I- Stimulus synthesis - J Acoust Soc Am 1989 - Wightman.pdf ("SDO" HRTF by Wightman and Kistler from Department of Psychology and Waisman Center, University of Wisconsin--Madison, Madison, provided a basis for HRTF research) [4] http://youtube search " RealSpace 3D v0.9.9 Audio Demo YouTube Ref: http://realspace3daudio.com/demos/ [5] http://slab3d.sourceforge.net/ - Simple HRTF code to try out simulations of moving audio-objects. [6] Virtual Sound Source Positioning Using VBAP, Ville Pulkki, Journal of Audio Engg. Society, Vol 45, No 6, June 1997 [7] Spatial Sound Technologies and Psychoacoustics, presentation by V. Pulkki in IEEE Winter School in 2012 at Crete, Greece. [8] An Introduction to Higher Order Ambisonic, Florian Hollerweger, Oct 2008 [9] http://youtube search Hear New York City in 3D Audio (1st short clip)

Recently Viewed Presentations

  • ENUMCLAW FORESTRY CLUB  TREE IDENTIFICA  TION ALASKA-CEDAR  OREGON

    ENUMCLAW FORESTRY CLUB TREE IDENTIFICA TION ALASKA-CEDAR OREGON

    Western Paper Birch. Western Red Cedar. Western White Pine. Whitebark Pine. Willows. Author: The Malgarini Created Date: 11/01/2014 12:37:07 Title: Slide 1 Last modified by: Mike Company:
  • Political History

    Political History

    Russian Politics P.Sc. 251 "Russians idolize the past, hate the present, and fear the future." Anton Chekhov "Democracy" Recognize management costs Doors & windows Listening Accountability Elects leaders Consensus-seeking Equitable rules Patience "legitimize" rules Patriotism Prepare to compromise / lose...
  • Critical Elements of School-wide PBIS - Laurel County

    Critical Elements of School-wide PBIS - Laurel County

    PBIS is comprised of a broad range of systemic and individualized strategies for achieving important social and learning outcomes while preventing problem behavior with all students (Connect outside to inside). ... recent teacher comment RE boys town and how her...
  • VCE PSYCHOLOGY UNIT 4 PURPOSE OF SLEEP Study

    VCE PSYCHOLOGY UNIT 4 PURPOSE OF SLEEP Study

    Evolutionary theory. According to the evolutionary (circadian) theory of sleep, sleep is a means of increasing an animal's chances of survival in its environment. Sleep patterns have adapted in terms of whether the animal in question is predator or prey,...
  • An Ad Omnia Approach to Defining and Achieving Private Data ...

    An Ad Omnia Approach to Defining and Achieving Private Data ...

    Some ratings not sensitive, some may be sensitive OK for Netflix to know, not OK for public to know A Publicly Available Set of Movie Rankings International Movie Database (IMDb) Individuals may register for an account and rate movies Need...
  • Sibling Rivalry - media.xenos.org

    Sibling Rivalry - media.xenos.org

    We don't take toys from one another….If want something, ask for it….Took the toy away….your sibling is more important than things…recognize that…most cases no one gets it….blanket, favorite toy…get to have all own…
  • Eclipses - seslarscience.weebly.com

    Eclipses - seslarscience.weebly.com

    We only have eclipses during the full and new moon phases. Those are the only times when the Earth , Moon and Sun are lined up and the shadows can fall on each other. Shadow cones in space. Waning crescent....
  • Hot off the Press - Oregon Counseling

    Hot off the Press - Oregon Counseling

    Her mother. and grandmother (who is also deceased) suffered from . progressive Alzheimer's disease. My client revealed to me that . she administered a lethal dose of sleeping pills to her grandmother. during the final stages of the Alzheimer's, and...