It’s almost 2019 and the general state of audio in games is similar to where graphics were many years ago. Significant advancements in graphics technology and workflows were made since then and today graphics are well ahead of audio in many respects.
Engines offer a plethora of cutting-edge graphic features, advancements in visuals are always in focus and my personal perception is that audio was left behind both from a workflow and a technological standpoint. For most of the games, the existing audio concepts and their implementations are sufficient, but there’s so much more that can be done.
It would be a shame not to use the available hardware and software support to build new concepts that would further push what it can be achieved audio-wise. It would be a shame not to also push the audio hardware the same way graphics hardware was more than two decades ago.
Where is audio now in game development?
More or less, the current audio workflow involves a significant amount of off-line audio authoring and audio setup effort.
Audio content is designed or recorded, imported, deployed into the engine’s 3D-enabled audio players that can be positioned according to their goal to create the illusion of real-world audio emitters, routed to mixers / sub-mixers, globally mixed into the audio rendering pipeline and deployed to the audio output devices.
Optionally, various filters can be applied along the way at various emitting and mixing stages, but in general that’s pretty much it. Of course, some games / engines / components take it further and build on top of that various techniques like audio occlusion, geometry-based audio reflections, HRTF, but the audio rendering in most game engines do not offer these features by default.
Also, the relation between audio and graphics is usually indirectly defined, created and maintained by the user. We are pretty much left on our own in managing audio (which sometimes is a plus, but for many may not be).
The static approach to audio content
From a physical audio standpoint, we’re still in the non-PBR era where the designer is allowed to fail on creating versatile or even correct audio content.
The content often statically bakes physical features that are in contrast with the game’s dynamics and trying to add dynamism often comes with significant coding and designing effort.
An important advancement was made when real-time spatialization in form of reverb & echo (delay) became available, allowing the artists to design dry content instead of baking the wet reverb signal into the audio. However, a lot of other dynamic audio features are still baked in.
The result is a content that is in strong relation to a particular state of the game. This that can lead to inconsistencies when commuting to a different game state, where the baked content might incorrectly manifest from a physical standpoint.
The proposal: Physically-Based and Unified Audio Rendering
To address some of these limitations, I’m proposing a Unified (and possibly Physically-Based) Audio Rendering approach. Since the transition to a unified audio rendering workflow involves the introduction of several fundamental sub-concepts that would represent the building blocks for the proposed system, let’s have a look at them first.
Unified Audio Rendering
The master-concept. In nature, there’s a strong collaboration between sound (or any type of vibration really) and matter. Just as in the case of light where matter emits light waves that travel through various mediums, are absorbed, reflected, refracted, scattered and so on, matter vibrates and emits sound waves.
The produced sound waves propagate through various mediums, are absorbed, reflected, refracted and even collide with other sound waves. During their traveling, the sound waves which are essentially a mix of sine waves with different frequency, amplitude and phase add up or cancel out with the sine waves of other sound waves, leading to various acoustic changes.
Game engines approximate this whole process using 2D or 3D audio players that "emit" audio right in the “ear” of the player without being subject to all these phenomena. A more realistic model would account for all emitters, all geometry, all traveling mediums and produce a unified and continuous audio stream, just as it happens in nature.
Therefore, the Unified Audio Rendering is the proposed concept in which the produced audio is the global collaboration of several scene elements instead of a simple all-sources mix-up approach, as we're currently used to.
Unified Audio Rendering System
The system that provides a definition for and implements the Unified Audio Rendering Pipelines, Audio Renderers, Audio Materials and Audio Shaders. The system can be enhanced to support Physically-Based Audio Rendering.
Unified Audio Rendering Pipeline
The audio rendering sub-pipeline implemented by the Unified Audio Rendering System handles the data submitted by the unified Audio Renderers and performs the actual audio rendering which includes Audio Shader execution, sub-mixing and outputting the resulted audio stream to an engine’s master audio rendering system.
The Audio Renderers would be entities that would “render” an Audio Material. Note however, this is not actual synthesis or Audio Shader execution, but rather reading Audio Materials and submitting rendering information to the Unified Audio Rendering Pipeline. Pretty much in the same way a Mesh Renderer would submit rendering data to the graphics rendering system.
Advances in the Audio Renderers implementations would allow further-increasingly complex Audio Materials (powered by perhaps more advanced Audio Shaders) to be designed and supported.
Not necessarily physically-based right from the start, but the Audio Material is an important concept that would allow designers to specify audio rendering properties of a certain game entity. Just as with graphical rendering, various Audio Materials powered by Audio Shaders would be used to describe a game entity from an audio standpoint.
As the Audio Renderer’s implementation will get more complex with time, also more complex Audio Materials could be supported and eventually we’ll have fully physically-based Audio Materials that would enable designers to define a game entity’s audio properties by its physical properties as opposed to relying exclusively on pre-baked content as we mostly do today.
Of course, pre-baked Audio Maps (I’m still thinking how these would be represented) could still be used to provide refinement of the physical audio properties in the same way various graphical maps are used today to specify per-pixel physical material properties.
The Audio Materials essentially represent a shell that holds game entity specific data and a reference for an Audio Shader that will use that data. Then at rendering time, the data is being fed into the Audio Shader that "lives" inside the Audio Material, which in turn will produce audio data that will be further injected into the final (engine's) audio pipeline.
Just as in the case of graphic shaders, the Audio Shaders would be "scripts" that provide a definition of how the specific information chosen in the Audio Material properties should be used when participating in the Unified Audio Rendering.
There could be many types of Audio Shaders, each type being used in a certain way. For instance, some Audio Shaders would be used for audio synthesis (therefore pure emitters), some would be used for audio processing (absorption, reflection, perhaps suitable for surfaces) and some for audio analysis (perhaps generating certain events based on the analysis). The type of Audio Shader would be specifically decided based on the actual game entity that the Audio Material is applied to.
The Audio Shaders would have properties that would enable an artist to control the data the Audio Shader operates with internally. The properties would be visible in the Audio Material inspector.
Physically-Based Audio Rendering
Given the concepts of Audio Materials and Audio Shaders that describe a game entity's audio properties, a physically-based Audio Material would describe a game entity's audio properties by its physical properties as opposed to exclusively using pre-baked data.
For example, in case of a (visually) metallic surface, we could use an Audio Material that produces metallic collision sounds when colliding with a different hard surface and also exhibits audio reflection. However, there could be many ways in which we could design our metallic Audio Material.
The most basic one is to use an Audio Shader that performs a simple rendering of the metallic output using few “metal hit” pre-rendered samples. This is the equivalent of using a "diffuse-only" graphics material when rendering graphics.
However, a better approach is to use a physically-based Audio Shader that features physically-based properties and produces the output using physically-based synthesis techniques. This would essentially yield physically-based Audio Rendering.
Of course, the physically-based Audio Shaders would most probably require a new, physically-based audio rendering pipeline.
Programmable Hardware Audio Acceleration
Since we’re brought into discussion the Audio Shaders, we’ll make a detour here and address the required additional hardware support that would help Physically-Based Audio Rendering reach its true potential.
Some of the concepts we’re depicted so far, like the Physically-Based Audio Rendering using Physically-Based Audio Materials would most probably require significant CPU budget in order to work in real-time, as their Audio Shaders would most probably feature complex computations to achieve their goals.
So naturally, just as in the case of graphics, at some point we might turn to hardware acceleration and build specialized hardware that can off-load the CPU from intensive audio-related operations.
Currently, the sound cards feature hardware acceleration in the form of various audio effects being executed on the sound card’s DSP, wavetables for sample-based synthesizing, hardware additive and subtractive synthesizing and hardware mixing with hardware ring buffers.
On an API level, DirectX features DirectSound, an API that exposes some hardware acceleration functionalities. Later on, DirectSound was extended to DirectSound3D that promised to standardize 3D audio. Further, DirectSound is extended by EAX, which allows an application to use the hardware accelerated effects and hardware buffers.
So, although there are some audio acceleration technologies featured on some sound cards and supported by some APIs, the audio acceleration is still quite in the Voodoo 3dfx era. Exciting times ahead.
The hardware audio acceleration could be further extended to support custom programs being executed right on an audio device’s DSP (APU?) in order to off-load the CPU. These DSP programs would essentially be Audio Shaders.Who even knows, perhaps someday, mainstream sound cards will even implement hardware audio ray-tracing. (A-RTX? 😊)
However, until programmable audio sound cards become "a thing", we're pretty much left with the following options to execute the Audio Shaders:
Execute the Audio Shaders on the CPU - The Audio Shaders would have to be super-lightweight from a performance overhead perspective and that may force us to make a compromise between functionality and performance.
Execute the Audio Shaders on the GPU - Perhaps only marginally better than running on the CPU, since most games are GPU-bounded, not CPU-bounded, but in some cases it might prove a good option.
For sure, the desirable option would be to run the Audio Shaders on dedicated audio hardware. This would yield the least compromise between functionality and performance.
How would all this work in Unity?
Now that we have an overview of the proposed concepts, let’s see how they would actually work in Unity. Of course, there are multiple ways in which these concepts could be approached and materialized in a certain application.
And of course, there are even more ways in which these approaching strategies could be implemented, but let’s address things progressively and see how we can build our way up towards the ideal implementation.
Installment & initial setup
The Unified Audio Rendering System could be either:
A separate package that a user could download and import into the project.
An integrated part of Unity itself, just as the graphics rendering system is.
Provided that the Unified Audio Rendering System is available in the project one way or another, from an initial setup standpoint, the Unified Audio Rendering pipeline or system might be configured to accommodate various project-specific needs.
Participating in the Unified Audio Rendering: Audio Renderers
It should be then as easy as dropping an Audio Renderer component on any game object we wish to be accounted by the Unified Audio Rendering System, hence participating in the Unified Audio Rendering.
Defining the game objects' audio properties: Audio Materials
The Audio Renderer would then reference one or more Audio Materials which would describe the game object from an audio perspective.
As an automation aid, perhaps upon adding the Audio Renderer, a default Audio Material would be assigned and automatically pre-configured based on the graphic and physics material’s physical properties.
For instance, if your mesh has a metallic material assigned, the Audio Material could inherit some of the graphic material's properties and be automatically pre-configured to be metallic as well. Then the user could further go ahead and configure the assigned Audio Material or create a new one using the Audio Shader of choice. Just an idea.
The Audio Shaders are the workhorses of the entire system, standing at the core of every Audio Material. As mentioned before, the Audio Material essentially represent a shell that holds game entity specific data and a reference for an Audio Shader that will use that data when being executed. Then at rendering time, the data is being fed into the Audio Shader that "lives" inside the Audio Material.
Some Audio Shaders could be an integrated part of the system, but the user could as well write his / her very own custom Audio Shader that handles audio like no other Audio Shader. I’m already thinking of some cool plasma gun audio being dynamically rendered with the help of some FFT-based synthesis Audio Shader. This is where you could get very creative.
How about a footstep audio "system"? There should be no "system" for such use-case in the first place. It should be as easy as assigning the proper Audio Materials to the various ground surfaces that may exist in the scene and assigning a “boots” Audio Material to character’s “boots” that would have to collide with the ground from time to time to simulate walking, running, jumping, falling etc.
At run-time, the Unified Audio Rendering System would handle collisions between the game objects that participate in the Unified Audio Rendering (the ones that have an Audio Renderer component), fetch their Audio Materials and produce the right collision sounds with the help of the Audio Shaders powering the Audio Materials. Among these collisions, we conveniently have the character’s feet colliding with the ground, producing footsteps sounds.
Shader-wise, the Audio Materials could range from basic sample playback (trivial playback of pre-baked audio assets as we’re already used to) to advanced physically-based Audio Materials powered by complex synthesis Audio Shaders.
Audio Rendering Pipeline support
Of course, more advanced Audio Materials would perhaps require more advanced audio rendering support in the Unified Audio Rendering System / Pipeline.
For example, perhaps we assign a highly reflective Audio Material to all the indoor walls of the scene. Naturally, we expect reverberation to occur. For that, some significant “audio reflection” support would have to be implemented on the Unified Audio Rendering System / Pipeline side, as the Audio Shaders alone would probably not be sufficient (or efficient) to achieve this rather global behavior.
Think of the new HD Rendering Pipeline. The Lit shader cannot work on the previous rendering pipelines simply because the previous pipelines don’t know about some new features used by the Lit shader. Most probably, the advancements in Audio Shader development would have to be constantly backed up by advancements in the Unified Audio Rendering System / Pipeline.
Here are some hard requirements for anyone to consider when implementing the proposed system:
Non-disruptiveness - The system should be able to co-exist and operate with the existing audio content and existing audio setups / workflows.
Incremental and partial integrability - It should be possible to adhere to the system in a partial or incremental fashion. That is, the system should allow for only using the components that are needed and not force the user to perform a full integration if not necessary. For example, a user could create Audio Materials just for the game entities it needs to. Also, the user could only use the basic, pre-defined Audio Shaders without having to write its own custom Audio Shaders. And of course, other components in the scene should work fine with standard pre-rendered audio content being played the classic way using Audio Source components. This requirement is in strong relation with the non-disruptiveness requirement.
High workflow efficiency - The system should be incorporated, set up and used with minimum effort and maximum efficiency. It sounds generic and utopian, but the user should not have a hard time and not spend significant effort when using the system, otherwise we’re defying the whole purpose of the concept which is to make things easier.
High performance and low execution overhead - The system should not add significant CPU performance overhead. As the overall performance is driven by multiple factors, some of which are externally defined (there will always be that crazy user-defined Audio Shader that cripples the entire audio rendering pipeline), this is a shared requirement that must be dispatched to all involved sub-components of the system regardless of them being defined by the system or by the user.
I hope this article brought in your perspective some potential paths the audio in games could take in order to advance to a more comfortable and capable state. The goal is to empower the game developer with new workflows and tools that would enable him / her to achieve things that were previously impossible or perhaps not thought of.
I also hope it forced you to think game audio in a different way, possibly from a (hopefully very exciting) perspective that could prove to be more prone to generate significant advancements in game audio technology than the perspectives we’re used to.