How a Multi-Threaded Renderer Works

twitterlinkedin

Introduction

A lot of game engines sport a special kind of renderer called a “multi-threaded renderer”.   Multi-threaded renderers have been around for quite some time.  When I first started working with game engines, I didn’t really understand what a multi-threaded renderer was or how it worked.  In this article, I will teach you how they are implemented so that you can better understand how to use them in modern game engines like Unity.

Why Use Threads at All?

Imagine you own a restaurant and have 15 employees.  However, you only give a single employee work to do.  That one single employee must interact with customers, wait tables, prepare food, clean the restaurant, etc. The other 14 employees just sit around and collect a paycheck every 2 weeks.   By the way, the customers are not very happy because it takes forever to get their food and check.  As a business owner, that would be a waste of time and money.  Oddly enough, many software engineers write code in a way similar to this.

Most modern CPUs contain 4 to 8 cores.  Each core has

  • An ALU
    • Handles integer math
  • An FPU
    • Handles floating point math
  • L1 and L2 cache
    • Provides fast access to memory that has recently been fetch from system memory.

However many software engineers design their code to run on a single core.  That leaves 3 to 7 cores doing nothing.  A single core can only compute information so fast.  In the restaurant example, it would have been better for the owner to figure out how to keep all of his employees busy.  The business owner could have assigned some of the employees to wait tables, others to preparing food, and others to cleaning.  In the software engineering business, to get the best performance, we need to think about how we can break our solutions into jobs that can be tasked to each core to work on.  This can be challenging because you need to understand your algorithm’s data input dependencies and outputs.  Often times, you also need to think about how you can merge all the individual results from each core into something useful.

What is a Multi-threaded Renderer?

A multi-threaded renderer is generally composed of at least 2 threads.  One thread is called the “simulation thread” and is responsible for handling gameplay logic, physics, etc.  Based on the update game play state, graphics API commands are queued up to be consumed by a second thread called the “render thread”.  The render thread typically owns the graphics device/context and is responsible for invoking the native graphics API commands thereby issuing work to be done on the GPU.

How can a multi-threaded renderer increase performance?

The graphics driver has quite a bit of work to do each time you invoke a native graphics API command.  The driver must validate the various parameters you passed into a graphics API along with the overall graphics state to avoid crashing the GPU.  The graphics driver will also be responsible for uploading textures, vertex buffers, and other resources to and from the GPU.   Take a look at A Trip Down the Graphics Pipeline for more details on what the graphics driver does.  All of this driver work takes time.  This means the graphics driver must block or force the thread executing the graphics API commands to wait until the commands have been executed.  However, what you choose to render is often the result of some changed game state.  In other words, often you will handle new input state like game controllers, then update the AI, then update physics, and then update sound, and finally, render something that reflects the new game state.  Often times, your AI code will not need to know what is happening with regards to what you are rendering on the GPU.  The AI, physics, and overall game state are independent of the renderer and will be used as input into the renderer.  Therefore it seems like a waste to update some AI logic and then immediately block the simulation thread to wait on the GPU to complete the render of a frame.  Instead, it would be better to queue up a list of commands for the renderer to execute in parallel with the simulation work.  This would allow us to start simulating the next frame in parallel while waiting on the previous frame to be displayed on the screen.

However, if you aren’t careful, the simulation and render threads can quickly get out of sync.  Imagine you are playing a first-person shooter.  As the player, you depend on the final rendered image as input to your brain to help you decide which buttons to press next.  If the scene is visually very complex, it could very well be that the render thread needs to spend much more time working than the simulation thread for a single frame.  In that case, the AI would have more time to hunt you down because the simulation thread would be getting executed at a faster rate than what can be rendered on the screen to help you determine how to react.  Therefore some kind of synchronization needs to take place if the simulation thread gets too far ahead of the render thread.  Unity will simulate frame N and render frame N in parallel.  Then Unity will immediately simulate frame N+1.  Unity will then wait on frame 0 to be completely rendered before proceeding.  Therefore it is important to ensure your rendering algorithms and shaders are optimized in a way to reduce the chance of stalling the simulation thread.

How is a multi-threaded renderer implemented?

Typically a cross-platform game engine, that supports multiple rendering APIs like DirectX11/Vulkan/OpenGL/etc, will have some abstracted high-level graphics API.  This high-level graphics API will look and feel much like the DirectX device context APIs.  Those high-level API calls will then be translated into a native graphics API call.  Its important to note that the graphics API calls would be executed immediately when invoked.  However, when a multi-threaded renderer is used, all native graphics API calls will be deferred.  The reason is that we are trying to reduce the chance of stalling a CPU core.  We want to keep all the CPU cores we have as busy as possible.  Whichever core runs the simulation thread can be thought of as the master core.  We can then use another slave core to run the native graphics API code.  The simulation thread will queue up graphics related work for the slave core to do.  However, the slave core will only consume new tasks when it is done completing a previous task.  The queueing/dequeuing of graphics related work is typically managed using a data structured called a Ring Buffer or Circular Buffer.  Ring buffers are queues implemented using a regular array that loops.  When you run out of room in your array to store information, then you just loop back to the first element in the array.  So you never need to allocate more memory.  Ring Buffers are pretty useful data structures to use when writing multi-threaded code.  They allow you to queue/dequeue objects from different threads in a safe manner.  This is because the simulation thread will be writing to a unique index in the array while the render thread will be reading from another unique index in the array.  It is also possible to write thread-safe Ring buffers that are lock-less.  Lock-less ring buffers further increase performance by reducing the chance of one thread waiting for another thread to queue or dequeue work.  When a high-level graphics API is invoked on the simulation thread, a graphics command data packet is queued in the ring buffer.  When the render thread is done executing its previous task, it will consume the next task in the ring buffer by dequeuing it and executing it.  Note: The following is a simple multi-threaded renderer skeleton.  Its not production quality or optimized but it is intended to give you an idea of how a multi-threaded renderer could be implemented in a game engine.

Example Code

Download  Sample Project

Interacting with Unity’s Multi-threaded Renderer

If you are using Unity, multi-threaded rendering just works out of the box.  There isn’t really anything you need to do because all your code will be running on Unity’s simulation thread.  However, if you are writing a native rendering plugin DLL, you will need to ensure that your graphics API code runs on Unity’s render thread.  This is because native render DLLs share Unity’s graphics device and context.  Take a look at the following article for more information on how to write a native rendering plugin for Unity.  The article contains an example of how to execute DirectX graphics API commands on Unity’s render thread via Unity’s C# script method IssuePluginEventAndData.  When you call IssuePluginEventAndData, Unity essentially queues an “IssuePluginEventAndData” command on the simulation thread in its ring buffer and eventually dequeues the command on the render thread and executes it.  If Unity were configured to be single threaded and you were to call IssuePluginEventAndData, IssuePluginEventAndData would be executed immediately.

Conclusion

If you have made it this far, you should have a better grasp on how Unity’s multi-threaded renderer works If you ever write a native rendering plugin for Unity.  For more information on how to write a more optimal Ring Buffer for multi-threaded rendering, take a look at Kaspar Daugaard’s article “Writing a Fast and Versatile SPSC Ring Buffer”. That’s it for now.  Go create something!


twitterlinkedin
Share It!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.