Introduction
A lot of game engines sport a special kind of renderer called a “multi-threaded renderer”. Multi-threaded renderers have been around for quite some time. When I first started working with game engines, I didn’t really understand what a multi-threaded renderer was or how it worked. In this article, I will teach you how they are implemented so that you can better understand how to use them in modern game engines like Unity.
Why Use Threads at All?
Imagine you own a restaurant and have 15 employees. However, you only give a single employee work to do. That one single employee must interact with customers, wait tables, prepare food, clean the restaurant, etc. The other 14 employees just sit around and collect a paycheck every 2 weeks. By the way, the customers are not very happy because it takes forever to get their food and check. As a business owner, that would be a waste of time and money. Oddly enough, many software engineers write code in a way similar to this.
Most modern CPUs contain 4 to 8 cores. Each core has
- An ALU
- Handles integer math
- An FPU
- Handles floating point math
- L1 and L2 cache
- Provides fast access to memory that has recently been fetch from system memory.
However many software engineers design their code to run on a single core. That leaves 3 to 7 cores doing nothing. A single core can only compute information so fast. In the restaurant example, it would have been better for the owner to figure out how to keep all of his employees busy. The business owner could have assigned some of the employees to wait tables, others to preparing food, and others to cleaning. In the software engineering business, to get the best performance, we need to think about how we can break our solutions into jobs that can be tasked to each core to work on. This can be challenging because you need to understand your algorithm’s data input dependencies and outputs. Often times, you also need to think about how you can merge all the individual results from each core into something useful.
What is a Multi-threaded Renderer?
A multi-threaded renderer is generally composed of at least 2 threads. One thread is called the “simulation thread” and is responsible for handling gameplay logic, physics, etc. Based on the update game play state, graphics API commands are queued up to be consumed by a second thread called the “render thread”. The render thread typically owns the graphics device/context and is responsible for invoking the native graphics API commands thereby issuing work to be done on the GPU.
How can a multi-threaded renderer increase performance?
The graphics driver has quite a bit of work to do each time you invoke a native graphics API command. The driver must validate the various parameters you passed into a graphics API along with the overall graphics state to avoid crashing the GPU. The graphics driver will also be responsible for uploading textures, vertex buffers, and other resources to and from the GPU. Take a look at A Trip Down the Graphics Pipeline for more details on what the graphics driver does. All of this driver work takes time. This means the graphics driver must block or force the thread executing the graphics API commands to wait until the commands have been executed. However, what you choose to render is often the result of some changed game state. In other words, often you will handle new input state like game controllers, then update the AI, then update physics, and then update sound, and finally, render something that reflects the new game state. Often times, your AI code will not need to know what is happening with regards to what you are rendering on the GPU. The AI, physics, and overall game state are independent of the renderer and will be used as input into the renderer. Therefore it seems like a waste to update some AI logic and then immediately block the simulation thread to wait on the GPU to complete the render of a frame. Instead, it would be better to queue up a list of commands for the renderer to execute in parallel with the simulation work. This would allow us to start simulating the next frame in parallel while waiting on the previous frame to be displayed on the screen.
However, if you aren’t careful, the simulation and render threads can quickly get out of sync. Imagine you are playing a first-person shooter. As the player, you depend on the final rendered image as input to your brain to help you decide which buttons to press next. If the scene is visually very complex, it could very well be that the render thread needs to spend much more time working than the simulation thread for a single frame. In that case, the AI would have more time to hunt you down because the simulation thread would be getting executed at a faster rate than what can be rendered on the screen to help you determine how to react. Therefore some kind of synchronization needs to take place if the simulation thread gets too far ahead of the render thread. Unity will simulate frame N and render frame N in parallel. Then Unity will immediately simulate frame N+1. Unity will then wait on frame 0 to be completely rendered before proceeding. Therefore it is important to ensure your rendering algorithms and shaders are optimized in a way to reduce the chance of stalling the simulation thread.
How is a multi-threaded renderer implemented?
Typically a cross-platform game engine, that supports multiple rendering APIs like DirectX11/Vulkan/OpenGL/etc, will have some abstracted high-level graphics API. This high-level graphics API will look and feel much like the DirectX device context APIs. Those high-level API calls will then be translated into a native graphics API call. Its important to note that the graphics API calls would be executed immediately when invoked. However, when a multi-threaded renderer is used, all native graphics API calls will be deferred. The reason is that we are trying to reduce the chance of stalling a CPU core. We want to keep all the CPU cores we have as busy as possible. Whichever core runs the simulation thread can be thought of as the master core. We can then use another slave core to run the native graphics API code. The simulation thread will queue up graphics related work for the slave core to do. However, the slave core will only consume new tasks when it is done completing a previous task. The queueing/dequeuing of graphics related work is typically managed using a data structured called a Ring Buffer or Circular Buffer. Ring buffers are queues implemented using a regular array that loops. When you run out of room in your array to store information, then you just loop back to the first element in the array. So you never need to allocate more memory. Ring Buffers are pretty useful data structures to use when writing multi-threaded code. They allow you to queue/dequeue objects from different threads in a safe manner. This is because the simulation thread will be writing to a unique index in the array while the render thread will be reading from another unique index in the array. It is also possible to write thread-safe Ring buffers that are lock-less. Lock-less ring buffers further increase performance by reducing the chance of one thread waiting for another thread to queue or dequeue work. When a high-level graphics API is invoked on the simulation thread, a graphics command data packet is queued in the ring buffer. When the render thread is done executing its previous task, it will consume the next task in the ring buffer by dequeuing it and executing it. Note: The following is a simple multi-threaded renderer skeleton. Its not production quality or optimized but it is intended to give you an idea of how a multi-threaded renderer could be implemented in a game engine.
Example Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
#include <iostream> #include <thread> #include <atomic> #include <vector> using namespace std; // Check out the following links for more information on ring buffers. //http://www.mathcs.emory.edu/~cheung/Courses/171/Syllabus/8-List/array-queue2.html //http://wiki.c2.com/?CircularBuffer //https://preshing.com/20130618/atomic-vs-non-atomic-operations/ //https://www.daugaard.org/blog/writing-a-fast-and-versatile-spsc-ring-buffer/ template <typename T> class RingBuffer { private: int maxCount; T* buffer; atomic<int> readIndex; atomic<int> writeIndex; public: RingBuffer() : maxCount(51), readIndex(0), writeIndex(0) { buffer = new T[maxCount]; memset(buffer, 0, sizeof(buffer[0]) * maxCount); } RingBuffer(int count) : maxCount(count+1), buffer(NULL), readIndex(0), writeIndex(0) { buffer = new T[maxCount]; memset(buffer, 0, sizeof(buffer[0]) * maxCount); } ~RingBuffer() { delete[] buffer; buffer = 0x0; } inline void Enqueue(T value) { // We don't want to overwrite old data if the buffer is full // and the writer thread is trying to add more data. In that case, // block the writer thread until data has been read/removed from the ring buffer. while (IsFull()) { this_thread::sleep_for(500ns); } buffer[writeIndex] = value; writeIndex = (writeIndex + 1) % maxCount; } inline bool Dequeue(T* outValue) { if (IsEmpty()) return false; *outValue = buffer[readIndex]; readIndex = (readIndex + 1) % maxCount; return true; } inline bool IsEmpty() { return readIndex == writeIndex; } inline bool IsFull() { return readIndex == ((writeIndex + 1) % maxCount); } inline void Clear() { readIndex = writeIndex = 0; memset(buffer, 0, sizeof(buffer[0]) * maxCount); } inline int GetSize() { return abs(writeIndex - readIndex); } inline int GetMaxSize() { return maxCount; } }; struct GfxCmd { public: virtual void Invoke() {}; }; struct GfxCmdSetRenderTarget : public GfxCmd { public: void* resourcePtr; GfxCmdSetRenderTarget(void* resource) : resourcePtr(resource) {} void Invoke() { // Invoke ID3D11DeviceContext::OMSetRenderTargets method here... //https://docs.microsoft.com/en-us/windows/desktop/api/d3d11/nf-d3d11-id3d11devicecontext-omsetrendertargets printf("%s(%p);\n", name, resourcePtr); } private: const char* name = "GfxCmdSetRenderTarget"; }; struct GfxCmdClearRenderTargetView : public GfxCmd { public: int r, g, b; GfxCmdClearRenderTargetView(int _r, int _g, int _b) : r(_r), g(_g), b(_b) {} void Invoke() { // Invoke ID3D11DeviceContext::ClearRenderTargetView method method here... //https://docs.microsoft.com/en-us/windows/desktop/api/d3d11/nf-d3d11-id3d11devicecontext-clearrendertargetview printf("%s(%d, %d, %d);\n", name, r, g, b); // Pretend this command is requiring the render thread // to do a lot of work. this_thread::sleep_for(250ms); } private: const char* name = "GfxCmdClearRenderTargetView"; }; struct GfxCmdDraw : public GfxCmd { public: int topology; int vertCount; GfxCmdDraw(int _topology, int _vertCount) : topology(_topology), vertCount(_vertCount) {} void Invoke() { // Invoke ID3D11DeviceContext::DrawIndexed method method here... //https://docs.microsoft.com/en-us/windows/desktop/api/d3d11/nf-d3d11-id3d11devicecontext-drawindexed printf("%s(%d, %d);\n", name, topology, vertCount); } private: const char* name = "GfxCmdDraw"; }; void UpdateSimulationThread(RingBuffer<GfxCmd*>& gfxCmdList) { // Update gameplay here. // Determine what to draw based on the new game state below. // The graphics commands will be queued up on the render thread // which will execute the graphics API (I.E. OpenGL/DirectX/Vulcan/etc) calls. gfxCmdList.Enqueue(new GfxCmdSetRenderTarget{ (void*)0x1 }); gfxCmdList.Enqueue(new GfxCmdClearRenderTargetView{ 255, 0, 245 }); gfxCmdList.Enqueue(new GfxCmdDraw{ 1, 10 }); } void UpdateRenderThread(RingBuffer<GfxCmd*>& gfxCmdList) { GfxCmd* gfxCmd = 0x0; if (gfxCmdList.Dequeue(&gfxCmd)) { gfxCmd->Invoke(); delete gfxCmd; } } void GameLoop() { RingBuffer<GfxCmd*> gfxCmdList(3); atomic<int> counter = 0; atomic<bool> quit = false; // Run this indefinitely... while (1) { quit = false; counter = 0; gfxCmdList.Clear(); thread simulationThread = thread([&gfxCmdList, &counter, &quit] { UpdateSimulationThread(gfxCmdList); quit = true; }); thread renderThread = thread([&gfxCmdList, &quit] { // Continue to read data from the ring buffer until it is both empty // and the simulation thread is done submitting new items into the ring buffer. while (!(gfxCmdList.IsEmpty() && quit)) { UpdateRenderThread(gfxCmdList); } }); // Ensure that both the simulation and render threads have completed their work. simulationThread.join(); renderThread.join(); cout << "---\n"; } } int main(int argc, char** argv[]) { GameLoop(); return 0; } |
Download Sample Project
Interacting with Unity’s Multi-threaded Renderer
If you are using Unity, multi-threaded rendering just works out of the box. There isn’t really anything you need to do because all your code will be running on Unity’s simulation thread. However, if you are writing a native rendering plugin DLL, you will need to ensure that your graphics API code runs on Unity’s render thread. This is because native render DLLs share Unity’s graphics device and context. Take a look at the following article for more information on how to write a native rendering plugin for Unity. The article contains an example of how to execute DirectX graphics API commands on Unity’s render thread via Unity’s C# script method IssuePluginEventAndData. When you call IssuePluginEventAndData, Unity essentially queues an “IssuePluginEventAndData” command on the simulation thread in its ring buffer and eventually dequeues the command on the render thread and executes it. If Unity were configured to be single threaded and you were to call IssuePluginEventAndData, IssuePluginEventAndData would be executed immediately.
Conclusion
If you have made it this far, you should have a better grasp on how Unity’s multi-threaded renderer works If you ever write a native rendering plugin for Unity. For more information on how to write a more optimal Ring Buffer for multi-threaded rendering, take a look at Kaspar Daugaard’s article “Writing a Fast and Versatile SPSC Ring Buffer”. That’s it for now. Go create something!