Notifications
Article
Physics simulation on GPU with compute shader
Published 5 months ago
145
0
Physics simulation on GPU with compute shader

Introduction
How it looks [gif]

In this tutorial I'll explain how to use compute shader to perform computations on a videocard. Hairs simulation used as example.

You can download example project here (it is a Unity3d project):

https://drive.google.com/open?id=1LTgdVaiOuGuXhb6Mhc_Z1gStP7muxZlO

Who will understand this tutorial? Unity3D users or any C#/C++ programmer. Shader is wrote on HLSL which is a C family language.
Who will benefit from this tutorial? Experienced programmers who want to start using GPU for computations. Unexperienced, but passionate ot diligent programmers migh benefit as well.

Why should I use a videocard for computations? It outperforms CPU by 10-100x times for parallel tasks. Great boost, isn't it? A little supercomputer inside computer! Why not use it sometimes if a task is appropriate?
Is this "great boost" of performance really required? Yes, sometimes CPU performance is a limiting factor. For example, when a program should perform the same set of operations on a huge amounts of data. It's so easy to parallel this kind of task. Sometimes developers simply ignore some algorithms because of their known performance cost. For example, games benefit alot from good physics, but good physics load CPU alot, lowering fps. So, developers use lousy physics or no physics at all. But it can be fixed by simply pushing the calculations to GPU. Cool physics with no CPU cost.
So, we can kinda crash tough tasks with a brute force? Optimization is always a good thing, no matter how fast your processor is. There's no such supercomputer in the world that can't be slowed to the ground with ineffective code.
Why compute shader though? Why not OpenCL or CUDA? CUDA is Nvidia specific and I don't know opencl. In the same time Unity can build in any API: DirectX, OpenGL Core, Metal, Vulkan. So, compute shader might be a good choice. Though, I must note here that if you want your compute shader based program to work on every of those platforms, you have to take into account the differences between those different graphics APIs. For example, Metal won't support more than 256 threads per axis. While DX10 standard supports 1024. Android API won't let you use more than 4 buffers per kernel. While DX10 allows 8, DX11 - even more.
Alright, why physics simulation then? It's a good task for learning: it is performance demanding and well paralleled. Also, it's a useful, relevant, widely demanded kind of task in general. Gamedev people can implement cool physics in games. Students can create experimental models for their projects. Engineers and scientists can use models for their experiments. Casual enthusiasts and hobby programmers can have fun implementing funny looking stuff.
Why is the hairs simulation used? It's simple, but covers the whole range of features I'd like to include in this tutorial.

How to use this tutorial? It's better to download the source project, open it in Unity, play with it, check the code, find yourself curious about what this or that line do. And then just read the tutorial and see how the lines I explain used in the code. I won't explain every single line, only important ones. Besides, there's no complex algorithms at all. On CPU side of the code there are some methods of the compute shader related classes used. And on GPU side (in the shader) some simple math is done. But if there's something not clear for you, ask me in comments, I post this on reddit to communicate with people and help them.

If you have zero experience in the compute shader area I suggest make a pause and go to the easy level tutorial I wrote earlier. That one only covers the basics. After going through that simple one you can get back to this tutorial, and that should make it easier for you. But if you are comfortable with the basics or really confident, then let's get deeper in the subject right now.

Actual tutorial starts here

If you want to build a physics simulation on GPU, you can cut the whole task to these 4 parts:

-math model of the simulated object
-general algorithm
-actual shader code
-cpu side shader and data preparations

Math model

The good thing about videocards is that they can apply a set of instructions to a big amount of data units. We'll use it by implementing hair model as a number of points, and each one will interact with its neighbors by the same rules. So, it can be easily paralleled. The interaction between the points will be spring-alike:

k * (S0-S)^n

where S0 - is an equillibrium distance, and S - current distance.

In reality a hair is not a spring, it has a constant length. So, we need to make our spring hard. It's better to do it by increasing the "n" - the power factor. Because it makes the curve of force steeper in the area of equilibrium. I decided to have n = 2. The meaning of "k" factor we will discuss a bit later.

Besides the spring-alike force, there will be velocity diffusion implemented between the points. Tangential component of relative force models dynamic resistance to stretching. Normal component of the relative velocity models dynamic resistance to flexing. They both would improve hair's dynamic, make it look less spring-alike.

Also, we will have a static flexibility related force. Each point will tend to compensate the bending. The force will be proportional to bending degree and point to the direction of reducing the bending. Neighbor points will recieve oppositely pointed force of the halved intensity, so the sum of all flexibility related forces wouldn't move the whole hair.

Together these interactions simulate a hair nicely. But we will add something else. We want our hairs to interact with solid bodies. Physics siumlations in general use different kinds of objects that interact with each other, for example liquid and solid shapes. But also because I write this tutorial for game developers, and there's no point for us to implement anything that doesn't interact with the CPU side game world. GPU side things must interact with anything on CPU side. So, we will make our hairs interact with scene objects, by putting information about those objects to GPU memory and then reading the forces hairs affected those objects with.

For the sake of simplicity we will only do this for circle 2d colliders from standard unity physics. It can be extended to all other kinds of colliders, but I only want to show the approach, then you will be able to improve it, to make it work with all kinds of data you want. So, only circle colliders in our case.

Hairs will interact with scene objets like this: if hair's point find itself inside a solid body, it will be moved to the closest surface point of the body, and also hair point and the body will exchange impulses. A fraction of point's velosity that points to the body will be substracted from the point's whole velocity, and then will be added to the body's velocity. We will not take into account body's current velocity. It would be more correct to do, but we won't, for simplicity.

Algorithm, shader code and CPU side shader preparation

These three are too related to each other, so we will go through them in parallel.

To describe a point we model our hair with, we will use this struct:

struct hairNode{
float x; // 2d position of the point
float y; //
float vx; // 2d velocity
float vy; //
int dvx; // sum of forces or change of velocity in the current frame
int dvy; //
int dummy1; // these two parameters added to make struct's size divisible by 128 bit
int dummy2; //
}

This struct is declared twice: on GPU and on CPU sides. For convenience. On CPU side we initialize the array of this struct with the initial coordinates of the points. Then we copy this data to GPU. But if we don't need to pass initial values to GPU, we can declare this struct only on GPU side.

About those dummy1 and dummy2 parameters. Reading from GPU side buffer takes some calculations. And if we keep the stride aliquot to 128 bit, it will take less inner core calculations to get access to the data unit - no additional offset calculations will be needed. Therefore nvidia engineers advise to add empty variables to structs to make them fit this condition.

I hope the meaning of other vaules of the point struct are understood. Though, an attentive reader might ask: why is position and velocity have float type, and forces are int? Short answer: because forces are being modified in parallel by different threads. So, we need to use protected writing operation to avoid data loss. But protected data access functions only work for int and uint values. So, we have force as int. I'll describe this later with more details.

We have alot of points we model the hairs with. Number of hairs multiplied by number of points per hair. Alot. This data is stored in GPU as a buffer:

RWStructuredBuffer<hairNode> hairNodesBuffer;

In shader code we conly have to declare buffer name and it's type. While its size is being setup from outside, from CPU side.

Now let's see how shader code is acatually structured. Shader code is made of kernels. They are like methods. But each kernel runs in many threads in parallel. Therefore, for each kernel we need to set the amount of threads it should run in. Number of threads is a three dimentional value. It has x, y and z axes. Which corresponds to GPU architecture. But we don't have to use all three dimensions. For example, if we are doing something with a 1024x1024 texture, we can have 1024x1024 threads (threads per each pixel).

This is how an empty kernel looks like:

#pragma kernel kernelName
[numthreads(8,4,1)]
void kernelName (uint3 id : SV_DispatchThreadID){
// actual code should be here
}

Kernel has "id" parameter. It provides thread index. Which is very convenient, it is being used to access a specific data unit. Remember I mentioned 1024x1024 texture being computed thread-per-pixel by 1024x1024 threads? Pixels would be adressed by the "id" parameter:

texture[id.xy] = float(1, 1, 1, 1);

or (which is the same)

texture[uint2(id.x, id.y)]

By the way. From CPU side a kernel is being called by a method from ComputeShader class, like this:

shaderInstance.Dispatch(kernelIndex, 2, 2, 1);

Those numbers "2, 2, 1" are related to the line that follows shader name declaration in shader:

[numthreads(8,4,1)]

These two triplets of numbers define the amount of threads the kernel should run in. We just have to multiply them axis by axis. In our case: "2, 2, 1" at Dispatch() and "8, 4, 1" at numthreads(). So, the kernel will run in 2 * 8 = 16 by x, and 2 * 4 = 8 by y dimension. 16 * 8 threads.

So, once more. On CPU side we call a kernel like this:

Dispatch(kernelIndex, X, Y, Z);

On GPU side kernel has this directive:

[numthreads(x,y,z)]

And as a result the kernel will run in (X * x) * (Y * y) * (Z * z) threads.

Example: we need to process every pixel of the 256x256 texture, and we want to go thread-per-pixel approach. So, on CPU side we define the number of threads this way:

Dispatch(kernelIndex, 16, 16, 1);

and on shader side:

[numthreads(16,16,1)]

Inside the kernel id.x will have [0, 255] range, same for id.y

And, pixels will be written like this:

texture[id.xy]=float4(1, 1, 1, 1);

it will paint white all 65536 pixels

Hope this part about the number of threads is understood, but if not, you might want to go through that simple tutorial I linked in the introduction to this tutorial.

But if you got the idea, let's go further and see how does the shader code look.

The kernels

Our shader code contains a few kernels, that being called one after another from Update() method on CPU side. I'll explain each kernel's code later, now let's see what role each kernel plays.

calc - tangential and normal components of the force between the points is being calculated. Remember we model the hair as a set of points? these points interact with a spring-alike force, which pulls them. ALso, there's a flexibility related force, that compensate the curves and pushes the points perpendicular to the line between theri neighbors. The forces calculated for each perticle is being stored in its dvx and dvy parameters.

velShare - points exchange fractions of relative speed between each other. Tangential and normal components separately. Tangential component should be exchanged more intensively, so we first extract it and make the exchange with a bigger factor. And then we exchange a smalle fraction of the full velocity. Velocity changes we again store in dvx and dvy variables of each point.

interactionWithColliders - each point interacts with the colliders that we provided the info about from CPU side

calcApply - the forces we have calculated in the previous kernels are modifying points' velocities, and then velocities are used to modify points' positions

visInternodeLines - lines are being drawn between the points, but they are being drawn on a 1024x1024 buffer instead of texture

pixelsToTexture - here we move the pixels from the buffer to the actual texture

clearPixels - all pixel data from the buffer are being removed

clearTexture - the texture is being cleaned

oneThreadAction - this kernel runs in a single thread, it's purpose is to move the whole system of hair to the point user's imput wanted it to move. Instead of teleporting it to the desired point, this thread moves it smoothly. To avoid instability in the system, because forces are a power of two of the distance, and they would be enormous if we jumped the system of hair in one step

How to prepare a shader and run it

Declaring a variable:
ComputeShader _shader;

Initializing the variable by providing a shader file:
_shader = Resources.Load<ComputeShader>("shader");

Setting up the constants we will need on GPU side
// nodesPerHair and nHairs variables are already initialized
_shader.SetInt("nNodsPerHair", nodesPerHair);
_shader.SetInt("nHairs", nHairs);

Declaring the arrays, which will contain points' data. Also declaring the buffer we will use to write data to GPU
hairNode[] hairNodesArray;
ComputeBuffer hairNodesBuffer;

Initializing the buffer and writing data to GPU
// hairNodesArray is already initialized with the points' data
hairNodesBuffer = new ComputeBuffer(hairNodesArray.Length, 4 * 8);
hairNodesBuffer.SetData(hairNodesArray);

For each kernel we are setting the buffers that are used by the kernel, so it would read and write to those buffers
kiCalc = _shader.FindKernel("calc");
_shader.SetBuffer(kiCalc, "hairNodesBuffer", hairNodesBuffer);

When all buffers are set for all kernels, we can finally run the kernels

All kernels are being runned from Update(). We shouldn't call it from FixedUpdate(), it will work very slow, because graphics pipeline is synchronized with Update().

All kernels are being run in this sequence. I now provide the full text of the doShaderStuff() method that is being called from Update()

void doShaderStuff(){
int i, nHairThreadGroups, nNodesThreadGroups;
nHairThreadGroups = (nHairs - 1) / 16 + 1;
nNodesThreadGroups = (nodesPerHair - 1) / 8 + 1;
_shader.SetFloats("pivotDestination", pivotPosition);
circleCollidersBuffer.SetData(circleCollidersArray);
i = 0;
while (i < 40) {
_shader.Dispatch(kiVelShare, nHairThreadGroups, nNodesThreadGroups, 1);
_shader.Dispatch(kiCalc, nHairThreadGroups, nNodesThreadGroups, 1);
_shader.Dispatch(kiInteractionWithColliders, nHairThreadGroups, nNodesThreadGroups, 1);
_shader.Dispatch(kiCalcApply, nHairThreadGroups, nNodesThreadGroups, 1);
_shader.Dispatch(kiOneThreadAction, 1, 1, 1);
i++;
}
circleCollidersBuffer.GetData(circleCollidersArray);
_shader.Dispatch(kiVisInternodeLines, nHairThreadGroups, nNodesThreadGroups, 1);
_shader.Dispatch(kiClearTexture, 32, 32, 1);
_shader.Dispatch(kiPixelsToTexture, 32, 32, 1);
_shader.Dispatch(kiClearPixels, 32, 32, 1);
}

One can notice there's a set of kernels that are being runned 40 times per update. Why? To compensate the small time step and still have the simulation work fast in realtime. Why is time step small? To reduce discretization error, to make the system work more stable. Why is it unstable when big time step causes big discretization error? If time step is big and a big force is applied to a point, it will fly far, to a greater distance than it were from the equilibrium position. So, on the next step the force is stronger, the point flies to even greater distance. And the system goes crazy, points just fly chaotically with increasing amplitude. Smaller step makes all force curve smooth, error is small, and system works correctly.

So, instead of a big step, we perform 40 small steps. This makes computations precise. And that allows to use bigger forces without risking to fall into instable state. Bigger forces are good, they turn a wobbly exploding spaghetty into a nice strong hair.

The data of the points we use to model hair are stored in gpu side memory as a buffer, which is a random access one dimensional array. You remember, each thread has id parameter, that has thread's 2d index. We went the thread-per-point approach, so the index addresses one of the points. id.x is a number of hair, and id.y is point's index, so we access the point this way: hairNodesBuffer[id.x * nNodesPerHar + id.y]

You also remember, that total number of threads a kernel is runned in depends on Dispatch() method parameters on CPU side and [numthreads()] directive parameters in shader code. In our case all kernels that work with the points have [numthreads(16,8,1)] directive. So, Dispatch() method's parameters must be defined the way to provide total amount of threads not less than the amount of points. Here's how we calculate parameters for Dispatch() method:

nHairThreadGroups = (nHairs - 1) / 16 + 1;
nNodesThreadGroups = (nodesPerHair - 1) / 8 + 1;

The relationship between [numthreads()] and Dispatch() comes from GPU architecture. The first one is the number of threads in a thread group. The sedond one is the number of thread groups. Their ration affects how fast till all kernels be executed. If we need 1024 threads by x axis, it's better to have 32 thread groups and 32 threads per group. Better than 1024 threads and one group. To answer the "why" question, I'd have to describe too much about GPU architecture, so, let's just not waste our time on that. For practical reasons there's one advice: try to keep number of threads and number of thread groups about even.

The actual physics code in the shader

Alright, so 40 times per update we run the kernels that handle all physics between the points. Let's have a closer look to each of those kernels. There's nothing fancy, we just need to learn how to deal with buffer data.

"calc" kernel calculates forces between the points. The points are stored sequentially in "hairNodesBuffer" buffer. First point of first hair first, then second point, then all of them till the last point of the first hair, then first point of the second hair. And so on for all points of all hairs. id.x is hair's number, id.y is point's number, so this is how we actuall get the point in shader code:

int nodeIndex, nodeIndex2;
hairNode node, node2;
nodeIndex = id.x * nNodesPerHair + id.y;
nodeIndex2 = nodeIndex + 1;
node = hairNodesBuffer[nodeIndex];
node2 = hairNodesBuffer[nodeIndex2];

nNodesPerHair is a constant we've set on init from GPU side. You can see that buffers data has been copied to local variables node and node2. It's done becasue reading from buffer may take more inner core cycles than reading local variable, so we just save some computational timy by using local variables for data reads. The actual algorithm is the following: for each point (if it's not hair's last) we calculate the force between this point and the next one. And we wirte this force to the current point and reversed force value to the next point.

There's an important thing to note: each thread modifies two points: the current one and the next one. Which means every point is being modified by two parallel threads. This might lead to data lose. Let me explain this. If we use a normal increment like this:

variable += value;

this operation might happen in the same time. First thread will copy data, add value, but before it write the result back, the second thread will copy the data. Then the first thread will write the modified data back. Then the second thread will write the result too. Even though both threads did implement, the result will only have one increment. TO avoid this we must use protected writing. HLSL has some functions for protected writing, they guarantee that every thread's data change will be performced.

But there's a little problem: those protected writing functions only work for int or uint data types. And that is the reason we used int type for dvx and dvy variables in our point data struct. It allows us to let many threads write to those variables, and each thread's contribution will be safely applied. But we have to store float numbers as int, right? To avoid losing precision, we are using multipliers, to project small range float values to the whole range of int data type. We do lose some precision though, but it's negligibly small.

Protected writing looks like this:

InterlockedAdd(hairNodesBuffer[nodeIndex].dvx, (int)(F_TO_I * (dv.x + 2 * dvFlex.x)));
InterlockedAdd(hairNodesBuffer[nodeIndex].dvy, (int)(F_TO_I * (dv.y + 2 * dvFlex.y)));
InterlockedAdd(hairNodesBuffer[nodeIndex2].dvx, (int)(F_TO_I * (-dv.x - dvFlex.x)));
InterlockedAdd(hairNodesBuffer[nodeIndex2].dvy, (int)(F_TO_I * (-dv.y - dvFlex.y)));

F_TO_I is the multiplier I mentioned to project float to int. dvFlex is the un-bending force we calculated. "(int)" cast is used to explicitly cast the result to int, becasue InterlockedAdd() function is reloaded for int and uint, and by default interprets data as uint, which we don't like, because there are negative values used.

"velShare" kernel is similar to the previous one, dvx and dvy of each particle are being modified here, but instead of forces here we calculate velocity exchange between the neighboring points

In "interactionWithColliders" points do not interact with each other. Each point here `checks every collider from colliders buffer. So evem if there's some interaction with a collider, every thread only writes to one and only one point, and therefore we don't need to use protected writing functions. But points should add their impulse to the collider they have interacted with, and there might be many threads that need to write impulse change to the collider data, so we will use protected writing to write impulses to the colliders.

A note: when we project float to int, the greater the multiplier, the better precision will be. But greater precision means less value range. Because there's only 32 bit in te integer, and multiplier decides how many bits work for value range, and how many for precision. For the points we picked an appropriate multiplier, that allows good range and enough precision. But when we write impulses to the collider, we don't know how much range we will need. Hundreds of points might add their impulses to the collider. So we used a much smaller multiplier for this case. We lost precision, but won value range.

After all interactions between the points are calculated, in "calcApply" kernel we add impulse to velocity, and velocity to position. Also, int this kernel first point of each hair is being positioned relatively to the system of all hairs position. Also in this carnel gravity is being added to vertical component of velocity. And finally, velocity is being multiplied by a factor slightly less than 1 to model slowing the hairs down because of friction with air.

A note: in "calcApply" kernel velocity affects position through "sPosRate" factor. It defines simulation's time step size. It is set up on CPU side by a variable named "simulationSpeed". The bigger this parameter is, the faster the model will evolve in time but the bigger will be error. As I mentioned before, big error limits strength of forces between points. So we used small step size, which provided higher precision of calculations, which allowed higher forces, which defined more realistic behavior of simulated model.

Strength of force is defined be "dVelRate" factor. It defines how much interaction between the points will affect their velocities. This factor is big, it's defined on CPU side in the "strengthOfForces" variable.

Now let's get back to doShaderStuff() method on CPU side. After doing the 40 cycles loop block of kernels colliders data is being read from GPU:

circleCollidersBuffer.GetData(circleCollidersArray);

The structure we used to contain collider related data to pass it to GPU has dvx and dvy parameters, which are being used to store the impulses from hair points each collider gets. After we have read colliders data from GPU we use those parameters simply as a force vector to be applied to collider's rigidbody. This force is being added to rigidbody in FixedUpdate(), because physics system is synchronized with it. But the data is being read from GPU in Update() method. Update() and FixedUpdate() are not synchronized with each other, so two Updated() may happen after one FixedUpdate() and vice verse. This means some data about hairs->colliders impulses might be rewritten before it's used. Or same data used twice. But it's not important actually. We could add some measures to fix this little issue, but we didn't, no reason to add extra complexity to this simple example.


A note: GetData() method stalls graphics pipeline, which slows program's flow. There's no asynchronous version of this method in Unity yet. But rumors are it will be implemented in 2018. But for now you shoud know: if you need to get data from GPU, the program will work 20-30% slower. In the same time SetData() method doesn't have this effect and works fast.

Visualization

The rest of the kernel we run in doShaderStuff() method are all about visualization.

Let's see how is the visualization done.
On CPU side we have declared RenderTexture variable (don't forget to set enableRandomWrite = true), and this texture is used as mainTexture for Image component's material. Also, SetTexture() method is called for the kernel that writes pixels to this texture, to bind shader side texture and CPU side texture. This is what we have done on CPU side:

RenderTexture renderTexture;
renderTexture = new RenderTexture(1024, 1024, 32);
renderTexture.enableRandomWrite = true;
renderTexture.Create();
GameObject.Find("canvas/image").GetComponent<UnityEngine.UI.Image>().material.mainTexture = renderTexture;
_shader.SetTexture(kiPixelsToTexture, "renderTexture", renderTexture);

On GPU side there's RWTexture2D<float4> variable declared, and rendering kernel writes pixels to this texture:

RWTexture2D<float4> renderTexture;

And this is the kernel used to clear texture before rendering the next frame.

#pragma kernel clearTexture
[numthreads(32,32,1)]
void clearTexture (uint3 id : SV_DispatchThreadID){
renderTexture[id.xy] = float4(0, 0, 0, 0);
}

On CPU side this kernel is being runned like this:

_shader.Dispatch(kiClearTexture, 32, 32, 1);

We can see this gives us 1024x1024 threads, thread per pixel. Which again is convenient, so we simply use id.xy to reference thread's pixel.

How exactly are the hairs being drawn? They are nothing but lines betwen the points. I decided to make the hair semitransparent, so if they intersect, the color would be brighter. This again requires protected writing, because two threads can write to the same pixel. Also, our texture has float4 data type: RWTexture2D<float4>, so instead of writing to it, I had to use a buffer of int, and perform protected writing to it.

After "visInternodeLines" kernel wrote all lines, we just copy that int buffer's data to the actual texture in another kernel. In this visualization method no colors have been used, just shades of white. But there wouldn't be any troubles with the color, we could use int4 as buffer's type and save color, or color could be encoded in one int as four 8 bit colors.

Byt the wey this whole render method with render texture doesn't work on Macs, and I couldn't get the answer to the "why" question on official unity forums. There are other render methods, for example, hair points could control mesh vertexes. But I haven't learned these methods yet.

After "pixelsToTexture" kernel have modified the texture, we see the hair on the screen.

And this completes our tutorial. I explained all code involved into GPU computation from our little example. And you can see it's alot of information. If you plan to experiment further in this area, I recommend to write a simple GPU computing program to test your fresh knowledge. Would be useful. Just write a shader with only one kernel, that would replace each number from a buffer with its power of two. On CPU side prepare the buffer and write its data to GPU. Then run the kernel, get the result from GPU and check if it worked as intended. If you manage to create this little program, the tutorial did its best for you and you can go further.

Vasya
Programmer, game designer - Programmer
2
Comments