Psychopath Renderer

a slightly psychotic path tracer

Drawbacks of Geometry Caches

At the end of my post about ray reordering I alluded to some problems I ran into with using geometry caches and ray reordering. In this post I'm going to talk about the problem I ran into with geometry caching.

Geometry caching in-and-of itself isn't particularly problematic. But it can become a problem if you're trying to share it between multiple threads. The issue is that with an LRU cache even just reading from the cache is a write operation, because you have to move the read item to the front of the cache. And that means locking. And locking means thread contention.

I think I was actually reasonably smart in the way I wrote the cache. The cache treats the cached data as read-only, so it only needs to be locked long enough to move the accessed item to the front and then fetch a shared pointer to the data. That is a very short period of time, certainly far less than the tracing done with the geometry afterwards.

The result of that cleverness was that rendering with eight cores sped up rendering by about 5-6x. That's nothing to sneeze at considering that the cache was being accessed for every ray-object intersection test. But it was a clear indication that it wouldn't scale well with too many more cores. And on eight cores, you really want close to an 8x speed up if you can get it.

I was able to improve the situation further by better exploiting ray reordering. Instead of accessing the cache for every ray-object test like I was before, I just accessed it once for an entire batch of rays being tested against an object. This gave a 7x speed-up over a single-core render on most of my test scenes. Again, quite good. But still, how many more cores would it scale well with?

But even worse, on one of my test scenes it was still only about a 6x speed-up. The reason, as it turned out, was because that scene was much more complex, with lots of very small objects. When the objects are smaller, fewer rays are queued against each individual object, and therefore cache access isn't amortized over as large a batch of rays. And, in theory, with smaller and smaller objects that problem could get arbitrarily bad.

So I wanted to push it even further. To do this, I thought of two basic approaches:

  1. Eliminate the locking by giving each thread its own (smaller) thread-local cache.
  2. Eliminate the cache entirely.

In the end, I decided to take the latter approach. That might sound extreme, but in the end the geometry cache wasn't actually giving that much of a performance boost. Ray reordering on its own was really only about 5-10% slower on most scenes, and removing the cache resulted in enough of a speed up on eight cores to make up for that (though I don't recall the exact number... I don't have my 8-core machine on hand as I'm writing this). Moreover, removing the cache simplified the code quite a bit, and removed its memory footprint. And, most importantly, it completely eliminated the problem of many small objects, assuring good scaling independent of scene geometry.

To be totally honest, I'm actually mixing up the development timeline a bit: I didn't disable the cache in committed code until after I'd moved away from ray reordering as well, which I'll talk about in a later post. But nevertheless, in the end I don't think a shared geometry cache is a scalable approach. I may revisit the idea of thread-local caches at some point. But for now, I don't think Psychopath needs them.

Ray Reordering

In my last rendering architecture post I talked about using geometry caching to ray trace micropolygons. But Psychopath didn't stay there.

It wasn't long after making the first micropolygon renders that I stumbled across the paper Two-Level Ray Tracing with Reordering for Highly Complex Scenes by Hanika et al. Instead of using a geometry cache to ray trace diced surfaces, Hanika et al. used something called "ray reordering".

With standard ray tracing you trace one ray at a time, doing all of its intersection tests against all relevant parts of the scene before moving on to the next ray. But with ray reordering you instead queue up many rays at once and attempt (as best you can) to explicitly change the order of the ray tests, with the goal of testing many rays against the same part of the scene at once.

There are many different ways to go about this, but typically it involves some mechanism that lets you pause a ray's traversal through the scene. If you have such a mechanism, you can then pause a ray just before it's tested against an object in the scene. Once the ray is paused, you can move on to other rays in the hopes that some of them will also be paused before that same object. When you have enough rays paused before an object, you can unpause all of them together and do the ray tests for that object all at once.

Doing things this way definitely involves some overhead, but the benefit is that—especially if you queue a large number of rays—you end up accessing scene data in a much more coherent way. And that allows you to do some really cool things.

Classically, ray reordering has been used to ray trace scenes that are larger than available RAM: by explicitly trying to access the scene in a coherent way, you can drastically reduce paging to/from disk.

But really, ray reordering can be applied to any situation where accessing scene elements is expensive. It allows you to distribute the cost of that access among many rays. And in the case of Hanika et al., they applied this to dicing surfaces into microgeometry. If you re-dice a surface for every single ray test, that's obviously going to be prohibitively slow. But if you re-dice a surface for a whole batch of rays at a time, then the amortized cost can be quite reasonable.

In a sense, this is just doing the reverse of what a geometry cache does. A geometry cache saves up geometry to (hopefully) be tested against multiple rays. Ray reordering saves up rays to be tested as a (hopefully) large batch against some geometry.

For the second iteration of Psychopath, I did both. I kept the geometry cache, but I also implemented ray reordering as described in the Hanika et al. paper. And the results were interesting.

It turned out that geometry caching and ray reordering complemented each other nicely, picking up the slack for each other in the areas where each was weak. The ray reordering prevented the hard performance cliff that would happen when the geometry cache was too small (essentially helping to access the cache in a more coherent way). And the geometry cache helped reduce the amount of dicing when the ray reordering couldn't extract enough coherence from the ray batch.

So all-in-all it seemed like this was a great combination. But it turned out there were some pretty severe drawbacks that I didn't realize until I got further along in development. The drawbacks weren't so much from doing both in combination, but rather were individual drawbacks of each approach.

But I'll talk about that in another post.

Glossy Fixed! And the GTR Microfacet BRDF

So, I've fixed the obvious bug in the glossy BRDF from my last post.

GTR Glossy

The bug was in the fresnel calculation. I was doing a dot product with one of the vectors unormalized, when it was supposed to be normalized. This caused the dot product to be greater than one in some areas, which made the result of the whole equation negative when taken together. Fun!

The glossy BRDF still isn't 100% there yet. I'm still struggling to figure out what I'm doing wrong with one part of the equation (for now I'm just leaving that part out, and it still looks believable—when I include it, things explode with brightness, so I must be doing something wrong). But that will be for another post.

But still! Looks cool!

Now that I have it more-or-less working, I want to talk briefly about what makes this particular glossy BRDF cool.

As I mentioned before, this BRDF uses the GTR (or Generalized Trowbridge-Reitz) microfacet distribution. Like most glossy BRDF's, it has a parameter for how blurry the reflections are. But unlike most glossy BRDF's, it also has a parameter for the shape of the blurriness. And when I say "shape" I don't mean like stars or squares or circles. Rather, I mean this:

GTR Glossy Tails

This image has exactly the same parameters as the one above, except for the shape parameter. Notice that the blurry reflections are still about the same blurriness, and the central hot-spots of the highlights are still roughly the same size (give or take), they just have a wider trailing blur around them. Basically, the shape parameter makes the reflection more or less "foggy".

It's really cool, because it lets you convincingly achieve a much wider range of materials. Incidentally, when the shape parameter is set to 2.0, the BRDF is identical to GGX.

Glossy Render

Everything still uses a single shader, but I've now implemented a much better glossy BRDF.

GTR Glossy

It's based on the "Generalized Trowbridge-Reitz" or GTR microfacet distribution presented in Disney's Principled BRDF paper. GTR is pretty much identical to GGX, except that it has an additional parameter to tweak the tail of the specular highlight.

There are still some bugs in the implementation (see, for example, the black spot on the lip of the teacup). And I'm still struggling to figure out certain things about microfacet BRDF's in general. But it's more-or-less working, with proper importance sampling and everything.

I'm especially pleased to see it working in the context of the curved surfaces, since they're rendered using micro-geometry. I'm not seeing any artifacts originating from that.

First Phong Render

So, this has been a long time coming. Psychopath's first non-lambert-shaded render.

First Phong

The bsdf itself is totally throw-away code (the importance sampling isn't good at all, hence the noise). But it's done via the bsdf system I've put in place. And it works! So, yay!

There's still lots more work to do to get a proper shader system working. You may notice that all of the surfaces still use the same hard-coded bsdf, for example.

But nevertheless! I'm excited.

Although it does feel kind of silly to be so excited about this, given that phong is such a basic thing. For most renderers this is like "Hello World". But for Psychopath, with its weird-ass ray tracing kernel, it's taken a while to get here.

Low Discrepancy Sequences

As a follow-up to my post about random numbers, I'd like to ramble a bit about low discrepancy sequences.

Low discrepency sequences (or LDS's as I'll refer to them) are deterministic sequences of points in an n-dimensional space that attempt to be as well distributed and evenly spaced as possible. That's quite a mouthful, but the high-level idea is pretty simple. For a more complete introduction to LDS's I recommend reading chapter 7 of Physically Based Rendering: From Theory to Implementation by Matt Pharr and Greg Humphreys (it's an amazing book, by the way, which you should totally buy if you're at all interested in 3d rendering).

But, suffice it to say, LDS's are really useful. In many contexts they can be used as a drop-in replacement for random numbers to reduce variance. In my case, with a unidirectional path tracer like Psychopath, if I plug an LDS where I otherwise would use random numbers, I get some nice reductions in noise.

Random numbers @ 16spp:

Low discrepency sequence @ 16spp:

Notice the significant reduction in variance in the LDS render compared to the random one. But also notice the patterned appearance of the variance in the LDS render, especially around the edges of shadows. This is an artifact of using the LDS directly.

Another drawback of using LDS's directly is that you have to decide ahead of time how many samples you're taking. A lot of the time this doesn't matter. But if you're doing progressive rendering or adaptive sampling that can be a real limitation.

Fortunately there is a single solution that addresses both of these problems, and that is to decorrelate the samples between pixels. The idea is that each pixel still draws from a single continuous LDS sequence, but the pixels are each randomized with respect to each other. That way you still have good convergance within each pixel, but each pixel is random with respect to all other pixels.

One popular way to do this is called "Cranley Patterson Rotation". The idea behind Cranley Patterson Rotation is to randomly generate a vector, and use that vector to offset the values of all the LDS points. To keep the resulting values within the interval [0,1], you do the offset modulo 1.0 so that resulting values greater than 1 wrap back to start at zero again.

If you do Cranley Patterson Rotation using the same vector for all pixels, you don't gain anything. But if you use a different random vector for each pixel then you get a noisy non-patterned image that still has less noise and converges faster than a truly randomly sampled image.

But Psychopath doesn't use Cranley Patterson Rotation. It uses a different technique to accomplish essentially the same thing.

Some LDS sequences are endless. You can keep drawing points from them forever, and they will never run out. Such LDS's have an interesting property: any contiguous subset of the sequence is also an LDS. A cool implication of this is that you can start drawing samples from anywhere in the sequence, and it's just as good as if you started at the beginning.

Psychopath exploits that by giving each pixel a random offset into the LDS. So, for example, the first pixel might have an offset of 42, meaning that it starts drawing samples at the 42nd sample. The second pixel might have an offset of 518,364,163, meaning that it starts drawing samples at the 518,364,163rd sample. Both pixels will get equally good samples, but they are both totally random with respect to each other.

The end result is this (@16spp):

It's slightly higher variance than using the LDS sequence directly (but still much lower variance than random sampling), but it has the same convergence properties and it avoids the pattern artifacts. It also allows us to draw endless samples for each pixel without worrying about the number of samples being drawn for other pixels, which is a big win if you want to do e.g. adaptive sampling.

The main benefit of this technique over Cranley Patterson Rotation is computation. You only need to generate a single random value with this technique, and do a single integer addition. With Cranley Patterson Rotation you have to generate an entire vector of random values, and as many floating point additions and modulo.

The main drawback compared to Cranley Patterson Rotation is that this only works with endless LDS's. But in practice that's not a significant restriction, because most practical LDS's (at least that I'm aware of) are endless.

Finally, credit and a huge thanks must go to Leonhard Grünschloß for the LDS code I'm using. Psychopath uses his Faure-permuted Halton sequence, which is available on his site. I am extremely grateful to him for making his LDS code freely available, because I am, frankly, not smart enough to implement anything more complex than a simple Halton sequence.