Blog Archive About

Psychopath Renderer

a slightly psychotic path tracer

2014 - 08 - 02

Drawbacks of Ray Reordering

Now that I've covered the drawbacks I ran into with the geometry cache, I'd like to talk about the issue I ran into with ray reordering.

Ray reordering worked extremely well in pretty much every way. It was actually quite fast, it wasn't too complex or difficult to implement, and I was even able to do a pretty nice QBVH implementation thanks to the awesome paper Stackless Multi-BVH Traversal for CPU, MIC and GPU Ray Tracing by Áfra et al.

But the big drawback of ray reordering is that you have to keep ray state small. You have to be able to pause a ray's traversal, which means storing its current traversal state so that you can resume tracing later. And you are typically tracing thousands or even millions of rays at a time, so that state data can't be large.

With a typical BVH this is relatively straight-forward because it's a tree structure. You never have to store where you came from in the tree because it's implicit. No matter where you are in the BVH, you can just go to the parent node, and then the parent's parent node, and so-on.

Or put into different terms, it's a bit like writing an algorithm for walking a maze. There is a classic maze solving algorithm that doesn't require any state to be stored other than the current position: all you do is follow the wall on your right. And this works great as long as the maze is simply connected. By analogy, that's more-or-less how Psychopath handled things, and it allowed ray state to be very small.

Then I decided to introduce instancing (and full, arbitrarily deep, hierarchical instancing at that). When I did this, everything that allowed ray reordering to work came crashing down.

The fundamental problem is that when you introduce instancing you change the BVH from being a tree structure to being a DAG structure. Many nodes will then have more than one parent. Or to use the maze analogy again, it's like introducing loops and overpasses into the maze. When you do that, suddenly you have to start recording additional information about your traversal to do it correctly. And worse, that information can grow to be arbitrarily large.

When tracing thousands or millions of rays at a time, having unbounded storage for each ray is really not feasible. It also introduces memory allocation into the traversal process, which isn't ideal.

To get around this I considered limiting instancing to being a single level deep. That would put simple bounds on the amount of state information. But I really, really wanted hierarchical instancing. As a production feature I think it's really useful (imagine instancing leaves on a tree, and then instancing trees in a forest).

So I banged my head against the wall for a week or so, trying to figure out if there was a way I could reasonably make hierarchical instancing work with ray reordering. In practice, people aren't actually going to make arbitrarily deep instancing hierarchies, so I considered exploiting that in some way. But ultimately all the solutions I came up with seemed overly complex and/or brittle. I wanted things to Just Work™.

In the end, I abandoned ray reordering for something called breadth first ray tracing. I had already been using breadth first ray tracing in Psychopath's surface splitting code, so I had a reasonably good handle on it. And it had some really cool properties all its own. But I'll talk about that in another post.

2014 - 08 - 01

Drawbacks of Geometry Caches

At the end of my post about ray reordering I alluded to some problems I ran into with using geometry caches and ray reordering. In this post I'm going to talk about the problem I ran into with geometry caching.

Geometry caching in-and-of itself isn't particularly problematic. But it can become a problem if you're trying to share it between multiple threads. The issue is that with an LRU cache even just reading from the cache is a write operation, because you have to move the read item to the front of the cache. And that means locking. And locking means thread contention.

I think I was actually reasonably smart in the way I wrote the cache. The cache treats the cached data as read-only, so it only needs to be locked long enough to move the accessed item to the front and then fetch a shared pointer to the data. That is a very short period of time, certainly far less than the tracing done with the geometry afterwards.

The result of that cleverness was that rendering with eight cores sped up rendering by about 5-6x. That's nothing to sneeze at considering that the cache was being accessed for every ray-object intersection test. But it was a clear indication that it wouldn't scale well with too many more cores. And on eight cores, you really want close to an 8x speed up if you can get it.

I was able to improve the situation further by better exploiting ray reordering. Instead of accessing the cache for every ray-object test like I was before, I just accessed it once for an entire batch of rays being tested against an object. This gave a 7x speed-up over a single-core render on most of my test scenes. Again, quite good. But still, how many more cores would it scale well with?

But even worse, on one of my test scenes it was still only about a 6x speed-up. The reason, as it turned out, was because that scene was much more complex, with lots of very small objects. When the objects are smaller, fewer rays are queued against each individual object, and therefore cache access isn't amortized over as large a batch of rays. And, in theory, with smaller and smaller objects that problem could get arbitrarily bad.

So I wanted to push it even further. To do this, I thought of two basic approaches:

  1. Eliminate the locking by giving each thread its own (smaller) thread-local cache.
  2. Eliminate the cache entirely.

In the end, I decided to take the latter approach. That might sound extreme, but in the end the geometry cache wasn't actually giving that much of a performance boost. Ray reordering on its own was really only about 5-10% slower on most scenes, and removing the cache resulted in enough of a speed up on eight cores to make up for that (though I don't recall the exact number... I don't have my 8-core machine on hand as I'm writing this). Moreover, removing the cache simplified the code quite a bit, and removed its memory footprint. And, most importantly, it completely eliminated the problem of many small objects, assuring good scaling independent of scene geometry.

To be totally honest, I'm actually mixing up the development timeline a bit: I didn't disable the cache in committed code until after I'd moved away from ray reordering as well, which I'll talk about in a later post. But nevertheless, in the end I don't think a shared geometry cache is a scalable approach. I may revisit the idea of thread-local caches at some point. But for now, I don't think Psychopath needs them.

2014 - 07 - 28

Ray Reordering

In my last rendering architecture post I talked about using geometry caching to ray trace micropolygons. But Psychopath didn't stay there.

It wasn't long after making the first micropolygon renders that I stumbled across the paper Two-Level Ray Tracing with Reordering for Highly Complex Scenes by Hanika et al. Instead of using a geometry cache to ray trace diced surfaces, Hanika et al. used something called "ray reordering".

With standard ray tracing you trace one ray at a time, doing all of its intersection tests against all relevant parts of the scene before moving on to the next ray. But with ray reordering you instead queue up many rays at once and attempt (as best you can) to explicitly change the order of the ray tests, with the goal of testing many rays against the same part of the scene at once.

There are many different ways to go about this, but typically it involves some mechanism that lets you pause a ray's traversal through the scene. If you have such a mechanism, you can then pause a ray just before it's tested against an object in the scene. Once the ray is paused, you can move on to other rays in the hopes that some of them will also be paused before that same object. When you have enough rays paused before an object, you can unpause all of them together and do the ray tests for that object all at once.

Doing things this way definitely involves some overhead, but the benefit is that—especially if you queue a large number of rays—you end up accessing scene data in a much more coherent way. And that allows you to do some really cool things.

Classically, ray reordering has been used to ray trace scenes that are larger than available RAM: by explicitly trying to access the scene in a coherent way, you can drastically reduce paging to/from disk.

But really, ray reordering can be applied to any situation where accessing scene elements is expensive. It allows you to distribute the cost of that access among many rays. And in the case of Hanika et al., they applied this to dicing surfaces into microgeometry. If you re-dice a surface for every single ray test, that's obviously going to be prohibitively slow. But if you re-dice a surface for a whole batch of rays at a time, then the amortized cost can be quite reasonable.

In a sense, this is just doing the reverse of what a geometry cache does. A geometry cache saves up geometry to (hopefully) be tested against multiple rays. Ray reordering saves up rays to be tested as a (hopefully) large batch against some geometry.

For the second iteration of Psychopath, I did both. I kept the geometry cache, but I also implemented ray reordering as described in the Hanika et al. paper. And the results were interesting.

It turned out that geometry caching and ray reordering complemented each other nicely, picking up the slack for each other in the areas where each was weak. The ray reordering prevented the hard performance cliff that would happen when the geometry cache was too small (essentially helping to access the cache in a more coherent way). And the geometry cache helped reduce the amount of dicing when the ray reordering couldn't extract enough coherence from the ray batch.

So all-in-all it seemed like this was a great combination. But it turned out there were some pretty severe drawbacks that I didn't realize until I got further along in development. The drawbacks weren't so much from doing both in combination, but rather were individual drawbacks of each approach.

But I'll talk about that in another post.

2014 - 07 - 07

Glossy Fixed! And the GTR Microfacet BRDF

So, I've fixed the obvious bug in the glossy BRDF from my last post.

GTR Glossy

The bug was in the fresnel calculation. I was doing a dot product with one of the vectors unormalized, when it was supposed to be normalized. This caused the dot product to be greater than one in some areas, which made the result of the whole equation negative when taken together. Fun!

The glossy BRDF still isn't 100% there yet. I'm still struggling to figure out what I'm doing wrong with one part of the equation (for now I'm just leaving that part out, and it still looks believable—when I include it, things explode with brightness, so I must be doing something wrong). But that will be for another post.

But still! Looks cool!

Now that I have it more-or-less working, I want to talk briefly about what makes this particular glossy BRDF cool.

As I mentioned before, this BRDF uses the GTR (or Generalized Trowbridge-Reitz) microfacet distribution. Like most glossy BRDF's, it has a parameter for how blurry the reflections are. But unlike most glossy BRDF's, it also has a parameter for the shape of the blurriness. And when I say "shape" I don't mean like stars or squares or circles. Rather, I mean this:

GTR Glossy Tails

This image has exactly the same parameters as the one above, except for the shape parameter. Notice that the blurry reflections are still about the same blurriness, and the central hot-spots of the highlights are still roughly the same size (give or take), they just have a wider trailing blur around them. Basically, the shape parameter makes the reflection more or less "foggy".

It's really cool, because it lets you convincingly achieve a much wider range of materials. Incidentally, when the shape parameter is set to 2.0, the BRDF is identical to GGX.