Psychopath Renderer

a slightly psychotic path tracer

Breadth First Ray Tracing

Phew! I'm almost finished with my "catching up to the present" retrospective architectural posts. I may sprinkle a few more retrospective posts here and there, but this is the last for this series. This post will expand upon my first and second ray reordering posts and talk about breadth first ray tracing, which is what eventually replaced ray reordering in Psychopath.

First, let me say that breadth first ray tracing is really cool. Like, really cool. Much like ray reordering, it accesses scene data coherently for a batch of rays. But unlike ray reordering it has strong guarantees about those access patterns. Specifically, each instance of a scene element is guaranteed to be accessed at most once for an entire batch of rays. And, perhaps surprisingly, breadth first ray tracing is also much simpler to implement than ray reordering.

So how does it work?

The core idea behind breadth first ray tracing is to trace an entire batch of rays together. This is subtly different from ray reordering where, although you also have a batch of rays, you are tracing each ray separately—you just use pausing to switch between them at strategic times. But in breadth first ray tracing you really trace all of the rays together, even down to traversing the acceleration structure of the scene.

The most fundamental operation of breadth first ray tracing is something called "ray filtering". Imagine that you have a batch of rays and an object's bounding box. You can test the whole batch of rays against the bounding box together (either with SIMD if it's wide enough, or sequentially), and then create a new sub-batch out of the rays that hit it. Once you have that sub-batch, you can then proceed to test only that sub-batch against the object's geometry.

That process of creating a new batch from the rays that hit something is called "ray filtering". And in breadth first ray tracing you do ray filtering at every step of traversing the scene's acceleration structure.

There are multiple ways to implement ray filtering. Some approaches use partitioning to rearrange the rays into two adjacent groups in memory, storing pointers to the sub-lists. Other approaches actually create new lists of pointers to the individual rays. But the basic principle is the same regardless of how you do it (albeit with different memory behavior).

In Psychopath I'm taking the partitioning approach, because it has lower memory requirements and accesses the rays in memory more coherently. Scanning through a linear array of rays while doing the ray tests is a lot faster than chasing pointers and jumping all over the place in memory, especially with a large number of rays.

You might object that there is still the matter of partitioning, which scans the rays again, potentially incoherently. But, in fact, partitioning algorithms also generally access arrays of data quite coherently. And, moreover, although most papers I've read split the ray testing and partitioning into separate steps, you can easily roll them into a single step.

A good partitioning algorithm only executes its predicate once for each item in the array being partitioned. C++'s std::partition, for example, makes this guarantee for random-access containers. And this means that the ray test itself can be used as the predicate.

Pseudo-code for the ray testing and partitioning rolled into one might, then, look something like this:

bbox = get_bounding_box()
rays = get_ray_batch()
filtered_rays = partition(rays, bbox.test_ray)

Where partition is a function that partitions a list (in this case rays) based on a unary predicate (in this case bbox.test_ray) and returns a reference to the part of the list that tested true.

This is almost frighteningly simple.

To incorporate this into a BVH traversal algorithm you need a stack for the ray batch references, which is fairly straight-forward. Pseudo code for the entire BVH traversal algorithm might look something like this:

node_stack.push(get_root_bvh_node())
rays_stack.push(get_ray_batch())

while node_stack is not empty:
    node = node_stack.pop()
    rays = rays_stack.pop()
    filtered_rays = partition(rays, node.bbox.test_ray())
    if filtered_rays is not empty:
        if node is leaf:
            node.data.test_rays(rays)
        else:
            node_stack.push(node.child1)
            node_stack.push(node.child2)
            rays_stack.push(filtered_rays)
            rays_stack.push(filtered_rays)

With node_stack and rays_stack being the stacks for BVH node references and ray batch references, respectively.

This is really simple code, bordering on elegant. Especially when compared to ray reordering where you have to store ray state, pause traversal, un-pause traversal, store which rays are about to hit which nodes, etc.

Looking at this code, it also starts to become clear why breadth first ray tracing can do instancing: we can use stacks during traversal. The node stack in particular lets us keep track of where we've come from and what we need to do next. And because that stack is shared by the entire ray batch, it takes up only a tiny amount of space even for fairly deep stacks. With ray reordering we would need a separate stack for each ray, which would be huge in aggregate.

And it's not just the node stack that is helpful. With breadth first ray tracing we can make any number of shared stacks for any kind of data we might want as long as that data is the same for each ray. So, for example, we can have a stack of transform matrices representing the current transform space of the rays, which is also necessary for instancing.

So that's breadth first ray tracing in a nut shell.

There are, of course, drawbacks to breadth first ray tracing as well. The three biggest drawbacks I've found so far are:

  1. The overhead of creating/partitioning/whatever the rays lists slows things down.
  2. Because the rays traverse the acceleration structure together, they have to traverse it in the same order. That, in turn, means we end up doing a lot of unnecessary ray tests that could be avoided with individual traversal orders. Specifically, we miss out on optimizations like always testing against the closest node first—the closest node for one ray often isn't the closest node for another.
  3. Implementing typical SIMD optimizations like QBVH's etc. becomes a lot trickier and doesn't provide as much benefit because of drawback #1 above.

Point #1 isn't an enormous bottleneck on its own, especially when ray testing is rolled directly into the partitioning step. But the severity depends on how large the ray data structures are: the larger they are, the more costly partitioning becomes. Unfortunately, rays are fairly fat data structures in Psychopath. That's definitely something I want to improve at some point if I can.

Point #2 can be addressed somewhat by first splitting ray batches into separate directionally-ordered batches. But then you also lose some of the traversal coherence, so it's a trade-off.

Point #3 depends on what you're doing. Most breadth first ray tracing papers I've seen exploit SIMD instructions by testing multiple rays against single bounding boxes, which actually appears to work very well based on those papers. But unfortunately such techniques are not (as far as I know) compatible with efficient motion blur implementations. The fundamental problem is that with motion blur you're essentially testing each ray against a "different" bounding box anyway since it's moving over time.

On the flip side, QBVH's work very well with motion blur, but they're difficult to put into a breadth-first context. It's not impossible, though. I've implemented a breadth-first QBVH traverser successfully by using a small bit stack embedded within each ray. But, on average, it doesn't reduce the partitioning cost and makes the ray data structure larger. Thankfully, it still speeds things up, just not as much as QBVH's with individual ray traversal.

On the whole, though, I'm definitely happy with breadth first ray tracing. It's relatively simple to implement, it gives extremely strong guarantees about scene data access patterns, and it allows for full hierarchical instancing. Definitely a good fit for what I'm trying to do with Psychopath. So unless something better comes along, I expect this will remain a core part of Psychopath's architecture for the foreseeable future.

Drawbacks of Ray Reordering

Now that I've covered the drawbacks I ran into with the geometry cache, I'd like to talk about the issue I ran into with ray reordering.

Ray reordering worked extremely well in pretty much every way. It was actually quite fast, it wasn't too complex or difficult to implement, and I was even able to do a pretty nice QBVH implementation thanks to the awesome paper Stackless Multi-BVH Traversal for CPU, MIC and GPU Ray Tracing by Áfra et al.

But the big drawback of ray reordering is that you have to keep ray state small. You have to be able to pause a ray's traversal, which means storing its current traversal state so that you can resume tracing later. And you are typically tracing thousands or even millions of rays at a time, so that state data can't be large.

With a typical BVH this is relatively straight-forward because it's a tree structure. You never have to store where you came from in the tree because it's implicit. No matter where you are in the BVH, you can just go to the parent node, and then the parent's parent node, and so-on.

Or put into different terms, it's a bit like writing an algorithm for walking a maze. There is a classic maze solving algorithm that doesn't require any state to be stored other than the current position: all you do is follow the wall on your right. And this works great as long as the maze is simply connected. By analogy, that's more-or-less how Psychopath handled things, and it allowed ray state to be very small.

Then I decided to introduce instancing (and full, arbitrarily deep, hierarchical instancing at that). When I did this, everything that allowed ray reordering to work came crashing down.

The fundamental problem is that when you introduce instancing you change the BVH from being a tree structure to being a DAG structure. Many nodes will then have more than one parent. Or to use the maze analogy again, it's like introducing loops and overpasses into the maze. When you do that, suddenly you have to start recording additional information about your traversal to do it correctly. And worse, that information can grow to be arbitrarily large.

When tracing thousands or millions of rays at a time, having unbounded storage for each ray is really not feasible. It also introduces memory allocation into the traversal process, which isn't ideal.

To get around this I considered limiting instancing to being a single level deep. That would put simple bounds on the amount of state information. But I really, really wanted hierarchical instancing. As a production feature I think it's really useful (imagine instancing leaves on a tree, and then instancing trees in a forest).

So I banged my head against the wall for a week or so, trying to figure out if there was a way I could reasonably make hierarchical instancing work with ray reordering. In practice, people aren't actually going to make arbitrarily deep instancing hierarchies, so I considered exploiting that in some way. But ultimately all the solutions I came up with seemed overly complex and/or brittle. I wanted things to Just Work™.

In the end, I abandoned ray reordering for something called breadth first ray tracing. I had already been using breadth first ray tracing in Psychopath's surface splitting code, so I had a reasonably good handle on it. And it had some really cool properties all its own. But I'll talk about that in another post.

Drawbacks of Geometry Caches

At the end of my post about ray reordering I alluded to some problems I ran into with using geometry caches and ray reordering. In this post I'm going to talk about the problem I ran into with geometry caching.

Geometry caching in-and-of itself isn't particularly problematic. But it can become a problem if you're trying to share it between multiple threads. The issue is that with an LRU cache even just reading from the cache is a write operation, because you have to move the read item to the front of the cache. And that means locking. And locking means thread contention.

I think I was actually reasonably smart in the way I wrote the cache. The cache treats the cached data as read-only, so it only needs to be locked long enough to move the accessed item to the front and then fetch a shared pointer to the data. That is a very short period of time, certainly far less than the tracing done with the geometry afterwards.

The result of that cleverness was that rendering with eight cores sped up rendering by about 5-6x. That's nothing to sneeze at considering that the cache was being accessed for every ray-object intersection test. But it was a clear indication that it wouldn't scale well with too many more cores. And on eight cores, you really want close to an 8x speed up if you can get it.

I was able to improve the situation further by better exploiting ray reordering. Instead of accessing the cache for every ray-object test like I was before, I just accessed it once for an entire batch of rays being tested against an object. This gave a 7x speed-up over a single-core render on most of my test scenes. Again, quite good. But still, how many more cores would it scale well with?

But even worse, on one of my test scenes it was still only about a 6x speed-up. The reason, as it turned out, was because that scene was much more complex, with lots of very small objects. When the objects are smaller, fewer rays are queued against each individual object, and therefore cache access isn't amortized over as large a batch of rays. And, in theory, with smaller and smaller objects that problem could get arbitrarily bad.

So I wanted to push it even further. To do this, I thought of two basic approaches:

  1. Eliminate the locking by giving each thread its own (smaller) thread-local cache.
  2. Eliminate the cache entirely.

In the end, I decided to take the latter approach. That might sound extreme, but in the end the geometry cache wasn't actually giving that much of a performance boost. Ray reordering on its own was really only about 5-10% slower on most scenes, and removing the cache resulted in enough of a speed up on eight cores to make up for that (though I don't recall the exact number... I don't have my 8-core machine on hand as I'm writing this). Moreover, removing the cache simplified the code quite a bit, and removed its memory footprint. And, most importantly, it completely eliminated the problem of many small objects, assuring good scaling independent of scene geometry.

To be totally honest, I'm actually mixing up the development timeline a bit: I didn't disable the cache in committed code until after I'd moved away from ray reordering as well, which I'll talk about in a later post. But nevertheless, in the end I don't think a shared geometry cache is a scalable approach. I may revisit the idea of thread-local caches at some point. But for now, I don't think Psychopath needs them.

Ray Reordering

In my last rendering architecture post I talked about using geometry caching to ray trace micropolygons. But Psychopath didn't stay there.

It wasn't long after making the first micropolygon renders that I stumbled across the paper Two-Level Ray Tracing with Reordering for Highly Complex Scenes by Hanika et al. Instead of using a geometry cache to ray trace diced surfaces, Hanika et al. used something called "ray reordering".

With standard ray tracing you trace one ray at a time, doing all of its intersection tests against all relevant parts of the scene before moving on to the next ray. But with ray reordering you instead queue up many rays at once and attempt (as best you can) to explicitly change the order of the ray tests, with the goal of testing many rays against the same part of the scene at once.

There are many different ways to go about this, but typically it involves some mechanism that lets you pause a ray's traversal through the scene. If you have such a mechanism, you can then pause a ray just before it's tested against an object in the scene. Once the ray is paused, you can move on to other rays in the hopes that some of them will also be paused before that same object. When you have enough rays paused before an object, you can unpause all of them together and do the ray tests for that object all at once.

Doing things this way definitely involves some overhead, but the benefit is that—especially if you queue a large number of rays—you end up accessing scene data in a much more coherent way. And that allows you to do some really cool things.

Classically, ray reordering has been used to ray trace scenes that are larger than available RAM: by explicitly trying to access the scene in a coherent way, you can drastically reduce paging to/from disk.

But really, ray reordering can be applied to any situation where accessing scene elements is expensive. It allows you to distribute the cost of that access among many rays. And in the case of Hanika et al., they applied this to dicing surfaces into microgeometry. If you re-dice a surface for every single ray test, that's obviously going to be prohibitively slow. But if you re-dice a surface for a whole batch of rays at a time, then the amortized cost can be quite reasonable.

In a sense, this is just doing the reverse of what a geometry cache does. A geometry cache saves up geometry to (hopefully) be tested against multiple rays. Ray reordering saves up rays to be tested as a (hopefully) large batch against some geometry.

For the second iteration of Psychopath, I did both. I kept the geometry cache, but I also implemented ray reordering as described in the Hanika et al. paper. And the results were interesting.

It turned out that geometry caching and ray reordering complemented each other nicely, picking up the slack for each other in the areas where each was weak. The ray reordering prevented the hard performance cliff that would happen when the geometry cache was too small (essentially helping to access the cache in a more coherent way). And the geometry cache helped reduce the amount of dicing when the ray reordering couldn't extract enough coherence from the ray batch.

So all-in-all it seemed like this was a great combination. But it turned out there were some pretty severe drawbacks that I didn't realize until I got further along in development. The drawbacks weren't so much from doing both in combination, but rather were individual drawbacks of each approach.

But I'll talk about that in another post.

Glossy Fixed! And the GTR Microfacet BRDF

So, I've fixed the obvious bug in the glossy BRDF from my last post.

GTR Glossy

The bug was in the fresnel calculation. I was doing a dot product with one of the vectors unormalized, when it was supposed to be normalized. This caused the dot product to be greater than one in some areas, which made the result of the whole equation negative when taken together. Fun!

The glossy BRDF still isn't 100% there yet. I'm still struggling to figure out what I'm doing wrong with one part of the equation (for now I'm just leaving that part out, and it still looks believable—when I include it, things explode with brightness, so I must be doing something wrong). But that will be for another post.

But still! Looks cool!

Now that I have it more-or-less working, I want to talk briefly about what makes this particular glossy BRDF cool.

As I mentioned before, this BRDF uses the GTR (or Generalized Trowbridge-Reitz) microfacet distribution. Like most glossy BRDF's, it has a parameter for how blurry the reflections are. But unlike most glossy BRDF's, it also has a parameter for the shape of the blurriness. And when I say "shape" I don't mean like stars or squares or circles. Rather, I mean this:

GTR Glossy Tails

This image has exactly the same parameters as the one above, except for the shape parameter. Notice that the blurry reflections are still about the same blurriness, and the central hot-spots of the highlights are still roughly the same size (give or take), they just have a wider trailing blur around them. Basically, the shape parameter makes the reflection more or less "foggy".

It's really cool, because it lets you convincingly achieve a much wider range of materials. Incidentally, when the shape parameter is set to 2.0, the BRDF is identical to GGX.

Glossy Render

Everything still uses a single shader, but I've now implemented a much better glossy BRDF.

GTR Glossy

It's based on the "Generalized Trowbridge-Reitz" or GTR microfacet distribution presented in Disney's Principled BRDF paper. GTR is pretty much identical to GGX, except that it has an additional parameter to tweak the tail of the specular highlight.

There are still some bugs in the implementation (see, for example, the black spot on the lip of the teacup). And I'm still struggling to figure out certain things about microfacet BRDF's in general. But it's more-or-less working, with proper importance sampling and everything.

I'm especially pleased to see it working in the context of the curved surfaces, since they're rendered using micro-geometry. I'm not seeing any artifacts originating from that.