Breadth First Ray Tracing

Phew! I'm almost finished with my "catching up to the present" retrospective architectural posts. I may sprinkle a few more retrospective posts here and there, but this is the last for this series. This post will expand upon my first and second ray reordering posts and talk about breadth first ray tracing, which is what eventually replaced ray reordering in Psychopath.

First, let me say that breadth first ray tracing is really cool. Like, really cool. Much like ray reordering, it accesses scene data coherently for a batch of rays. But unlike ray reordering it has strong guarantees about those access patterns. Specifically, each instance of a scene element is guaranteed to be accessed at most once for an entire batch of rays. And, perhaps surprisingly, breadth first ray tracing is also much simpler to implement than ray reordering.

So how does it work?

The core idea behind breadth first ray tracing is to trace an entire batch of rays together. This is subtly different from ray reordering where, although you also have a batch of rays, you are tracing each ray separately—you just use pausing to switch between them at strategic times. But in breadth first ray tracing you really trace all of the rays together, even down to traversing the acceleration structure of the scene.

The most fundamental operation of breadth first ray tracing is something called "ray filtering". Imagine that you have a batch of rays and an object's bounding box. You can test the whole batch of rays against the bounding box together (either with SIMD if it's wide enough, or sequentially), and then create a new sub-batch out of the rays that hit it. Once you have that sub-batch, you can then proceed to test only that sub-batch against the object's geometry.

That process of creating a new batch from the rays that hit something is called "ray filtering". And in breadth first ray tracing you do ray filtering at every step of traversing the scene's acceleration structure.

There are multiple ways to implement ray filtering. Some approaches use partitioning to rearrange the rays into two adjacent groups in memory, storing pointers to the sub-lists. Other approaches actually create new lists of pointers to the individual rays. But the basic principle is the same regardless of how you do it (albeit with different memory behavior).

In Psychopath I'm taking the partitioning approach, because it has lower memory requirements and accesses the rays in memory more coherently. Scanning through a linear array of rays while doing the ray tests is a lot faster than chasing pointers and jumping all over the place in memory, especially with a large number of rays.

You might object that there is still the matter of partitioning, which scans the rays again, potentially incoherently. But, in fact, partitioning algorithms also generally access arrays of data quite coherently. And, moreover, although most papers I've read split the ray testing and partitioning into separate steps, you can easily roll them into a single step.

A good partitioning algorithm only executes its predicate once for each item in the array being partitioned. C++'s std::partition, for example, makes this guarantee for random-access containers. And this means that the ray test itself can be used as the predicate.

Pseudo-code for the ray testing and partitioning rolled into one might, then, look something like this:

bbox = get_bounding_box()
rays = get_ray_batch()
filtered_rays = partition(rays, bbox.test_ray)

Where partition is a function that partitions a list (in this case rays) based on a unary predicate (in this case bbox.test_ray) and returns a reference to the part of the list that tested true.

This is almost frighteningly simple.

To incorporate this into a BVH traversal algorithm you need a stack for the ray batch references, which is fairly straight-forward. Pseudo code for the entire BVH traversal algorithm might look something like this:

node_stack.push(get_root_bvh_node())
rays_stack.push(get_ray_batch())

while node_stack is not empty:
    node = node_stack.pop()
    rays = rays_stack.pop()
    filtered_rays = partition(rays, node.bbox.test_ray())
    if filtered_rays is not empty:
        if node is leaf:
            node.data.test_rays(rays)
        else:
            node_stack.push(node.child1)
            node_stack.push(node.child2)
            rays_stack.push(filtered_rays)
            rays_stack.push(filtered_rays)

With node_stack and rays_stack being the stacks for BVH node references and ray batch references, respectively.

This is really simple code, bordering on elegant. Especially when compared to ray reordering where you have to store ray state, pause traversal, un-pause traversal, store which rays are about to hit which nodes, etc.

Looking at this code, it also starts to become clear why breadth first ray tracing can do instancing: we can use stacks during traversal. The node stack in particular lets us keep track of where we've come from and what we need to do next. And because that stack is shared by the entire ray batch, it takes up only a tiny amount of space even for fairly deep stacks. With ray reordering we would need a separate stack for each ray, which would be huge in aggregate.

And it's not just the node stack that is helpful. With breadth first ray tracing we can make any number of shared stacks for any kind of data we might want as long as that data is the same for each ray. So, for example, we can have a stack of transform matrices representing the current transform space of the rays, which is also necessary for instancing.

So that's breadth first ray tracing in a nut shell.

There are, of course, drawbacks to breadth first ray tracing as well. The three biggest drawbacks I've found so far are:

  1. The overhead of creating/partitioning/whatever the rays lists slows things down.
  2. Because the rays traverse the acceleration structure together, they have to traverse it in the same order. That, in turn, means we end up doing a lot of unnecessary ray tests that could be avoided with individual traversal orders. Specifically, we miss out on optimizations like always testing against the closest node first—the closest node for one ray often isn't the closest node for another.
  3. Implementing typical SIMD optimizations like QBVH's etc. becomes a lot trickier and doesn't provide as much benefit because of drawback #1 above.

Point #1 isn't an enormous bottleneck on its own, especially when ray testing is rolled directly into the partitioning step. But the severity depends on how large the ray data structures are: the larger they are, the more costly partitioning becomes. Unfortunately, rays are fairly fat data structures in Psychopath. That's definitely something I want to improve at some point if I can.

Point #2 can be addressed somewhat by first splitting ray batches into separate directionally-ordered batches. But then you also lose some of the traversal coherence, so it's a trade-off.

Point #3 depends on what you're doing. Most breadth first ray tracing papers I've seen exploit SIMD instructions by testing multiple rays against single bounding boxes, which actually appears to work very well based on those papers. But unfortunately such techniques are not (as far as I know) compatible with efficient motion blur implementations. The fundamental problem is that with motion blur you're essentially testing each ray against a "different" bounding box anyway since it's moving over time.

On the flip side, QBVH's work very well with motion blur, but they're difficult to put into a breadth-first context. It's not impossible, though. I've implemented a breadth-first QBVH traverser successfully by using a small bit stack embedded within each ray. But, on average, it doesn't reduce the partitioning cost and makes the ray data structure larger. Thankfully, it still speeds things up, just not as much as QBVH's with individual ray traversal.

On the whole, though, I'm definitely happy with breadth first ray tracing. It's relatively simple to implement, it gives extremely strong guarantees about scene data access patterns, and it allows for full hierarchical instancing. Definitely a good fit for what I'm trying to do with Psychopath. So unless something better comes along, I expect this will remain a core part of Psychopath's architecture for the foreseeable future.