Psychopath Renderer

a slightly psychotic path tracer

Subdivision Surfaces

I recently implemented Catmull-Clark subdivision surfaces in Psychopath.

One of my goals with Psychopath is for all curved surfaces to actually render as curved surfaces rather than as tessellated triangle meshes. To accomplish this for subdivision surfaces I am leveraging OpenSubdiv to pre-process the cage mesh into b-spline patches. Psychopath then converts the b-spline patches into bezier patches which it already knows how to render.

The end result is this:

Pontiac GTO '67

(Model by 'thecali', available on BlendSwap—please note that the version in the render above has some parts missing, as some bits were modeled with curves which Psychopath can't render yet.)

In other words, I can finally render interesting models! Yay! Most things are modeled with Catmull-Clark subdivision surfaces these days, so this a big step.

The only down-side so far is that OpenSubdiv seems to be a bit on the slow side when converting the mesh to b-spline patches—it takes around 45 seconds for this model. It's possible I'm not using the library optimally. It's certainly something I'll need to investigate. But I'm quite pleased for now!

Spectral Rendering

Up to this point Psychopath has represented color with RGB values. This is a convenient way to handle color in a renderer because most color input to the renderer is going to be in some kind of RGB format (textures, hand-specified colors, etc.) and most color output the renderer produces is going to be in some kind of RGB format as well (final rendered images). So it makes sense that if your input is RGB, and your output is RGB, probably everything in-between should be RGB as well.

But RGB isn't how real light works. Real light is composed of a spectrum of wavelengths, much the same way that sound is. RGB is a bit like sampling audio at only three frequencies.

Even so, rendering with RGB usually works fine because human color vision is also more-or-less RGB. But there are side effects of spectrums that, although their causes are not visible to the naked eye, their results are. Fluorescent lighting is a good example: the effect it has on colors can't be accurately reproduced in RGB without also tweaking the colors of the materials. Prisms and rainbows are another example.

Admittedly, these examples are largely corner cases that can be faked, and I'm designing Psychopath with animation and VFX in mind—both of which are cases where artistic control is more important than strict accuracy. But nevertheless, being able to render spectral effects is cool! And it might come in handy when trying to match scene lighting for a VFX shot in a fluorescent-lit office building.

So I decided to make Psychopath a spectral renderer.

There are two broad approaches to spectral rendering: binning, and Monte Carlo spectral sampling.

Binning is basically RGB on crack. You can think of RGB color as three spectral samples: one in the reddish part of the spectrum, one in the greenish part, and one in the blueish part. Binning formalizes and generalizes this concept. Instead of three vaguely defined samples, you have N very specific ranges or "bins" over the visible spectrum. Anywhere from 9 to 30 bins is common with this approach.

Monte Carlo spectral sampling takes a completely different approach. Instead of dividing up the spectrum into predefined regions, each light path gets a randomly assigned wavelength. Then as more and more paths are traced, more and more points on the spectrum get sampled, and the image converges to the correct colors in much the same way as everything else in a Monte Carlo path tracer.

Both approaches have their pros and cons.

The biggest pro of the binning approach is that colors still have a clear canonical representation within the renderer: every color in the renderer is specified by a set of N values. Simple, easy, obvious.

In the Monte Carlo approach, colors don't have a single canonical way to be represented. However, that is also one of the pros of the Monte Carlo approach: you can have many different ways to represent color, and as long as they can all be sampled spectrally, they can all coexist effortlessly within the same renderer. Monte Carlo sampling is also more accurate than binning. For example, if you rendered a scene with a prism, the resulting image from a binning approach would only have as many colors as there are bins. But a Monte Carlo approach would, when converged, reveal all the colors of the rainbow in a smooth gradient. Finally, the Monte Carlo approach takes up less memory (only one value instead of 9-30), and might be faster depending on a variety of factors.

In the end, I decided to go with the Monte Carlo approach for Psychopath. The combination of being more accurate and taking less memory was very attractive. It also strikes me as more elegant: since I'm already solving the rest of the rendering equation with (Quasi-) Monte Carlo, why not just extend that into the color spectrum dimension as well?

However, the Monte Carlo approach has one big down side: color noise. Below is a spectral render of a solid medium-gray background with one sample per pixel.

One sample spectral render

As you can see, the resulting render is not a good representation of the underlying color in the scene. Instead of gray, we get rainbow. Even at 16 samples per pixel, the color noise is still quite visible:

Sixteen sample spectral render

With this approach it takes a lot of samples to make the color noise go away. And really, we aren't using our ray tracing very efficiently here: most visual phenomenon in the real world are not significantly spectral in nature, and yet we're taking the time to trace an entire light path for each single spectral sample.

There is a clever solution to this problem in the paper "Hero Wavelength Spectral Sampling" by Wilkie et al. The basic idea is simple: instead of just a single wavelength per light path, have several. One of the wavelengths is randomly chosen (the "hero" wavelength), and the others are evenly spaced apart from that to cover the visible spectrum.

The result of this method with four wavelengths per light path at one sample per pixel looks like this for the gray background:

One sample hero spectral render

And looks like this at sixteen samples per pixel:

Sixteen sample hero spectral render

The difference is enormous. Instead of unrecognizable color at one sample per pixel, we can clearly make out the intended color, albeit with noise. And at sixteen samples per pixel the color noise is all but gone.

The hero wavelength paper is very much worth a read, because it also covers how to properly handle cases with strong spectrally-dependant effects. But even just this simple idea makes a big difference, and makes Monte Carlo spectral sampling practical.

One last practical matter: given that artists aren't going to manually specify an entire spectrum for every color they use, the renderer needs a way to convert RGB input into a matching light spectrum automatically. There are a few approaches out there to do this, but the one I'm using in Psychopath is from the recent paper "Physically Meaningful Rendering Using Tristimulus Colours" by Meng et al. The awesome thing about the algorithm they develop in the paper is that it accurately covers (almost) the entire XYZ color space. This is significant because that means it can be used for converting colors from any color space: just convert to XYZ first. This is not the case for any of the other approaches I'm aware of.

Basic Shading System

Psychopath now has a basic shading system! Yay!

As you can see in this artistically-awful image, different objects can now have different shaders assigned to them:

Multiple Materials

This was actually fairly easy to get up and running. It just required some design work. I did indeed go with shading being done at intersection time, as described in my last post.

This basic shading system does not use OSL, and doesn't even support texturing of any kind. Each shader just produces a single fixed BSDF. But building this basic version of a shading system helped me work out the machinery behind how shaders get assigned to objects and how they are processed during rendering.

One of the cool features of this shading system is that shaders are assigned hierarchically. Any node in the scene graph (more-or-less) can have a shader assigned to it, and all nodes under such a node will inherit that shader—unless they themselves have been assigned a different shader. This makes it easy to assign a single shader to the entire scene, for example.

Designing a Shading System

One of Psychopath's current limitations is that every surface has to have the same shader. This is obviously not workable if it's going to be used as a "real" renderer, so I need to design a shading system that allows different shaders to be assigned to different objects.

Eventually I want to use Open Shading Language for shading in Psychopath, but as a first pass I'm going to do something a little simpler to get my feet wet. I've never written a shading system, after all.

But even so, since Psychopath traces many rays at once there are actually some interesting choices to make here which will carry over to how Psychopath utilizes OSL. There are two broad approaches I can take:

  1. Do the shading when a ray hits something. This would require storing the result of the shading in the intersection structure, which would be a BSDF closure.

  2. Do the shading after all of the ray intersection tests have been completed. This would require storing all information necessary to do the shading in the intersection structure, which would be things such as a reference to the shader and any required differential geometry data.

The biggest benefit of the first approach is that it keeps the architecture a bit simpler. Displacement shaders will have to be evaluated during ray traversal anyway, so the first approach would keep all of the shading code in the same general part of Psychopath.

The biggest benefit of the second approach is that the shading evaluations can be reordered to maximize e.g. coherent texture access.

I've been going back-and-forth trying to decide which approach to take, but I think I've finally settled on the first approach. The first approach gives me more flexibility in how much data I store, and since Psychopath is using breadth first ray tracing there are already strong guarantees about accessing shaders and their associated data in a coherent way.

The main problem with the second approach is that the amount of data that needs to be stored is unbounded. If a mesh has e.g. multiple UV sets, then Psychopath has to store intersection data about all of those UV sets to be able to do the shading later. And I'd also like Psychopath to be able to support arbitrary geometry data for use in shaders, which would exacerbate the problem. It won't be possible to predict ahead of time how much data needs to be stored.

With the first approach, on the other hand, the amount of data can be bounded. There is still the possibility (even likelihood) of multiple mixed closures. But because they are closures, I have more flexibility. I can take a monte carlo approach and, after shading, select a subset of them to be stored along with the appropriate weighting information. The downside to that, of course, would be increased noise. But the max number of closures to store could be a render setting, which would allow the user to decide what to do with that memory/noise tradeoff.

The biggest problem with the first approach, however, is that the shader will need to be evaluated at every ray hit during traversal, rather than just the final hit. And especially for non-trivial shaders in a complex scene, that could have a non-negligible performance impact.

But in the end, I think it's a reasonable tradeoff. I would rather keep memory usage predictably bounded, as one of Psychopath's goals is to be able to render very large scenes.

Breadth First Ray Tracing

Phew! I'm almost finished with my "catching up to the present" retrospective architectural posts. I may sprinkle a few more retrospective posts here and there, but this is the last for this series. This post will expand upon my first and second ray reordering posts and talk about breadth first ray tracing, which is what eventually replaced ray reordering in Psychopath.

First, let me say that breadth first ray tracing is really cool. Like, really cool. Much like ray reordering, it accesses scene data coherently for a batch of rays. But unlike ray reordering it has strong guarantees about those access patterns. Specifically, each instance of a scene element is guaranteed to be accessed at most once for an entire batch of rays. And, perhaps surprisingly, breadth first ray tracing is also much simpler to implement than ray reordering.

So how does it work?

The core idea behind breadth first ray tracing is to trace an entire batch of rays together. This is subtly different from ray reordering where, although you also have a batch of rays, you are tracing each ray separately—you just use pausing to switch between them at strategic times. But in breadth first ray tracing you really trace all of the rays together, even down to traversing the acceleration structure of the scene.

The most fundamental operation of breadth first ray tracing is something called "ray filtering". Imagine that you have a batch of rays and an object's bounding box. You can test the whole batch of rays against the bounding box together (either with SIMD if it's wide enough, or sequentially), and then create a new sub-batch out of the rays that hit it. Once you have that sub-batch, you can then proceed to test only that sub-batch against the object's geometry.

That process of creating a new batch from the rays that hit something is called "ray filtering". And in breadth first ray tracing you do ray filtering at every step of traversing the scene's acceleration structure.

There are multiple ways to implement ray filtering. Some approaches use partitioning to rearrange the rays into two adjacent groups in memory, storing pointers to the sub-lists. Other approaches actually create new lists of pointers to the individual rays. But the basic principle is the same regardless of how you do it (albeit with different memory behavior).

In Psychopath I'm taking the partitioning approach, because it has lower memory requirements and accesses the rays in memory more coherently. Scanning through a linear array of rays while doing the ray tests is a lot faster than chasing pointers and jumping all over the place in memory, especially with a large number of rays.

You might object that there is still the matter of partitioning, which scans the rays again, potentially incoherently. But, in fact, partitioning algorithms also generally access arrays of data quite coherently. And, moreover, although most papers I've read split the ray testing and partitioning into separate steps, you can easily roll them into a single step.

A good partitioning algorithm only executes its predicate once for each item in the array being partitioned. C++'s std::partition, for example, makes this guarantee for random-access containers. And this means that the ray test itself can be used as the predicate.

Pseudo-code for the ray testing and partitioning rolled into one might, then, look something like this:

bbox = get_bounding_box()
rays = get_ray_batch()
filtered_rays = partition(rays, bbox.test_ray)

Where partition is a function that partitions a list (in this case rays) based on a unary predicate (in this case bbox.test_ray) and returns a reference to the part of the list that tested true.

This is almost frighteningly simple.

To incorporate this into a BVH traversal algorithm you need a stack for the ray batch references, which is fairly straight-forward. Pseudo code for the entire BVH traversal algorithm might look something like this:

node_stack.push(get_root_bvh_node())
rays_stack.push(get_ray_batch())

while node_stack is not empty:
    node = node_stack.pop()
    rays = rays_stack.pop()
    filtered_rays = partition(rays, node.bbox.test_ray())
    if filtered_rays is not empty:
        if node is leaf:
            node.data.test_rays(rays)
        else:
            node_stack.push(node.child1)
            node_stack.push(node.child2)
            rays_stack.push(filtered_rays)
            rays_stack.push(filtered_rays)

With node_stack and rays_stack being the stacks for BVH node references and ray batch references, respectively.

This is really simple code, bordering on elegant. Especially when compared to ray reordering where you have to store ray state, pause traversal, un-pause traversal, store which rays are about to hit which nodes, etc.

Looking at this code, it also starts to become clear why breadth first ray tracing can do instancing: we can use stacks during traversal. The node stack in particular lets us keep track of where we've come from and what we need to do next. And because that stack is shared by the entire ray batch, it takes up only a tiny amount of space even for fairly deep stacks. With ray reordering we would need a separate stack for each ray, which would be huge in aggregate.

And it's not just the node stack that is helpful. With breadth first ray tracing we can make any number of shared stacks for any kind of data we might want as long as that data is the same for each ray. So, for example, we can have a stack of transform matrices representing the current transform space of the rays, which is also necessary for instancing.

So that's breadth first ray tracing in a nut shell.

There are, of course, drawbacks to breadth first ray tracing as well. The three biggest drawbacks I've found so far are:

  1. The overhead of creating/partitioning/whatever the rays lists slows things down.
  2. Because the rays traverse the acceleration structure together, they have to traverse it in the same order. That, in turn, means we end up doing a lot of unnecessary ray tests that could be avoided with individual traversal orders. Specifically, we miss out on optimizations like always testing against the closest node first—the closest node for one ray often isn't the closest node for another.
  3. Implementing typical SIMD optimizations like QBVH's etc. becomes a lot trickier and doesn't provide as much benefit because of drawback #1 above.

Point #1 isn't an enormous bottleneck on its own, especially when ray testing is rolled directly into the partitioning step. But the severity depends on how large the ray data structures are: the larger they are, the more costly partitioning becomes. Unfortunately, rays are fairly fat data structures in Psychopath. That's definitely something I want to improve at some point if I can.

Point #2 can be addressed somewhat by first splitting ray batches into separate directionally-ordered batches. But then you also lose some of the traversal coherence, so it's a trade-off.

Point #3 depends on what you're doing. Most breadth first ray tracing papers I've seen exploit SIMD instructions by testing multiple rays against single bounding boxes, which actually appears to work very well based on those papers. But unfortunately such techniques are not (as far as I know) compatible with efficient motion blur implementations. The fundamental problem is that with motion blur you're essentially testing each ray against a "different" bounding box anyway since it's moving over time.

On the flip side, QBVH's work very well with motion blur, but they're difficult to put into a breadth-first context. It's not impossible, though. I've implemented a breadth-first QBVH traverser successfully by using a small bit stack embedded within each ray. But, on average, it doesn't reduce the partitioning cost and makes the ray data structure larger. Thankfully, it still speeds things up, just not as much as QBVH's with individual ray traversal.

On the whole, though, I'm definitely happy with breadth first ray tracing. It's relatively simple to implement, it gives extremely strong guarantees about scene data access patterns, and it allows for full hierarchical instancing. Definitely a good fit for what I'm trying to do with Psychopath. So unless something better comes along, I expect this will remain a core part of Psychopath's architecture for the foreseeable future.

Drawbacks of Ray Reordering

Now that I've covered the drawbacks I ran into with the geometry cache, I'd like to talk about the issue I ran into with ray reordering.

Ray reordering worked extremely well in pretty much every way. It was actually quite fast, it wasn't too complex or difficult to implement, and I was even able to do a pretty nice QBVH implementation thanks to the awesome paper Stackless Multi-BVH Traversal for CPU, MIC and GPU Ray Tracing by Áfra et al.

But the big drawback of ray reordering is that you have to keep ray state small. You have to be able to pause a ray's traversal, which means storing its current traversal state so that you can resume tracing later. And you are typically tracing thousands or even millions of rays at a time, so that state data can't be large.

With a typical BVH this is relatively straight-forward because it's a tree structure. You never have to store where you came from in the tree because it's implicit. No matter where you are in the BVH, you can just go to the parent node, and then the parent's parent node, and so-on.

Or put into different terms, it's a bit like writing an algorithm for walking a maze. There is a classic maze solving algorithm that doesn't require any state to be stored other than the current position: all you do is follow the wall on your right. And this works great as long as the maze is simply connected. By analogy, that's more-or-less how Psychopath handled things, and it allowed ray state to be very small.

Then I decided to introduce instancing (and full, arbitrarily deep, hierarchical instancing at that). When I did this, everything that allowed ray reordering to work came crashing down.

The fundamental problem is that when you introduce instancing you change the BVH from being a tree structure to being a DAG structure. Many nodes will then have more than one parent. Or to use the maze analogy again, it's like introducing loops and overpasses into the maze. When you do that, suddenly you have to start recording additional information about your traversal to do it correctly. And worse, that information can grow to be arbitrarily large.

When tracing thousands or millions of rays at a time, having unbounded storage for each ray is really not feasible. It also introduces memory allocation into the traversal process, which isn't ideal.

To get around this I considered limiting instancing to being a single level deep. That would put simple bounds on the amount of state information. But I really, really wanted hierarchical instancing. As a production feature I think it's really useful (imagine instancing leaves on a tree, and then instancing trees in a forest).

So I banged my head against the wall for a week or so, trying to figure out if there was a way I could reasonably make hierarchical instancing work with ray reordering. In practice, people aren't actually going to make arbitrarily deep instancing hierarchies, so I considered exploiting that in some way. But ultimately all the solutions I came up with seemed overly complex and/or brittle. I wanted things to Just Work™.

In the end, I abandoned ray reordering for something called breadth first ray tracing. I had already been using breadth first ray tracing in Psychopath's surface splitting code, so I had a reasonably good handle on it. And it had some really cool properties all its own. But I'll talk about that in another post.