Blog Archive About

Psychopath Renderer

a slightly psychotic path tracer

2021 - 01 - 02

Sobol Sampling - Take 2

Shortly after I wrote my last post about Sobol sampling, I got in contact with Brent Burley, who created the hash-based Owen scrambling used in the paper Progressive Multi-Jittered Sample Sequences. He was extremely kind, and spared some time to chat with me about his Sobol sampler. He has also now published a paper about it:

Practical Hash-based Owen Scrambling, by Brent Burley

Unsurprisingly, it goes beyond what I had figured out on my own, and improves upon existing Sobol samplers in multiple ways. It's really an amazing paper, and I highly recommend giving it a read.

As it says in the title, it presents an efficient hash-based Owen scrambling technique. But it also presents a way to efficiently shuffle the order of the points in the sequence using the same hashing approach.1 The shuffling allows you to create as many statistically uncorrelated Sobol sequences as you like, which is crazy useful. For example, as described in the paper, you can pad a Sobol sequence with another Sobol sequence.

The combination of these two things--the shuffling and the Owen scrambling--creates a very powerful, very high-quality sampler. And it's efficient enough to be computed on-the-fly while rendering.

My Sobol sampler is now essentially the same as what's described in Burley's paper, so you can just read the paper to know what I'm doing. But my implementation does depart in one small way, which I believe is a tiny but worthwhile contribution, and is what the rest of this post is about.

The LK hash

The hash function used in Brent Burley's paper is taken from Stratified Sampling for Stochastic Transparency by Laine et al., and Burley variously refers to it as both the "Laine-Karras permutation" and the "Laine-Karras hash". For brevity, I'll be calling it the "LK hash". Here is the pseudo code:

function lk_hash(x, seed) {
    x += seed
    x ^= x * 0x6c50b47c
    x ^= x * 0xb82f1e52
    x ^= x * 0xc7afe638
    x ^= x * 0x8d22f6e6
    return x
}

The critical property of this function is that, unlike a normal hash, it only avalanches upward, from lower bits to higher bits. In other words, it's a hash function where each bit only affects bits higher than itself.

This is the opposite of Owen scrambling, which is equivalent to a hash that only avalanches downward. However, it's easy to turn it into an Owen scrambling function: just reverse the order of the bits both before and after it:

function owen_scramble(x, seed) {
    x = reverse_bits(x)
    x = lk_hash(x, seed)
    x = reverse_bits(x)
    return x
}

Burley also uses the LK hash for shuffling the order of the Sobol points. So functions with this upward-only-avalanche property are really useful: they can be used both for shuffling and for Owen scrambling.

The only problem is that this particular function isn't especially high-quality.2 It seems to be good enough in practice for rendering, so it may not actually be worth improving. But it still bugs me. Let's take a look.

Here is a fully Owen scrambled sequence as ground-truth, computed using a much slower technique (from left-to-right are 64, 256, 1024, and 4096 points from the sequence, respectively):

full Owen scramble

And here is the bit-reversed LK hash:

Laine-Karras scramble

It looks pretty good, but the higher sample counts are visibly different. In particular, at 1024 samples there is an obvious structure to it. And at 4096 samples the "texture" of the samples is different (though you may need to click to view the full-size images to see it clearly).

Making a better hash function

I first tried to improve this by optimizing the constants, using more permutation rounds, etc. That's where I was when I wrote my previous Sobol post. And although it did help some, all of the results still had obvious issues, especially when looking at other dimension pairs.3

After getting in touch with Brent Burley, however, I was inspired to take another crack at it. It was at that point that I tried different permutation operations. The first real success was adding x += x << 1 to each round:

function lk_improved_v1(x, seed) {
    x += seed
    x ^= x * 0x6c50b47c
    x += x << 1
    x ^= x * 0xb82f1e52
    x += x << 1
    x ^= x * 0xc7afe638
    x += x << 1
    x ^= x * 0x8d22f6e6
    x += x << 1
    return x
}

The result is this:

Laine-Karras improved v1

This is visibly much better than the original LK hash, especially at 1024 samples. The main problem is that it's also slower. Mind you, it's still plenty fast. But I really wanted to make a hash that works better and is at least as fast as the original LK hash.

The astute among you (more astute than I was at the time) may notice something a little fishy about the new operation I added. I originally came up with it by specifically trying to make something that only avalanches upward. But it also turns out to be the same as just multiplying x by 3.

When I finally noticed that, and after a bit more thought, I realized that multiplying by odd numbers is also a valid operation in a hash like this. So, all together, these are the permutations I'm currently aware of that are valid for constructing a hash like this:

  1. x ^= constant
  2. x += constant
  3. x *= odd_constant
  4. x ^= x * even_constant

You can mix and match these to try to create better hashes. There are likely even more valid permutations, but I don't know what they might be.

Playing around with those permutations, I came up with this hash:

function lk_improved_v2(x, seed) {
    x += seed
    x ^= 0xdc967795
    x *= 0x97b754b7
    x ^= 0x866350b1
    x *= 0x9e3779cd
    return x
}

And here are the results:

Laine-Karras improved v2

It's not quite as good as the previous one, but it's close. And it's still much better than the original LK hash. Moreover, it's faster than both of them: this version only takes 5 CPU operations, compared to the 9 and 17 ops of the other two.

This version is what I'm now using in Psychopath. It works at least as well as the original LK hash, and is a bit faster.

In practice, I don't notice any difference in render quality between any of these different hashes, or even between them and full Owen scrambling for that matter. So... it probably doesn't really matter. But it makes me feel better that this is a bit closer to a full Owen scramble.

Final thoughts

I certainly don't think I've come up with the best possible hashes. There's clearly still room for improvement.

Moreover, a quantitative analysis of hash quality would be really useful here. I did measure upward avalanche with some bespoke throw-away testing code, and the results suggested that these new hashes do improve things. But my testing wasn't careful or thorough, and there could easily have been awful bugs in my testing code. So I don't actually trust the results.

I hope that this post can serve as a starting point for other people to develop even better LK-style hashes. And if someone does further work on this, I'd love to hear about it! But for the time being, I'm quite satisfied with what I've arrived at.

For the curious, here is my implementation.


Footnotes

  1. In my previous post I said that you shouldn't offset into Sobol sequences, because it makes Sobol sampling slow. And yet that's more-or-less what shuffling does, just in a much more sophisticated way. Indeed, shuffling will also make the Sobol sampler slower in the same way that offsetting does. However, since writing my last post I've discovered a couple of ways to mitigate that performance impact. Moreover, it's trivial to evaluate multiple Sobol dimensions at once with SIMD, which my sampler now does. So, it's quite performant now.

  2. Burley notes in his paper that measuring hash quality is a complex endeavor, and that he uses the original LK hash due to the testing and claims of the original authors. It's also important to note that in practice the LK hash seems to perform very well in rendering applications. So I think Burley's choice to use the hash as-is in his paper was a good one.

    However, looking at the actual points it generates, it obviously has room for improvement. Moreover, in the Laine-Karras paper I saw no indication that they developed the hash in any kind of rigorous way. Even further, they say this about the constants used in their hash:

    The multiplication constants are the ones used in generating the images in this paper, but they carry no particular significance beyond that.

    So my strong suspicion is that they essentially just eye-balled it until it worked well enough for their particular use-case.

  3. For brevity, the images in this post only show the first two Sobol dimensions. In my testing, however, I looked at over 50 different dimension pairs, which revealed similar differences between the LK hash and the full Owen scramble. The improved hashes in this post fix those differences about as well as the differences in the first two dimensions. Which is to say, it's notably better, but certainly still has room for improvement.

2020 - 09 - 22

Storing Color Compactly

One of the tricky things about a Manuka-style rendering architecture is that you need to store your shading information really compactly, since it will be stored at every micropolygon vertex.

Storing surface normals seems like a fairly solved problem at this point, and I recommend the paper Survey of Efficient Representations for Independent Unit Vectors by Cigolle et al. if you haven't already read it.

Color, on the other hand, still seems like a bit of an open problem. For something like Psychopath, ideally I want a color format that meets these criteria:

  • Compact (ideally 32 bits or less).
  • Fast to decode (I want to use this at runtime while rendering, potentially decoding on every ray hit).
  • Covers the entire gamut of human-visible colors.
  • Has sufficient dynamic range to capture day-to-day natural phenomenon (e.g. the sun) without clipping.
  • Has sufficient precision to avoid visible banding.

I took a few detours, but I believe I've come up with a format that meets all of these criteria. And I call it FLuv32.

It's based on the 32-bit variant of LogLuv, which is a color format designed by Greg Ward of Radiance fame. LogLuv ticks almost all of the boxes: it's 32 bits, it covers the full gamut of visible colors, it has a massive dynamic range of 127 stops, and it has precision that exceeds the capabilities of human vision. It's a super cool format, and I strongly recommend reading the original paper.

However, the one box that it doesn't tick is performance. Since it uses a log encoding for luminance, decoding it requires an exp2() call, which is slower than I want.1

FLuv32 is essentially the same as LogLuv, but replaces the log encoding with floating point (hence the "F" in the name) which can be decoded with just a few fast operations. To achieve the same precision with floating point, I dropped the sign bit, foregoing negative luminance capabilities. But aside from that and a couple minor tweaks, it's the same as LogLuv. It has the same precision, the same massive dynamic range, and also fits in 32 bits. It's just a lot faster.

The FLuv32 color format

For anyone wanting to use FLuv32 themselves, I have a reference implementation in Rust here, and below is a description of the format.

Like LogLuv, FLuv32 uses an absolute color space, and as such takes CIE XYZ triplets as input. Also like LogLuv, it uses the Y channel directly as luminance, and computes u' and v' with the following CIELUV formulas:

u' = 4X / (X + 15Y + 3Z)
v' = 9Y / (X + 15Y + 3Z)

FLuv32 stores Y, u', and v' in the following bit layout (from most to least significant bit):

  • 7 bits: Y exponent (bias 42)
  • 9 bits: Y mantissa (implied leading 1, for 10 bits precision)
  • 8 bits: u'
  • 8 bits: v'

To store u' and v' in 8 bits each, their values need to be scaled to a [0, 255] range. u' is in the interval [0, 0.62], and v' is a tad smaller than that, so the original LogLuv uses a scale of 410 for both. FLuv32 departs from that very slightly, and instead uses the following scales:

u'-scale = 817 / 2
v'-scale = 1235 / 3

The motivation for this change is to be able to exactly represent E, the chromaticity of an equal-energy spectrum. This is not perceptually meaningful, which is presumably why the original LogLuv didn't bother. But in the context of a spectral renderer this may be useful, and it's trivial to accommodate.

The luminance, Y, is stored with 7 bits of exponent and 9 bits of mantissa. The exponent has a bias of 42, and the mantissa follows the standard convention of having an implicit leading one, giving it 10 bits of precision. The minimum exponent indicates a value of zero (denormal numbers are not supported). The maximum exponent is given no special treatment (no infinities, no NaN).

The exponent's bias of 42 may seem surprising, since typically a 7-bit exponent would be biased by 63. The reasoning for this is two-fold:

  • The dynamic range is already monstrously excessive for practical imaging applications, so to some extent it doesn't matter.
  • It seems more useful to be able to represent the luminance of supernovae in physical units (which exceed 2^64 cd/m^2) than ludicrously dark values (and even so, 2^-42 is already ludicrously dark).

42 was chosen by putting two thirds of the exponent range above 1.0, and one third below.2

Other formats

I also explored some other formats before arriving at FLuv32.

The obvious first one is simply using three 16-bit floats for RGB values. Surprisingly, although this provides more than enough precision, it doesn't provide enough dynamic range. Standard IEEE half-floats only have 5 bits of exponent, giving only 32 stops. That's not bad, but it's not quite enough to cover the whole range from dark shadows to the brightness of the sun. Moreover, 3 half floats takes up 48 bits, which is unnecessary.

After that, I considered RGBE, which is actually a predecessor to LogLuv, and is used in the Radiance HDR image format. In its typical formulation, 8 bits each are given to R, G, B, and the shared exponent. This format is very fast to decode, but unfortunately doesn't have the precision needed to avoid visible banding.

I implemented two other RGBE variants to try to improve things:

  • 32-bits with a 9-9-9-5 split. While this improved precision, it also unfortunately reduced dynamic range too much.
  • 40-bits with a 11-11-11-7 split. This provides more than enough precision and range, at the expense of slightly increased storage requirements. I think this is actually a solidly good format if RGB is a requirement.

Before stumbling on LogLuv, I also explored my own home-grown luma-chroma based format, based on YCbCr. It didn't pan out, largely because if you don't know the precise RGB colorspace you're working from, you can't accurately separate chroma from luminance, which is critical for storing chroma in fewer bits. You could, of course, just choose a specific RGB colorspace (e.g. ACES2065-1), but even then your chroma don't end up being perceptually uniform, which also thwarts low-bit-count storage.

All-in-all, I would say if you want to store HDR RGB specifically, go with the 40-bit RGBE format, which should easily accommodate any RGB color space with plenty of precision. And if you just want to store HDR color more generally, then both LogLuv and FLuv32 are great formats.

Shower thought

It occurred to me that in a spectral renderer you might not even need to do a full decode from FLuv32 back to XYZ. You have to spectrally upsample all your colors anyway, and usually that's done via some kind of chroma-oriented lookup table. So I'm wondering now if it would be reasonable to build a lookup table that's just directly indexed by the quantized u' and v' coordinates.

I haven't fully explored that yet, but decoding just the luminance from FLuv32 is super fast. On CPU, on my machine, the full XYZ decode is about 3ns, whereas the Y decode is about 1ns. So that would be a significant savings. Even moreso if you would just end up recomputing chroma from XYZ anyway for a spectral lookup.


Footnotes

  1. Arguably you could use a fast approximate exp2 to overcome performance limitations. However, the accuracy would need to be good enough to avoid introducing visible error, otherwise the whole point of the format is somewhat subverted.
  2. That it also happens to be the answer to everything is pure coincidence.
2020 - 04 - 20

Light Trees and The Many Lights Problem

This post will walk you through a particular approach for efficiently importance sampling huge numbers of lights in a scene. I call the approach "Light Trees".

This post has been a long time coming. I actually wrote about this technique (tersely) back in 2014 on ompf2.com. But I never got around to writing about it properly on this blog. So I'm fixing that now!

Before I get started, I want to make it clear that I am not the first person to come up with this idea. In particular, the Manuka guys have apparently been doing something like this since the inception of their renderer. And there are several other people that apparently came up with pretty much the same technique around the same time as me.

Really, the core approach is super straightforward. And the fact that I (of all people) independently came up with it is a strong indication that it's pretty obvious. After all, I'm an animator professionally, not a software developer.

So my goal in this post is not to claim any credit for the idea. Quite the contrary—given how many people have independently come up with it, I don't think anyone can reasonably claim credit for it.

Rather, the goal of this post is to clarify and distill the approach, since the papers I've seen about it describe more detailed or complex variants. I discuss those papers and their contributions at the end of this post.

The Many Lights Problem

The problem we're trying to solve with light trees is sometimes called the "many lights" problem. If you're already familiar with this problem and with importance sampling, then please feel free to skip to the "Light Trees" section. These first two sections are just filling in that background.

In a typical forward path tracer, when you want to calculate lighting at a point on a surface, you generally do two things1:

  1. Sample the light(s) in the scene to calculate direct lighting.
  2. Shoot one or more rays out in random directions to calculate indirect lighting.

The many lights problem shows up in that first step.

Calculating direct lighting is really easy if you only have one light: just shoot a shadow ray at the light, and if the ray reaches it, calculate the lighting. But what if there's more than one light? In that case, there are two broad approaches:

  1. Sample all lights.
  2. Pick one light at random.

The first approach—sampling all lights—works really well:

three lights full sampling

The second approach—sampling one light at random—also works well enough, but results in some noise due to the random selection:

three lights random sampling

Both approaches are fine. The former takes longer to render because it needs to do more calculations per shading point, but produces less noise. The latter renders faster but produces more noise, which takes more samples to get rid of. Which approach is better depends on the particular scene.

The many lights problem starts to show up when there are... eh... many lights. A large number of lights. For example, I have a test scene that contains roughly 8000 lights. Rendering this scene by sampling all lights looks like this2:

city scape full sampling

It doesn't look too bad. But it took 3 hours to render at only 16 samples per pixel, which is about 11 minutes per sample. That's because we have to spawn 8000 shadow rays on every ray hit, and potentially do as many lighting calculations. Imagine the render times on an actual production scene with even more lights, and at the sample rates needed to converge to a sufficiently noise-free image.

You might think that randomly selecting lights will save us. And, indeed, that brings our render time down to 6 seconds. But the result is so noisy that it's unusable:

city scape random sampling

That's because the probability of selecting a light that matters to the shading point is so low. Even if we crank our samples into the thousands, the render would still be somewhat noisy.

So it seems like we're stuck, then. Either our render times are intractable, or our renders are unusably noisy. And that, in a nut shell, is the many lights problem.

Importance Sampling

The answer to this problem is something called importance sampling3.

With approach #2 (random selection) we have a lot of wiggle room in how we select our lights, and the idea behind importance sampling is to use that wiggle room to sample important things with higher probability.

As an example, let's make a scene with two lights, with one of the lights much brighter than the other. If we render it with equal probability of selecting either light, it looks like this:

two lights equal sampling

But if we use importance sampling to select the brighter light with higher probability, it looks like this:

two lights importance sampling

So is that it, then? We just select the brighter lights more often, and we're good to go?

Unfortunately, no. You can get a hint of why that won't work by noticing that the area under the dimmer light actually gets more noisy in the importance sampled image.

The problem is that even though one light is brighter than the other, that doesn't mean it actually contributes more light to the point we're shading. For shading points that are sufficiently close to the dim light, the dim light is effectively the brighter one.

For a simple scene like the one above, that's not a big problem. But for something like the city scape scene it's a huge problem. First of all, all the lights in that scene are about the same brightness, so even with brightness-based importance sampling we would just be selecting them equally anyway. But also, even if some of the lights were brighter, what are the chances that any given light has much effect on any given shading point? Very low.

So what we actually want is to somehow sample the lights in proportion to the brightness they contribute to the current shading point. If we do that, things look way better:

two lights light tree sampling

But... how do we do that?

Light Trees

(To summarize for those that skipped to this section: our goal is to select a light from all lights in the scene according to their contributions to the current shading point.)

The naive way to do this would be to calculate the shading from all the lights, and then select a light based on those results. But that doesn't get us anywhere because then we're just sampling all the lights anyway.

So instead we need some efficient way to approximate that. And that's exactly what light trees let us do.

A light tree is a pretty straightforward data structure: it's just a BVH of the lights in the scene, where each node in the BVH acts as an approximation of the lights under it. For it to be an effective approximation, that means we also need to store some extra data at each node.

The simplest kind of light tree would simply store the total energy of all the lights under each node. You could then treat each node as a light with the same size as the spatial bounds of the node, and emitting the same energy as the lights under it.

I'm going to use that simple light tree variant to illustrate the technique. But I'll discuss some possible extensions in a bit.

Okay, so we have our light tree as described above, and we have our shading point on a surface. How do we select a light?

Starting from the root of the tree, we:

  1. Calculate the lighting (or an approximation thereof) from each of the child nodes.
  2. Take those lighting results and select which child to traverse into based on them. For example, if the left node contributes twice as much light, then we'll be twice as likely to traverse into it.
  3. If the selected child is a leaf, return the actual light under it. Otherwise, go back to step 1 with the selected child.

So that's the basic idea. But there are a couple of further details:

  • We also need to return the probability of selecting that light. This is because the calling code needs to weight that light's contribution based on that probability.
  • We need a good "approximate lighting" function to use in step 1. Incidentally, I like to call this the "weighting" function, because we only use it to determine the weighted probability of traversing into each node.

Calculating the light-selection probability is pretty easy: we just combine the probabilities of all the child nodes we traverse into, which is just multiplying them all together.

Coming up with a good weighting function is trickier. The ideal weighting function would exactly calculate the lighting from all the individual lights under the node. But if we could do that efficiently then we wouldn't need the light tree in the first place! So instead we need something that approximates that well. I'm going to punt on this for now, but we'll come back to it later. For the moment, just assume we have a reasonable weighting function.

Given all of those things, our light selection process is the following pseudo code, where tree_root is the root node of the light tree, shading_point is the data of the point we're shading, and n is a random number in the range [0, 1):

def light_tree_select(tree_root, shading_point, n):
    node = tree_root   # Updated as we traverse.
    selection_p = 1.0  # Selection probability, also updated as we go.
    
    loop:
        if is_leaf(node):
            return (node.light, selection_p)

        # Child node weights, or "approximate lighting".
        child_1_w = weight(node.child_1, shading_point)
        child_2_w = weight(node.child_2, shading_point)

        # Normalized probabilities of traversing into either child.
        child_1_p = child_1_w / (child_1_w + child_2_w)
        child_2_p = 1.0 - child_1_p

        # Select the child for the next loop.
        if n < child_1_p:
            node = node.child_1
            selection_p *= child_1_p # Accumulate probability.
            n /= child_1_p           # Scale random number.
        else:
            node = node.child_2
            selection_p *= child_2_p        # Accumulate probability.
            n = (n - child_1_p) / child_2_p # Scale random number.

And that's it! Well, modulo some floating point corner cases. But that's the entire thing. Really.

One thing I didn't mention yet: you need to scale the random number on each traversal step. That's what lets us get away with using just one number, instead of needing a new random number every time we choose a child. I'm feeling too lazy to explain how it works in detail, but you can think of it as the tree whittling down the space of the random number as it traverses deeper and deeper.

Anyway, if we implement that pseudo code, we get this beautiful result at 16 samples per pixel:

city scape basic light tree 16spp

This particular render uses a pretty simple weighting function: it assumes the shading point is lambert, treats the node as a spherical light, and just uses the analytic formula for lambert shading from a spherical light4.

That isn't an especially good weighting function for anything other than diffuse surfaces. But it works surprisingly well in many cases. It's definitely not sufficient for a robust implementation, however.

Potential Extensions and Improvements

Starting from that base, there are a bunch of potential improvements we could explore. For example:

  • A better weighting function that handles, among other things, a wider variety of materials well. This is more-or-less necessary for a robust implementation.
  • Storing more information than just total energy in the nodes, and using that for better weighting. For example, storing the approximate directionality of lights in cases such as spot lights and some area lights.
  • Alternative tree constructions. For example, using something other than the surface area heuristic to cluster lights, or using higher-arity trees.
  • Accounting for shadowing (somehow...).
  • etc.

The core idea—using a hierarchy to approximate light importance—is always the same. But the potential for variety within that is huge.

As a quick example, this render:

city scape better light tree 16spp

...is also 16 samples per pixel, but (if you flip between it and the last render) it's clearly far less noisy. This was achieved by playing with some of the things above.

Further Reading

There are two published papers I'm aware of that describe variations on this light tree approach, both of which are very much worth reading:

Importance Sampling of Many Lights With Adaptive Tree Splitting is, in my opinion, the more significant paper of the two. It covers a large breadth of things including:

  • A novel tree construction approach that goes beyond the SAH, accounting for factors unique to light trees.
  • A way to account for the directionality of lights in the tree and to use that information to improve weighting calculations.
  • A novel "tree splitting" approach for returning multiple lights with reduced variance.

The main downside is that their weighting functions don't work well without their tree splitting approach, and can approach very high variance without it. (In fact, the Stochastic Lightcuts paper makes comparisons to it without the tree splitting, which highlights how poorly it performs in that case.)

Stochastic Lightcuts proposes the same technique as both this post and my 2014 post on ompf2, but frames it as an extension to lightcuts. I personally find that framing confusing, and it actually took me a couple readings to realize that it is, in fact, the same technique. But some people might find that framing easier to understand, especially if they're already familiar with lightcuts.

However, I believe the significant contribution of Stochastic Lightcuts is the weighting function they propose, which is based on the lightcuts error bounding function, and accounts for different surface materials and lights with directional components. Moreover, unlike the weighting function from the Adaptive Tree Splitting paper, it works really well with the standard traversal approach—no tree splitting required. I would definitely recommend this paper as the baseline for any real implementation of light trees, and then explore improvements from there.


Footnotes

  1. This is a simplification, of course. But it's sufficient for illustrating the problem.

  2. If you compare to later renders in the post, you'll notice that it doesn't quite match. That's because this one was rendered in Cycles, but the others are rendered in Psychopath. Psychopath doesn't support sampling all lights, and I didn't feel like implementing a whole feature just for illustration purposes.

  3. For a more thorough introduction to importance sampling, I strongly recommend the amazing PBRT book, which is available for free online. (But is also very much worth buying!)

  4. See the paper "Area Light Sources for Real-Time Graphics" by John M. Snyder for the appropriate formula.

2020 - 04 - 14

Building a Sobol Sampler

(For an update on Psychopath's Sobol sampler, please also see the next post in this series: Sobol Sampling - Take 2.)

Up until recently I was using Leonhard Grünschloß's Faure-permuted Halton sampler as my main sampler in Psychopath. I even ported it to Rust when I made the switch from C++. And it works really well.

I've also had a Rust port of Grünschloß's Sobol sampler in Psychopath's git repo for a while, and from time to time I played with it. But it never seemed notably better than the Halton sampler in terms of variance, and it was significantly slower at generating samples. This confused me, because Sobol seems like the go-to low discrepancy sequence for 3d rendering, assuming you want to use low discrepancy sequences at all.

However, I recently stumbled across some new (to me) information, which ultimately sent me down a rabbit hole implementing my own Sobol sampler. I'd like to share that journey.

Incidentally, if you want to look at or use my Sobol sampler, you can grab it here. It's MIT licensed. (Note from the future: this link is to the code as of the writing of this post. See the above-linked second post for a significantly improved version.)

Don't Offset Sobol Samples

The first two things I learned are about what not to do. And the title of this section actually refers to both of those things simultaneously:

  1. Cranely Patterson Rotation.
  2. Randomly offseting your starting index into the sequence for each pixel.

These are two different strategies for decorrelating samples between pixels. The first offsets where the samples are in space, and the second offsets where you are in the sequence. Both are bad with Sobol sequences, but for different reasons.

Cranley Patterson Rotation is bad because it increases variance. Not just apparent variance, but actual variance. It's still not entirely clear to me why that's the case—it feels counter-intuitive. But I first found out about it in the slides of a talk by Heitz, et al. about their screen-space blue noise sampler. It's also been borne out in some experiments of my own since then.

So the first lesson is: don't use Cranley Patterson Rotation. It makes your variance worse.

Randomly offsetting what index you start at for each pixel is bad for a completely different reason: it's slow. Due to the way computing Sobol samples works, later samples in the sequence take longer to compute. This is what was making it slow for me. I already used this technique with the Halton sampler, to great success. So I assumed I could just apply it to Sobol as well. And you can. It works correctly. It just ends up being comparatively slow.

So the second lesson is: don't offset into your Sobol sequence (at least not by much).

The Right Way to Decorrelate

So if you can't do Cranley Patterson Rotation, and you can't offset into your sequence, how can you decorrelate your pixels with a Sobol sampler?

The answer is: scrambling.

There are two ways (that I'm aware of) to scramble a Sobol sequence while maintaining its good properties:

  1. Random Digit Scrambling
  2. Owen Scrambling

Random digit scrambling amounts to just xoring the samples with a random integer, using a different random integer for each dimension. The PBRT book covers this in more detail, but it turns out that doing this actually preserves all of the good properties of the Sobol sequence. So it's a very efficient way to decorrelate your pixels.

I gave that a try, and it immediately made the Sobol sampler comparable to the Halton sampler in terms of performance. So it's definitely a reasonable way to do things.

But then, due to a semi-unrelated discussion with the author of Raygon, I ended up re-reading the paper Progressive Multi-Jittered Sample Sequences by Christensen et al. The paper itself is about a new family of samplers, but in it they also compare their new family to a bunch of other samplers, including the Sobol sequence. And not just the plain Sobol sequence, but also the rotated, random-digit scrambled, and Owen-scrambled Sobol sequences.

The crazy thing that I had no idea about, and which they discuss at length in the paper, is that Owen scrambled Sobol actually converges faster than plain Sobol. That just seems nuts to me. Basically, Owen scrambling is the anti-Cranley-Patterson: it's a way of decorrelating your pixels while actually reducing variance. So even if you're not trying to decorrelate your pixels, it's still a good thing to do.

(As an interesting aside: random digit scrambling is a strict subset of Owen scrambling. Or, flipping that around, Owen scrambling is a generalization of random digit scrambling. But what's nice about that is you can use the same proof to show that both approaches preserve the good properties of Sobol sequences.)

Implementing Owen Scrambling Efficiently

The trouble with Owen scrambling is that it's slow. If you do some searches online, you'll find a few papers out there about implementing it. But most of them are really obtuse, and none of them are really trying to improve performance. Except, if I recall correctly, there's one paper about using the raw number crunching power of GPUs to try to make it faster. But that's not really that helpful, since you probably want to use your GPU for other number crunching.

So it basically seems like Owen-scrambling is too slow to be practical for 3d rendering.

Except! There's a single sentence in the "Progressive Multi-Jittered Sample Sequences" paper that almost seems like an afterthought. In fact, I almost missed it:

We do Owen scrambling efficiently with hashing – similar in spirit to Laine and Karras [LK11], but with a better hash function provided by Brent Burley.

"LK11" refers to the paper Stratified sampling for stochastic transparency by Laine et al. The really weird thing is that this paper isn't even about Sobol sampling. But they, too, have a one-liner that's really important:

This produces the same result as performing an Owen scramble [...]

What they're referring to here is a kind of hashing operation. They use it to decorrelate a standard base-2 radical inverse sequence, but the same approach applies just as well to Sobol sequences.

But you can't just use any hash function. In fact, you have to use a really specific, weird, special kind of hash function that would be a very poor choice for any application that you would typically use hashing for.

Here's the basic idea: the key insight Laine et al. provide is that Owen scrambling is equivalent to a hash function that only avalanches downward, from higher bits to lower bits. In other words, a hash function where each bit only affects bits lower than itself. This is a super cool insight! And it means that if we can design a high-quality, efficient hash function with this property, then we have efficient Owen scrambling as well.

It turns out, though, that developing a high-quality hash function that meets these criteria is pretty challenging. For example, the hash function they present in the paper, while fast, is pretty poor quality. Nevertheless, I decided to take a crack at it myself.

So far, I've stuck to the general approach from Laine et al., but worked on optimizing the constants and tweaking a few things. I don't think I've gained any real insights beyond their paper, but I have nevertheless made some improvements through those tweaks. At this point, I have a hash function that I would call "okay" for the purposes of Owen scrambling. But there is very clearly still a lot of room for improvement.

I'm extremely curious what the implementation from "Progressive Multi-Jittered Sample Sequences" looks like. And I'm especially curious if there are any interesting insights behind it regarding how to construct this bizarre kind of hash function.

My Sobol Sampler

Again, you can grab my Sobol sampler here if you like.

It includes implementations of all the (non-slow) decorrelation approaches I mentioned in this post:

  • Cranley-Patterson rotation
  • Random Digit Scrambling
  • And hash-based Owen Scrambling

It also includes, of course, plain non-scrambled Sobol.

If you do make use of this somewhere, I strongly recommend using the Owen-scrambled version. But it can be fun to play with the others as well for comparison.

However, one thing I noticed in my testing is that Cranley-Patterson rotation seems to do a better job of decorrelating pixels. The two scrambling approaches seem to exhibit some correlation artifacts in the form of clusters of mildly higher-variance splotches in certain scenes. It's mild, and Owen scrambling still seems to win out variance-wise over Cranely-Patterson. But still. This is definitely a work in progress, so be warned.

Regardless, I've had a lot of fun with this, and I've learned a lot. Pretty much everything I talked about in this post was like black magic to me before diving head-first into all of it. I definitely recommend looking into these topics yourself if this post was at all interesting to you.