Blog Archive About

Psychopath Renderer

a slightly psychotic path tracer

2021 - 01 - 30

Building a Better LK Hash

(This post has been updated as of 2021-05-08 with important changes due to an issue identified by Matt Pharr. The majority of the post is essentially unchanged, but additional text discussing the issue and providing a new hash with the issue addressed have been added near the end of the post. If you just want the hash itself, skip down to the section "The final hash".)

At the end of my last post about Sobol sampling I said that although there was room for improvement, I was "quite satisfied" with the LK hash I had come up with for use with the techniques in Brent Burley's Owen scrambling paper. Well, apparently that satisfaction didn't last long, because I decided to have another go at it. And in the process I've not only come up with a much better hash, but I've also learned a bunch of things that I'd like to share.

Owen scrambling: what is it, really?

Before getting into the work I've done, let's revisit Owen scrambling.

Let's say we have a bunch of Sobol points on the unit square:

Points on the unit square

To perform a single-axis Owen scramble we do the following recursive procedure: divide the current segment into two halves, randomly choose to swap those halves or not, and then recurse into each half and repeat.

Let's walk through a quick example: we start by deciding randomly whether to swap the left and right halves or not. In this case, let's say we do swap them:

Points on the unit square, swapped 1

Then for each half we decide randomly whether to swap its sub-halves. Let's say we do for the left, but we don't for the right:

Points on the unit square, swapped 2

Then for each of those segments, we decide randomly whether to swap their left and right halves:

Points on the unit square, swapped 3

And so on, all the way to infinity (or to the precision of your numeric representation).

To scramble on the vertical axis as well, we just do the same random procedure vertically. And you can extend it to 3, 4, or a million dimensions by just repeating the process in all dimensions. Here are the same points, fully scrambled on both axes:

Points on the unit square, fully scrambled

And that's Owen scrambling.1

Pretty simple, right? And remarkably, if performed on Sobol points, it not only preserves their great stratification properties, but actually improves their quality.

Thinking in bits.

The above explanation is, in my opinion, the most intuitive way to think about Owen scrambling. However, to build a hash that behaves the same way, we'll want to frame the process in terms of binary bits. So... here we go:

If we represent each point's coordinates with unsigned binary integers, with their range spanning the full extent of the square:

Integer spanning the unit square

...then we can think of the scrambling process as making random bit-flipping decisions. For example, flipping the highest (most significant) bit of a coordinate for all points is identical to swapping the left and right halves of the square on that coordinate's axis.

One way we can visualize bit-flipping-based Owen scrambling is with what I'll be calling a "scramble tree". Let's take a four-bit binary number, 1101, as an example:

A 4-bit Owen scramble tree, processing a 4-bit number

Each node in the scramble tree represents a bit flipping decision, with green = flip and white = don't flip.

The top node tells us whether to flip the highest bit or not. Then we move down the tree, going left if the highest bit was 0, and right if it was 1.2 The node we traverse into then tells us whether we flip the second-highest bit or not. Then we traverse left or right depending on whether that bit was 0 or 1. And so on.

If this scramble tree is generated randomly, and a different tree is used for each dimension, then it's equivalent to the more visual process I outlined earlier.

Implementing scramble trees in practice.

We could use scramble trees as an actual data structure to perform Owen scrambling. However, in practice the trees are too large to be practical: a 32-bit scramble tree is at minimum about 500MB. So instead we want something that behaves identically to a scramble tree, but without the storage requirement.

One way to accomplish that is to consider how we fill in the nodes with random values. We could use an RNG, of course, and just fill in one node after another. But there is another source of randomness we can use instead. Let's take a look at the tree again:

path through a scramble tree

(I've notated the tree with 1's and 0's this time, indicating which input bit values correspond to which traversal paths.)

What we can see in this diagram is that each node has a unique "address", so to speak, which is simply the string of 1's and 0's that lead to it starting from the root. Utilizing those unique addresses, we can fill in the nodes with a good hash function: instead of using an RNG, we simply hash each node's address to get its random value.

What's great about that is it means we can drop the explicit tree altogether: since we can produce each node's value on demand with nothing more than its address, and since the address of the node we want is always just the string of bits higher than the bit we're making the flip decision for, we can just directly compute the node's value when needed without any tree traversal.

With this approach we still have a scramble tree, but it's implicit in the hash function we choose rather than being explicitly stored in memory.

Writing a function to do Owen scrambling this way is relatively straightforward, and in pseudo code might look something like this:

function owen_scramble(x):
    for bit in bits of x:
        if hash(bits higher than bit) is odd:
            flip bit
    return x

It simply loops through all the bits and flips each one based on a hash of all the bits higher than it.

This is awesome, but it's not quite enough for our needs. This function uses a single fixed hash function as its source of randomness, and is therefore equivalent to a single fixed scramble tree. However, we need a different scramble tree for each dimension, and if we want to decorrelate pixels with Owen scrambling then we even need different trees for each pixel. That's a lot of trees!

Fortunately, a small change lets us accomplish this quite easily. If we use a seedable hash function instead, then we can change the hash function (and thus the tree) by just passing a different seed value:3

function owen_scramble(x, seed):
    for bit in bits of x:
        if seedable_hash(bits higher than bit, seed) is odd:
            flip bit
    return x

And that's it! This function, if implemented properly and with a sufficiently good seedable hash function4, is a fully capable Owen scrambling function.

If you're curious, my actual implementation is here.

Owen hashing.

The line of reasoning I outlined above to arrive at this Owen scramble function focuses on scramble trees. And that's a really useful framing, because it's the framing that connects it to Owen scrambling. But if we shift our perspective a bit, there's an alternative framing that's also really useful: hashing. In other words, our Owen scrambling function doesn't just use a hash function, it actually is a hash function as well.

Specifically, in keeping with the insight from the Laine-Karras paper, it's a special kind of hash function that only allows bits to affect bits lower than themselves.

Since our goal is to create precisely such a function, that means we've already succeeded. Yay! And for many applications, using the above function is exactly what you want to do, because it is certainly high quality. However, for rendering it's a bit too slow. It executes a full hash for each bit, and that's quite a bit of CPU time.

Nevertheless, it serves as an excellent reference--a "ground truth" of sorts--that we can compare to while designing our LK-style hash.

One way we could make that comparison is to just eye-ball the resulting point sets, as I did in my previous post. We could also approach it with tools typical of point set analysis, such as comparing the frequency content of point sets via fast fourier transform.

But another way we can approach it is to take the "hash function" framing and evaluate our scrambling functions quantitatively as hash functions.

Evaluating hash functions.

One of the primary measures of hash function quality is avalanche. Avalanche is the extent to which each input bit affects each output bit. For a proper hash with good avalanche, all input bits should affect all output bits with close to equal probability.

Measuring avalanche is pretty straightforward: we take an input number and its resulting hash output, and then we flip each of the input number's bits one at a time to see which output bits flip with them. For a single input number, and a proper hash function, a graph of that looks like this:

single number avalanche graph

The vertical axis is which input bit we flipped, and the horizontal axis is which output bits flipped with it. A white pixel means the output bit flipped, a black pixel means it didn't.

If we do this same process with a whole boatload of input numbers, and average the results together, we get an avalanche graph:

multi-number avalanche graph

All pixels are a medium gray (modulo sampling noise). That means that, probabilistically, all input bits have a 50% chance of flipping any given output bit. This is exactly what you want from a proper hash.

We can also visualize the avalanche bias, which is simply the difference from that ideal 50% gray result. Black pixels then represent a 50% flipping probability, and anything lighter represents a deviation from that ideal:

multi-number avalanche bias graph

As you would expect from a good hash function, it's just black. No significant bias.

Owen scrambling bias.

So what does our reference Owen scrambling implementation look like? Well, we have to be a little careful when computing its avalanche bias. Remember that our Owen hash is seedable, so it actually represents a set of hash functions. So what we actually want is to compute bias graphs for a whole boatload of different seeds, and average them together. That will give us the average bias.

If we do that, it looks like this:

Owen scramble avalanche bias graph

This is really interesting! If we were trying to build a proper hash function, this would be an awful result. There's tons of bias. Fortunately, we're not actually trying to build a hash function per-se, we're just exploiting hashing techniques to do Owen scrambling. And this is simply what proper Owen scrambling looks like when treated as a hash.

So this gives us a target to hit--a ground-truth result. There are a few interesting things to notice about it:

  1. The entire lower-right half of the graph is pure white. This is what we want, and represents a critical property of Owen hashing: bits should only affect bits lower than themselves.

  2. There is a single additional white pixel in the upper right. This is also expected: that pixel indicates the effect that the highest bit has on the second-highest bit. If you revisit the scramble tree, you can see that every possible tree configuration either causes the highest bit to always change the flip decision of the second-highest bit or never change it. You cannot create a tree with less than 100% bias between those two bits.

  3. The upper-right corner has some gray that fades off as it moves left. This actually has the same cause as the extra white pixel: a limited set of possible tree configurations. This gray gradient is simply the average bias of that limited set. In fact, this precise gradient5 is an indication (though not proof) that the seeding is working properly, and producing all tree configurations with roughly equal probability.

How do the LK hashes compare?

So now that we have a ground-truth, we can compare our previous hashes. But we do have to address one thing first: as I mentioned in my previous posts, LK-style hashes actually do the opposite of an Owen scramble in the sense that instead of mixing from high bits to low bits, they mix from low to high.

So for convenience of comparison, here is the Owen scramble graph again, flipped to match LK-style hashes (this is what a "perfect" LK hash should look like):

reverse-bit Owen scramble avalanche bias graph

And here's the original Laine-Karras hash:

LK hash avalanche bias graph

And here's the hash from the end of my previous post:

v2 avalanche bias graph

So... my "improved" hash from the previous post actually looks quite a bit worse than the original Laine-Karras hash, which is embarrassing. So... uh, yeah, don't use that one. (In my defense, the point sets it generates actually do have a more consistently "okay" FFT, but still.)

The original Laine-Karras hash has a curious line of gray along the diagonal, but otherwise looks pretty good over-all.

Both hashes suffer from having a very uneven/noisy gray gradient, and actually have less apparent bias in many pixels compared to the reference Owen scramble. Again, if we were developing a proper hash function, less bias would be good. But in our case that actually suggests that there are scramble trees they're not generating regardless of seeding. In other words, it means that the seeding process is biased.

Measuring seeding bias.

Seeing the issues with the gray gradient made me want a way to visualize that seeding bias more directly. The avalanche bias graph only gives an indirect indication of that issue.

What I came up with is based on this:

seed = random()

input1 = random()
input2 = random()
output1 = lk_hash(input1, seed)
output2 = lk_hash(input2, seed)

y = reverse_bits(input1 ^ input2)
x = reverse_bits(output1 ^ output2)

...and essentially making a 2d histogram of the resulting (x, y) points. The idea behind this is to reveal node relationships in the scramble tree that don't meaningfully change with the seed. Or, in other words, it provides some insight into what tree configurations the seeding process can produce.

In practice, a full histogram of this would be impractically large--it would be a 4294967296 x 4294967296 pixel image. So although the above code snippet is the idea, in practice I did some further manipulations to focus on the bits corresponding to the top of the scramble tree.

Both the base approach as well as the additional manipulations I did mean that this is far from perfect. It absolutely misses things. But it nevertheless provides more insight into our hash functions, which is good!

Anyway, here's what it looks like for the proper (bit-reversed) Owen scramble function:

Owen scramble dual-graph

Here's the original LK hash:

LK hash dual-graph

And here's my "improved" hash from the previous post:

v2 dual-graph

So it turns out both hashes are pretty bad. At least, in the sense that they are strongly biased away from particular tree configurations.6

Developing better LK-style hashes.

Since we now have some objective--even if incomplete--quality measures, some avenues for hash design have opened up. In particular, we can automate or semi-automate the search for better hashes by trying to maximize those quality measures.

I took inspiration from Hash Function Prospector, and wrote some code to randomly generate hashes from valid operations, and keep the hashes that most closely matched the measures of an ideal LK-style hash.

My hope was that this, on its own, would find some really great hashes. And although it certainly did find better hashes, in practice it required some intervention and guidance to find really good ones. In particular, I found that the following three-stage process tended to work well:

  1. Use a fully randomized search to look for sequences of operations that seem to work well.
  2. Take the best-performing of those op sequences, and do a search that only changes their constants.
  3. For the very best hashes that come out of that, compare them in the visual inspector that Brent Burley provided in the supplemental material of his Owen hashing paper. In particular, focusing on visually comparing their FFTs across a variety of seeds and sample counts.

The code I hacked together to do these evaluations and searches can be found here. It also includes a version of Brent Burley's visual inspector, with the hashes from this post included (including the full Owen scramble and the two hashes below).

The first results, and a problem.

The highest quality hash that initially came out of this process was the following:

(Note: please don't use this hash! It has an issue, discussed and fixed below.)

x *= 0x788aeeed
x ^= x * 0x41506a02
x += seed
x *= seed | 1
x ^= x * 0x7483dc64

high-quality hash dual-graph

The quantitative measures, while not perfect, appear very good. And visual inspection of both the resulting point sets and their FFTs was (as far as I could tell) essentially indistinguishable from a full Owen scramble. It's also no more expensive than the original Laine-Karras hash.

However, Matt Pharr identified a significant problem with this hash when testing it in PBRT. In short, if you run the following code which buckets outputs based on their lowest 8 bits:

any_constant = 123
buckets = [0, 0, 0, 0, ...]
loop for as long as you like:
    seed = random()
    buckets[owen_hash(any_constant, seed) & 0xff] += 1

...almost half of the buckets end up empty, and there is significant bias even in the buckets that aren't. Since you need to reverse the bits for Owen scrambling, those lowest bits become the highest bits, and it can end up being a real problem.7

After investigating a bit, I discovered that the root cause of the issue was this part of the hash:

x += seed
x *= seed | 1

The addition followed immediately by a multiplication with the same seed interfere with each other.

This is unfortunate because this sequence is otherwise remarkably good at producing Owen-scrambling-like behavior with a seed. There is no other sequence I could find that even comes close. Even just changing the order of the addition/multiply, or inserting other operations between them, significantly degrades the quality of the hash by the other measures in this post.

Fortunately, it turns out that the fix is easy: use two independent random seed values.

x += seed_1
x *= seed_2 | 1

This not only fixes the issue Matt found, but actually improves the other measures as well.

The downside is that it now requires 64 bits of random seed, rather than just 32 bits. This isn't a huge problem, but I wanted to keep it to 32 bits if I could. Some further investigation revealed this slightly hacky trick, which works really well in practice:

x += seed
x *= (seed >> 16) | 1

The idea here is to decorrelate the lower bits of the two operations. The bucketing issue is still present to some extent in the higher bits with this hack, but even with 2^24 buckets the distribution has very even statistics and only about 100 empty buckets. For 3d rendering, that's more than good enough, especially considering that 32-bit floats only have 24 bits of precision anyway. But if that nevertheless makes you uncomfortable, you can use a full 64 bits of seed instead.

In any case, if you simply drop either of these fixes into the hash above, it resolves the issue.

The final hash.

After figuring this out, I decided to do another round of random hash generation to see if I could get an even better LK-style hash. And it turns out the answer is yes!

So here is the final hash:

x ^= x * 0x3d20adea
x += seed
x *= (seed >> 16) | 1
x ^= x * 0x05526c56
x ^= x * 0x53a22864

final high-quality hash dual-graph

This is much better than the previous results, and is remarkably close to the reference per-bit hash. Examining the FFT in Burley's visual inspector also appears essentially indistinguishable from a full Owen hash.

Please note that, just like the original LK hash, this hash needs random seeds to really work well. In Psychopath I pass incremental seeds through a fast 32-bit hash function before feeding them to this hash, and that works really well.

Wrapping up

One of the things that has made exploring this problem so interesting to me is that it's largely unexplored territory. It shares a lot in common with general hashing, but is nevertheless quite different in both purpose and quality evaluation.

I've taken a stab at developing such evaluation measures in this post, but as is clear from the problem that Matt Pharr found, they aren't complete. I feel reasonably confident that this final hash is good—certainly better than the original LK hash. But it's always possible there are still subtle issues. If anyone does discover issues with it, please contact me!


Footnotes

  1. Actually, this is just Owen scrambling in base 2. Owen scrambling can be performed in arbitrary bases. For example, base 3 would divide each segment into thirds and randomly shuffle those thirds. For scrambling Sobol points, however, we want base-2 Owen scrambling, so that's what I'm focusing on here.

  2. It doesn't really matter whether we traverse based on the input bits or the output bits--both conventions result in an Owen scramble, and both conventions are capable of representing all possible Owen scrambles. I use the input bits here because I think that's easier to follow. But it's worth noting that the visual Owen scramble process I outlined earlier in the post is better represented by the output bit convention.

  3. If your seed is only 32 bits, of course, then it can't possibly provide enough variation to cover all possible 32-bit Owen scramble trees (each one being roughly 500MB large). But in practice this doesn't matter--we just need a good random sampling of Owen scramble trees. And a high-quality seedable hash function will accomplish that just fine.

  4. There are actually some subtle ways you can screw up a real implementation of this, which I (repeatedly) learned the hard way. For example, you need to properly handle flipping the highest bit, which can go wrong if not handled specially thanks to how e.g. bit shifting works on x86 architectures. Also, you need to account for string length when hashing the bits, so that e.g. a single binary zero and a string of two binary zeros gives a different hash result, even though their numeric value is the same. As for choosing a seedable hash function, I went with SipHash-1-3 in my implementation, passing the seed as SipHash's key. But there are almost certainly cheaper hash functions that will work just as well.

  5. You can compute the expected gradient for this analytically using a little combinatorial math. Beyond the highest 16 bits (half the width of the graph), however, it becomes difficult to compute in practice due to the enormous numbers that the factorials produce. But I did compute it out to 16 bits, and it matches very closely.

  6. It's not entirely clear that being biased in this sense is always a bad thing. For example, I wonder if being biased towards generating less biased trees might be a good thing for some applications. But being arbitrarily biased--rather than biased in a carefully engineered way--is probably a bad thing, which is what we're facing here.

  7. Interestingly, if you do both scrambling and shuffling, like I am in Psychopath, the practical impacts of this don't manifest in any obvious way (which is why I didn't notice it). It turns out that shuffling is really good at hiding problems with scrambling. But if you do only scrambling—no shuffling—this problem shows up as subtle but noticeable artifacts at low sample counts.

2021 - 01 - 02

Sobol Sampling - Take 2

(For yet another update on Psychopath's Sobol sampler, as well as significantly better LK hashes, please see the next post in this series: Building a Better LK Hash.)

Shortly after I wrote my last post about Sobol sampling, I got in contact with Brent Burley, who created the hash-based Owen scrambling used in the paper Progressive Multi-Jittered Sample Sequences. He was extremely kind, and spared some time to chat with me about his Sobol sampler. He has also now published a paper about it:

Practical Hash-based Owen Scrambling, by Brent Burley

Unsurprisingly, it goes beyond what I had figured out on my own, and improves upon existing Sobol samplers in multiple ways. It's really an amazing paper, and I highly recommend giving it a read.

As it says in the title, it presents an efficient hash-based Owen scrambling technique. But it also presents a way to efficiently shuffle the order of the points in the sequence using the same hashing approach.1 The shuffling allows you to create as many statistically uncorrelated Sobol sequences as you like, which is crazy useful. For example, as described in the paper, you can pad a Sobol sequence with another Sobol sequence.

The combination of these two things--the shuffling and the Owen scrambling--creates a very powerful, very high-quality sampler. And it's efficient enough to be computed on-the-fly while rendering.

My Sobol sampler is now essentially the same as what's described in Burley's paper, so you can just read the paper to know what I'm doing. But my implementation does depart in one small way, which I believe is a tiny but worthwhile contribution, and is what the rest of this post is about.

The LK hash

The hash function used in Brent Burley's paper is taken from Stratified Sampling for Stochastic Transparency by Laine et al., and Burley variously refers to it as both the "Laine-Karras permutation" and the "Laine-Karras hash". For brevity, I'll be calling it the "LK hash". Here is the pseudo code:

function lk_hash(x, seed) {
    x += seed
    x ^= x * 0x6c50b47c
    x ^= x * 0xb82f1e52
    x ^= x * 0xc7afe638
    x ^= x * 0x8d22f6e6
    return x
}

The critical property of this function is that, unlike a normal hash, it only avalanches upward, from lower bits to higher bits. In other words, it's a hash function where each bit only affects bits higher than itself.

This is the opposite of Owen scrambling, which is equivalent to a hash that only avalanches downward. However, it's easy to turn it into an Owen scrambling function: just reverse the order of the bits both before and after it:

function owen_scramble(x, seed) {
    x = reverse_bits(x)
    x = lk_hash(x, seed)
    x = reverse_bits(x)
    return x
}

Burley also uses the LK hash for shuffling the order of the Sobol points. So functions with this upward-only-avalanche property are really useful: they can be used both for shuffling and for Owen scrambling.

The only problem is that this particular function isn't especially high-quality.2 It seems to be good enough in practice for rendering, so it may not actually be worth improving. But it still bugs me. Let's take a look.

Here is a fully Owen scrambled sequence as ground-truth, computed using a much slower technique (from left-to-right are 64, 256, 1024, and 4096 points from the sequence, respectively):

full Owen scramble

And here is the bit-reversed LK hash:

Laine-Karras scramble

It looks pretty good, but the higher sample counts are visibly different. In particular, at 1024 samples there is an obvious structure to it. And at 4096 samples the "texture" of the samples is different (though you may need to click to view the full-size images to see it clearly).

Making a better hash function

I first tried to improve this by optimizing the constants, using more permutation rounds, etc. That's where I was when I wrote my previous Sobol post. And although it did help some, all of the results still had obvious issues, especially when looking at other dimension pairs.3

After getting in touch with Brent Burley, however, I was inspired to take another crack at it. It was at that point that I tried different permutation operations. The first real success was adding x += x << 1 to each round:

function lk_improved_v1(x, seed) {
    x += seed
    x ^= x * 0x6c50b47c
    x += x << 1
    x ^= x * 0xb82f1e52
    x += x << 1
    x ^= x * 0xc7afe638
    x += x << 1
    x ^= x * 0x8d22f6e6
    x += x << 1
    return x
}

The result is this:

Laine-Karras improved v1

This is visibly much better than the original LK hash, especially at 1024 samples. The main problem is that it's also slower. Mind you, it's still plenty fast. But I really wanted to make a hash that works better and is at least as fast as the original LK hash.

The astute among you (more astute than I was at the time) may notice something a little fishy about the new operation I added. I originally came up with it by specifically trying to make something that only avalanches upward. But it also turns out to be the same as just multiplying x by 3.

When I finally noticed that, and after a bit more thought, I realized that multiplying by odd numbers is also a valid operation in a hash like this. So, all together, these are the permutations I'm currently aware of that are valid for constructing a hash like this:

  1. x ^= constant
  2. x += constant
  3. x *= odd_constant
  4. x ^= x * even_constant

You can mix and match these to try to create better hashes. There are likely even more valid permutations, but I don't know what they might be.

Playing around with those permutations, I came up with this hash:

function lk_improved_v2(x, seed) {
    x += seed
    x ^= 0xdc967795
    x *= 0x97b754b7
    x ^= 0x866350b1
    x *= 0x9e3779cd
    return x
}

(UPDATE: please do NOT use this hash. It turns out that it's not very good. Please see the next post in this series for hashes that are actually worth using.)

And here are the results:

Laine-Karras improved v2

It's not quite as good as the previous one, but it's close. And it's still much better than the original LK hash. Moreover, it's faster than both of them: this version only takes 5 CPU operations, compared to the 9 and 17 ops of the other two.

This version is what I'm now using in Psychopath. It works at least as well as the original LK hash, and is a bit faster.

In practice, I don't notice any difference in render quality between any of these different hashes, or even between them and full Owen scrambling for that matter. So... it probably doesn't really matter. But it makes me feel better that this is a bit closer to a full Owen scramble.

Final thoughts

I certainly don't think I've come up with the best possible hashes. There's clearly still room for improvement.

Moreover, a quantitative analysis of hash quality would be really useful here. I did measure upward avalanche with some bespoke throw-away testing code, and the results suggested that these new hashes do improve things. But my testing wasn't careful or thorough, and there could easily have been awful bugs in my testing code. So I don't actually trust the results.

I hope that this post can serve as a starting point for other people to develop even better LK-style hashes. And if someone does further work on this, I'd love to hear about it! But for the time being, I'm quite satisfied with what I've arrived at.

For the curious, here is my implementation.


Footnotes

  1. In my previous post I said that you shouldn't offset into Sobol sequences, because it makes Sobol sampling slow. And yet that's more-or-less what shuffling does, just in a much more sophisticated way. Indeed, shuffling will also make the Sobol sampler slower in the same way that offsetting does. However, since writing my last post I've discovered a couple of ways to mitigate that performance impact. Moreover, it's trivial to evaluate multiple Sobol dimensions at once with SIMD, which my sampler now does. So, it's quite performant now.

  2. Burley notes in his paper that measuring hash quality is a complex endeavor, and that he uses the original LK hash due to the testing and claims of the original authors. It's also important to note that in practice the LK hash seems to perform very well in rendering applications. So I think Burley's choice to use the hash as-is in his paper was a good one.

    However, looking at the actual points it generates, it obviously has room for improvement. Moreover, in the Laine-Karras paper I saw no indication that they developed the hash in any kind of rigorous way. Even further, they say this about the constants used in their hash:

    The multiplication constants are the ones used in generating the images in this paper, but they carry no particular significance beyond that.

    So my strong suspicion is that they essentially just eye-balled it until it worked well enough for their particular use-case.

  3. For brevity, the images in this post only show the first two Sobol dimensions. In my testing, however, I looked at over 50 different dimension pairs, which revealed similar differences between the LK hash and the full Owen scramble. The improved hashes in this post fix those differences about as well as the differences in the first two dimensions. Which is to say, it's notably better, but certainly still has room for improvement.

2020 - 09 - 22

Storing Color Compactly

One of the tricky things about a Manuka-style rendering architecture is that you need to store your shading information really compactly, since it will be stored at every micropolygon vertex.

Storing surface normals seems like a fairly solved problem at this point, and I recommend the paper Survey of Efficient Representations for Independent Unit Vectors by Cigolle et al. if you haven't already read it.

Color, on the other hand, still seems like a bit of an open problem. For something like Psychopath, ideally I want a color format that meets these criteria:

  • Compact (ideally 32 bits or less).
  • Fast to decode (I want to use this at runtime while rendering, potentially decoding on every ray hit).
  • Covers the entire gamut of human-visible colors.
  • Has sufficient dynamic range to capture day-to-day natural phenomenon (e.g. the sun) without clipping.
  • Has sufficient precision to avoid visible banding.

I took a few detours, but I believe I've come up with a format that meets all of these criteria. And I call it FLuv32.

It's based on the 32-bit variant of LogLuv, which is a color format designed by Greg Ward of Radiance fame. LogLuv ticks almost all of the boxes: it's 32 bits, it covers the full gamut of visible colors, it has a massive dynamic range of 127 stops, and it has precision that exceeds the capabilities of human vision. It's a super cool format, and I strongly recommend reading the original paper.

However, the one box that it doesn't tick is performance. Since it uses a log encoding for luminance, decoding it requires an exp2() call, which is slower than I want.1

FLuv32 is essentially the same as LogLuv, but replaces the log encoding with floating point (hence the "F" in the name) which can be decoded with just a few fast operations. To achieve the same precision with floating point, I dropped the sign bit, foregoing negative luminance capabilities. But aside from that and a couple minor tweaks, it's the same as LogLuv. It has the same precision, the same massive dynamic range, and also fits in 32 bits. It's just a lot faster.

The FLuv32 color format

For anyone wanting to use FLuv32 themselves, I have a reference implementation in Rust here, and below is a description of the format.

Like LogLuv, FLuv32 uses an absolute color space, and as such takes CIE XYZ triplets as input. Also like LogLuv, it uses the Y channel directly as luminance, and computes u' and v' with the following CIELUV formulas:

u' = 4X / (X + 15Y + 3Z)
v' = 9Y / (X + 15Y + 3Z)

FLuv32 stores Y, u', and v' in the following bit layout (from most to least significant bit):

  • 7 bits: Y exponent (bias 42)
  • 9 bits: Y mantissa (implied leading 1, for 10 bits precision)
  • 8 bits: u'
  • 8 bits: v'

To store u' and v' in 8 bits each, their values need to be scaled to a [0, 255] range. u' is in the interval [0, 0.62], and v' is a tad smaller than that, so the original LogLuv uses a scale of 410 for both. FLuv32 departs from that very slightly, and instead uses the following scales:

u'-scale = 817 / 2
v'-scale = 1235 / 3

The motivation for this change is to be able to exactly represent E, the chromaticity of an equal-energy spectrum. This is not perceptually meaningful, which is presumably why the original LogLuv didn't bother. But in the context of a spectral renderer this may be useful, and it's trivial to accommodate.

The luminance, Y, is stored with 7 bits of exponent and 9 bits of mantissa. The exponent has a bias of 42, and the mantissa follows the standard convention of having an implicit leading one, giving it 10 bits of precision. The minimum exponent indicates a value of zero (denormal numbers are not supported). The maximum exponent is given no special treatment (no infinities, no NaN).

The exponent's bias of 42 may seem surprising, since typically a 7-bit exponent would be biased by 63. The reasoning for this is two-fold:

  • The dynamic range is already monstrously excessive for practical imaging applications, so to some extent it doesn't matter.
  • It seems more useful to be able to represent the luminance of supernovae in physical units (which exceed 2^64 cd/m^2) than ludicrously dark values (and even so, 2^-42 is already ludicrously dark).

42 was chosen by putting two thirds of the exponent range above 1.0, and one third below.2

Other formats

I also explored some other formats before arriving at FLuv32.

The obvious first one is simply using three 16-bit floats for RGB values. Surprisingly, although this provides more than enough precision, it doesn't provide enough dynamic range. Standard IEEE half-floats only have 5 bits of exponent, giving only 32 stops. That's not bad, but it's not quite enough to cover the whole range from dark shadows to the brightness of the sun. Moreover, 3 half floats takes up 48 bits, which is unnecessary.

After that, I considered RGBE, which is actually a predecessor to LogLuv, and is used in the Radiance HDR image format. In its typical formulation, 8 bits each are given to R, G, B, and the shared exponent. This format is very fast to decode, but unfortunately doesn't have the precision needed to avoid visible banding.

I implemented two other RGBE variants to try to improve things:

  • 32-bits with a 9-9-9-5 split. While this improved precision, it also unfortunately reduced dynamic range too much.
  • 40-bits with a 11-11-11-7 split. This provides more than enough precision and range, at the expense of slightly increased storage requirements. I think this is actually a solidly good format if RGB is a requirement.

Before stumbling on LogLuv, I also explored my own home-grown luma-chroma based format, based on YCbCr. It didn't pan out, largely because if you don't know the precise RGB colorspace you're working from, you can't accurately separate chroma from luminance, which is critical for storing chroma in fewer bits. You could, of course, just choose a specific RGB colorspace (e.g. ACES2065-1), but even then your chroma don't end up being perceptually uniform, which also thwarts low-bit-count storage.

All-in-all, I would say if you want to store HDR RGB specifically, go with the 40-bit RGBE format, which should easily accommodate any RGB color space with plenty of precision. And if you just want to store HDR color more generally, then both LogLuv and FLuv32 are great formats.

Shower thought

It occurred to me that in a spectral renderer you might not even need to do a full decode from FLuv32 back to XYZ. You have to spectrally upsample all your colors anyway, and usually that's done via some kind of chroma-oriented lookup table. So I'm wondering now if it would be reasonable to build a lookup table that's just directly indexed by the quantized u' and v' coordinates.

I haven't fully explored that yet, but decoding just the luminance from FLuv32 is super fast. On CPU, on my machine, the full XYZ decode is about 3ns, whereas the Y decode is about 1ns. So that would be a significant savings. Even moreso if you would just end up recomputing chroma from XYZ anyway for a spectral lookup.


Footnotes

  1. Arguably you could use a fast approximate exp2 to overcome performance limitations. However, the accuracy would need to be good enough to avoid introducing visible error, otherwise the whole point of the format is somewhat subverted.
  2. That it also happens to be the answer to everything is pure coincidence.
2020 - 04 - 20

Light Trees and The Many Lights Problem

This post will walk you through a particular approach for efficiently importance sampling huge numbers of lights in a scene. I call the approach "Light Trees".

This post has been a long time coming. I actually wrote about this technique (tersely) back in 2014 on ompf2.com. But I never got around to writing about it properly on this blog. So I'm fixing that now!

Before I get started, I want to make it clear that I am not the first person to come up with this idea. In particular, the Manuka guys have apparently been doing something like this since the inception of their renderer. And there are several other people that apparently came up with pretty much the same technique around the same time as me.

Really, the core approach is super straightforward. And the fact that I (of all people) independently came up with it is a strong indication that it's pretty obvious. After all, I'm an animator professionally, not a software developer.

So my goal in this post is not to claim any credit for the idea. Quite the contrary—given how many people have independently come up with it, I don't think anyone can reasonably claim credit for it.

Rather, the goal of this post is to clarify and distill the approach, since the papers I've seen about it describe more detailed or complex variants. I discuss those papers and their contributions at the end of this post.

The Many Lights Problem

The problem we're trying to solve with light trees is sometimes called the "many lights" problem. If you're already familiar with this problem and with importance sampling, then please feel free to skip to the "Light Trees" section. These first two sections are just filling in that background.

In a typical forward path tracer, when you want to calculate lighting at a point on a surface, you generally do two things1:

  1. Sample the light(s) in the scene to calculate direct lighting.
  2. Shoot one or more rays out in random directions to calculate indirect lighting.

The many lights problem shows up in that first step.

Calculating direct lighting is really easy if you only have one light: just shoot a shadow ray at the light, and if the ray reaches it, calculate the lighting. But what if there's more than one light? In that case, there are two broad approaches:

  1. Sample all lights.
  2. Pick one light at random.

The first approach—sampling all lights—works really well:

three lights full sampling

The second approach—sampling one light at random—also works well enough, but results in some noise due to the random selection:

three lights random sampling

Both approaches are fine. The former takes longer to render because it needs to do more calculations per shading point, but produces less noise. The latter renders faster but produces more noise, which takes more samples to get rid of. Which approach is better depends on the particular scene.

The many lights problem starts to show up when there are... eh... many lights. A large number of lights. For example, I have a test scene that contains roughly 8000 lights. Rendering this scene by sampling all lights looks like this2:

city scape full sampling

It doesn't look too bad. But it took 3 hours to render at only 16 samples per pixel, which is about 11 minutes per sample. That's because we have to spawn 8000 shadow rays on every ray hit, and potentially do as many lighting calculations. Imagine the render times on an actual production scene with even more lights, and at the sample rates needed to converge to a sufficiently noise-free image.

You might think that randomly selecting lights will save us. And, indeed, that brings our render time down to 6 seconds. But the result is so noisy that it's unusable:

city scape random sampling

That's because the probability of selecting a light that matters to the shading point is so low. Even if we crank our samples into the thousands, the render would still be somewhat noisy.

So it seems like we're stuck, then. Either our render times are intractable, or our renders are unusably noisy. And that, in a nut shell, is the many lights problem.

Importance Sampling

The answer to this problem is something called importance sampling3.

With approach #2 (random selection) we have a lot of wiggle room in how we select our lights, and the idea behind importance sampling is to use that wiggle room to sample important things with higher probability.

As an example, let's make a scene with two lights, with one of the lights much brighter than the other. If we render it with equal probability of selecting either light, it looks like this:

two lights equal sampling

But if we use importance sampling to select the brighter light with higher probability, it looks like this:

two lights importance sampling

So is that it, then? We just select the brighter lights more often, and we're good to go?

Unfortunately, no. You can get a hint of why that won't work by noticing that the area under the dimmer light actually gets more noisy in the importance sampled image.

The problem is that even though one light is brighter than the other, that doesn't mean it actually contributes more light to the point we're shading. For shading points that are sufficiently close to the dim light, the dim light is effectively the brighter one.

For a simple scene like the one above, that's not a big problem. But for something like the city scape scene it's a huge problem. First of all, all the lights in that scene are about the same brightness, so even with brightness-based importance sampling we would just be selecting them equally anyway. But also, even if some of the lights were brighter, what are the chances that any given light has much effect on any given shading point? Very low.

So what we actually want is to somehow sample the lights in proportion to the brightness they contribute to the current shading point. If we do that, things look way better:

two lights light tree sampling

But... how do we do that?

Light Trees

(To summarize for those that skipped to this section: our goal is to select a light from all lights in the scene according to their contributions to the current shading point.)

The naive way to do this would be to calculate the shading from all the lights, and then select a light based on those results. But that doesn't get us anywhere because then we're just sampling all the lights anyway.

So instead we need some efficient way to approximate that. And that's exactly what light trees let us do.

A light tree is a pretty straightforward data structure: it's just a BVH of the lights in the scene, where each node in the BVH acts as an approximation of the lights under it. For it to be an effective approximation, that means we also need to store some extra data at each node.

The simplest kind of light tree would simply store the total energy of all the lights under each node. You could then treat each node as a light with the same size as the spatial bounds of the node, and emitting the same energy as the lights under it.

I'm going to use that simple light tree variant to illustrate the technique. But I'll discuss some possible extensions in a bit.

Okay, so we have our light tree as described above, and we have our shading point on a surface. How do we select a light?

Starting from the root of the tree, we:

  1. Calculate the lighting (or an approximation thereof) from each of the child nodes.
  2. Take those lighting results and select which child to traverse into based on them. For example, if the left node contributes twice as much light, then we'll be twice as likely to traverse into it.
  3. If the selected child is a leaf, return the actual light under it. Otherwise, go back to step 1 with the selected child.

So that's the basic idea. But there are a couple of further details:

  • We also need to return the probability of selecting that light. This is because the calling code needs to weight that light's contribution based on that probability.
  • We need a good "approximate lighting" function to use in step 1. Incidentally, I like to call this the "weighting" function, because we only use it to determine the weighted probability of traversing into each node.

Calculating the light-selection probability is pretty easy: we just combine the probabilities of all the child nodes we traverse into, which is just multiplying them all together.

Coming up with a good weighting function is trickier. The ideal weighting function would exactly calculate the lighting from all the individual lights under the node. But if we could do that efficiently then we wouldn't need the light tree in the first place! So instead we need something that approximates that well. I'm going to punt on this for now, but we'll come back to it later. For the moment, just assume we have a reasonable weighting function.

Given all of those things, our light selection process is the following pseudo code, where tree_root is the root node of the light tree, shading_point is the data of the point we're shading, and n is a random number in the range [0, 1):

def light_tree_select(tree_root, shading_point, n):
    node = tree_root   # Updated as we traverse.
    selection_p = 1.0  # Selection probability, also updated as we go.
    
    loop:
        if is_leaf(node):
            return (node.light, selection_p)

        # Child node weights, or "approximate lighting".
        child_1_w = weight(node.child_1, shading_point)
        child_2_w = weight(node.child_2, shading_point)

        # Normalized probabilities of traversing into either child.
        child_1_p = child_1_w / (child_1_w + child_2_w)
        child_2_p = 1.0 - child_1_p

        # Select the child for the next loop.
        if n < child_1_p:
            node = node.child_1
            selection_p *= child_1_p # Accumulate probability.
            n /= child_1_p           # Scale random number.
        else:
            node = node.child_2
            selection_p *= child_2_p        # Accumulate probability.
            n = (n - child_1_p) / child_2_p # Scale random number.

And that's it! Well, modulo some floating point corner cases. But that's the entire thing. Really.

One thing I didn't mention yet: you need to scale the random number on each traversal step. That's what lets us get away with using just one number, instead of needing a new random number every time we choose a child. I'm feeling too lazy to explain how it works in detail, but you can think of it as the tree whittling down the space of the random number as it traverses deeper and deeper.

Anyway, if we implement that pseudo code, we get this beautiful result at 16 samples per pixel:

city scape basic light tree 16spp

This particular render uses a pretty simple weighting function: it assumes the shading point is lambert, treats the node as a spherical light, and just uses the analytic formula for lambert shading from a spherical light4.

That isn't an especially good weighting function for anything other than diffuse surfaces. But it works surprisingly well in many cases. It's definitely not sufficient for a robust implementation, however.

Potential Extensions and Improvements

Starting from that base, there are a bunch of potential improvements we could explore. For example:

  • A better weighting function that handles, among other things, a wider variety of materials well. This is more-or-less necessary for a robust implementation.
  • Storing more information than just total energy in the nodes, and using that for better weighting. For example, storing the approximate directionality of lights in cases such as spot lights and some area lights.
  • Alternative tree constructions. For example, using something other than the surface area heuristic to cluster lights, or using higher-arity trees.
  • Accounting for shadowing (somehow...).
  • etc.

The core idea—using a hierarchy to approximate light importance—is always the same. But the potential for variety within that is huge.

As a quick example, this render:

city scape better light tree 16spp

...is also 16 samples per pixel, but (if you flip between it and the last render) it's clearly far less noisy. This was achieved by playing with some of the things above.

Further Reading

There are two published papers I'm aware of that describe variations on this light tree approach, both of which are very much worth reading:

Importance Sampling of Many Lights With Adaptive Tree Splitting is, in my opinion, the more significant paper of the two. It covers a large breadth of things including:

  • A novel tree construction approach that goes beyond the SAH, accounting for factors unique to light trees.
  • A way to account for the directionality of lights in the tree and to use that information to improve weighting calculations.
  • A novel "tree splitting" approach for returning multiple lights with reduced variance.

The main downside is that their weighting functions don't work well without their tree splitting approach, and can approach very high variance without it. (In fact, the Stochastic Lightcuts paper makes comparisons to it without the tree splitting, which highlights how poorly it performs in that case.)

Stochastic Lightcuts proposes the same technique as both this post and my 2014 post on ompf2, but frames it as an extension to lightcuts. I personally find that framing confusing, and it actually took me a couple readings to realize that it is, in fact, the same technique. But some people might find that framing easier to understand, especially if they're already familiar with lightcuts.

However, I believe the significant contribution of Stochastic Lightcuts is the weighting function they propose, which is based on the lightcuts error bounding function, and accounts for different surface materials and lights with directional components. Moreover, unlike the weighting function from the Adaptive Tree Splitting paper, it works really well with the standard traversal approach—no tree splitting required. I would definitely recommend this paper as the baseline for any real implementation of light trees, and then explore improvements from there.


Footnotes

  1. This is a simplification, of course. But it's sufficient for illustrating the problem.

  2. If you compare to later renders in the post, you'll notice that it doesn't quite match. That's because this one was rendered in Cycles, but the others are rendered in Psychopath. Psychopath doesn't support sampling all lights, and I didn't feel like implementing a whole feature just for illustration purposes.

  3. For a more thorough introduction to importance sampling, I strongly recommend the amazing PBRT book, which is available for free online. (But is also very much worth buying!)

  4. See the paper "Area Light Sources for Real-Time Graphics" by John M. Snyder for the appropriate formula.