Blog Archive About

Psychopath Renderer

a slightly psychotic path tracer

2022 - 04 - 23

Blackmagic Design: Color Spaces

(Note 1: if you just want the LUTs and color space data, jump down to the section titled "The Data".)

(Note 2: this is a slightly odd post for this blog. I normally only write about things directly relevant to 3D rendering here, and this post has very little to do with 3D rendering. Although it is relevant to incorporating rendered VFX into footage if you're a Blackmagic user.)

When you're doing VFX work it's important to first get your footage into a linear color representation, with RGB values proportional to the physical light energy that hit the camera sensor.

Color in image and video files, however, is not typically stored that way1, but rather is stored in a non-linear encoding that allocates disproportionately more values to darker colors. This is a good thing for efficient storage and transmission of color, because it gives more precision to darker values where the human eye is more sensitive to luminance differences. But it's not good for color processing (such as in VFX), where we typically want those values to be proportional to physical light energy.

The functions that transform between non-linear color encodings and linear color are commonly referred to as transfer functions.2 There are several standard transfer functions for color, with sRGB gamma perhaps being the most widely known. But many camera manufacturers develop custom transfer functions for their cameras.3

If you're doing VFX work, you really want to know what transfer function your camera used when recording its colors. Otherwise you can only guess at how to decode the colors back to linear color.

Because of how important this is for a variety of use cases, pretty much every camera manufacturer documents and publishes their transfer functions. With two exceptions:

  1. Manufacturers of consumer-level cameras such as phones, pocket cameras, etc. (Totally reasonable, because these cameras are not intended for professional or semi-professional use.)
  2. Blackmagic Design.

Even the notoriously proprietary RED fully documents and publishes their transfer function in a freely available white paper. And that's because there's nothing proprietary or secret about transfer functions. They're about as proprietary as knowing which channels of your image data are red, green, and blue. You just need to know it to interpret your image data correctly.

So it's really odd and also very frustrating that Blackmagic hasn't done this basic due diligence. Blackmagic doesn't even provide this information to people who have purchased their cameras.4

(Update: after posting this, a couple of people pointed out that Blackmagic does have a white paper that documents the RGBW chromaticities15 and log transfer function of their Gen 5 color spaces. However, there are two more Gen 5 transfer functions they don't document, and they haven't documented any of their other numerous color spaces. Moreover, it's nearly impossible to find this white paper unless you already know it exists and where it is, because it's buried in the installation of their BRAW viewer. So my overall criticism still stands.)

Gotta Do It Yourself

Since Blackmagic hasn't published this basic information themselves, and since my work recently shot footage on a Blackmagic camera for a show with heavy VFX work, I took it upon myself to do Blackmagic's homework for them and reverse engineer their cameras' color spaces for publication here.

I want to emphasize again that all of this data is basic information that any professional or semi-professional camera manufacturer absolutely should publish about their cameras. This is no more proprietary than information about how to navigate a camera's menu system. It's just "how to use the camera", but for VFX people.

Methodology

There are a few places that already sell 3rd-party LUTs for Blackmagic cameras. Unfortunately, they're all oriented towards achieving "looks" with the footage, which is useless for VFX work. The color transformations we need for VFX are technical, not artistic.

Moreover, the places that sell these LUTs don't document how they make them, so even if they did publish linearization LUTs it would be hard to know if they were correct.

For that reason, I'm documenting here exactly how I created these LUTs. I'm also publishing the software tool I built for this purpose under an open source license so that others can reproduce and verify my work.5

Step 1

Although Blackmagic Design doesn't publish any information about their color spaces, they do provide a software tool called Resolve that, among other things, can linearize footage from their cameras.6 In other words, Resolve has all of Blackmagic's transfer functions built into it, just not in a way that is accessible as data or formulas.

But the great thing about color processing is that you can always compare results before and after. And that's the key idea behind determining Blackmagic's transfer functions.

I wrote a tool called LUT Extractor that generates a very specific OpenEXR image with gray, red, green, and blue gradients from 0.0 to 1.0, each with 2^17 steps. The image looks like this7:

test image

I then sent this image to a friend to pipe through Resolve (I wasn't able to install Resolve on my OS), once for each of Blackmagic's transfer functions.8 We transformed from encoded non-linear color to linear color for the exports, both because that's the transform we ultimately want to know and because the input for that is on a known, bounded interval. Linear color often exceeds the [0.0, 1.0] interval, whereas encoded color doesn't.

(Note: it's important to export at exactly the same resolution as the original image, and to do so as a lossless 32-bit floating point OpenEXR image.)

Step 2

I then piped those processed images back into LUT Extractor, which generates LUTs by mapping the colors from the original image to the processed image.

You can generate LUTs with 2^17 entries with this approach. But since that's overkill, I reduced them to 4096 entries, which is enough for 12-bit color without LUT interpolation.9

I used essentially the same process to determine the RGBW chromaticities of Blackmagic's color gamuts (which they also don't publish) as well. The only difference being that we stayed entirely in linear color and exported to CIE XYZ color space for that.

Step 3

Having LUTs for the transfer functions is already more than enough for most purposes. But there are situations where analytic formulas are better. For example, implementing the transfer functions in software.

To accommodate those use cases I also reverse engineered analytic formulas for a subset of the transfer functions. Specifically, the log footage subset.

Most log transfer functions use essentially the same formula, just with different constants. There are some minor variations of the formula, but they're all basically the same.10 Blackmagic's log transfer functions are no exception.

With the basic formula in hand, we just need to find the constants that match the target LUT. And it turns out this is pretty straightforward with a bit of iterative optimization.

My first attempt at this was to just throw some common optimization algorithms at the problem. And that worked fine—I got fits that were more than good enough for any practical purpose. But it turns out we can do even better thanks to the nature of the formula.

The standard log footage formula is a piecewise function consisting of a small linear segment near zero and a shifted/scaled logarithm for the rest of it. Both the linear and log parts have only a single free parameter if we keep the end points fixed, and we already know the end points from the first and last entries of the LUT. Moreover, the error metric for both the linear and log part have no local minimums other than the global minimum.

This means we can use a really simple iterative optimization algorithm that converges very quickly and very precisely (similar to but slightly more complex than a binary search) on each of the two parts separately.

With the exception of the Pocket Film transfer functions11, the resulting formulas match the LUTs with an average relative error of about 1/10,000th of a percent, and a maximum relative error of less than 1/100th of a percent.

In other words, for all practical purposes, the formulas are exact.

Blackmagic also has several non-log transfer functions. I haven't attempted to determine their formulas12, but the LUTs are still available.

The Data

So with that out of the way, here are Blackmagic Design's color spaces. For convenience, all names match what they are in Resolve's UI (at the time of posting), minus the "Blackmagic Design" prefix.

Transfer Function LUTs

Download (641 KB): Blackmagic Design - Transfer Function LUTs - 2022-04-23.zip

These are the linearizing LUTs (encoded color -> linear color). If you need the reverse LUTs (linear color -> encoded color) feel free to email me and I'll generate and post those too.

Both .cube and .spi1d formats are included.

Transfer Function Formulas (log only)

These are listed in pseudo code, along with the maximum and average relative error13 compared to the original values in the corresponding LUT.

ln is the natural logarithm, ^ is exponentiation, and e is the standard mathematical constant e.

All constants were computed in 64-bit precision, and are listed with that full precision.14

4K Film:

  • Max Relative Error: 0.0000145
  • Avg Relative Error: 0.0000007
A = 3.4845696382315063
B = 0.035388150275256276
C = 0.0797443784368146
D = 0.2952978430809614
E = 0.781640290185019

linear_to_log(x) =
    if x <= 0.005000044472991669:
        x * A + B
    else:
        ln(x + C) * D + E

log_to_linear(x) =
    if x <= 0.0528111534356503:
        (x - B) / A
    else:
        e^((x - E) / D) - C

4.6K Film Gen3:

  • Max Relative Error: 0.0000336
  • Avg Relative Error: 0.0000007
A = 4.6708570973650385
B = 0.07305940817239664
C = 0.0287284246696045
D = 0.15754052970309015
E = 0.6303838233991069

linear_to_log(x) =
    if x <= 0.00499997387034723:
        x * A + B
    else:
        ln(x + C) * D + E

log_to_linear(x) =
    if x <= 0.09641357161134774:
        (x - B) / A
    else:
        e^((x - E) / D) - C

Broadcast Film Gen 4:

  • Max Relative Error: 0.0000166
  • Avg Relative Error: 0.0000014
A = 5.2212906000378565
B = -0.00007134598996420424
C = 0.03630411093543444
D = 0.21566456116952773
E = 0.7133134738229736

linear_to_log(x) =
    if x <= 0.00500072683168086:
        x * A + B
    else:
        ln(x + C) * D + E

log_to_linear(x) =
    if x <= 0.026038902009648163:
        (x - B) / A
    else:
        e^((x - E) / D) - C

Film:

  • Max Relative Error: 0.0000146
  • Avg Relative Error: 0.0000007
A = 4.969340550061595
B = 0.03538815027497705
C = 0.03251848397268609
D = 0.1864420102390252
E = 0.6723093484094137

linear_to_log(x) =
    if x <= 0.004999977151237935:
        x * A + B
    else:
        ln(x + C) * D + E

log_to_linear(x) =
    if x <= 0.060234739482005174:
        (x - B) / A
    else:
        e^((x - E) / D) - C

Film Gen 5:

  • Max Relative Error: 0.0000098
  • Avg Relative Error: 0.0000007
A = 8.283611088773256
B = 0.09246580021201303
C = 0.0054940711907293955
D = 0.08692875224330131
E = 0.5300133837514731

linear_to_log(x) =
    if x <= 0.004999993693740552:
        x * A + B
    else:
        ln(x + C) * D + E

log_to_linear(x) =
    if x <= 0.13388380341727862:
        (x - B) / A
    else:
        e^((x - E) / D) - C

Pocket 4K Film Gen 4:

  • Max Relative Error11: 0.0065348
  • Avg Relative Error: 0.0002885
A = 4.323288448370592
B = 0.07305940818036996
C = 0.03444835397444396
D = 0.1703663112023471
E = 0.6454296550413368

linear_to_log(x) =
    if x <= 0.004958295208669562:
        x * A + B
    else:
        ln(x + C) * D + E

log_to_linear(x) =
    if x <= 0.09449554857962233:
        (x - B) / A
    else:
        e^((x - E) / D) - C

Pocket 6K Film Gen 4:

  • Max Relative Error11: 0.0059241
  • Avg Relative Error: 0.0002745
A = 4.724515510884684
B = 0.07305940816299691
C = 0.027941380463157067
D = 0.15545874964938466
E = 0.6272665887366995

linear_to_log(x) =
    if x <= 0.004963316175308281:
        x * A + B
    else:
        ln(x + C) * D + E

log_to_linear(x) =
    if x <= 0.09650867241866573:
        (x - B) / A
    else:
        e^((x - E) / D) - C

Gamut Chromaticities

These are given in CIE 1931 chromaticity coordinates.

(Note: the 4.6K Film Gen 1 chromaticities are missing because it doesn't seem to be a simple chromaticity transform. It's possible we screwed up the export from Resolve. If you need it, feel free to get in touch and we can try again.)

Wide Gamut Gen 4/5:

xy
R0.7177220.317118
G0.2280410.861569
B0.100584-0.082045
W0.3127000.329000

4K Film Gen 1:

xy
R0.7422250.285898
G0.4140111.303536
B0.034208-0.083318
W0.3135440.330476

4K Film Gen 3:

xy
R1.0624920.394762
G0.3689340.777492
B0.0956030.033224
W0.3135440.330476

4.6K Film Gen 3:

xy
R0.8608290.368869
G0.3282130.615591
B0.078252-0.023256
W0.3127000.329000

Film Gen 1:

xy
R0.9172580.250238
G0.2833281.707231
B0.085572-0.070780
W0.3135380.330465

Pocket 4K Film Gen 4:

xy
R0.7177220.317118
G0.2280410.861569
B0.100584-0.082045
W0.3127000.329000

Video Gen 4:

xy
R0.6827770.318592
G0.2376130.813547
B0.121743-0.044283
W0.3127000.329000

Video Gen 5:

xy
R0.6400000.330000
G0.3000000.600000
B0.1500000.060000
W0.3127000.329000

Footnotes

  1. There are exceptions, of course. OpenEXR stores color linearly (at least, if you ignore the nature of floating point numbers). And many raw formats are also linear.

  2. "Transfer function" is actually a more general engineering term, but in the context of color this is always what it means.

  3. And this is for good reason: most standardized transfer functions are designed to encode colors for final delivery and display, which has different considerations than encoding colors that may undergo further processing (e.g. in VFX and color grading).

  4. I don't mean to be too critical of Blackmagic here. They make excellent cameras at a shockingly cheap price point for what they're capable of. I don't have anything against Blackmagic except for their extreme laziness on this point. (Well, that and them claiming to have created an "open standard" raw format that they've also failed to document and which is only accessible via a closed-source SDK. It might be a good raw format, but I don't think they know what "open standard" means.)

  5. Which is also valuable because I'm human and make mistakes! Also, this way if Blackmagic comes out with additional color spaces in the future, people aren't dependant on me for figuring them out.

  6. This is great if you can go through Resolve, but there are situations where you may want or need to do your color processing via other tools. It's also important information when archiving footage, where (when possible) you want to include metadata about how to properly interpret the color data, such as chromaticity primaries and the transfer function used for encoding.

  7. This has been converted to PNG for online viewing purposes, but the original is a full 32-bit floating point OpenEXR image and can be easily generated with the linked tool. Also, the funky stuff below the gradients is a 144x144x144 3D RGB cube. I'm not using that part yet, but it will be useful if I ever need to get 3D LUTs in the future.

  8. This turned out to be unexpectedly tricky because Resolve really wants to do additional color processing beyond just color space conversion. Our first attempts resulted in LUTs that had filmic s-curves baked in, etc., which I guess are just defaults in Resolve. That might make sense for color grading, but thwarts actually linearizing footage for VFX purposes. We did eventually figure out how to get Resolve to do what we needed, though.

  9. With LUT interpolation it's likely more than enough for any bit depth. You actually don't need much resolution in non-linear -> linear LUTs to achieve very high precision because of how they typically curve. You do need a lot of resolution for LUTs going the other way, however. This is in fact another reason we exported from Resolve in the non-linear -> linear direction.

  10. This makes sense because there's basically no reason to innovate here, if "innovate" is even a sensical term in this context.

    Having said that, there is room for innovation when designing transfer functions for delivery rather than capture. For example, the Perceptual Quantizer is a transfer function designed to closely approximate the human visual system's response to luminance, which is really useful for storing and broadcasting finished media using minimal data.

    But for transfer functions intended for capture, the existing standard formulas are more-or-less optimal, and just need the constants tweaked for the capture hardware's noise floor and dynamic range.

  11. The less precise fit of these two transfer functions is because Blackmagic didn't quite implement them correctly: the linear and log parts of the curve aren't aligned precisely, so there's a small discontinuity where they're supposed to meet. You can see this yourself if you graph the LUTs: zoom in around x=0.095 and you can see a little blip in both transfer functions.

  12. Although, interestingly, Blackmagic Design Video (sans any "Gen #" postfix) turns out to be identical to the Rec.709 transfer function. I only noticed this while checking graphs of the generated LUTs.

  13. I list relative error rather than absolute error because relative is a better metric for color transfer functions—small changes in dark values are far more important than small changes in bright values. Relative error is computed as: abs(fitted - original) / abs(original)

  14. This is certainly way overkill given the relative error of the fits. But I figured listing them in the full computed precision doesn't hurt.

  15. It's also interesting to note that the white point they document in their paper for Gen 5 doesn't quite match the white point they use in Resolve for Gen 5 (the latter is in my listings). It's not different enough to make a practical difference, though—both are D65, just with different precision. And it would be an easy software update, so this footnote may become outdated pretty quickly.

2021 - 01 - 30

Building a Better LK Hash

(This post has been updated as of 2021-05-08 with important changes due to an issue identified by Matt Pharr. The majority of the post is essentially unchanged, but additional text discussing the issue and providing a new hash with the issue addressed have been added near the end of the post. If you just want the hash itself, skip down to the section "The final hash".)

At the end of my last post about Sobol sampling I said that although there was room for improvement, I was "quite satisfied" with the LK hash I had come up with for use with the techniques in Brent Burley's Owen scrambling paper. Well, apparently that satisfaction didn't last long, because I decided to have another go at it. And in the process I've not only come up with a much better hash, but I've also learned a bunch of things that I'd like to share.

Owen scrambling: what is it, really?

Before getting into the work I've done, let's revisit Owen scrambling.

Let's say we have a bunch of Sobol points on the unit square:

Points on the unit square

To perform a single-axis Owen scramble we do the following recursive procedure: divide the current segment into two halves, randomly choose to swap those halves or not, and then recurse into each half and repeat.

Let's walk through a quick example: we start by deciding randomly whether to swap the left and right halves or not. In this case, let's say we do swap them:

Points on the unit square, swapped 1

Then for each half we decide randomly whether to swap its sub-halves. Let's say we do for the left, but we don't for the right:

Points on the unit square, swapped 2

Then for each of those segments, we decide randomly whether to swap their left and right halves:

Points on the unit square, swapped 3

And so on, all the way to infinity (or to the precision of your numeric representation).

To scramble on the vertical axis as well, we just do the same random procedure vertically. And you can extend it to 3, 4, or a million dimensions by just repeating the process in all dimensions. Here are the same points, fully scrambled on both axes:

Points on the unit square, fully scrambled

And that's Owen scrambling.1

Pretty simple, right? And remarkably, if performed on Sobol points, it not only preserves their great stratification properties, but actually improves their quality.

Thinking in bits.

The above explanation is, in my opinion, the most intuitive way to think about Owen scrambling. However, to build a hash that behaves the same way, we'll want to frame the process in terms of binary bits. So... here we go:

If we represent each point's coordinates with unsigned binary integers, with their range spanning the full extent of the square:

Integer spanning the unit square

...then we can think of the scrambling process as making random bit-flipping decisions. For example, flipping the highest (most significant) bit of a coordinate for all points is identical to swapping the left and right halves of the square on that coordinate's axis.

One way we can visualize bit-flipping-based Owen scrambling is with what I'll be calling a "scramble tree". Let's take a four-bit binary number, 1101, as an example:

A 4-bit Owen scramble tree, processing a 4-bit number

Each node in the scramble tree represents a bit flipping decision, with green = flip and white = don't flip.

The top node tells us whether to flip the highest bit or not. Then we move down the tree, going left if the highest bit was 0, and right if it was 1.2 The node we traverse into then tells us whether we flip the second-highest bit or not. Then we traverse left or right depending on whether that bit was 0 or 1. And so on.

If this scramble tree is generated randomly, and a different tree is used for each dimension, then it's equivalent to the more visual process I outlined earlier.

Implementing scramble trees in practice.

We could use scramble trees as an actual data structure to perform Owen scrambling. However, in practice the trees are too large to be practical: a 32-bit scramble tree is at minimum about 500MB. So instead we want something that behaves identically to a scramble tree, but without the storage requirement.

One way to accomplish that is to consider how we fill in the nodes with random values. We could use an RNG, of course, and just fill in one node after another. But there is another source of randomness we can use instead. Let's take a look at the tree again:

path through a scramble tree

(I've notated the tree with 1's and 0's this time, indicating which input bit values correspond to which traversal paths.)

What we can see in this diagram is that each node has a unique "address", so to speak, which is simply the string of 1's and 0's that lead to it starting from the root. Utilizing those unique addresses, we can fill in the nodes with a good hash function: instead of using an RNG, we simply hash each node's address to get its random value.

What's great about that is it means we can drop the explicit tree altogether: since we can produce each node's value on demand with nothing more than its address, and since the address of the node we want is always just the string of bits higher than the bit we're making the flip decision for, we can just directly compute the node's value when needed without any tree traversal.

With this approach we still have a scramble tree, but it's implicit in the hash function we choose rather than being explicitly stored in memory.

Writing a function to do Owen scrambling this way is relatively straightforward, and in pseudo code might look something like this:

function owen_scramble(x):
    for bit in bits of x:
        if hash(bits higher than bit) is odd:
            flip bit
    return x

It simply loops through all the bits and flips each one based on a hash of all the bits higher than it.

This is awesome, but it's not quite enough for our needs. This function uses a single fixed hash function as its source of randomness, and is therefore equivalent to a single fixed scramble tree. However, we need a different scramble tree for each dimension, and if we want to decorrelate pixels with Owen scrambling then we even need different trees for each pixel. That's a lot of trees!

Fortunately, a small change lets us accomplish this quite easily. If we use a seedable hash function instead, then we can change the hash function (and thus the tree) by just passing a different seed value:3

function owen_scramble(x, seed):
    for bit in bits of x:
        if seedable_hash(bits higher than bit, seed) is odd:
            flip bit
    return x

And that's it! This function, if implemented properly and with a sufficiently good seedable hash function4, is a fully capable Owen scrambling function.

If you're curious, my actual implementation is here.

Owen hashing.

The line of reasoning I outlined above to arrive at this Owen scramble function focuses on scramble trees. And that's a really useful framing, because it's the framing that connects it to Owen scrambling. But if we shift our perspective a bit, there's an alternative framing that's also really useful: hashing. In other words, our Owen scrambling function doesn't just use a hash function, it actually is a hash function as well.

Specifically, in keeping with the insight from the Laine-Karras paper, it's a special kind of hash function that only allows bits to affect bits lower than themselves.

Since our goal is to create precisely such a function, that means we've already succeeded. Yay! And for many applications, using the above function is exactly what you want to do, because it is certainly high quality. However, for rendering it's a bit too slow. It executes a full hash for each bit, and that's quite a bit of CPU time.

Nevertheless, it serves as an excellent reference--a "ground truth" of sorts--that we can compare to while designing our LK-style hash.

One way we could make that comparison is to just eye-ball the resulting point sets, as I did in my previous post. We could also approach it with tools typical of point set analysis, such as comparing the frequency content of point sets via fast fourier transform.

But another way we can approach it is to take the "hash function" framing and evaluate our scrambling functions quantitatively as hash functions.

Evaluating hash functions.

One of the primary measures of hash function quality is avalanche. Avalanche is the extent to which each input bit affects each output bit. For a proper hash with good avalanche, all input bits should affect all output bits with close to equal probability.

Measuring avalanche is pretty straightforward: we take an input number and its resulting hash output, and then we flip each of the input number's bits one at a time to see which output bits flip with them. For a single input number, and a proper hash function, a graph of that looks like this:

single number avalanche graph

The vertical axis is which input bit we flipped, and the horizontal axis is which output bits flipped with it. A white pixel means the output bit flipped, a black pixel means it didn't.

If we do this same process with a whole boatload of input numbers, and average the results together, we get an avalanche graph:

multi-number avalanche graph

All pixels are a medium gray (modulo sampling noise). That means that, probabilistically, all input bits have a 50% chance of flipping any given output bit. This is exactly what you want from a proper hash.

We can also visualize the avalanche bias, which is simply the difference from that ideal 50% gray result. Black pixels then represent a 50% flipping probability, and anything lighter represents a deviation from that ideal:

multi-number avalanche bias graph

As you would expect from a good hash function, it's just black. No significant bias.

Owen scrambling bias.

So what does our reference Owen scrambling implementation look like? Well, we have to be a little careful when computing its avalanche bias. Remember that our Owen hash is seedable, so it actually represents a set of hash functions. So what we actually want is to compute bias graphs for a whole boatload of different seeds, and average them together. That will give us the average bias.

If we do that, it looks like this:

Owen scramble avalanche bias graph

This is really interesting! If we were trying to build a proper hash function, this would be an awful result. There's tons of bias. Fortunately, we're not actually trying to build a hash function per-se, we're just exploiting hashing techniques to do Owen scrambling. And this is simply what proper Owen scrambling looks like when treated as a hash.

So this gives us a target to hit--a ground-truth result. There are a few interesting things to notice about it:

  1. The entire lower-right half of the graph is pure white. This is what we want, and represents a critical property of Owen hashing: bits should only affect bits lower than themselves.

  2. There is a single additional white pixel in the upper right. This is also expected: that pixel indicates the effect that the highest bit has on the second-highest bit. If you revisit the scramble tree, you can see that every possible tree configuration either causes the highest bit to always change the flip decision of the second-highest bit or never change it. You cannot create a tree with less than 100% bias between those two bits.

  3. The upper-right corner has some gray that fades off as it moves left. This actually has the same cause as the extra white pixel: a limited set of possible tree configurations. This gray gradient is simply the average bias of that limited set. In fact, this precise gradient5 is an indication (though not proof) that the seeding is working properly, and producing all tree configurations with roughly equal probability.

How do the LK hashes compare?

So now that we have a ground-truth, we can compare our previous hashes. But we do have to address one thing first: as I mentioned in my previous posts, LK-style hashes actually do the opposite of an Owen scramble in the sense that instead of mixing from high bits to low bits, they mix from low to high.

So for convenience of comparison, here is the Owen scramble graph again, flipped to match LK-style hashes (this is what a "perfect" LK hash should look like):

reverse-bit Owen scramble avalanche bias graph

And here's the original Laine-Karras hash:

LK hash avalanche bias graph

And here's the hash from the end of my previous post:

v2 avalanche bias graph

So... my "improved" hash from the previous post actually looks quite a bit worse than the original Laine-Karras hash, which is embarrassing. So... uh, yeah, don't use that one. (In my defense, the point sets it generates actually do have a more consistently "okay" FFT, but still.)

The original Laine-Karras hash has a curious line of gray along the diagonal, but otherwise looks pretty good over-all.

Both hashes suffer from having a very uneven/noisy gray gradient, and actually have less apparent bias in many pixels compared to the reference Owen scramble. Again, if we were developing a proper hash function, less bias would be good. But in our case that actually suggests that there are scramble trees they're not generating regardless of seeding. In other words, it means that the seeding process is biased.

Measuring seeding bias.

Seeing the issues with the gray gradient made me want a way to visualize that seeding bias more directly. The avalanche bias graph only gives an indirect indication of that issue.

What I came up with is based on this:

seed = random()

input1 = random()
input2 = random()
output1 = lk_hash(input1, seed)
output2 = lk_hash(input2, seed)

y = reverse_bits(input1 ^ input2)
x = reverse_bits(output1 ^ output2)

...and essentially making a 2d histogram of the resulting (x, y) points. The idea behind this is to reveal node relationships in the scramble tree that don't meaningfully change with the seed. Or, in other words, it provides some insight into what tree configurations the seeding process can produce.

In practice, a full histogram of this would be impractically large--it would be a 4294967296 x 4294967296 pixel image. So although the above code snippet is the idea, in practice I did some further manipulations to focus on the bits corresponding to the top of the scramble tree.

Both the base approach as well as the additional manipulations I did mean that this is far from perfect. It absolutely misses things. But it nevertheless provides more insight into our hash functions, which is good!

Anyway, here's what it looks like for the proper (bit-reversed) Owen scramble function:

Owen scramble dual-graph

Here's the original LK hash:

LK hash dual-graph

And here's my "improved" hash from the previous post:

v2 dual-graph

So it turns out both hashes are pretty bad. At least, in the sense that they are strongly biased away from particular tree configurations.6

Developing better LK-style hashes.

Since we now have some objective--even if incomplete--quality measures, some avenues for hash design have opened up. In particular, we can automate or semi-automate the search for better hashes by trying to maximize those quality measures.

I took inspiration from Hash Function Prospector, and wrote some code to randomly generate hashes from valid operations, and keep the hashes that most closely matched the measures of an ideal LK-style hash.

My hope was that this, on its own, would find some really great hashes. And although it certainly did find better hashes, in practice it required some intervention and guidance to find really good ones. In particular, I found that the following three-stage process tended to work well:

  1. Use a fully randomized search to look for sequences of operations that seem to work well.
  2. Take the best-performing of those op sequences, and do a search that only changes their constants.
  3. For the very best hashes that come out of that, compare them in the visual inspector that Brent Burley provided in the supplemental material of his Owen hashing paper. In particular, focusing on visually comparing their FFTs across a variety of seeds and sample counts.

The code I hacked together to do these evaluations and searches can be found here. It also includes a version of Brent Burley's visual inspector, with the hashes from this post included (including the full Owen scramble and the two hashes below).

The first results, and a problem.

The highest quality hash that initially came out of this process was the following:

(Note: please don't use this hash! It has an issue, discussed and fixed below.)

x *= 0x788aeeed
x ^= x * 0x41506a02
x += seed
x *= seed | 1
x ^= x * 0x7483dc64

high-quality hash dual-graph

The quantitative measures, while not perfect, appear very good. And visual inspection of both the resulting point sets and their FFTs was (as far as I could tell) essentially indistinguishable from a full Owen scramble. It's also no more expensive than the original Laine-Karras hash.

However, Matt Pharr identified a significant problem with this hash when testing it in PBRT. In short, if you run the following code which buckets outputs based on their lowest 8 bits:

any_constant = 123
buckets = [0, 0, 0, 0, ...]
loop for as long as you like:
    seed = random()
    buckets[owen_hash(any_constant, seed) & 0xff] += 1

...almost half of the buckets end up empty, and there is significant bias even in the buckets that aren't. Since you need to reverse the bits for Owen scrambling, those lowest bits become the highest bits, and it can end up being a real problem.7

After investigating a bit, I discovered that the root cause of the issue was this part of the hash:

x += seed
x *= seed | 1

The addition followed immediately by a multiplication with the same seed interfere with each other.

This is unfortunate because this sequence is otherwise remarkably good at producing Owen-scrambling-like behavior with a seed. There is no other sequence I could find that even comes close. Even just changing the order of the addition/multiply, or inserting other operations between them, significantly degrades the quality of the hash by the other measures in this post.

Fortunately, it turns out that the fix is easy: use two independent random seed values.

x += seed_1
x *= seed_2 | 1

This not only fixes the issue Matt found, but actually improves the other measures as well.

The downside is that it now requires 64 bits of random seed, rather than just 32 bits. This isn't a huge problem, but I wanted to keep it to 32 bits if I could. Some further investigation revealed this slightly hacky trick, which works really well in practice:

x += seed
x *= (seed >> 16) | 1

The idea here is to decorrelate the lower bits of the two operations. The bucketing issue is still present to some extent in the higher bits with this hack, but even with 2^24 buckets the distribution has very even statistics and only about 100 empty buckets. For 3d rendering, that's more than good enough, especially considering that 32-bit floats only have 24 bits of precision anyway. But if that nevertheless makes you uncomfortable, you can use a full 64 bits of seed instead.

In any case, if you simply drop either of these fixes into the hash above, it resolves the issue.

The final hash.

After figuring this out, I decided to do another round of random hash generation to see if I could get an even better LK-style hash. And it turns out the answer is yes!

So here is the final hash:

x ^= x * 0x3d20adea
x += seed
x *= (seed >> 16) | 1
x ^= x * 0x05526c56
x ^= x * 0x53a22864

final high-quality hash dual-graph

This is much better than the previous results, and is remarkably close to the reference per-bit hash. Examining the FFT in Burley's visual inspector also appears essentially indistinguishable from a full Owen hash.

Please note that, just like the original LK hash, this hash needs random seeds to really work well. In Psychopath I pass incremental seeds through a fast 32-bit hash function before feeding them to this hash, and that works really well.

Wrapping up

One of the things that has made exploring this problem so interesting to me is that it's largely unexplored territory. It shares a lot in common with general hashing, but is nevertheless quite different in both purpose and quality evaluation.

I've taken a stab at developing such evaluation measures in this post, but as is clear from the problem that Matt Pharr found, they aren't complete. I feel reasonably confident that this final hash is good—certainly better than the original LK hash. But it's always possible there are still subtle issues. If anyone does discover issues with it, please contact me!


Footnotes

  1. Actually, this is just Owen scrambling in base 2. Owen scrambling can be performed in arbitrary bases. For example, base 3 would divide each segment into thirds and randomly shuffle those thirds. For scrambling Sobol points, however, we want base-2 Owen scrambling, so that's what I'm focusing on here.

  2. It doesn't really matter whether we traverse based on the input bits or the output bits--both conventions result in an Owen scramble, and both conventions are capable of representing all possible Owen scrambles. I use the input bits here because I think that's easier to follow. But it's worth noting that the visual Owen scramble process I outlined earlier in the post is better represented by the output bit convention.

  3. If your seed is only 32 bits, of course, then it can't possibly provide enough variation to cover all possible 32-bit Owen scramble trees (each one being roughly 500MB large). But in practice this doesn't matter--we just need a good random sampling of Owen scramble trees. And a high-quality seedable hash function will accomplish that just fine.

  4. There are actually some subtle ways you can screw up a real implementation of this, which I (repeatedly) learned the hard way. For example, you need to properly handle flipping the highest bit, which can go wrong if not handled specially thanks to how e.g. bit shifting works on x86 architectures. Also, you need to account for string length when hashing the bits, so that e.g. a single binary zero and a string of two binary zeros gives a different hash result, even though their numeric value is the same. As for choosing a seedable hash function, I went with SipHash-1-3 in my implementation, passing the seed as SipHash's key. But there are almost certainly cheaper hash functions that will work just as well.

  5. You can compute the expected gradient for this analytically using a little combinatorial math. Beyond the highest 16 bits (half the width of the graph), however, it becomes difficult to compute in practice due to the enormous numbers that the factorials produce. But I did compute it out to 16 bits, and it matches very closely.

  6. It's not entirely clear that being biased in this sense is always a bad thing. For example, I wonder if being biased towards generating less biased trees might be a good thing for some applications. But being arbitrarily biased--rather than biased in a carefully engineered way--is probably a bad thing, which is what we're facing here.

  7. Interestingly, if you do both scrambling and shuffling, like I am in Psychopath, the practical impacts of this don't manifest in any obvious way (which is why I didn't notice it). It turns out that shuffling is really good at hiding problems with scrambling. But if you do only scrambling—no shuffling—this problem shows up as subtle but noticeable artifacts at low sample counts.

2021 - 01 - 02

Sobol Sampling - Take 2

(For yet another update on Psychopath's Sobol sampler, as well as significantly better LK hashes, please see the next post in this series: Building a Better LK Hash.)

Shortly after I wrote my last post about Sobol sampling, I got in contact with Brent Burley, who created the hash-based Owen scrambling used in the paper Progressive Multi-Jittered Sample Sequences. He was extremely kind, and spared some time to chat with me about his Sobol sampler. He has also now published a paper about it:

Practical Hash-based Owen Scrambling, by Brent Burley

Unsurprisingly, it goes beyond what I had figured out on my own, and improves upon existing Sobol samplers in multiple ways. It's really an amazing paper, and I highly recommend giving it a read.

As it says in the title, it presents an efficient hash-based Owen scrambling technique. But it also presents a way to efficiently shuffle the order of the points in the sequence using the same hashing approach.1 The shuffling allows you to create as many statistically uncorrelated Sobol sequences as you like, which is crazy useful. For example, as described in the paper, you can pad a Sobol sequence with another Sobol sequence.

The combination of these two things--the shuffling and the Owen scrambling--creates a very powerful, very high-quality sampler. And it's efficient enough to be computed on-the-fly while rendering.

My Sobol sampler is now essentially the same as what's described in Burley's paper, so you can just read the paper to know what I'm doing. But my implementation does depart in one small way, which I believe is a tiny but worthwhile contribution, and is what the rest of this post is about.

The LK hash

The hash function used in Brent Burley's paper is taken from Stratified Sampling for Stochastic Transparency by Laine et al., and Burley variously refers to it as both the "Laine-Karras permutation" and the "Laine-Karras hash". For brevity, I'll be calling it the "LK hash". Here is the pseudo code:

function lk_hash(x, seed) {
    x += seed
    x ^= x * 0x6c50b47c
    x ^= x * 0xb82f1e52
    x ^= x * 0xc7afe638
    x ^= x * 0x8d22f6e6
    return x
}

The critical property of this function is that, unlike a normal hash, it only avalanches upward, from lower bits to higher bits. In other words, it's a hash function where each bit only affects bits higher than itself.

This is the opposite of Owen scrambling, which is equivalent to a hash that only avalanches downward. However, it's easy to turn it into an Owen scrambling function: just reverse the order of the bits both before and after it:

function owen_scramble(x, seed) {
    x = reverse_bits(x)
    x = lk_hash(x, seed)
    x = reverse_bits(x)
    return x
}

Burley also uses the LK hash for shuffling the order of the Sobol points. So functions with this upward-only-avalanche property are really useful: they can be used both for shuffling and for Owen scrambling.

The only problem is that this particular function isn't especially high-quality.2 It seems to be good enough in practice for rendering, so it may not actually be worth improving. But it still bugs me. Let's take a look.

Here is a fully Owen scrambled sequence as ground-truth, computed using a much slower technique (from left-to-right are 64, 256, 1024, and 4096 points from the sequence, respectively):

full Owen scramble

And here is the bit-reversed LK hash:

Laine-Karras scramble

It looks pretty good, but the higher sample counts are visibly different. In particular, at 1024 samples there is an obvious structure to it. And at 4096 samples the "texture" of the samples is different (though you may need to click to view the full-size images to see it clearly).

Making a better hash function

I first tried to improve this by optimizing the constants, using more permutation rounds, etc. That's where I was when I wrote my previous Sobol post. And although it did help some, all of the results still had obvious issues, especially when looking at other dimension pairs.3

After getting in touch with Brent Burley, however, I was inspired to take another crack at it. It was at that point that I tried different permutation operations. The first real success was adding x += x << 1 to each round:

function lk_improved_v1(x, seed) {
    x += seed
    x ^= x * 0x6c50b47c
    x += x << 1
    x ^= x * 0xb82f1e52
    x += x << 1
    x ^= x * 0xc7afe638
    x += x << 1
    x ^= x * 0x8d22f6e6
    x += x << 1
    return x
}

The result is this:

Laine-Karras improved v1

This is visibly much better than the original LK hash, especially at 1024 samples. The main problem is that it's also slower. Mind you, it's still plenty fast. But I really wanted to make a hash that works better and is at least as fast as the original LK hash.

The astute among you (more astute than I was at the time) may notice something a little fishy about the new operation I added. I originally came up with it by specifically trying to make something that only avalanches upward. But it also turns out to be the same as just multiplying x by 3.

When I finally noticed that, and after a bit more thought, I realized that multiplying by odd numbers is also a valid operation in a hash like this. So, all together, these are the permutations I'm currently aware of that are valid for constructing a hash like this:

  1. x ^= constant
  2. x += constant
  3. x *= odd_constant
  4. x ^= x * even_constant

You can mix and match these to try to create better hashes. There are likely even more valid permutations, but I don't know what they might be.

Playing around with those permutations, I came up with this hash:

function lk_improved_v2(x, seed) {
    x += seed
    x ^= 0xdc967795
    x *= 0x97b754b7
    x ^= 0x866350b1
    x *= 0x9e3779cd
    return x
}

(UPDATE: please do NOT use this hash. It turns out that it's not very good. Please see the next post in this series for hashes that are actually worth using.)

And here are the results:

Laine-Karras improved v2

It's not quite as good as the previous one, but it's close. And it's still much better than the original LK hash. Moreover, it's faster than both of them: this version only takes 5 CPU operations, compared to the 9 and 17 ops of the other two.

This version is what I'm now using in Psychopath. It works at least as well as the original LK hash, and is a bit faster.

In practice, I don't notice any difference in render quality between any of these different hashes, or even between them and full Owen scrambling for that matter. So... it probably doesn't really matter. But it makes me feel better that this is a bit closer to a full Owen scramble.

Final thoughts

I certainly don't think I've come up with the best possible hashes. There's clearly still room for improvement.

Moreover, a quantitative analysis of hash quality would be really useful here. I did measure upward avalanche with some bespoke throw-away testing code, and the results suggested that these new hashes do improve things. But my testing wasn't careful or thorough, and there could easily have been awful bugs in my testing code. So I don't actually trust the results.

I hope that this post can serve as a starting point for other people to develop even better LK-style hashes. And if someone does further work on this, I'd love to hear about it! But for the time being, I'm quite satisfied with what I've arrived at.

For the curious, here is my implementation.


Footnotes

  1. In my previous post I said that you shouldn't offset into Sobol sequences, because it makes Sobol sampling slow. And yet that's more-or-less what shuffling does, just in a much more sophisticated way. Indeed, shuffling will also make the Sobol sampler slower in the same way that offsetting does. However, since writing my last post I've discovered a couple of ways to mitigate that performance impact. Moreover, it's trivial to evaluate multiple Sobol dimensions at once with SIMD, which my sampler now does. So, it's quite performant now.

  2. Burley notes in his paper that measuring hash quality is a complex endeavor, and that he uses the original LK hash due to the testing and claims of the original authors. It's also important to note that in practice the LK hash seems to perform very well in rendering applications. So I think Burley's choice to use the hash as-is in his paper was a good one.

    However, looking at the actual points it generates, it obviously has room for improvement. Moreover, in the Laine-Karras paper I saw no indication that they developed the hash in any kind of rigorous way. Even further, they say this about the constants used in their hash:

    The multiplication constants are the ones used in generating the images in this paper, but they carry no particular significance beyond that.

    So my strong suspicion is that they essentially just eye-balled it until it worked well enough for their particular use-case.

  3. For brevity, the images in this post only show the first two Sobol dimensions. In my testing, however, I looked at over 50 different dimension pairs, which revealed similar differences between the LK hash and the full Owen scramble. The improved hashes in this post fix those differences about as well as the differences in the first two dimensions. Which is to say, it's notably better, but certainly still has room for improvement.

2020 - 09 - 22

Storing Color Compactly

One of the tricky things about a Manuka-style rendering architecture is that you need to store your shading information really compactly, since it will be stored at every micropolygon vertex.

Storing surface normals seems like a fairly solved problem at this point, and I recommend the paper Survey of Efficient Representations for Independent Unit Vectors by Cigolle et al. if you haven't already read it.

Color, on the other hand, still seems like a bit of an open problem. For something like Psychopath, ideally I want a color format that meets these criteria:

  • Compact (ideally 32 bits or less).
  • Fast to decode (I want to use this at runtime while rendering, potentially decoding on every ray hit).
  • Covers the entire gamut of human-visible colors.
  • Has sufficient dynamic range to capture day-to-day natural phenomenon (e.g. the sun) without clipping.
  • Has sufficient precision to avoid visible banding.

I took a few detours, but I believe I've come up with a format that meets all of these criteria. And I call it FLuv32.

It's based on the 32-bit variant of LogLuv, which is a color format designed by Greg Ward of Radiance fame. LogLuv ticks almost all of the boxes: it's 32 bits, it covers the full gamut of visible colors, it has a massive dynamic range of 127 stops, and it has precision that exceeds the capabilities of human vision. It's a super cool format, and I strongly recommend reading the original paper.

However, the one box that it doesn't tick is performance. Since it uses a log encoding for luminance, decoding it requires an exp2() call, which is slower than I want.1

FLuv32 is essentially the same as LogLuv, but replaces the log encoding with floating point (hence the "F" in the name) which can be decoded with just a few fast operations. To achieve the same precision with floating point, I dropped the sign bit, foregoing negative luminance capabilities. But aside from that and a couple minor tweaks, it's the same as LogLuv. It has the same precision, the same massive dynamic range, and also fits in 32 bits. It's just a lot faster.

The FLuv32 color format

For anyone wanting to use FLuv32 themselves, I have a reference implementation in Rust here, and below is a description of the format.

Like LogLuv, FLuv32 uses an absolute color space, and as such takes CIE XYZ triplets as input. Also like LogLuv, it uses the Y channel directly as luminance, and computes u' and v' with the following CIELUV formulas:

u' = 4X / (X + 15Y + 3Z)
v' = 9Y / (X + 15Y + 3Z)

FLuv32 stores Y, u', and v' in the following bit layout (from most to least significant bit):

  • 7 bits: Y exponent (bias 42)
  • 9 bits: Y mantissa (implied leading 1, for 10 bits precision)
  • 8 bits: u'
  • 8 bits: v'

To store u' and v' in 8 bits each, their values need to be scaled to a [0, 255] range. u' is in the interval [0, 0.62], and v' is a tad smaller than that, so the original LogLuv uses a scale of 410 for both. FLuv32 departs from that very slightly, and instead uses the following scales:

u'-scale = 817 / 2
v'-scale = 1235 / 3

The motivation for this change is to be able to exactly represent E, the chromaticity of an equal-energy spectrum. This is not perceptually meaningful, which is presumably why the original LogLuv didn't bother. But in the context of a spectral renderer this may be useful, and it's trivial to accommodate.

The luminance, Y, is stored with 7 bits of exponent and 9 bits of mantissa. The exponent has a bias of 42, and the mantissa follows the standard convention of having an implicit leading one, giving it 10 bits of precision. The minimum exponent indicates a value of zero (denormal numbers are not supported). The maximum exponent is given no special treatment (no infinities, no NaN).

The exponent's bias of 42 may seem surprising, since typically a 7-bit exponent would be biased by 63. The reasoning for this is two-fold:

  • The dynamic range is already monstrously excessive for practical imaging applications, so to some extent it doesn't matter.
  • It seems more useful to be able to represent the luminance of supernovae in physical units (which exceed 2^64 cd/m^2) than ludicrously dark values (and even so, 2^-42 is already ludicrously dark).

42 was chosen by putting two thirds of the exponent range above 1.0, and one third below.2

Other formats

I also explored some other formats before arriving at FLuv32.

The obvious first one is simply using three 16-bit floats for RGB values. Surprisingly, although this provides more than enough precision, it doesn't provide enough dynamic range. Standard IEEE half-floats only have 5 bits of exponent, giving only 32 stops. That's not bad, but it's not quite enough to cover the whole range from dark shadows to the brightness of the sun. Moreover, 3 half floats takes up 48 bits, which is unnecessary.

After that, I considered RGBE, which is actually a predecessor to LogLuv, and is used in the Radiance HDR image format. In its typical formulation, 8 bits each are given to R, G, B, and the shared exponent. This format is very fast to decode, but unfortunately doesn't have the precision needed to avoid visible banding.

I implemented two other RGBE variants to try to improve things:

  • 32-bits with a 9-9-9-5 split. While this improved precision, it also unfortunately reduced dynamic range too much.
  • 40-bits with a 11-11-11-7 split. This provides more than enough precision and range, at the expense of slightly increased storage requirements. I think this is actually a solidly good format if RGB is a requirement.

Before stumbling on LogLuv, I also explored my own home-grown luma-chroma based format, based on YCbCr. It didn't pan out, largely because if you don't know the precise RGB colorspace you're working from, you can't accurately separate chroma from luminance, which is critical for storing chroma in fewer bits. You could, of course, just choose a specific RGB colorspace (e.g. ACES2065-1), but even then your chroma don't end up being perceptually uniform, which also thwarts low-bit-count storage.

All-in-all, I would say if you want to store HDR RGB specifically, go with the 40-bit RGBE format, which should easily accommodate any RGB color space with plenty of precision. And if you just want to store HDR color more generally, then both LogLuv and FLuv32 are great formats.

Shower thought

It occurred to me that in a spectral renderer you might not even need to do a full decode from FLuv32 back to XYZ. You have to spectrally upsample all your colors anyway, and usually that's done via some kind of chroma-oriented lookup table. So I'm wondering now if it would be reasonable to build a lookup table that's just directly indexed by the quantized u' and v' coordinates.

I haven't fully explored that yet, but decoding just the luminance from FLuv32 is super fast. On CPU, on my machine, the full XYZ decode is about 3ns, whereas the Y decode is about 1ns. So that would be a significant savings. Even moreso if you would just end up recomputing chroma from XYZ anyway for a spectral lookup.


Footnotes

  1. Arguably you could use a fast approximate exp2 to overcome performance limitations. However, the accuracy would need to be good enough to avoid introducing visible error, otherwise the whole point of the format is somewhat subverted.
  2. That it also happens to be the answer to everything is pure coincidence.