At the end of my post about ray reordering I alluded to some problems I ran into with using geometry caches and ray reordering. In this post I'm going to talk about the problem I ran into with geometry caching.
Geometry caching in-and-of itself isn't particularly problematic. But it can become a problem if you're trying to share it between multiple threads. The issue is that with an LRU cache even just reading from the cache is a write operation, because you have to move the read item to the front of the cache. And that means locking. And locking means thread contention.
I think I was actually reasonably smart in the way I wrote the cache. The cache treats the cached data as read-only, so it only needs to be locked long enough to move the accessed item to the front and then fetch a shared pointer to the data. That is a very short period of time, certainly far less than the tracing done with the geometry afterwards.
The result of that cleverness was that rendering with eight cores sped up rendering by about 5-6x. That's nothing to sneeze at considering that the cache was being accessed for every ray-object intersection test. But it was a clear indication that it wouldn't scale well with too many more cores. And on eight cores, you really want close to an 8x speed up if you can get it.
I was able to improve the situation further by better exploiting ray reordering. Instead of accessing the cache for every ray-object test like I was before, I just accessed it once for an entire batch of rays being tested against an object. This gave a 7x speed-up over a single-core render on most of my test scenes. Again, quite good. But still, how many more cores would it scale well with?
But even worse, on one of my test scenes it was still only about a 6x speed-up. The reason, as it turned out, was because that scene was much more complex, with lots of very small objects. When the objects are smaller, fewer rays are queued against each individual object, and therefore cache access isn't amortized over as large a batch of rays. And, in theory, with smaller and smaller objects that problem could get arbitrarily bad.
So I wanted to push it even further. To do this, I thought of two basic approaches:
- Eliminate the locking by giving each thread its own (smaller) thread-local cache.
- Eliminate the cache entirely.
In the end, I decided to take the latter approach. That might sound extreme, but in the end the geometry cache wasn't actually giving that much of a performance boost. Ray reordering on its own was really only about 5-10% slower on most scenes, and removing the cache resulted in enough of a speed up on eight cores to make up for that (though I don't recall the exact number... I don't have my 8-core machine on hand as I'm writing this). Moreover, removing the cache simplified the code quite a bit, and removed its memory footprint. And, most importantly, it completely eliminated the problem of many small objects, assuring good scaling independent of scene geometry.
To be totally honest, I'm actually mixing up the development timeline a bit: I didn't disable the cache in committed code until after I'd moved away from ray reordering as well, which I'll talk about in a later post. But nevertheless, in the end I don't think a shared geometry cache is a scalable approach. I may revisit the idea of thread-local caches at some point. But for now, I don't think Psychopath needs them.