Compressing PBR material texture sets with sparsity and k-SVD dictionary learning

Introduction

In this blog post, I am going to continue exploration of compressing whole PBR texture sets together (as opposed to compressing each texture from the set separately) and using the fact that those textures are strongly correlated.

In my previous post, I have described how we can “decorrelate” some of the channels using PCA/SVD and how many of the dimensions in such a decorrelated space contain small amount of information, and therefore can be either dropped completely, stored at a lower resolution, or quantized aggressively. This idea is a backbone of almost all successful image compression algorithms – from JPEG to BC/DXT.

This time we will look at exploring some redundancy across both the stored dimensions, as well as the image space using sparse methods and kSVD algorithm (original paper link).

From my personal motivation, I wanted to learn more about those sparse methods, play with implementing them and build some intuition – one of my coworkers and collaborators on my previous team at Google, Michael Elad is one of the key researchers in the field and his presentations inspired and interested me a lot – and as I believe that until you can implement and explain something yourself, you don’t really understand it – I wanted to explore this fascinating field (in a basic, limited form) myself.

First I am going to do present a simplified and intuitive explanation of what “sparsity” and “overcompletness” mean and how their redundancy can be an advantage. Then I will describe how we can use sparse approximation methods for tasks like denoising, super-resolution, and texture compression. Then I will provide a quick explanation of k-SVD algorithm steps and how it works, and finally show some results / evaluation for the task of compressing PBR material texture sets.

This post will come with a colab notebook (soon – I need to clean it up slightly) so you can reproduce all the code and explore those methods yourself.

Sparse and dictionary based methods in real-time rendering?

I tried to look for some other references to dictionary based methods in real time rendering and was able to find only of three resources: a presentation by Manny Ko and Robin Green about kSVD and OMP, a presentation by Manny Ko about Dictionary Learning in Games, and a mention of using Vector Quantization for shadow mask compression by Bo Li. VQ is a method that is more similar to palletizing than full dictionary learning, but still a part of the family of sparse methods. I recommend those presentations, (a lot of what I describe about the theoretical part – and how I describe – overlaps with them) and if you are aware of any more mentions of those IMO underutilized methods in such a context, please leave a comment and I’ll be happy to link.

Vector storage – complete and orthogonal basis, vs overcomplete and sparsely used frames / dictionaries

To explain what are sparse methods, let’s start with a simple, 2 dimensional vector. How can we represent such a vector? “This is obvious, we can just use 2 numbers, x and y!”. Sure, but what do those numbers mean? They correspond to an orthonormal vector basis. x and y represent actually a proportion of position along the x axis, and the y axis. If you have done computer graphics, you know that you can easily change your basis. For example, a vector can be decribed in so called “world space” (with arbitrary, fixed point), “camera / view space”, or for some simple video game AI purpose in space of a computer-controlled character.

Same vector, two different storage basis.

Vector basis might have different properties. Two that are very desired are orthogonality and normality, often combined together as orthonormal basis. Orthonormality gives us trivial way of changing information between two basis back and forth (no matrix inversion required, no scaling/non-unity Jacobians introduced) and some other properties related to closed form products and integrals – it’s beyond scope of this post, but I recommend “Stupid Spherical Harmonics Tricks” from Peter-Pike Sloan for some examples how SH being orthonormal makes it a desired basis for storing spherical signals.

Choice of “the best” basis matters a lot when doing compression – I have described it in my previous blog, but as a quick refresher, “decorrelating” dimensions can give us basis in which information is maximized in each sequential subset of dimensions – meaning that the first dimension has the most information and variability present, the next one the same or smaller amount etc. This is limited to linear relationships and covariance, but lots of real data shows linear redundancy.

Demonstration from my previous post of finding a storage basis in which we maximize information stored in single dimension.

One of the requirements of a space to be a basis is completeness and uniqueness of representation. Uniqueness of the representation means that given a fixed basis, there is only a single way to describe a vector in this space. Completeness means that every vector can be represented without loss of information. In my last post, we looked at the truncated SVD – which is not complete, because we lose some original information and there is no way how we can represent it back.

What is quite interesting is that we can go further than vector basis and go towards overcomplete frames (let’s simplify – a lot – for the purpose of this post and say that frame is just “some generalization” of a basis). This means that we have more vectors in space than the “rank” of the information we want to store. Overcomplete frames can’t be orthogonal or unique (I encourage readers for whom this is a new concept to try to prove it – even if just intuitively). Sounds pretty useless for information compression? Not really, let me give here a visual example:

Example of a set of points/vectors that cannot be represented well by any two orthogonal vectors, but on the other hand three frame vectors describe this set pretty well.

Note that while in the case of overcomplete frame every point can be described in infinitely many ways (just like x + y = 5 has infinitely many solutions of x and y values that satisfy it), we usually are going to look for a sparse representation. We want as many of the components to be zero (or close to zero enough so that we can discard them, accepting some errors), and the input vectors described by a linear combination of less frame vectors than the size of the original basis.

If this description seems vague, looking at the above example – with complete basis, we generally need two numbers to describe the input point set. However, with an overcomplete frame, we can use just a single 2-bit basis index and a float value describing magnitude along it:

Sparse representation in a “proper” overcomplete frame can approximate some input data pretty well.

Yes, we lose some information, and there is an approximation error – but there is always an error in the case of lossy compression – and the error in such a basis is way smaller than from any truncated SVD (keeping a single dimension of orthogonal basis).

Terminology wise – we commonly refer to a frame vector as atom, and collection of atoms as dictionary – and I will use atoms and dictionary mostly throughout this post.

Dense vs sparse representation – a dense representation has a minimal number of dimensions (rank) of the set, but every item represented in it is going to be represented by multiple coefficients. With sparse representation, we might have many more dimensions, but every item is going to be represented by a small subset of those – most coefficients are zero.

Also note – we will come back to this later in the next section – I have used here a 2D example and a 1-sparse representation (single atom to represent input) – but with higher dimensions, we can (and in practice do) use multiple atoms per vector – so if you have 16 atoms, any representation that uses 1-15 of them can be considered sparse. Obviously if the input space is 3D, and your dictionary has 20 atoms, while a 10 atom representation of a vector “could” be seen as sparse, it probably won’t make sense for most practical scenarios.

Sparsity and images – palettes and tiles

Now I admit, this 2D case is not super compelling. However, things start to look much more interesting in some much higher dimensions. Imagine that instead of storing 20 numbers for a 20 dimensional vector, you could store just a few indices and a few numbers – in very high dimensions, sparse representations get more and more interesting. Before going towards high dimensional case, let me just show some interesting connection – what happens if for a single pixel of e.g. RGB image we store only index (not even a weight) from a dictionary? Then we end up with an image palette.

Example images compressed with palettization and their palettes – source: wikipedia, author: Ricardo Cancho Niemietz

Image palettization is an example of a sparse representation method! One of many algorithms to compute image palette it is k-means, which is a more basic version of k-SVD that I’m going to describe later.

Palettization generalizes to many dimensions – if you have a PBR texture set with 20 channels, you can still palettize it – just your entries will be 20 dimensional vectors.

However, we can go much further and increase the dimensionality (hoping for more redundancy in more dimensions) much further. Let’s have a look at this Kodak dataset image and at image tiles:

Notice in colored rectangles how much self similarity there is inside this image!

There is lots of self-similarity in this image – not just within pixel values (those would palettize very well), but among whole image patches! In such a scenario, we can take a whole patch to become our input vector. I have mentioned in the past that thinking beyond “image is 2D matrix / 3D tensor” is one of the most important concepts in numerical methods for image processing, but as a recap, we can take an RGB image and instead of a tensor, represent it by a matrix:

Representing and image or image patch as a matrix.

Or we can go even further, and represent it as a single vector:

Representing and image or image patch as a vector.

So you can turn a 16×16 RGB tile into a single 768 dimensional vector – and with those 16×16 tiles, a 512 x 512 x3 (RGB) image into a 32×32 image of 768 vectors. This takes time to get used to, but if you internalize this concept (and experiment with it yourself), I promise it will be very useful!

What are sparse methods useful for?

Sparse methods were (and to a lesser extent still are) a “hot” research topic because of a simple “philosophical” observation – the world we live in, and the signals we deal with (audio, images, 3D representation), are in many ways sparse and self-similar.

A 100% random image would not be sparse, but images that we capture with cameras or render are in some domains.

A random image is not sparse – there is no structure or self-similarity to it. On the other hand, natural images are – notice lots of self similarities, flat regions, simple geometric constructs like edges, corners, textures.

I will not present any proofs of it (also personally I haven’t seen one that would be both rigorous and not handwavy, while staying intuitive), however let me describe some implications and use-cases of it.

Compression

Very straightforward and the use-case I am going to focus on – if instead of storing full “dense” information of a large, multidimensional vector (like image tile), we can instead store a dictionary, and indices of atoms (plus their weights). If the compressed signal was truly sparse in some domain, this can result in massive compression ratios.

Denoising

What happens when we add white noise to a sparse signal? It’s going to become non-sparse! Therefore by finding best possible sparse approximation of it, we are going to effectively decompose it into sparse and non-sparse component, which is going to be discarded and hopefully, corresponds to noise.

Crude, quickly coded k-SVD based denoising of a very noisy signal. While the output image doesn’t look good (things like blended, overlapping tiles or varying atom count per tile are a must), it is a demonstration that it’s possible to remove most of the noise!

Super-resolution

Principle of self-similarity and sparsity can be used for super-resolution of low-resolution, aliased images. Different “phases” of the sampled aliased signal produce different samples, but if we assume there is an underlying common shared representation, we could find it and reconstruct them.

Aliased triangle and different tiles corresponding to the same original shape.

In this example if we correctly located all the aliased tiles and be certain that single, same high resolution atom corresponds to all of them, then we’d have enough information to remove most of the aliasing and reconstruct the high resolution edge.

Computer vision

Computer vision tasks like object recognition or object classification don’t work very well with hugely dimensional data like images. Almost all computer vision algorithms start by constructing some representation of the image tile – corner, line detection. While in the “old times” those features were hard-coded, nowadays state of the art relies on artificial neural networks with convolutional layers – and the convolutional layers with different filter banks jointly learn the dictionary (edges, corners, higher level patterns), as well as use of this dictionary data (I will mention this connection around the end of this post).

Compressive sensing / compressed sensing

This is a really wild, profound idea that I have learned about just this year and definitely would like to explore more. If we assume that the world and signals we sample are sparse in “some” domain and we know it, then we can capture and transmit much less information (orders of magnitude less!), like random subsamples of the signal. Later we can recompute the missing pieces of the information by forcing sparsity of the reconstruction in this other domain (in which we we know the signal is sparse). This is really interesting and wild area – think about all the potential applications in signal/image processing, or in computer graphics (strongly subsampling expensive raytracing or GI and easily reconstructing the full resolution signal).

Note: Sparsity of simple, orthogonal basis

It is important to emphasize that you don’t need any super fancy overcomplete dictionary to observe signal sparsity. Even simple, orthogonal basis can be sparse – for example Fourier transform, DCT, image gradient domain, decimating wavelets – and many of them are actually very good / close to optimal in “general” (average of different possible images) scenarios. This is a base for many other compression algorithms.

Natural images are usually sparse in Fourier domain.
Left: original image, Right: Only 10% of the largest coefficients in Fourier domain = immediate 10x compression ratio. The output doesn’t look great perceptually (ringing, separate treatment of color channels = color shifts; both can be avoided easily), but this was close to zero consideration or effort!

kSVD for texture and material compression

Later in this post, I am going to describe the k-SVD algorithm – but first, let’s look at use of sparse, dictionary representations and learning for PBR material compression.

I have stated the problem and proposed a simple solution in my previous blog post. As a recap, PBR materials consist of multiple textures representing different physical properties of a surface of a material. Those properties represent surface interaction with light through BRDF models. Different parts of that texture might correspond to different surface types, but usually large parts of the material correspond to the same surface with the same physical properties. I have demonstrated how those properties are globally correlated. Let’s have a look at it again:

An example PBR material, credit cc0textures, Lennart Demes. There are 5 different textures, but each texture corresponds to the same physical location.

After my longish introduction, it might be already clear that we can see lots of self-similarity in this image and very similar local patches. Maybe you can even imagine how those atoms will look like – “here is an edge”, “here is a stone top”, “here is a surface between the stones” that repeat themselves across the image.

It seems that it should be possible to reuse this “global” information and compress it a lot. And indeed, we can use one of sparse representation learning algorithms, kSVD to both find the dictionary, as well as fit the image tiles to it. I will describe the algorithm itself below, but it was used in following way:

  • Split the NxN image into n x n sized tiles after subtracting image mean,
  • Set the dictionary size to contain m atom tiles,
  • Run kSVD to learn the dictionary and allowing for up to k atoms per tile,
  • Recombine the image.

Just for the initial demo (not optimal parameters, I selected the dictionary to contain 8×8 tiles, dictionary sized like 1/32th of the original image, 8 atoms per tile). Afterwards, the texture set will look like this:

Top: reconstruction, middle: reference, bottom: abs error.

I’m presenting it in a low resolution as later I will look more into proper comparisons and the effect of some parameters. From such a first look, I’d say it doesn’t look too bad at all given that per every tile instead of storing 8x8x9 numbers, we store only 9 indices and 9 weights, plus the dictionary is only 1/32th of the orignal image!

But more importantly, how do the atoms / dictionary look like? Note: atoms are normalized across all channels and we have subtracted the image mean; I am showing absolute value of each pixel of each atom:

Dimension-wise slices (visualization) of unsorted atoms from fitting a dictionary to a PBR texture set.

Ok, this looks a bit too messy for my taste. Let’s sort them based on similarity (greedy, naive algorithm: start with the first tile, find most similar to it, then proceed to the next tile until you run out of tiles):

Dimension-wise slices (visualization) of sorted atoms from fitting a dictionary to a PBR texture set.

Important note: from the above 5 differently colored subsets, first 8×8 tile of each of them corresponds to the same atom. They are sliced across different PBR properties for visualization purpose only (I don’t know about you, but I don’t like looking at 9 dimensional images😅).

This is hopefully clearer and also shows that there are some clear “categories” of patches and words in our dictionary: some “flat” patches with barely any texture or normal contribution at all, some strong edges, and some high frequency textured regions. An edge patch will show an edge both in normal-map as well as the AO, while a “textured” patch will show texture and variation in all of the channels. Dictionary learning extended correlation between global color channels to correlation of full local structures. I think this is pretty cool and the first time I saw it running I was pretty excited – we can automatically extract quite “meaningful” (even semantically!) information and how the algorithm can find some self-similarities, repeating patterns and use it to compress the image a lot.

I will come back to analyzing the compression and quality (as it depends on the parameters), but let’s now look at how the k-SVD works.

k-SVD algorithm

I have some bad news and some good news. The bad news are that finding optimal dictionary and atom assignments is NP-Hard problem. â˜č We cannot solve it for any real problems in reasonable time. The good news is that there are some approximate/heuristic algorithms that perform well enough for practical applications!

k-SVD is such an algorithm. It is a greedy, iterative, alternating optimization algorithm that is made of the following steps:

  1. Initialize some dictionary,
  2. Fit atoms to each vector,
  3. Update the dictionary given current assignment
  4. If satisfied with the results, finish. Otherwise, go back to step 2.

This going between steps 2 and 3 is the “alternating” part, and repeating it is “iterative”.

Note that there are many variations of the algorithm and improvements in many follow-up papers, so I will look at some flavor of it. 🙂 But let’s look at each step.

Dictionary initialization

For the dictionary initialization, you can apply many heuristics. You can start with completely random atoms in your dictionary! Or you can take some random subset of your initial vectors / image tiles. I have not investigated this properly (I am sure there are many great papers on initialization that maximizes the convergence speed), but generally taking some random subset seems to work reasonably. Important note is that you’d want your atoms to be “normalized” across all channels – you’ll see why in the next step.

Atom assignment

For the atom assignment, we’re going to use Orthogonal Matching Pursuit (OMP). This part is “greedy” and not necessarily optimal. We start with the original image tile and call it a “residual”. Then we repeat the following:

  1. Find the best matching atom compared to residual. To find this best matching atom according to squared L2 norm, we can simply use dot product. (Think and try to prove why the dot product – cosine similarity of normalized vectors – ordering is equivalent to the squared L2 norm, subject to offset and scale).
  2. Looking at all atoms found so far, do a least squares fit against the original vector.
  3. Update the residual.

We repeat it as many times as the desired atom count, or alternatively until error / residual is under some threshold.

As I mentioned, this is a greedy and not necessarily optimal part. Why? Look at the following scenario where we have a single 4 dimensional vector, 3 atoms, and allow max 2 atoms per fit:

Given our yellow vector to fit, at the first step if we find the most similar atom, it is going to be the orange one – the error is the smallest. Then we would find the green one, and produce some combination that would produce 2 last dimensions too high/too low – a non-negligible error. By comparison, optimal, non-greedy assignment of 2 atoms “red” and “green” would produce zero error!

Greedy algorithms are often reasonable approximations that run fast enough (the decision made is only locally optimal, so we can consider very small subset of solutions), but in many cases are not optimal.

Dictionary update

In this pass, we assume that our atom assignment is “fixed” and we cannot do anything about it – for now. We go even further, as we look at each atom one by one, and assume that all the other atoms are also fixed! Given almost everything fixed, we optimize: given this assignment and all the other atoms being fixed, what is the optimal value for this atom?

So in this example we will try to optimize the “red box” atom, given two “purple box” items using it.

The way we approach it is solving the following: if there was no this atom contribution, how can we minimize the error of all the atoms using it? We look at all errors of every single item using this atom and try to find some single vector that multiplied by different weights will maximize the removal of average of those errors (so matching it).

In a visual form, it will look something like this:

Updating a single atom to approximate all the residuals of multiple items is simply a SVD rank1 decomposition! One vector will be the locally optimal atom, while the orthogonal vector will be a weight vector.

If you are reader of this blog, I hope you recognized that we are looking for a separable, rank 1 approximation of the error matrix! This is where the SVD of k-SVD comes from – we use truncated SVD and compute a rank 1 approximation of the residual. One of separable vectors will be our atom, the other one – updated weights for this atom in each approximated vector.

Note that unlike the previous step, which could lead to unoptimal results, here we have at least a guarantee that each step will improve the average error. It can increase the error of some items, but on average / in total it will always be decreased (or it would have been the same if SVD would found that the initial atom and the weight assignment were already minimizing the residual).

Also note that updating the atoms one by one and updating the weights / residuals at the same time makes it similar to Gauss-Seidel method – resulting in a faster convergence, but at a cost of sacrificed parallelism.

And this is roughly it! We alternate the steps of assigning atoms and updating them until we are happy with the results. There are many nitty-gritty details and follow-up papers, but described like this, kSVD is pretty easy to implement and works pretty well!

How many iterations do you need? This is a good question and probably depends on the problem and those implementation details. For my experiments I was using generally 8-16 iterations.

Different parameters, trade-offs, and results

Note for this section – I am clearly not a compression expert. I am sure there are some compression specific interesting tricks and someone could do a much better analysis – I will not pretend to be competent in those. So please treat it as mostly theoretical analysis from a compression layman.

I will analyze all three main parameters one by one, but first, let’s write down a formula for general compression/storage cost:

Image tiles themselves – k indices, k weights for each of the tiles. Indices need as many bits as log2 of the number of atoms m in the dictionary, and the weights need to be quantized to b bits (which will affect the final precision/error). Then there is the dictionary, which is going to be simply m of n*n sized tiles, each element containing original number of dimensions. So the total storage is going to be:

It’s worth thinking about comparing it to a “break even” ratio, so simply N*N*d. Ignoring the image, dictionary break even is m <= (N/n)^2. Ignoring the dictionary, image break even is:

Seeing those formulas, we can analyze the components one by one.

First of all, we need to make sure that m is small enough so that we actually get any compression (and our dictionary is not simply the input image set 🙂 ). In my experiments I was setting it to 1/32-1/16 of N^2 total tiles of the input for larger tiles, or simply fixed small value for very small tiles.

Then assuming that m * n^2 is small enough, we focus on the image itself. For better compression ratios we would like to have tile size n as large as possible without blowing this earlier constraint – this is a quadratic term. As usually, think about the extreme: a tile 1×1 is the same as no tiling/spatial information at all (still might offer a sparse frame alternative to SVD discussed in my previous post), and large tiles means that our “palette” elements are huge and there are few of them.

Generally, for all the following experiments, I looked only at very significant compression ratios (mentioned dictionary itself being 3-6% size of the input) and visible / relatively large quality loss to be able to compare the error visually.

People more competent in compression will notice lots of opportunities for improvements – like splitting indices from weights, deinterleaving the dictionary atoms, exploiting the uniqueness of atom indices etc. – but again, this is outside of my expertise.

Because of that, I haven’t done any more thorough analysis of the “real” compression ratios, but let’s compare a few results in terms of image quality – both subjective, and objective numbers.

Tile size 1, 3 atoms per pixel, dictionary 32 atoms -> PSNR 34.9

Top: reconstruction, middle: reference, bottom: abs error.
Single pixel element dictionary is not the most fascinating or insightful one. 🙂

This use-case is the simplest, as it is a sparse equivalent of the SVD we have analyzed in my previous post. Instead of dense 6 weights per pixel (to represent 9 features), we can get away with just 3 and just 32 elements in the dictionary, we can a pretty good SNR! Note however the problem (will be visible on those small crops in all of our examples) – tiny green grass islands disappear, as they occupy tiny portion of the data and don’t contribute too much to the error. This is also visible in the dictionary – none of the elements is green. Apart from it, the compressed image looks almost flawless.

Tile size 2, 4 atoms per tile, dictionary 256 atoms -> PSNR 32.16

Top: reconstruction, middle: reference, bottom: abs error.

In this case, we switch to a tiled approach – but with tiny, 2×2 tiles. We have as many weights per tile, as there are pixels per tile – but pixels were originally 8 channel and now we have per atom only a single weight. The PSNR / error are still ok, however the heightmap starts to show some “chunkiness”. In the dictionary, we can slowly start to recognize some patterns – and how some tiles represent only low frequency, while other higher frequency patterns! The exception is the heightmap – it wasn’t correlated too much with the other channels details, which explains errors and lack of high frequency information. A larger dictionary or more atoms per tile would help to decorrelate those.

Tile size 4, 8 atoms per tile, 6% of the image in dictionary -> PSNR 32.76

Top: reconstruction, middle: reference, bottom: abs error.
Note: this is only a subset of the dictionary.

4×4 tiles is an interesting switching point as we can compress dictionary with the hardware BC compression. We’d definitely want that, as the dictionary gets large! Note that we get less atoms per tile than there were pixels in the first place! So it’s not only compressing the channel count, but also less than actual pixel count. The dictionary looks much more interesting – all those tiny lines, edges, textured areas.

Tile size 8, 12 atoms per tile, 6% of the image in dictionary -> PSNR 28.11

Top: reconstruction, middle: reference, bottom: abs error.
Note: This is only a subset of the dictionary.

At this point, the compression got very aggressive (per each 8×8 pixels of the image, we store only 12 weights and indices, that both can be quantized!) – and the tiles are the same size as JPG ones. Still, visually looks pretty good, and the oversmoothing of normals shouldn’t be an issue in practice – but if you look very closely at the height map, you will notice some blocky artifacts. The dictionary itself shows a lot of variety of patches and content type.

Performance – compression and decompression

Why are dictionary based techniques not used more widely? I think the main problem is that the compression step is generally pretty slow – especially the atom fitting step. Per each element and per each atom, we are exhaustively searching for the best match, performing a least squares fit, recomputing the residual, and then searching again. Every operation is relatively fast, but there are many of them! How slow it is? To generate the examples in my post, colab was running in a few minutes per result – ugh. On the other hand, this is a Python implementation, with brute-force linear search. Every single operation in the atom fitting part is trivially parallelizable, and can be accelerated using spatial acceleration structures. I think that it could be made to run a few seconds / up to a minute per a texture set – which is not great, but also not terrible if not part of an interactive workflow, but an asset bake.

How about the decompression performance? Here things are much better. We simply fetch atoms one by one and add their weighted contributions. So similar to SVD based compression, this becomes just a series of multiply-adds and the cost is proportional to the atom count per tile or pixel. We incur a cost of indirection and semi-random access per tile, which can be slow (extra latency), however when decompressing larger tiles, this is very coherent across many neighboring pixels. Also, generally dictionaries are small so should fit in cache. I think it’s not unreasonable to also try to compress multiple similar texture sets using a single dictionary! In such a case when drawing sets of assets, we could be saving a lot on the texture bandwidth and cache efficiency. Specific performance trade-offs depend on the actual tile sizes, and for example whether block compression is used on the dictionary tiles. Generally, my intuition says that it can be made practical (and I’m usually pessimistic) – and Bo Li’s mention of using VQ for shadowmask (de)compression confirms it.

Relationship to neural networks

When you look at the dictionary atoms above, and at visualizations of some common neural networks, there is some similarity.

Neural network filters learned for image classification from a classic paper by Krizhevsky et al.

This resemblance might seem weak, but the network was trained on all kinds of natural images (and for a different task), while our dictionary was fit to a specific texture set. Still, you can observe some similar patterns – appearance of edges, fixed color patches, more highly textured regions. It also resembles some other overcomplete frames like Gabor wavelets. I think of it intuitively that as the network also learns some representation of the image (for the given task), those neural network features are also a overcomplete representation, and however they are not sparse, some regularization techniques can partially sparsify them.

Neural networks have mostly replaced sparse methods as they decoupled the slow training from (relatively) fast interference, however I think there is something beautiful and very “interpretable” about the sparse approaches, and when you have already learned a dictionary and don’t need to iterate, for many tasks they can be attractive and faster.

Conclusions

My post aimed to provide some gentle (not comprehensive, but rather a more intuitive and hopefully inspiring) introduction to sparse and dictionary learning methods and why they can be useful and attractive for some image compression and processing tasks. The only novelty is a proposal of using those for compressing PBR textures (again I want to strongly emphasize the idea of exploiting relationships between the channels – if you compress your PBR textures separately, you are throwing out many bits! 🙂 ), but I hope that you might find some other interesting uses. Compressing animations? Compressing volumetric textures and GI? Filtering of partial GI bakes (where one part of level is baked and we use its atoms to clean up the other one)? Compressing meshes? Let me know what you think!

Posted in Code / Graphics | Tagged , , , , , , , , , , | 4 Comments

Dimensionality reduction for image and texture set compression

Teaser image: Example PBR Texture set compressed with presented technique.Textures credit cc0textures, Lennart Demes.

In this blog post I am going to describe some of my past investigations on reducing the number of channels in textures / texture sets automatically and generally – without assuming anything about texture contents other than correspondence to some physical properties of the same object. I am going to describe the use of Singular Value Decomposition (a topic that I have blogged about before) / Principal Component Analysis for this purpose.

The main use-case I had in mind is for Physically Based Rendering materials with large sets of multiple textures representing different physical material properties (often different sets per different material/object!), but can be extended to more general images and textures – from simple RGB textures, to storing 3D HDR data like irradiance fields.

As usually, I am also going to describe some of the theory behind it, a gentle introduction on a “toy” problems and two examples on simple RGB textures from Kodak dataset, before proceeding to the large material texture set use-case.

Note: There is one practical, significant caveat of this approach (related to GPU block texture compression) and a reason why I haven’t written this post earlier, but let’s not get ahead of ourselves. 🙂

This post comes with three colabs: Part 1 – toy problem, Part 2 – Kodak images, Part 3 – PBR materials.

Intro – problem with storage of physically based rendering materials

I’ll start with a confession – this is a blog post that I was aiming to write… almost 3 years ago. Recently a friend of mine asked me “hey, I cannot find your blog post on materials dimensionality reduction, did you end up writing it?” and the answer is “no”, so now I am going to make up for it. 🙂

The idea for it came when I was still working on God of War and one of the production challenges was disk and memory storage cost for textures and materials. Without going into details on what were the challenges of our specific material system and art workflows, I believe this is a challenge for all contemporary AAA video games – simply physically based rendering type of workflows requires a lot of textures – stressing not only the BluRay disk capacities, but also the gamers are not happy about downloading tens of gigabytes of just patches…

To list a few example different textures used per a single object/material (note: probably no single material used all of them, some of them are rare, but I have seen all in production):

  • Albedo textures,
  • Specularity or specular color textures,
  • Gloss maps,
  • Normal maps,
  • Heightmaps for blending or parallax mapping,
  • Alpha/opacity maps for blending,
  • Subsurface or scattering color,
  • Ambient occlusion maps,
  • Specular cavity maps,
  • Overlay type detail maps.

Uff, that’s a lot! And some of those use more than one texture channel! (like albedo maps that use RGB colors, or normal maps that are usually stored in two or three channels). Furthermore, with increasing target screen resolutions, textures also need to be generally larger. While I am a fan of techniques like “detail mapping” and using masked tilers to avoid the problem of unique large textures everywhere, those cannot cover every case – and the method I am going to describe applies to them as well.

All video games use hardware accelerated texture block compression, and artists manually decide on the target resolution of different maps. It’s time consuming, not very creative, and fighting those disk usage and memory budgets is a constant production struggle – not just for the texture streaming or fitting on a BluRay disk, but also the end users don’t want to download 100s of GBs…

Could we improve that? I think so, but before we go to the material texture set use-case, let’s revisit some ideas that apply to “images” in more general sense.

Texture channels are correlated

Let’s start with a simple observation – all natural images have correlated color channels. “Physical” explanation for it is simple – no matter what is the color of the surface, when it is lit, its color is modulated by the light (in fact by the light spectrum and after interacting with its surface through BRDF, but let’s keep our mental model simple!). This means that all of those color channels will be at least partially correlated, and light intensity / luminance modulates whatever was the underlying physical surface (and vice versa).

What do you mean by “correlated” color channels and how to decorrelate them?

Let’s look at a contrived example – list of points representing pixels in a two color (red and green) channel texture to keep visualization in 2D. In this toy scenario we can observe a high correlation between red and green in the original red/green color space. What does this mean? If the value of green channel brightness is high, it is also very likely that the red channel will is also high. On the following plot, every dot represents a pixel of an image positioned in the red/green space on the left:

Left: scatter plot of points in our example in original, RG space. Right: scatter plot of the same points when “decorrelated” by projecting them onto luminance axis.

On the right plot, I have projected those on the blue line, representing something like luminance (overall brightness).

We can see that from the original signal which was changing a lot on both axis, after projecting onto luminance axis, we end up with significantly less variation and no correlation on the perpendicular “chrominance” (here defined as simple difference of red and green channelvalues) axis. Decorrelated color space means that values of a single channel are not dependent or linearly related to the values of another color channel. (There could be a non-linear relationship, but I am not going to cover it in this post and it becomes significantly more difficult to analyze, requiring iterative or non-linear methods.)

This property is useful for compression, as while the original image needed for example 8 bits for both red and green color channels, in the decorrelated space we can store the other axis with a significantly smaller precision – there is much less variation, so possibly 1-2 bits would suffice, or we could store it in smaller resolution.

This is one of the few reasons why we use formats like YUV for encoding, compressing and sending images and videos. (As a complete offtopic, three other reasons that come to my mind right now are also interesting – human visual perception system being less perceptive to color changes and allowing for more compression of color; chroma varying more smoothly (less high frequencies) in natural images; and finally for now obsolete now historical reasons, where old TVs needed “backwards compatibility” requiring transmitted signal to be possible to view on black&white old TVs ).

Let’s have a look at two images from Kodak dataset in RGB and YUV color spaces. I have deliberately picked one image with many different colors (greenish water, yellow boat, red vests, blue shorts, green and white tshirts), and the other one relatively uniform (warm-green white balance of an image with foliage):

RGB image and Y, U, and V color channels.
RGB image and Y, U, and V color channels.

It’s striking how much of the image content is encoded just in the Y, luminance color channel! You can fully recognize the image from the luminance, while chrominance is much more “ambiguous”, especially on the second picture.

But let’s look at it more methodically and verify – let’s plot the absolute value of the correlation matrices of color channels of those images.

Correlation matrices of two Kodak images / image crops in RGB space. The one with less colors (right) has almost perfectly correlated color channels! The other one (left) still has very strong correlation, but only between some channel pairs.

This plot might deserve some explanation: Value of 1 means fully correlated channels (as expected on the diagonal – variable is fully correlated with itself).

Value of zero means that two color channels don’t have any linear correlation (note – there could be some other correlation, e.g. quadratic!) and the matrices are symmetric, as the red color channel has same correlation with the green as the green with the red one.

First picture, with more colors has significantly less correlation between color channels, while the more uniformly colored one shows very strong correlation.

I claimed that YUV generally decorrelates color channels, but let’s put this to a test:

Absolute value of correlation matrices of two Kodak images / image crops in YUV space. Value of 1 means fully correlated or negatively correlated values (as expected on diagonal – variable is fully correlated with itself). The one with less colors (right) after conversion to YUV has almost no cross channel correlation The other one (left) has some decorrelation, but it is not as strong as the first example.

The YUV color space decorrelated more colorful image to some extent, while the less colorful one got decorrelated almost perfectly.

But let’s also see how much information is in every “new” channel with displaying absolute covariances (one can think of covariance matrix as correlation matrix multiplied by variances):

Covariance matrices of the two YUV converted Kodak images. Covariance is a pretty good measurement of “information” as also shows us how much variation is there in each color channel, and in relationship between them.

In those covariance plots we can verify our earlier intuition/observation that the luminance represents the image content/information pretty well and most of it is concentrated there.

Now, the Y luminance channel is created to roughly correspond to perception of the brightness of the human vision. There are some other color spaces like YCoCg that are designed for more decorrelation (as well as better computational efficiency).

But what if we have some correlation, but not along this single axis? Here’s an example:

Despite strong linear correlation, projecting onto fixed axis only partially decorrelates color channels.

Such a fixed axis doesn’t work very well here – we have somewhat reduced the ranges of variation, but there is still some “wasteful” strong linear correlation…

Is there some transformation that would decorrelate them perfectly while also providing us with the “best” representation for a given subset of components? I am going to describe a few different “angles” that we can approach the problem from.

Line fitting / least squares vs dimensionality reduction?

The problem as pictured above – finding best line fits – is simple. In theory, we could do some least squares linear fit there (similarly to what I described in my post on guided image filters). However, I wanted to strongly emphasize that line fitting is not what we are looking for here. Why so? Let’s have a look at the following plot:

Difference between least squares line fitting and PCA/SVD. The first one seeks smallest sum of squared errors in the original y space, while the latter obtains a line with the smallest sum of squared errors of the projection.

Least squares line fitting will minimize the error and fit for the original y axis (as it is written to minimize the average “residual” error a*x+b – y), while what we are looking for is a projection axis with smallest error when the point is projected on that line! In many cases those lines might be similar, but in some they might be substantially different.

Here’s an example on a single plot (hopefully still “readable”):

Least squares line fitting – blue line – might obtain substantially different results from optimal projection axis – orange line.

For the case of projection, we want to minimize the component perpendicular to the projection axis, while linear line fitting / linear least squares minimize the error measured in the y dimension. Those are very different theoretical (and also practical when the number of dimensions grows) concepts!

In the earlier plot I mentioned name PCA, which stands for Principal Component Analysis. It is a technique for dimensionality reduction – something that we are looking for.

Instead of writing how to implement PCA with covariance matrices etc., (it is quite simple and fun to derive it using Lagrange multipliers, so might write a separate post on it in future) I wanted to look at it from a different perspective – of Singular Value Decomposition, which I blogged about in the past in the context of filter separability

I recommend that post of mine if you have not seen it before – not because of its relevance (or my arrogance), but simply as it describes a very different use of SVD. I think that seeing some very different use-cases of mathematical tools, concepts, and their connections is the best way to understand them, build intuition and potentially lead to novel ideas. As a side note – in linear algebra packages, PCA is usually implemented using SVD solvers.

Representing images as matrices – image doesn’t have to be a width x height matrix!

Before describing how we are going to use SVD here, I wanted to explain how we want to represent N-channel images by matrices.

This is something that was not obvious to me, and even after seeing it, defaulting to thinking about it in different ways took me years. The “problem” is that usually in graphics we immediately think of an image as a 2D matrix – width and height of the image get represented with matrix width and height; simple adding color channels immediately poses a problem, as it requires a 3D tensor. This thinking is very limited, only one of few possible representations and arguably a not very useful one. It took me quite a while to stop thinking about it this way and it is important even for “simple” image filtering (bilateral filter, non-local means etc.).

In linear algebra, matrices represent “any” data. How we use a matrix to represent image data is up to us – but most importantly, up to the task and goal we want to achieve. Representation that we are going to use doesn’t care for spatial positions of pixels – they don’t even have to be placed on an uniform grid! Think of them just as “some samples somewhere”. For such use-cases, we can represent the image as all different pixel positions (or sample indices) on one axis, and color channels on the other axis.

Two equivalent image representations – as a w x h x c 3D tensor, or as a w*h x c 2D matrix.

This type of thinking becomes very important when talking about any kind of processing of image patches, collaborative filtering etc. – we usually “flatten” patch (or whole image) and pack all pixels on one axis. If there is one thing that I think is particularly important and worth internalizing from my post – it is this notion of not thinking of images as 2D matrices / 3D tensors of width x height pixels.

Important note – as long as we remember the original dimensions and sample locations, those two representations are equivalent and we can (and are going to!) transform back and forth between them.

This approach extends to 3D (volumetric textures), 4D (volumetric videos) etc. – we can always pack width * height * depth pixels along one axis.

SVD of image matrices

We can perform the Singular Value Decomposition of any matrix. What do we end up with for the case of our image represented as a w*h x c matrix?

(Note: I tried to be consistent with ordering of matrices and columns / rows, but it’s usually something that I trip over – but in practice it doesn’t matter so much as you can transpose the input matrix and swap the U and V matrices.)

The SVD is going to look like:

Singular value decomposition of our image data. We end up with three separate matrices, U, S, V where U matrix is the same size as the input image, singular value matrix S can be seen as a c x c diagonal matrix (depending on used interpretation there might be more rows, but the rest entries are zero), and a c x c V matrix.

Oh no, instead of a single matrix, we ended up with three different matrices, one of which is as big as the input! How does this help us?

Let’s analyze what those matrices represent and their properties. First of all, matrix U represents all of our pixels in our new color space. We still have C channels, but they have different “meaning”. I used color coding of “some” colors to emphasize that those are not our original colors anymore.

Those new C channels contain different amount of energy / information and have different “importance”. Amount of it is represented in the diagonal S matrix, with decreasing singular values. The first new color channel will represent the largest amount of information about our original image, the next channel less etc. (Note the same used colors as in matrix U).

Finally, there is a matrix V that transforms between the new color channels space, and the old one. Now the most important part – this transformation matrix is orthogonal (and even further, orthonormal). It means that we can think of it as a “rotation” from the old color space, to the new one, and every next dimension provides different information without any “overlap” or correlation. I used here “random” colors, as those represent weights and remappings from the original, RGBA+other channels, to the new vector space.

Now, together matrices T and S inform us about transforming image data into some color space, where channels are orthogonal, and sorted by their “importance”, and where first k components are “optimal” in terms of contained information (see Eckart–Young theorem).

Edit / bonus section: A few readers have commented in a thread on twitter (and on Facebook) that while Eckart-Young is from perspective of linear algebra, usually image compression and stochastic processes analysis use the name Karhunen–Loùve theorem/transform. It is a concept that is new to me (as I have not really worked with neither of those fields) and I have to and will be happy to read more about it. Thanks to everyone for pointing out this connection, always excited to learn some new concepts!

Think of it as rotating input data to a perfectly decorrelated linear space, with dimensions sorted based on their “importance” or contribution to the final matrix / image, and fully invertible – such representation is exactly what we are looking for.

How does it relate to my past post on SVD?

A question that you might wonder is – in my previous post, I described using SVD for analyzing separability and finding approximations of 2D image filters – how does this use of SVD relate to the other?

We have looked in the past at singular components, sorted by their importance, combined with single rows + columns reconstructing more and more of the original filter matrix.

Here we have a similar situation, but our U “rows” are all pixels of the image, every one with single channel, and V “columns” are combinations of the original input channels that transform to this new channel space. Just like before, we can add singular values multiplied by the corresponding vectors one by one, to get more and more accurate representation of the input (no matter whether it’s a filter matrix or a flattened representation of an image). The first singular value and color combination will tell us the most about the image, the next tells a bit less less information etc. Some images (with lots of correlation) have huge disproportion of those singular values, while in some other ones (with almost no correlation) they are going to be decaying very slowly.

This might seem “dry” and is definitely not exhaustive. But let’s try to build some intuition for it by looking at the results of SVD on our “toy” problem. If we run it on toy problem matrix image data as it is, we will immediately encounter a limitation that is easy to address:

Left: original data with a plotted yellow line corresponding to the “luminance” and a green line representing new first “dimension” of the SVD. Center: Same data converted to luminance/chrominance. Right: Same data when transformed by SVD matrix V and S.

The line corresponding to the first singular value passes nicely through the cluster or our data points (unlike simple “luminance”), however clearly doesn’t decorrelate points – we still see some simple linear relationship in the transformed data.

This is a bit disappointing, but there is a simple reason for it – we ran SVD on original data, fitting a line/projection, but also forcing it to go through original zero!

Usually when performing PCA or SVD in the context of dimensionality reduction, we want to remove the mean of all input dimensions, effectively “centering” the data. This way we can “release” the line intercept. Let’s have a look:

Left: original data with a plotted yellow line corresponding to the SVD without mean subtraction, and a green line representing new first “dimension” of the SVD with “centering” of the data before fitting. Center: Uncentered SVD shows some remaining linear correlation and value range skew away from zero. Right: Same data with “centered” SVD shows perfect decorrelation and a significant range reduction of the second dimension, while more information is represented in the first dimension.

As presented, shifting the fit center to the mean of the data, shifts the line fit and achieves perfect “decorrelating” properties. PCA that runs on covariance matrices does this centering implicitly, while in the case of SVD we need to do it manually.

In general, we would probably want to always center our data. For compression-like scenarios it has an additional benefit that we need to compress only the “residual”, so difference between local color, and the average for the whole image, making the compression and decompression easier at virtually no additional cost.

In general, many ML applications go even step further, and “standardize” the data, bringing it to the same range of values by dividing them by their standard deviations. For many ML-specific tasks this makes sense, as for tasks like recognition we don’t really know the importance of ranges of different dimensions / features. For example if someone was writing a model to recognize a basketball player based on their weight, height, and maybe average shot distance, we don’t want to get different answers depending on whether we enter the player’s weight in kilograms vs pounds!

However for tasks like we are aiming for – dimensionality reduction for “compression” we want to preserve the original ranges and meanings, and can assume that they are already scaled based on their importance. We can take it even further and use the “importance” to guide the compression, and I will describe it briefly around the end of my post.

SVD on Kodak images

Equipped with knowledge on how SVD applies to decorrelating image data, let’s try it on the two Kodak pictures that we analyzed using simple YUV transformations.

Let’s start with just visualizing both images in the target SVD color space:

Results of projecting images into a color space resulting from Singular Value Decomposition. Note that it seems to be very similar to the YUV color space!

Results of projection seem to be generally very similar to our original YUV decomposition, with one main difference is that our projected color channels are “sorted”. But we shouldn’t just “eyeball” those, instead we can analyze this decomposition numerically.

Covariance matrices and singular values

Comparing covariance matrices (and square roots of those matrices – just for visualization / “flattening”) shows the real difference between such a pretty good, but “fixed” basis function like YUV, and a decorrelated basis found by SVD:

Absolute value of covariance matrices that I normalized to 1 in the first channel (position 0,0 of each different basis projection) – in addition to regular covariance matrices on the left, I also added a version displaying a square root for dynamic range compression / easier visualization on the right.

As expected, SVD fully decorrelates the color channels. We can also look at how the singular values for those images decay:

Decay of the energy in different channels of YUV conversion and in SVD – in linear and in log scales.

So again, hopefully it’s clear that with SVD we get faster decay of the “energy” contained in different image channels – the first N channels will reconstruct the image better. Let’s look at the visual example.

Visual difference when keeping only two channels

To fully appreciate the difference, let’s look at how the images would look like if we used only two channels – YUV vs SVD and corresponding PSNRs:

Using only two “decorrelated” color channels – YUV vs SVD – visual difference as well as PSNR.

The difference is striking visually, as well as very significant numerically.

This shouldn’t come as a surprise, as SVD finds a reparametrization / channel space that is the “best” projection for thr L2 metric (which is also used for PSNR computations), decorrelates, and sorts this space according to importance.

I wonder why JPEG wouldn’t use such decomposition – I would be very curious to read in the comments! :)I have two guesses of mine: JPEG is an old color format with lots of “legacy” and support in hardware – both encoders, as well as decoders. A single point-wise floating point matrix multiply (where matrix is just 4×3 – 3×3 multiplication and adding back means) seems super cheap to any graphics programmers, but would take a lot of silicon and power to do efficiently in hardware. A second guess is related to the “perceptual” properties of chrominance (and worse human vision sensitivity to it) – SVD doesn’t guarantee us that decomposition will happen in the rough direction of luminance, could be “any” direction.

Edit / addendum: While my remark re JPEG was about the YUV luminance/chrominance decomposition as compared to doing PCA on the color channels, the use of DCT basis for the spatial compression component (instead of mentioned before KLT and eigendecomposition of data directly) is considered “good enough”, and I learned from a reply twitter thread with Per Vognsen and Fabian Giesen about some justifications for its use and another set of connections to stochastic processes.

Now, this use case – decorrelating RGB channels – can be very useful not for standard planar image textures, but also for 3D volumetric colored textures. I know that many folks store and quantize/compress irradiance fields in YUV-like color spaces, and using there SVD will most likely have better quality/compression ratios. This would become even more beneficial for storing sky / ambient occlusion in addition to the irradiance, and one of the further section includes an example AO texture (albeit for the 2D case).

Relationship to block texture compression

If you have worked with GPU “block texture compression” (like DXT / BC color formats) you can probably recognize some of the described concepts – in fact, the simplest BC1 format does a PCA/SVD and finds a single axis, on which colors are projected! I recommend Nathan Reed’s blog post on this topic and a cool visualization.

This is done per a small, 4×4 block and because of that, we can expect much more energy be stored in the single component. This is a visualization of how such BC compression would +/- look like with our toy problem:

The simplest variants of BC compression project colors onto a single line that is found using PCA/SVD approaches and then quantized heavily.

Any blocks that don’t have almost all of information in the first singular value (so are not constant or have more than a single gradient) are going to end up with discoloration and block artifacts – but since we talk about just 4×4 blocks, GPU block compression usually works pretty well.

Block compression will be also pose a problem for the initial use-case and research direction I tried to follow – but more in that later!

What if we had many more channels?! PBR texture sets.

After a lengthy introduction, time to get back to the original purpose of my investigations – PBR texture sets for materials. How would SVD work there? My initial thinking was that if there are many more channels, some of them have to be correlated (as local areas will share physical properties – metal parts might have high specularity and gloss, while diffuse might be rough and more scratched). While such correlations might be only weak, there are many channels, and a lot of potential for also different, non-obvious correlations. Or correlations that are “accidental” and not related to the physical properties. 🙂

Since I don’t have access to any actual game assets anymore, and am not sure about copyright / licensing of many of the scenes included with the engines, I downloaded some texture set from cc0textures by Lennart Demes (such an awesome initiative!). Texture sets for materials there are a bit different than what I was used to in the games I worked on, but are more than enough for demonstration purpose.

I picked one set with some rocks/ground (on GoW, the programmers were joking that half of GoW disk was taken by rock textures, and at some point in the production it wasn’t that far off from true 🙂 ).

The texture set looks like this:

Example PBR texture set created by Lennart Demes that we are going to analyze and “compress” with SVD.

For the grayscale channels I took only single channels, and normals are represented using 3 channels, giving total of 9 channels.

The cool thing is – we don’t really have to “care” or understand what those color channels represent! We could even use 3 color channels to represent the “gray” textures, and the SVD would recognize it automatically (with some singular values ending up exactly zero) but I don’t want to present misleading/too optimistic data.

For visualization, I represented this texture set as horizontally stacked textures, but the first thing we are going to do is to first represent it as 2048 x 1365 x 9 array, and then for purpose of SVD as (2048 * 1365) x 9 one.

If we run the SVD on it (and then reshape it to w x h x 9), we get the following 9 decorrelated new “color”/image channels:

New color channels resulting from the SVD decomposition stacked together. U used data in U matrix multiplied by the S matrix to demonstrate the relative importance of it – feature magnitudes.

Alternatively, if we display it as 3 RGB images:

Same data as above (U times S from the SVD decomposition), but with 9 channels displayed as three RGB textures.

Immediately we can see that information contained in those decays very quickly, last four containing barely any information at all!

We can verify this looking at the plot of singular values:

As seen from the textures earlier, after 5 color channels, we get significant drop off of the information and start observing diminishing returns (“knee” of the cumulative curve).

Let’s have a look at how did SVD remap the input to the output channels and what kind of correlations it found.

In the following plot, I put abs(S.T) (using absolute value in most of my plots, as it’s easier to reason about existence of correlations, not their “direction”):

In this matrix, columns represent input features, and rows represent the output channels/features. Color/magnitude signifies abs() of the contribution. Input channel 0 is AO and contributes to four output features; channels 1,2,3 are albedo and map mostly to first two channels, and a bit to channels 5/6/7 etc.

I marked two input textures – albedo and normals, and as they are a 3-channel representation of the some physical property and we can see how because of this correlation they contribute to very similar output color channels, almost like same “blocks”. We can also observe that for example the last two output features contribute to a little bit of AO, little bit of albedo, and a bit to one channel of normals – comprising some leftover residual error correction in all of the channels.

Now, if we were to preserve only some of the output channels (for data compression), what would be the reconstruction quality? Let’s start with the most “standard” metric, PSNR:

PSNR of the reconsctruction if we take only N channels from the SVD decomposition. Note that even with zero channels, we get non-zero PSNR due to mean-subtraction. Generally PSNRs above ~36dB are considered “ok”. I “clip” above 50dB as such high values don’t make any difference for any perceptual purpose.

I skipped keeping of 9 channels in the plot, as for those we would expect “infinite” PSNR (minus floating point imprecisions / roundoff error).

So generally, for most of the channels, we can get above ~30dB with keeping only 5 data channels – which is what we intuitively expected from looking at the singular values. This is how reconstruction from keeping only those 5 channels looks like:

Top: Original PBR texture set. Bottom: Reconstructed from the first 5 SVD channels. Difference is most visuble on albedo and the roughness.

Main problems I can see is slight albedo discoloration (some rocks lose the desaturated rock scratches) and different frequency content on the last, roughness texture.

Going back to the PSNRs plots, we can see that they are different per channels and e.g. displacement becomes very high quickly, while the color and roughness stay pretty low… Can we “fix” those and increase the importance of e.g. color?

Input channel weighting / importance

I mentioned before that many ML applications would want to “standardize” the data before computing PCA / SVD and make sure range (as expressed by min/max or standard deviations) is the same to make sure we don’t bake in assumption of importance of any features as compared to the others in the analysis.

But what if we wanted to bake our desired importance / weighting into the decomposition? 🙂

Artists do this all the time manually by adjusting the resolutions of different sets of PBR textures to different resolutions; most 3D engine have some reasonable “defaults” there. We can use the same principles to guide our SVD and improve the reconstruction ratio. To do it, we would simply rescale the input channels (after the mean subtraction) by some weights that control the feature importance. For the demonstration, I will rescale the color map to be 200% as important, while reducing the importance of displacement to 75%.

This is how channel contributions and PSNR curves look like before/after:

Channel contributions without and with channel weighting increasing the contribution of albedo (channels 1/2/3) and decreasing the weight of the fourth channel. Notice how with the weighting most of albedo contributions happen in the first singular value, and the 4th channel of displacement gets distributed among few multiple values.
PSNR curves before and after the channel weighting increasing the contribution of albedo (channels 1/2/3) and decreasing the fourth channel. Notice how increase of the reconstruction accuracy of some channels had to lower it for some other ones, similarly global PSNR.

Such simple weighting directly improved some of the metrics we might care about. We had to “sacrifice” PSNR and reconstruction quality in different channels, but this can be a tuneable, conscious decision.

And finally, the visual results for materials reconstructed from just 5 components:

Top: original textures. Middle: SVD decomposition and reconstruction from 5 components without weighting. Bottom: Reconstruction from 5 components that includes increasing the weight of the albedo, while decreasing height map (third displayed texture). We observe the visual error decreasing significantly on the albedo – no discoloration, while increasing significantly for the height map.

Other than some “perceptual” or artist-defined importance, I think that it could be also adjusted automatically – decreasing weight of channels that are reconstructed beyond diminishing returns, and increasing of the ones that get too high error. It could be some semi-interactive process, dynamically adjusting number of channels preserved and the weighting – an area for some potential future research / future work. 🙂

Dropping channels vs smaller channels

So far I have described “dropping channels” – this is really easy to get significant space savings (e.g. 5/9 == almost 50% savings with PSNR in the mid 30s), but isn’t it too aggressive? Maybe we can do the same thing as JPG and similar encoders do – just store the further components at smaller resolutions.

For the purpose of this post and just demonstration, I will do something really “crude” and naive – box filter and subsample (similar to mip-map generation), with nearest neighbor upsampling – we know we can do better, but it was simpler to implement and still demonstrates the point. 🙂

For demonstration I kept the 4 components at the original resolution, then 3 components at half resolution, and the final 2 components at quarter resolution. In total this gives us slightly smaller storage than before (7/16th), but the PSNRs are as follow: Average PSNR 36.96, AO 43.69, Color 41.08, Displacement 29.98. Normal 41.25, Roughness 35.13.

Not too bad for such naive downsampling and interpolation!

When downsampled, images looked +/- the same, so instead here is a zoom-in to just crops (as I expected to see some blocky artifacts):

Top: original textures crops, bottom: crops of textures reconstructed from aggressive decimation of further SVD channels with super-naive bilinear/box filter and NN upsampling.

With the exception of the height map which we have purposefully “downweighted” a lot, the reconstruction looks perceptually very good! Note that this ugly “ringing” and blocky look of it would be significantly better if even simple bilinear interpolation was used.

I’d add that we typically have also a third option – adjusting the quantization and the number of bits we encode each of our channels. This actually makes the most sense (because our channel value ranges get smaller and smaller and we know them directly from the proportions singular values), and is something I’d recommend playing with.

Runtime cost – conversion back to the original channel space

Before I conclude with the biggest limitation and potentially a “deal breaker” of the technique, a quick note on the runtime cost of such decomposition technique.

The decomposed/decorrelated color channels are computed with zero mean and unity standard deviation, so we would generally want to shift them by 0.5 and rescale so that they cover range [0, 1], instead of [-min, max]. Let’s call this a (de)quantization matrix M. So far we have matrices U (pixels x N), S (N x N diagonal), T (N x N, orthogonal), M (we could do N+1 x N to use “homogenous coordinates” and get translations for mean centering for free), input feature weight matrix / vector W, and vector with means A.

We can multiply all of those matrices (other than U which becomes our new image) offline and as a part of precomputation and using homogenous coordinates (N input features + ‘1’ constant in additional channel), we end up with a (N+1 x N) conversion matrix if keeping all channels, or (N x M+1) where M < N if dropping some channels. The full runtime application cost becomes a matrix multiply per pixel – or alternatively, a series of multiply-adds – each reconstructed feature becomes a weighted linear combination of the “compressed” / decorrelated features, so total cost per a single texture channel is M+1 madds.

This is very cheap to do on a GPU without any optimizations (maybe one could use “tensor cores” which are simply glorified mixed-precision matrix-multipliers?), but has a disadvantage that even if we are going to use a single channel (for example heightmap for raymarching or in a depth prepass), we still need to read all the input channels. If you have many passes or use-cases like this, this could elevate the required texture bandwidth significantly, but otherwise could be almost “free” (ALU is generally cheap, especially if doesn’t have any branching that would elevate the used register count).

Note however that it’s not all-or-nothing – if you want you could store some textures as before, and compress and reduce dimensionality of the other ones.

Caveats / limitations – BC texture compression

Time to reveal the reason why I haven’t followed up on this research direction earlier – I imagine that to many of you, it might have been obvious throughout the whole post. This approach is going to generally lose a lot of quality when being compressed with GPU block compression formats.

You remember how I described that the simplest BC formats do some local PCA and then just keep only single principal component? Now, if we purposefully decorrelated our color channels and there is no correlation anymore, such a fit and projection is going to be pretty bad! 😩

Now, it’s not that horrible – after all, we have found global decorrelation for the whole texture. Luckily, this doesn’t mean that there won’t be local (per block) one.

BC formats do this analysis per block (otherwise every image would end up as some form of “colored grayscale”) and there still might be potential for local correlations and not losing all the information with block compression. This is how our color channels to compress look like:

Crops of the colors of analyzed texture transformed to the decorrelated SVD space. Despite global decorrelation, we still observe blocks and areas of constant color or just two colors.

So generally, locally we see mostly flat regions, or combinations of two colors.

This is not any exhaustive test, proof, or analysis, just a single observation, but maybe it won’t be as bad as I thought when I abandoned this idea for a while – but it’s a future research direction! If you have any thoughts, experiments, or would like to brainstorm, let me know in the comments or contact me. 🙂

Summary

In this post we went through quite a few different concepts:

  • Idea of “decorrelating” color channels to maximize information contained in them and remove the redundancy,
  • One alternative possible way how to think about images or image patches for linear algebra applications – as 2D matrices, not 3D tensors,
  • Very brief introduction to use of SVD for computing “optimal” decorrelated channel decompositions (SVD for PCA-like dimensionality reduction),
  • Relationship between the least-squares line fitting and SVD/PCA (they are not the same!),
  • Demonstration of SVD decomposition on a 2-channel, “toy” problem,
  • Relationship of such decomposition to Block Compression formats used on GPUs,
  • Reducing the resolution/quantization/fully dropping “less significant” decorrelated color channels,
  • Comparison of efficacy of SVD as compared to YUV color space conversion that is commonly used for image transmission and compression,
  • Potential application of dimensionality reduction for compressing the full PBR texture sets,
  • And finally, caveats / limitations that come with it.

I hope you found this post at least somewhat interesting / inspiring and encouraging some further experiments with using the multi-channel texture / material cross-channel correlation.

While the scheme I explored here has some limitations for practical GPU applications (compatibility with BC formats), I hope I was able to convince you that we might be leaving some compression/storage benefits unexplored, and it’s worth investigating image channel relationships for dimensionality reduction and potentially some smarter compression schemes.

As a final remark I’ll add that applicability of this technique depends highly on the texture set itself, and how quickly the singular values decay. I have definitely seen more correlated texture sets than the example, but it’s also possible that on some inputs the correlation and technique efficacy might be worse – all depends on your data. But you don’t have to make a decision whether to use it or how many channels to keep manually – by analyzing the singular values as well as the PSNR itself it can be easily deducted from data and the decision automated.

The post comes with three colabs: Part 1 – toy problem, Part 2 – Kodak images, Part 3 – PBR materials which hopefully will encourage verification / reproduction / exploration.

Thanks for reading!

Posted in Code / Graphics | Tagged , , , , , , , , | 9 Comments

“Optimizing” blue noise dithering – backpropagation through Fourier transform and sorting

Introduction

This will be a blog post that is second in an (unanticipated) series on interesting uses of the JAX numpy autodifferentiation library, as well as an extra post in my very old post series on dithering in games and blue noise specifically.

Teaser: we will “optimize” a dithering matrix directly in Fourier domain and backpropagate for gradient descent together with backpropagation through sorting, to obtain specific perceptual, blue-noise dithering properties. The image shows comparison between white, and blue noise dithering.

I am going to describe how we can use auto-differentiation and loss function optimization to perform a precomputation of a dithering pattern directly with a desired Fourier/frequency spectrum and the desired range of covered values by backpropagating through those two (differentiable!) transforms.

While I won’t do any detailed analysis of the pattern properties and don’t claim any “superiority” or comparisons to any of the state of the art (hey, this is a blog, not a paper! 🙂 ), it seems pretty good, highly tweakable, works pretty fast for tileable sizes, and to my knowledge, this idea of generating such patterns directly from spectrum with other (spatial, histogram) loss components is quite fresh.

If you are new to “optimization”, gradient descent, and back-propagation, I recommend at least my previous post just as a starting point, and then if you have some time, reading proper literature on the subject.

This blog post comes with a “scratchpad quality” colab notebook that you can find and play with here. It is quite ugly and unoptimal code, but being as simple as possible and thus a starting point for some other explorations.

Motivation

Motivation for this idea came from one of the main themes that I believe in and pursue in my personal spare time explorations / research:

Offline optimization is heavily underutilized by the computer graphics community and incorporating optimization of textures, patterns, and filters can advance the field much more than just “end-to-end” deep learning alone.

This was my motivation for my past blog post on optimizing separable filters, and this is what guides my interests towards ideas like accelerating content creation or compression using learning based methods.

Given a problem, we might want to simplify it and extract some information from data that is not known in a simple, closed form – this is what machine learning is about. However, if we run the “expensive” learning, optimization and precomputation offline, the runtime can be as fast – or even faster if we simplify the problem – as before and we can still use our domain knowledge to get the best out of it. Some people call it differentiable programming, but I prefer “shallow learning”.

Playing with different ideas for differentiating and optimizing Fourier spectrum for some different problem, I realized – hey, this looks like directly applicable to the good, old blue-noise!

Blue noise for dithering – recap

Let’s get back on-topic! It’s been a while since I looked at blue noise dithering and when writing it, I felt like my blog post mini-series has exhausted this topic to the point of diminishing returns (FWIW I don’t think that blue noise is a silver bullet for many problems; it has some cool properties, but its strength lies in dithering and only extreme low sample counts for sampling).

Before I describe benefits of the blue noise for dithering, have a look at an example dithered gradient from my past blog post:

A linear gradient dithered with some white vs blue noise. Can you guess which one is which? 🙂

Generally – informal definition that is most common in graphics – blue noise is type of noise that in a general sense has no low frequency content. This spectral property is desirable for dithering for the following main reasons:

  • Low frequencies present in the noise cause low frequencies of the error, which are very easily “caught” by the human visual system,
  • With blue noise dithering pattern, we are less likely to “miss” high frequency details of a given range of values. Imagine if low frequency caused “rise” of the pattern in a local spot, while it had high frequencies or information in different range – they would all quantize to a similar value.
  • Finally, with further downstream processing and high resolution displays, high frequency error is much easier to filter out by either image processing, or simply our imperfect eyes, with eye sight or processing reducing the variance of the apparent error.
  • Unlike some predefined patterns like Bayer matrix, blue noise has no visible “artificial” structure, making it both more naturally looking, as well as less prone to amplify some of the artificial frequencies.

All those properties make it a very desireable type of noise / pattern for simple look-up based dithering. (Note for completeness: if you can afford to do error diffusion and optimize the whole image based on the error distribution, it will yield way better results).

“Optimizing” Fourier spectrum

One of the blue noise generation methods that I initially tried for 1D data was “highpass filter and renormalize” by Timothy Lottes. The method is based on a simple, smart idea – highpass filtering followed by renormalizing values for proper range/histogram – and is super efficient to the point of applicability in real-time, but has limited quality potential, as renormalization “destroys” the desired spectral properties. Iterating this method doesn’t give us much – every highpass operation done using finite size convolution has only local support, while the renormalization is global. This imbalance was in my explorations leading to geting stuck in some low quality equilibrium, and after just a few iterations a ping-pong of two exactly same patterns.

If only we knew some global support transform that is able to look directly at spectral content… If only such transform was “tileable”, so the spectra were “circular” and periodic… Hey, this is just plain old Fourier transform!

While in many cases, Fourier being valid on periodic / circular patterns is a nuisance that needs to be addressed; similarly its global support (single coefficient affects the whole image), but here both of those properties work to our advantage! We want our spectral properties to work on toroidal – tiled – domains, as well as we want to optimize such pattern globally.

Furthermore, the Fourier transform is trivially differentiable (every frequency bucket is a sum of exponentials on the input pixels) and we can even directly write a loss function that mixes operations directly on the frequency spectrum (but even a single “pixel” of the spectrum will impact every single pixel of the input image. With such gradient based optimization at every step we can just slightly “nudge” our solution towards desired outcome and maybe won’t get stuck immediately in a low quality equilibrium.

Fourier transform is differentiable and we can backpropagate any loss function defined on the frequency spectrum back to the original spatial domain. Interesting side-effect of the global spatial support is that modifying even single frequency bucket has effect on all of the input samples. Similarly, every input sample contributes to all frequency buckets.

Similarly, if we sort pixels by value (do get ordering that would generate a histogram), we can just push them “higher” and lower and we can back-propagate through sorting.

Similarly, backpropagating loss function defined on sorted values (~a CDF of a histogram) is trivial as it requires only “reindexing” values (remembering which input value index corresponded to given sorted bucket). In this case loss and gradient are more “local” – single value maps back to a single value, but if we optimize for the whole histogram, all values will have some non-zero gradient.

Writing gradient of both of those manually wouldn’t be very hard… But in JAX it is already done for us, we already got both operations implemented as part of the library, and we can JIT the resulting gradient evaluation, so nothing left for us but to design a loss and optimize it!

Fourier domain optimization – whole science fields and prior work

I have to mention here that optimizing frequency spectra, or backpropagating through them is actually very common and it is used all the time in medical imaging, computational imaging, radio / astronomical imaging, and there are types of microscopy and optics based on it. Those are super fascinating fields and it probably takes lifetime to study those properly, and I am complete layman when it comes to those – but before I conclude the section, I’ll link a talk by Katie Bouman on the recent event horizon telescope which also does frequency domain processing (and in a way uses the whole Earth as a such Fourier telescope – no big deal! 🙂 ).

Setting up loss function and optimization tricks

As I mentioned in my previous post, designing a good loss function is the main part of the problem when dealing with optimization-based techniques… (along things like convergence, proofs of optimality, conditioning, and better optimization shemes… but even many “serious” papers don’t really care about those!).

So what would be the loss component of our blue noise function? I came up with the following:

Components of our dithering blue noise loss function:
1. Histogram uniformity. We want some desired histogram shape. For simplicity I used uniform one, but a triangular or Gaussian one might be a better choice for smooth gradients!
2. Frequency content focused on high frequencies, with absence of the low ones.
3. General frequency spectrum uniformity – we don’t want “spikes” similar to e.g. a Bayer pattern.

This is just an example – again, I wanted to focus on the technique, not on the specific results or choices! If you some alternative ideas, please let me know in the comments below 👇.

Again, code for this blog post is available and immediately reproducible here.

There are also many “meta” tuning parameters – what would be our low frequency cutoff point? How do we balance each loss component? What histogram shape we want? The best part – you can try and adjust any of those to taste and to your liking and specific use-case. 🙂

When running optimization on such a compound loss function, I almost immediately encountered getting stuck in some “dreaded” local minima. On the other hand, when optimizing only for the loss sub-components, the optimization was converging nicely. Inspired by the stochastic gradient descent that is commonly known to generalize and converge to better global minima than any batch based / whole dataset optimization, I used this observation to “stochasticize” / randomize the loss function slightly (to help “nudge” the solutions out of the local minima or saddle points).

The idea is simple – scale the loss components by a random factor, different per each gradient update / iteration. Given my “complete” loss function comprised of three sub-components, I scale each one by a random uniform in the range 0-1. This seemed to work very well and help optimization to achieve plausibly looking results without local minima.

As a final step, I run fine-tuning on the histogram only part (could as well just force-normalize, but reused existing code), as without it histograms were “almost” perfect. I think in practice it wouldn’t matter, but why not have a “perfect” unbiased dithering pattern if we can?

Left: Pretty good resulting optimized histogram – result of optimizing for the whole loss – individual component is very close to perfect dithering histogram. Right: A “perfect” histogram after fine tuning (or just renormalization).

Results

Without further ado, after playing for a few tries with different loss components and learning rates, I got a following 64×64 dithering matrix:

And the following spectrum:

One of the components of the loss that is hardest to optimize for is spectrum uniformity. It makes sense to me intuitively – smoothness of the spectrum means that we can propagate values only between ajdacent frequencies, which then affect all pixels and most likely cancel each other out (especially together with other loss components). Most likely a smarter metric (maybe just optimize directly for desired value in buckets? 🙂 ) would yield better results – or better optimizer, or simply running the optimization for longer than a few minutes, but they are not too bad at all! And it’s not 100% clear to me if it is necessary to have perfectly uniform spectrum.

Since those blue noise dithering matrices would be used after tiling for dithering, I tested how they look on 2bit dithering of an RGB image from Kodak dataset:

The difference is striking – which is not surprising at all. It’s even more visible on a 1-but quantized image (though here we can argue that both look unacceptably low quality / “retro” heh):

Still – in the blue noise dithered version you can actually read the text and I think it looks much better! I haven’t compared with any other techniques and not sure if it really matters in practice, especially for something so simple as dithering that has also more expensive, better alternatives (like error diffusion).

Conclusions

I have presented an alternative approach to designing dither matrices – by directly optimizing the energy function on the frequency spectrum, and the value range, and running a gradient descent to optimize for those objectives.

I haven’t really exhausted the topic and there are many interesting follow-up questions:

  • how a triangular or Gaussian noise value PDF affects the optimization?
  • What is the desired frequency cutoff point for dithering?
  • How to balance the loss terms and how to express smoothness more elegantly? Does the smoothness matter?
  • Would better optimization schemes like e.g. momentum based yield even faster convergence?
  • What about extending it to more dimensions? Projection-slice property of Fourier transform acts against us, but maybe we can optimize frequency in 2D, and some other metric in other dimensions?

…and probably many more, that I don’t plan to work more on for now, but I hope my post content and immediately-reproducible Python JAX code that I included will inspire you to some further (and smarter than my brute-force!) experiments.

Post scriptum / digression – optimization vs deep learning – which one is “expensive”?

While writing this post, I contemplated on something quite surprising – a digression on optimization and its relationship with deep learning – something I found counter-intuitive and paradoxically converse views between practitioners and academia: When switching from working in “the industry” on mostly production (and applied research), to working in research groups and interacting with academics, I immediately encountered some opposite views on deep neural networks.

For me, a graphics practitioner, they were cool, higher quality than some previous solutions, but impractical and too slow for most real product applications. At the same time, I have heard from at least some academics that neural networks were super efficient, albeit somewhat sloppy approximations.

What the hell, where does this disconnect come from? Then, reading more about state of the art methods prior to DL revolution, I realized that most of them were using optimization of some energy function on a whole problem and solving a complex, non-linear and huge problem with gradient descent (for example on a non-linear system where every unknown is a pixel of an image!) per every single instance of the problem, and from scratch.

By comparison, artificial neural networks might seem brute-force and wasteful to a practitioner, (per every pixel, hundreds if not thousands/millions of huge matrix multiplies of mostly meaningless weights and “features”, and gigantic waste of memory bandwidth) but the key difference is that the expensive “loss” optimization happens offline. In the runtime/evaluation it is “just” passing data through all the layers and linear algebra 101 arithmetic.

So yeah, our perspectives were very different. 🙂 But I am also sure that if we care about e.g. power consumption, energy efficiency and “operations per pixel”, we can take the idea of optimizing offline much further.

Posted in Code / Graphics | Tagged , , , , , , , , | 2 Comments

Bilinear texture filtering – artifacts, alternatives, and frequency domain analysis

In this post we will look at one of the staples of real-time computer graphics – bilinear texture filtering. To catch your interest, I will start with focusing on something that is often referred to as “bilinear artifacts”, trapezoid/star-shaped artifact of bilinear interpolation – what causes them? I will discuss briefly some common bilinear filtering alternatives and how they fix those, link a few of my favorite papers on (fast) image interpolation, and analyze the frequency response of common cheap filters.

Screenshot from my favorite video game of all time – Deus Ex (2000). See this diamond shape of the pupil? This is what I will refer to as “bilinear artifacts”.

As usual, I will work on the topic in “layers”, starting with basics and “introduction to graphics” level and perspective – analyzing the “spatial” side of the artifact. Then I will (non-exhaustively) the alternatives – bicubic and biquadratic filter, and finally analyze those from a signal processing / EE / Fourier spectrum perspective. I will also comment on relationship between filtering as used in the context of upsampling / interpolation, and shifting images. (Note: I am going to completely ignore here downsampling, decimation, and generally trilinear filtering / mip-mapping.)

Update: Based on a request, I added a link to a colab with plots of the frequency responses here.

What is bilinear filtering and why graphics use it so much?

I assume that most of the audience of my blog post are graphics practitioners or enthusiasts and most of us use bilinear filtering every day, but a quick recap never hurts. Bilinear filtering is a texture (or more generally, signal) interpolation filter that is separable – it is a linear filter applied on the x axis of the image (along the width), and then a second filter applied along the y axis (along the height).

Note on notation: Throughout this post I will use linear/bilinear almost interchangeably due to this property of bilinear filtering being a linear filter applied in the two directions sequentially.

Starting with a single dimension, linear filter is often called a “tent” or “triangle” filter. In the context of input texture interpolation, we can look at it as either “gather” (“when computing an output pixel, which samples contribute to it?”), or “scatter” operation (“for each input sample, what are its contributions to the final result?”). Most of GPU programmers naturally think of “gather” operations, but here this “triangle” part is important, so let’s start with the scatter.

For every input pixel, its contributions to the output signal form a “tent” – a “triangle” of pixel contributions is centered at sample N, with height equal to the sample/pixel N value, and its base spans from the sample (N-1) to (N+1). To get the filtering result we sum all contributing “tents” (there are either 1 or 2 covering each output pixel). All of triangles have the same base length, and all of them overlap with a single other triangle on either side of the top vertex. Here is a diagram that hopefully will help:

(Bi)linear filtering is called triangle or tent filtering as contributions of the input samples to the filtered results are “triangles” or “tents”.

Switching from “scatter” to a more natural (at least to people who write shaders) gather approach – between samples (N+1) and (N) we sum up contributions of all “tents” covering this area, so (N+1) and N, and at exact position of the sample N we get only its contribution.

Formula for this line is simply xn * (t – 1) + xn-1 * t. This means that in 1D each output sample has two contributing input pixels (though sometimes one contribution is zero!), and we linearly interpolate between them. If we repeat this twice per x and y, we end up with four input input samples and four weights (t_x – 1) * (t_y – 1), (t_x ) * (t_y – 1), (t_x – 1) * (t_y ), (t_x) * (t_y). Since the weights are just multiplication of independent x and y weights, we can verify that this filter is separable (if you are still unsure what it means in practice, check out my previous blog post on separable filters).

Bilinear filtering is ubiquitous in graphics and there are a few reasons for it – but the main one is that is super cheap, and furthermore, hardware accelerated; and that it provides significant quality improvement over nearest-neighbor filtering.

All modern GPUs can do bilinear filtering of an input texture in a single hardware instruction. This involves both memory access (together with potentially uncompressing block compressed textures into local cache), as well as the interpolation arithmetic. Sometimes (depending on the architecture and input texture bit depth) the instruction might have slightly higher latency, but unless someone is micro-optimizing or has a very specific workload, the cost of the bilinear filtering can be considered “free” on contemporary hardware.

At the same time, it provides significant quality improvement over “nearest neighbor” filtering (just replicating each sample N times) for low resolution textures.

Nearest neighbor filtering (red) produces discontinuities, while bilinear sampling (blue) produces continuous values between samples.
Bilinear vs nearest neighbor filtering.

Being supported natively in hardware is a huge deal and the (bilinear) texture filtering was one of the main initial features of graphics accelerators (when they were still mostly separate cards, in addition to the actual GPUs).

Personal part of the story – I still remember the day when I saw a few “demo” games on 3dfx Voodoo accelerator on my colleague’s machine (it was a Diamond Monster3D) back in primary school. Those buttery smooth textures and no pixels jumping around made me ask my father over and over again to get us one (at that time we had a single computer at home, and it was my father’s primary work tool) – a dream that came true a bit later, with Voodoo 2.

Side remark – pixels vs filtering, interpolation, reconstruction

One thing that is worth mentioning is that I deliberately don’t want to draw input – or even the output pixels as “little squares” – pixels are not little squares!

Classic tech memo from Alvy Ray Smith elaborates on it, but if you want to to deeper into understanding filtering, I find it extremely important to not make such a “mistake” (I blame “nearest neighbor” interpolation for it).

So what are those values stored in your texture maps and framebuffers? They are infinitely small samples taken at a given location. Between them, there is no signal at all – you can think of your image as “impulses”. Such signal is not very useful on its own, it needs a reconstruction filter that reconstructs a continuous, non discrete representation. Such a reconstruction filter can be your monitor and in this case, pixels could be actually squares (though different LCD monitors have different pixel patterns), or a blurred dot of a diffracted ray in CRT monitor. (This is something that contemporary pixel art gets wrong and Timothy Lottes used to have interesting blog posts about – I think they might have gotten deleted, but his shadertoy showing more correct emulation of a CRT monitor is still there).

For me it is also useful to think of texture filtering / interpolation in a similar way – we are reconstructing a continuous (as in – not discrete/sampled) signal and then resample it, taking measurements / values at new pixel positions from this continuous representation. This idea was one of key components in our handheld mobile super-resolution work at Google, allowing to resample reconstructed signals at any desired magnification level.

Example from published past work that I worked on on how given a continuous reconstruction of a sparse signal (from many locally randomly shifted input frames) can be sampled at different destination grid densities, achieving higher zoom factors with less aliasing and more details.

Bilinear artifacts – spatial perspective – pyramid, and mach bands

Ok, bilinear filtering seems to be “pretty good” and for sure is cheap! So what are those bilinear artifacts?

Again a zoom-in into a screenshot from Deus Ex intro – notice the star-like pupil.

In the post intro I have shown example artifact from a 20 year old video game back when the asset textures were very low resolution, but those happen anytime we interpolate from low resolution buffers. In my Siggraph 2014 talk about volumetric atmospheric effects rendering, I have shown an example of problematic artifacts when upsampling low resolution 3D fog texture:

Volumetric fog bilinear artifacts – notice stair/saw-like look.

In my old talk proposed to solve this problem with additional temporal low-pass effect (temporal jitter + temporal supersampling), because the alternative (bicubic / B-spline sampling – more about it later) was too expensive to apply in real time. Nota bene such a solution is actually pretty good at approximating blurring / filtering for free if you already have a TAA-like framework.

If we get back to our “tent” filter scatter interpretation, it should become easier to see what causes this effect. Separable filtering applies first a 1D triangle filter in one direction, then in another direction, multiplying the weights. Those two 1D ramps multiplied together result in a pyramid – this means that every input pixel will “splat” a small pyramid, and if the ratio of the output to the input resolutions is large and textures have high contrast, then those will become very apparent. Similarly those pyramids on any edge that is not aligned with perfect 45 degrees, will create jaggy, aliased appearance.

Horizontal tent filter followed by a vertical tent filter results in a “pyramid”.

Ok, it’s supposed to be a pyramid, but what’s the deal with this bright star-like lines?

The second “spatial” effect of why bilinear filtering is not great are so-called Mach bands.

This phenomenon is one of the fascinating “artifacts” (or conversely – desired features / capabilities) of human visual system. Human vision cannot be thought of as a “sensor” like in a digital camera or “film” in analog one – it is more like a complicated video processing system, will all sorts of edge detectors, contrast boosting, motion detection, embedded localized tonemapping etc. Mach bands are caused by tendency of HVS to “sharpen” image and emphasize any discontinuity.

Perfectly linear gradient (for example generated by linear interpolation). Do you also see this apparent bright white line on the right side of the gradient?

(Bi)linear interpolation is continuous, however its derivative is not (derivative of a piecewise linear function is a piecewise constant function). In mathematical definition of smoothness and C-continuity we say that it is C0 continuous, but not C1 continuous. I find it fascinating that HVS can detect such discontinuity – detecting features like lines and corners is like differentiation and indeed, most common and basic feature detector in computer vision, Harris Corner Detector analyzes local gradient fields.

To get rid of this effect, we need some filtering and interpolation that has a higher order of continuity. Before I move to some other filters, I have to mention here also a very smart “hack” from Inigo Quilez – by hacking the interpolation coordinates, you can get interpolation weights that are C1 continuous.

Screenshot from: https://www.shadertoy.com/view/XsfGDn, author Inigo Quilez. Using hacked UVs together with hardware bilinear interpolation to get smoother filtering. Be sure to check Inigo’s excellent article.

It’s a really cool hack and works very well with procedural art (where manually computed gradients are used to produce for example normal maps), but produces “rounded rectangles” type of look and aliases in motion, so let’s investigate some other fixes.

Bilinear alternatives – bicubic / biquadratic

The most common alternative considered in graphics is bicubic interpolation. The name itself is in my opinion confusing, as cubic filtering can mean any filtering where we use 4 samples, and filter weights result from evaluating 3rd order polynomials (similarly to linear using 2 samples and a 1st order polynomial). In 2D, we have 4×4 = 16 samples, but we also interpolate separably. However there are many alternative 3rd order polynomials weights that could be used for filtering… This is why I avoid using just the term “bicubic”. Just look at Photoshop image resizing options, does this make any sense!?

Bicubic, or bicubic sharper?  ÂŻ\_(ツ)_/ÂŻ

The most common bicubic filter used in graphics is one that is also called B-Spline bicubic (after reading this post, see this classic paper from Don Mitchell and Arun Netravali that analyzes some other ones – I revisit this paper probably once every two months!). It looks like this:

As you can see, it is way smoother and seems to reconstruct shapes much more faithfully than bilinear! Smoothness and lack of artifacts comes from its higher order continuity – it is designed to have matching and continuous derivatives at each original sample point.

It has a few other cool and useful properties, but my personal favorite one is that because all weights are positive, it can be optimized to use just 4 bilinear samples! This old (but not dated!) article from GPU Gems 2 (and a later blog post from Phill Djonov elaborating on the derivation) describe how to do it with just some extra ALU to compute the look-up UVs. (Similarly in 1D this can be done in just 2 samples – which can be quite useful for low-dimensionality LUTs).

Its biggest disadvantage – it is visibly “soft”/blurry, looks like the whole image got slightly blurred – which is what is happening… Some other cubic interpolators are sharper, but you have to pay a price for that – negative filter weights that can cause halos / ringing, as well as make it impossible to optimize so nicely.

Second alternative that caught my attention and was actually a motivation to finally write something about image resampling was a recent blog post from Thomas Deliot (and an efficient shadertoy implementation from Leonard Ritter) on biquadratic interpolation – continuous interpolation that reconstructs a quadratic function and considers 3 input samples (or 9 in 2d). I won’t take the spotlight from them on how it works, I recommend checking the linked materials, but included an animated comparison – it looks very good, sharper than the bspline bicubic, and is a bit cheaper.

Post update: Won Chun pointed to me a super interesting piece of literature on quadratic filters from Neil Dogson. The link is paywalled, but you will easily find references by googling for “Quadratic interpolation for image resampling”. It derives the quadratic filter with an additional parameter that allows to control filter sharpness similarly to bicubic filters parameters B/C. The one that I used for analysis is the blurriest, “approximating” one. Here is a shadertoy from Won Chun and a corresponding Desmos calculator that proposes also a 2 sample approximation of the (bi)quadratic filter, cool stuff!

Interpolation – shifting reconstructed signal and position dependent filtering

A second perspective that I am going to discuss here is how texture filtering creates different local filters and filter weights depending on the fractional output/input pixel positions relationship.

Let’s consider for now only 1D separable interpolation and the range of 0.0-0.5 (-0.5 to 0.0 or 0.5 to 1.0 can be considered as switching to different sample pair, or symmetric).

With a subpixel offset of 0.0, linear weights will be [1.0, 0.0], and with a subpixel offset of 0.5, they will be [0.5, 0.5]. Those two extremes correspond to a very different image filters! Imagine input filtered with a convolution filter of 1 (this returns just the original one) vs a filter of [0.5, 0.5] – this corresponds to low quality, box blur of the whole image.

To demonstrate this effect, I made a shadertoy that compares bilinear, bicubic bspline, biquadratic, and a sinc filter (more about it later) with dynamically changing offsets.

Same texture bilinearly interpolated. Upper left half of the image includes shift by 0.5 pixels, the other one no (and thus no filtering). I upsampled the resulting image with NN interpolation for magnification.

So with an offset of U or V of 0.5, we are introducing very strong blurring (even more with both shifted – see above), and no blurring at all with no fractional offsets.

Knowing that our effective image filters change based on the pixel “phase” we can have a look at spectral properties and frequency responses of those filters and see what we can learn about those resampling filters.

Bilinear artifacts – variance loss, and frequency perspective

So let’s have a look at filter weights of our 3 different types of filters as they change with pixel phase / fractional offset:

As expected, they are symmetric, and we can verify that they sum to 1. Weights are quite interesting themselves – for example quadratic interpolation with fractional offset of 0.0 has the same filter weights as the linear interpolation with offset of 0.5!

Looking at fractional offset dependent, we can analyze how those weights affect the signal variance, as well as the frequency response (and in effect, frequency content of the filtered images). Variance change resulting from a linear filter is equal to the sum of the squares of all its weights. So let’s plot the effect on input signal variance from those three filters.

Position dependent variance reduction of three different analyzed filters.

With the bilinear filter, the total variance changes (a lot!) with the subpixel position – which will cause apparent contrast loss. Variance and contrast loss due to blending/filtering is an interesting statistical effect, very visible perceptually, and one of my favorite papers on procedural texture synthesis from Eric Heitz and Fabrice Neyret achieves very good results from blending random blended tilings of an input texture mostly “just” by designing a variance preserving blending operator!

Now, moving to the full frequency response of a those filters – for this we can use for example Z-transform commonly used in signal processing (or go directly with complex phasors). This is goes beyond scope of this blog post, (if you are interested in details, let me know in the comments!) but I will present the resulting frequency responses of those 3 different filters:

Frequency responses of linear, quadratic, and bspline at fractional offsets of 0.0, 1/6, 1/3, 1/2. Each plot corresponds to a different fractional coordinate offset.

Looking at those plots we can observe a very different behavior. As we realized above, linear at no fractional offset produces no filtering effect at all (identity response), and both other filters are lowpass filters. Interestingly, no offset is where quadratic interpolation performs the most lowpass filtering, the opposite of both other filters! To be able to reason about the filters in more of isolation, I also plotted each filters frequency response at different phases on a single plot for each:

Frequency responses of linear, quadratic, and bspline at fractional offsets of 0.0, 1/6, 1/3, 1/2. Each plot corresponds to a different filter.

Looking at these plots, we can see that bspline cubic is the most “consistent” between the different fractional offsets, while quadratic is less consistent, but filters out somewhat less of high frequencies (which is part of its sharper look). Linear is the least consistent, going from between a very sharp image, and a very blurry one.

This inconsistency (both in terms of total variance, as well as frequency response – in fact the first can be derived from the latter from Parseval’s theorem) is another look at what we perceive as bilinear artifacts – “pulsing” and inconsistent filtering of the image. Looking at linear image shifted with phase 0.0 and 0.5 is like looking at two very different images. While bicubic bspline over-blurs the input, it is consistent, doesn’t “pulse” with changing offsets and doesn’t create aliased visible patterns. I see the biquadratic filter as kind of a “middle ground” that looks very reasonable and definitely will use it in practice.

In the original eye screenshot texture we can see filtering / variance reduction in action. Pixels close to the interpolated texel origins are sharp (green), some get blurred horizontally (purple), some vertically (yellow), and some both (red). Such an non-uniform effect of variable blurring and variable contrast is yet another take on the bilinear artifacts.

Given how bicubic and similar filters have quite consistent lowpass filtering, some of the blurring effect can be compensated with an additional, sharpening filter. This is what for example my friend Michal Drobot proposed in his talk on TAA for Far Cry 4 (and others have arrived to and used independently) – after resampling, do a global uniform sharpening pass on the whole image to counter some of the blurriness. Given how blurriness is relatively constant, sharpen filter can be even designed to recover the missing input frequencies accurately!

Bilinear / bicubic alternatives – (windowed) sinc

It’s worth noting that resampling filters don’t have to necessarily so heavily blur out the images and their frequency response can be much more high frequency preserving. This is beyond scope of this blog post, but some bicubic filters are actually designed to be slightly sharpening and boosting some of the high frequencies (Catmull-Rom spline – often used in TAA for continuous resampling).

Such built-in sharpening might not be desired, so many filters are designed to be as frequency preserving as possible (windowed sinc like Lanczos). As I mentioned, it is beyond the scope of this blog post, but just for fun, I include example frequency response of a truncated (non windowed!) sinc, as well as final gif comparison. Such a sinc is also featured in my shadertoy.

Unwidowed sinc frequency response at different subpixel offsets – note how there is no dominant lowpass filtering (with exception of last part, close to Nyquist), but we get lots of frequency response ringing and we will observe some oversharpening. Do not use unwindowed sinc in practice!
All filters compared. Note that truncated sinc might be the sharpest, but it looks unacceptably (ugly!) with over-filtering and significantly stronger image contrast. If you’re looking for sharpness, better choice might be some tweaked bicubic filter, or windowed sinc – Lanczos.

For much more complete and comprehensive treatment of filter trade-offs (sharpness, ringing, oversharpening), I again highly recommend this classic paper from Mitchell and Netravali – be sure to check it out.

Conclusions

This post touched on lots of topics and connection between them (I hope that it could be inspiring for you to go deeper into them) – I started with analyzing the most common, simple, ardware accelerated image interpolation filter – bilinear. I discussed its limitations and why it is often replaced (especially when interpolating from very low resolution textures) by other filters like bicubic – which might seem counter-intuitive given that they are much blurrier. I discussed what causes the “bilinear artifacts”, both from the simple spatial filtering perspective, as well as the effect of variance loss / contrast reduction / lowpass filtering. I barely scratched surface of the topic of interpolation, but I had fun writing it (and creating the visualizations), so expect some follow up posts about it in the future!

Posted in Code / Graphics | Tagged , , , , , | 4 Comments

Using JAX, numpy, and optimization techniques to improve separable image filters

By the end of this blog post, we aim to improve the results of my previous blog post of approximating circular bokeh with a sum of 4 separable filters and reducing its artifacts (left) through an offline optimization process to get the results on the right.

In today’s blog post I will look at two topics: how to use JAX (“hyped” new Python ML / autodifferentiation library), and a basic application that is follow-up to my previous blog post on using SVD for low-rank approximations and separable image filters – we will look at “optimizing” the filters to improve the filtered images.

My original motivation was to play a bit with JAX and see if it will be useful for me, and I immediately had this very simple use-case in mind. The blog post is not intended to be a comprehensive guide / tutorial to JAX, nor a complete optimization primer (one can spend decades studying and researching this topic), but a fun example application – hopefully immediately useful, and inspiring for other graphics or image processing folks.

The post is written as separate chapters/sections (problem statement, some basic theory and challenges, practical application, and the results) – feel free to skip ones that seem obvious to you. 🙂

This post comes with a colab that you can run and modify yourself. The code is not very clean, mostly scratchpad quality, but I decided that it’s still good for the others if I share it, no matter how poorly it reflects on my coding habits.

Recap of the problem – separable filters

In the last blog post, we have looked at analyzing if convolutional image filters can be made separable, and if not, finding separable approximations (as a sum of N separable filters). For this we used Singular Value Decomposition and using a low rank approximation by taking first N singular values.

We have found that to approximate a 50×50 circular filter (can be though of as a circular bokeh), one needs ~13 separable passes over the whole image, and even 8 singular components (that have extremely low numerical error) can produce distracting, unpleasant visual artifacts – negative values cause “ringing”, and some “leftover” corner errors.

Unpleasant, unacceptable visual artifacts in an image filtered with a rank 4 approximation of the circular kernel.

I have suggested that this can be solved with optimization, and in this post I will describe most likely the simplest method to do so!

Side note: After my blog post, I had a chat with a colleague of mine, Soufiane Khiat – we used to work together at Ubisoft Montreal – and given his mathematical background much better than mine, he had some cool insights on this problem. One of suggestions was to use singular value thresholding algorithm and some proximal methods and generally it is probably the way to go for the specific problem – but a bit of against “educational” goal of my blog posts (and staying as general as possible). I still recommend reading about this approach if you want to go much deeper into the topic and again – thanks Soufiane for a great discussion.

What is optimization?

Optimization is a loaded term and quite likely I am going to use it in a way you didn’t expect! Most graphics folks understand it as in “low level optimization” (or algorithmic optimization), aka optimizing the computational cost – memory, CPU/GPU usage, total timing spent on computations – and this is how I used to think of term “optimization” for many years.

But this is not the kind of optimization that a mathematician, or most academics think of and not the topic of my blog post!

Optimization that I will use is the one in which you have some function that depends on a set of parameters, and your goal is to find the set of parameters that achieves a certain goal, usually minimum (sometimes maximum, but they can be equivalent in most cases) over this function. This definition is very general – and in theory it even covers also computational performance optimizations (we are looking for a set of computer program instructions that optimizes performance while not diverging from the desired output).

Optimization of arbitrary functions is generally a NP-hard problem (there are no solutions other than exploring every possible value, which is impossible in the case of continuous functions), but under some constraints like convexity, or looking for local minima, it can be made feasible, and relatively robust and fast – and is basis of modern convolutional neural networks, and algorithms like gradient descent.

This post will skip explaining the gradient descent. If it’s a new or vague concept for you, I don’t recommend just trusting me 🙂 – so if you would like to get a good basic, but intuitive and visual understanding, be sure to check this fantastic video from Grant Sanderson on the topic.

What is JAX?

JAX is a new ML library that supports auto-differentiation and just-in-time compilation targeting CPUs, GPUs, and TPUs.

JAX got very popular recently (might be a sampling bias, but seemed to me like half of ML twitter was talking about it), and I am usually very skeptical about such new hype bandwagons. But I was also not very happy with my previous options, so I tried it, and I think it is popular for a few good reasons:

  • It is extremely lean, in the most basic form it is a “replacement” import for numpy.
  • Auto-differentiation “just works” (without any gradient tapes, graphs etc.) over most of numpy and Python constructs like loops, conditionals etc.
  • There are higher level constructs developed on top of it – like NNs and other ML infrastructure, but you can still use the low-level constructs.
  • There is no more weird array reshaping and expressing everything over batches, constantly thinking about shapes and dimensions. If you want to process a batch, just use functions that map your single-example functions over batches.
  • It is actively developed both by open-source, as well as some Google and DeepMind developers, and available right at your fingers with Colab – zero installation needed.

I have played with it for just a few days, but definitely can recommend it and it will become my go-to auto-diff / optimization library for Python.

Optimization 101 – defining a loss function

This might seem obvious, but before we can start optimizing an objective, we have to define it in some way that is understandable for the computer and optimizeable.

What is non-obvious is that coming up with a decent objective function is the biggest challenge of machine learning, IMO a much bigger problem than any particular choice of a technique. (Note: it also has much wider implications than our toy problems; ill-defined objectives can lead to catastrophes in wider social context, e.g. optimizing for engagement time leads to clickbaits, low quality filler content, and fake news).

For our problem – optimizing filters, we know three things that we want to optimize for:

  • Maintain the low rank of the filter,
  • Keep the separable filter similar to the original one,
  • Avoid the visual artifacts.

The first one is the simple – we don’t have much wiggle room there. We have a hard limit on maximum of N separable passes and fixed number of coefficients. But on the other hand, the two other goals are not well defined.

The “similarity” that we are looking for can be anything – average squared error, average absolute error, maximum error, some perceptual/visual similarity… Anything else, or even any combination of the above. Mathematically, we use term of distance, or metric. Convenient and well researched are metrics that are based on p-norm and mathematicians like to operate on squared Euclidean distance (L2 norm), so average squared errors. Average squared error is so often used as it usually has a simple, closed form solutions like linear least squares, linear regression, PCA), but in many cases it might not be the right loss. Defining perceptual similarity is the most difficult and is an open problem in computer vision, with the recent universal approach of using similarity features extracted by neural networks for the purpose of image recognition.

The third one “avoid the visual artifacts” is even more difficult to define, as we are talking about artifacts that are present in the final image, and not about numerical error in the approximated filter. Deciding on components of the loss function to avoid visual artifacts and tuning them is often the most time consuming part of optimization and machine learning.

Left: original filter, right: rank 4 approximation. Notice the negative values (blue color) and some leftover “garbage” values in the corners. Those will contribute to visual artifacts.

Looking at artifacts in the filtered images I think that the two most objectionable ones are: negative values causing ringing, and some “garbage” pixels in the corners of the filter.

Putting it together – target loss function for optimizing separable image filters

Given all that together, I decided on minimizing a loss function that will sum three terms:

  • Squared mean error term,
  • Heavily penalizing any negative elements in the approximated filter,
  • Additional penalty when a zero element of the original filter becomes non-zero after the approximation.

This is a very common approach – summing together multiple different terms with different meanings and finding the parameters that optimize such a sum. Open challenge is tuning the weights of the different components of the loss function, a further section will show the impact of it.

This loss function might not be the best one, but this is what makes such problems fun – often designing a better loss (closer corresponding to our intentions) can lead to significantly better results without changing the algorithm at all! This was one of my huge surprises – so many great papers just propose simple optimization framework together with an improved loss function to advance state of the art significantly.

Also if you think that it is quite “ad-hoc” – yes, it is! Most academic papers (and productionized code…) have such ad-hoc loss functions where each component might have some good motivation, but they end up with a hodge-podge that doesn’t make too much sense when put together, but empirically works (often verified by ablation studies).

In numpy, an example implementation of the loss function might look like:

def loss(target, evaluated_filter):
  l2_term = L2_WEIGHT * np.mean(np.square(evaluated_filter - target))
  non_negative_term = NON_NEGATIVE_WEIGHT * np.mean(-np.minimum(evaluated_filter, 0.0))
  keep_zeros_term = KEEP_ZEROS_WEIGHT * np.mean((target == 0.0) * np.abs(evaluated_filter))
  return l2_term + non_negative_term + keep_zeros_term 

Now that we have a loss function to optimize, we need to find parameters that minimize it. This is where things can become difficult. If we want to have rank 4 approximation of a 50×50 filter, we end up with 4*(50*2) == 400 parameters. If we wanted to do a brute force search in this 400 dimensional space, this could take a very long time! Let’s say we just wanted to evaluate 10 different potential values – this would take 10^400 loss function evaluations – and this is quite a toy problem!

Luckily, we are in a situation where our initial guess obtained through SVD is already kind of ok, and we want to just improve it. This is a classic assumption of “local optimization” and can be achieved through greedily minimizing the error, for example by coordinate descent, or even better gradient descent – going in the direction where the error is decreasing the most. In a general scenario we don’t get any theoretical guarantees of finding the true, best solution, but we are guaranteed that we will get a solution that will be better or at worst the same when compared to the initial one (according to our loss function).

Gradient descent in JAX

How do we compute the gradient of our loss function? For the function that I wrote it is relatively simple and follows calculus 101, but anytime we would change it, we need to re-derive our gradient… Wouldn’t it be great if we could compute it automatically?

This is where various auto-differentiation libraries can help us. Given some function, we can compute its gradient / derivative with regards to some variables completely automatically! This can be achieved either symbolically, or in some cases even numerically if closed-form gradient would be impossible to compute.

In C++ you can use Ceres (I highly recommend it; its templates can be non-obvious at first, but once you understand the basics, it’s really powerful and fast), in Python one of many frameworks, from smaller ones like Autograd to huge ones like Tensorflow or PyTorch. Compared to tf and pt, I wanted something lower level, simpler, and more convenient to use (setting up a graph or gradient tapes is not great 😩 ) – and JAX fills my requirements perfectly.

JAX can be a drop-in replacement to a combo of pure Python and numpy, keeping most of the functions exactly the same! In colab, you can import it either instead of numpy, or in addition to numpy. Here is code that computes our separable filter from list of separable vector pairs, and the loss function.
(note: I will paste some code here, but I highly encourage you to open the colab that accompanies this blog post and explore it yourself.)

import numpy as np
import jax.numpy as jnp

# We just sum the outer tensor products.
# vs is a list of tuples - pairs of separable horizontal and vertical filters.
def model(vs):
  dst = jnp.zeros((FILTER_SIZE, FILTER_SIZE))
  for separable_pass in vs:
    dst += jnp.outer(separable_pass[0], separable_pass[1])
  return dst

# Our loss function. 
def loss(vs, l2_weight, non_negativity_weight, keep_zeros_weight):
  target = model(vs)
  l2_term = l2_weight * jnp.mean(jnp.square(target- REF_SHAPE))
  non_negative_term = non_negativity_weight * jnp.mean(-jnp.minimum(target, 0.0))
  keep_zeros_term = keep_zeros_weight * np.mean((REF_SHAPE == 0.0) * jnp.abs(target))
  return l2_term + non_negative_term + keep_zeros_term 

Note how we mix some simple Python loops and logic in the model function, and numpy-like code (I could have imported jax.numpy as simply np, and you can do it if you want to, but to me such library shadowing feels a bit confusing).

Ok, but so far there is nothing interesting about it; I just kind of rewrote some numpy code as jax.numpy, what’s the big deal?

Now, the “magic” is that you compute the gradient of this function just by writing jax.grad(loss)! By default the gradient is wrt the first function parameter (you can change it if you want). JAX has some limitations and gotchas on what it can compute the gradient of and can require workarounds, but most of them feel quite “natural” (e.g. that PRNG requires explicit state) and I haven’t hit those in practice with my toy example.

Our whole gradient descent step function would look like:

def update_parameters_step(vs, learning_rate, l2_weight, non_negativity_weight, keep_zeros_weight):
  grad_loss = jax.grad(loss)
  grads = grad_loss(vs, l2_weight, non_negativity_weight, keep_zeros_weight)
  return [(param[0] - learning_rate * grad[0], param[1] - learning_rate * grad[1]) for param, grad in zip(vs, grads)]

I was mind-blown how simple it is – no gradient tapes, no graph definitions requires. This is how auto-differentiation should look like from user’s perspective. 🙂 What is even cooler is that the resulting function code can be JIT-compiled to native code for orders of magnitude speed-up! You can place a decorator @jax.jit above your function, or manually create optimized function from jax.jit(function). This makes it really fast and allows you to pick-and-choose what you jit (you can even unroll a few optimization iterations if you want).

Tuning loss function term weights

Time to run an optimization loop on our loss function. I picked a learning_rate of 0.1 and 5000 steps. Those are ad-hoc choices, step count is definitely an overkill, but it just worked and the optimization is pretty fast even on the CPU, so I didn’t bother tweaking them. The whole optimization looks like this:

# Our whole optimization loop.
def optimize_loop(vs, l2_weight, non_negativity_weight, keep_zeros_weight, print_loss):
  NUM_STEPS = 5000
  for n in range(NUM_STEPS):
    vs = update_parameters_step(vs, learning_rate=0.1, l2_weight=l2_weight, non_negativity_weight=non_negativity_weight, keep_zeros_weight=keep_zeros_weight)
    if print_loss and n % 1000 == 0:
      print(loss(vs))
  return vs

Finally we get to the problem of testing different loss function term weights. Luckily, because our problem is small, and jit’d optimization runs very fast, we can actually test it with a few different parameters.

One thing to note is that while we have 3 terms, in fact if we multiply all 3 of them by the same value, we will end up with loss function 3x bigger, but with the same parameters. So we have effectively just 2 degrees of freedom and can keep one of the weights “frozen” – I will set the L2 mean loss as just 1.0 and operate on the weight for the non-negativity and keeping-zeros.

Without further ado, here are our rank 4 separable circular filters after the optimization:

Left to right: keep_zeros_weight of 0.0, 0.3, 0.8.
Top to bottom: non_negativity_weight of 0.0, 0.5, 1.5.

We can notice how those two loss terms affect the results and interact with each other. Non-negativity will very effectively reduce the negative ringing, but won’t address the artifacts in corners of the filter.

The extra penalty of the zero terms becoming non-zero nicely cleans up the corners, but there is still mild ringing around the radius of the filter.

Both of them together reduce all artifacts, but the final shape starts to deviate from our circle, becoming more like a sum of four separate box-filters (which is exactly what is happening)! So it is a trade-off of accuracy, filter shape, and the artifacts. There is no free lunch! 🙂

Results when applied to an real image

Let’s look at the same filters when applied to the same image we have looked at in my previous blog post. Note: I have slightly boosted the gamma function from 7.0 to 8.0 as compared to my previous post to emphasize the visual error.

Left to right: keep_zeros_weight of 0.0, 0.3, 0.8.
Top to bottom: non_negativity_weight of 0.0, 0.5, 1.5.
I highly encourage opening this image in a new tab and zooming in – otherwise the artifacts and differences are hard to notice.

When you zoom in, the difference in artifacts and overall appearance becomes quite obvious. I personally like the most the center picture, and the one below it. Column on the right minimized artifacts the most, especially the bottom-right example, but to me looks too “blocky”.

Can you design a better loss function and parameters for this problem? I am sure you can, feel free to play with the colab! 🙂

Some random ideas that can be fun to evaluate and get better understanding of how loss functions affect the results: penalizing zeros becoming non-zeros more as they get further away from the center, using L1 loss instead of L2 loss, playing with maximum allowed error, optimizing for computational performance by trying to create filter with explicitly zero weights that could get skipped in a shader.

When such simple optimization through gradient descent is going to work?

While I mentioned that optimization is a whole field of mathematics and computer science and beyond scope of any simple blog, in my posts I usually try to give the readers some “intuition” about the topics I write about, and practical tips and tricks, so will mention a few gotchas regarding the optimization.

Gradient descent is so successful and ubiquitous technique (have you heard about this whole field of machine learning through artificial neural networks?!) because of few interesting properties:

First of all, in case of multivariate functions, we don’t need to be strictly convex! There might exist some path in the parameter space that gradient descent can use to “walk in the valleys and around the hills”.

If we are “lucky”, descent can optimize even a non-convex function!

Second related observation is that the more dimensions we have, the easier it is to find some “good” solution (that might not be the total global minimum, but some almost-as good local minimum). Over-completeness and over-parametrization is one of the things that makes very deep networks train well. We are not looking for a single unique perfect solution, but one of many good ones, and the more parameters and dimensions there are, the more (combinatoricly explosive) combinations of parameters can be ok. Think about the example I have visualized above – in 1D we could never “walk around” a locally bad solution, the more dimensions we add, the more paths are there that can lead to improving the results.

It’s one of the areas that are researched in machine learning and called “lottery ticket hypothesis”. I personally found this to initially to be super non-intuitive and was totally mind blown to see how adding fully linear layers to a network can make a huge difference on it converging to good results.

What helps gradient descent a lot is some good initial initialization. This not only increases the chances of convergence, but also the convergence speed. For our filters initializing them with SVD-based low-rank approximation was very helpful – feel free to try if you can get the same results with fully randomized initialization. The worst possible initialization would be something totally-uniform, and with its gradient being same for all variables (or even worse, zero gradient). An example: if optimizing a function that is product of a and b, don’t initialize them both to zero – think how the gradients would look like and what would happen in such case. 🙂

A second practical advice is to keep the learning rate as low as your patience allows for. You avoid the risks of “jumping over” your optimal solution.

Visualization on 1D toy example how wrong learning / optimization rate can cause lack of correct convergence.
Red: too large learning / optimization rate can jump over the global maximum!
Green: Lower learning rate is “safer” for local optimization.

Too high of a learning rate can make your gradients “explode” and jump completely out of reasonable parameter space. Probably everyone who ever played with gradient descent saw loss or parameters becomeing NaNs or infinity at some point and then debugged why. There can be many causes, from a wrong loss function, through bad behaving gradients (derivatives of certain function can get huge; e.g. sqrt around zero), but very often it is too large of learning rate. 🙂 If your optimization is too slow, instead of increasing the optimization rate, try momentum-based or higher order optimization – both are actually trivial to code with JAX, especially for such small use-cases!

Conclusions

In this blog post, we have looked at incorporating some simple offline optimization using numpy and JAX to our programming toolset. We discussed difficulties of designing a “good” loss function and then tuning it, and applied it to our problem of producing good separable image filters. If you do any kind of high performance work, don’t be deterred from machine learning – you don’t have to do it in real time and on the device, but can use it offline. For example of optimizing/computing offline approximations that can make something feasible in real time, check out one of papers that finally got out (I contributed to a longer while ago) – real time stylization through optimizing discrete filters.

My personal final conclusion is that I want more of such simple, yet expressive and powerful libraries like JAX. The less boilerplate the better! 🙂

Posted in Code / Graphics | Tagged , , , , , , , , , , | 5 Comments

Separate your filters! Separability, SVD and low-rank approximation of 2D image processing filters

In this blog post, I explore concepts around separable convolutional image filters: how can we check if a 2D filter (like convolution, blur, sharpening, feature detector) is separable, and how to compute separable approximations to any arbitrary 2D filter represented in a numerical / matrix form. I’m not covering any genuinely new research, but think it’s a really cool, fun, visual, interesting, and very practical topic, while being mostly unknown in the computer graphics community.

Throughout this post, will explore some basic use of Singular Value Decomposition, one of the most common and powerful linear algebra concepts (used in all sub-fields of computer science, from said image processing, through recommendation systems, ML, to ranking web pages in search engines). We will see how it can be used to analyze 2D image processing filters – check if they are separable, approximate the non-separable ones, and will demonstrate how to use those for separable, juicy bokeh.

Note: this post turned out to be a part one of a mini-series, and has a follow-up post here – be sure to check it out!

Post update: Using separable filters for bokeh approxmation is not a new idea – Olli Niemitalo pointed out this paper “Fast Bokeh effects using low-rank linear filters” to me, which doesn’t necessarily feature any more details on the technique, but has some valuable timings/performance/quality comparisons to the stochastic sampling, if you are interested in using it in practice, check it out – along my follow-up post.

Intro / motivation

Separable filters are one of the most useful tools in image processing and they can turn algorithms from “theoretical and too expensive” to practical under the same computational constraints. In some other cases, ability to use a separable filter can be the tipping point that makes some “interactive” (or offline) technique real-time instead.

Let’s say we want to filter an image – sharpen it, blur, maybe detect the edges or other features. Given a 2D image filter of size MxN, computing the filter would require MxN independent, sequential memory accesses (often called them “taps”), accompanied by MxN multiply-add operations. For large filters, this can get easily prohibitively expensive, and we get quadratic scaling with the filter spatial extent… This is where separable filters can come to the rescue.

If a filter is separable, we can decompose such filter into a sequence of two 1D filters in different directions (usually horizontal, and then vertical). Each pass filters with a 1D filter, first with M, and then the second pass with N taps, in total M+N operations. This requires storing the intermediate results – either in memory, or locally (line buffering, tiled local memory optimizations). While you pay the cost of storing the intermediate results and synchronizing the passes, you get linear and not quadratic scaling. Therefore, typically for any filter sizes larger than ~4×4 (depends on the hardware, implementation etc) using separable filters is going to be significantly faster than the naive, non-separable approach.

But how do we find separable filters?

Simplest analytical filters

Two simplest filters that are bread and butter of any image processing are the box filter, and the Gaussian filter. Both of them are separable, but why?

The box filter seems relatively straightforward – there are two ways of thinking about it: a) if you intuitively think about what happens if you just “smear” a value across horizontal, and then vertical axis, you will get a box; b) mathematically, there are two conditions on the value of the filter, there are two conditions on both dimensions that are independent:

f(x, y) = select(abs(x) < extent_x && abs(y) < extent_y), v, 0)
f(x, y) = select(abs(x) < extent_x, v, 0) * select(abs(y) < extent_y, 1, 0) 

This last product is the equation of separability – our 2D function is a product of a horizontal, and a vertical function.

Gaussian is a more interesting one, and it’s not immediately obvious (“what does the normal distribution have to do with separability?”), so let’s write out the maths.

For simplicity, I will use a non-normalized Gaussian function (usually we need to re-normalize after the discretization anyway) and assume standard deviation of 1/sqrt(2) so that it “disappears” from the denominator.

f(x, y) = exp(-sqrt(x^2 + y^2)^2) // sqrt(x^2 + y^2) is distance of filtered pixel
f(x, y) = exp(-(x^2 + y^2))       // from the filter center.
f(x, y) = exp(-x^2 + -y^2)
f(x, y) = exp(-x^2) * exp(-y^2)

The last step follows from the “definition” of exponential (exponential of sum is a product of exponentials), and afterwards we can see that again, our 2D filter is product of two 1D Gaussian filters.

Gaussian filter matrix

“Numerical” / arbitrary filters

We started with two trivial, analytical filters, but what do we do if we have an arbitrary filter given to us in a numerical form – just a matrix full of numbers?

We can plot them, analyze them, but how do we check if given filter is separable? Can we try to separate (maybe at least approximately) any filter?

Can you make this filter separable? Spoiler: yes, it’s just the Gaussian above, but how do we tell?

Linear algebra to the rescue

Let’s rephrase our problem in terms of linear algebra. Given a matrix MxN, we would like to express it as a product of two matrices, Mx1 and 1xN. Let’s re-draw our initial diagram:

There are a few ways of looking at this problem (they are roughly equivalent – but writing them all out should help build some intuition for it):

  • We would like to find a row Mx1 that can be replicated with different multipliers N times giving us original matrix,
  • Another way of saying the above is that we would like our MxN filter matrix rows/columns to be linearly dependent,
  • We would like our source matrix to be a matrix with rank of one,
  • If it’s impossible, then we would like to get a low-rank approximation of the matrix MxN.

To solve this problem, we will use one of the most common and useful linear algebra tools, Singular Value Decomposition.

Proving that it gives “good” results (optimal in least-squarest sense, at least for some norms) is beyond the scope of this post, but if you’re curious, check out the proofs on the wikipedia page.

Singular Value Decomposition

Specifically, the singular value decomposition of an m\times n real or complex matrix \mathbf {M}  is a factorization of the form {\displaystyle \mathbf {U\Sigma V^{*}} }, where \mathbf {U}  is an m\times m real or complex unitary matrix,\mathbf{\Sigma} is an m\times nrectangular diagonal matrix with non-negative real numbers on the diagonal, and \mathbf {V}  is an n\times n real or complex unitary matrix. If \mathbf {M}  is real, \mathbf {U}  and {\displaystyle \mathbf {V} =\mathbf {V^{*}} } are real orthonormal matrices.

Wikipedia

How does SVD help us here? We can look at the final matrix as a sum of many matrix multiplications, each of a single row/column from the U and V matrix, multiplied by the diagonal entry from the matrix E. SVD diagonal matrix E contains entries (called “singular values”) on the diagonal only and they are sorted from the highest value, to the lowest. If we look at the final matrix as a sum of different rank 1 matrices, weighted by their singular values, the singular values correspond to proportion of the “energy”/”information” of the original matrix contents.

SVD decomposes any matrix into three matrices. The middle one is purely diagonal one, and products of the values of the diagonal with columns/rows of the other matrices create multiple rank1 matrices. If we add all of them together, we are going to get our source matrix back.

If our original matrix is separable, then it is rank 1, and we will have only single singular value. Even if they are not zero (but significantly smaller), and we truncate all coefficients except for the first one, we will get a separable approximation of the original filter matrix!

Note: SVD is a concept and type of matrix factorization / decomposition, not an algorithm to compute one. Computing SVD efficiently and numerically stable is generally difficult (requires lots of careful consideration) and I am not going to cover it – but almost every linear algebra package or library has it implemented, and I assume we just use one.

Simple, separable examples

Let’s start with our two simple examples that we know that are separable – box and Gaussian filter. We will be using Python and numpy / matplotlib. This is just a warm-up, so feel free to skip those two trivial cases and jump to the next section.

import numpy as np
import matplotlib.pyplot as plt
import scipy.ndimage
plt.style.use('seaborn-whitegrid')

x, y = np.meshgrid(np.linspace(-1, 1, 50), np.linspace(-1, 1, 50))
box = np.array(np.logical_and(np.abs(x) < 0.7, np.abs(y) < 0.7),dtype='float64') 
gauss = np.exp(-5 * (x * x + y * y))
plt.matshow(np.hstack((gauss, box)), cmap='plasma')
plt.colorbar()

Let’s see what SVD will do with the box filter matrix and print out all the singular values, and and the U and E row / column associated with the first singular value:

U, E, V = np.linalg.svd(box)
print(E)
print(U[:,0])
print(V[0,:])
[34.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00]
[0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00]
[0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00]

Great, we get only one singular value – which means that it is a rank 1 matrix and fully separable.

On the other hand, the fact that our columns / vectors are negative is a bit peculiar (it is implementation dependent), but the signs negate each other; so we can simply multiply both of them by -1. We can also get rid of the singular value and embed it into our filters, for example bymultiplying both by sqrt of it to make them comparable in value:

[-0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 1.00 1.00 1.00 1.00 1.00
 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 1.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00]
[-0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 1.00 1.00 1.00 1.00 1.00
 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 1.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00]

Apart from cute -0.00s, we get exactly the separable filter we would expect. We can do the same with the Gaussian filter matrix and arrive at the following:

[0.01 0.01 0.01 0.02 0.03 0.04 0.06 0.08 0.10 0.14 0.17 0.22 0.27 0.33
 0.40 0.47 0.55 0.63 0.70 0.78 0.84 0.90 0.95 0.98 1.00 1.00 0.98 0.95
 0.90 0.84 0.78 0.70 0.63 0.55 0.47 0.40 0.33 0.27 0.22 0.17 0.14 0.10
 0.08 0.06 0.04 0.03 0.02 0.01 0.01 0.01]
[0.01 0.01 0.01 0.02 0.03 0.04 0.06 0.08 0.10 0.14 0.17 0.22 0.27 0.33
 0.40 0.47 0.55 0.63 0.70 0.78 0.84 0.90 0.95 0.98 1.00 1.00 0.98 0.95
 0.90 0.84 0.78 0.70 0.63 0.55 0.47 0.40 0.33 0.27 0.22 0.17 0.14 0.10
 0.08 0.06 0.04 0.03 0.02 0.01 0.01 0.01]

Now if we take an outer product (using function np.outer()) of those two vectors, we arrive at filters almost exactly same as the original (+/- the numerical imprecision differences).

Gaussian and its rank1 approximation – unsurprisingly they look exactly the same.

Non-separable filters

“D’oh, you took filters that you know are separable, fed through some complicated linear algebra machinery and verified that yes, they are separable – what a waste of time!” you might think – and you are right, it was a demonstration of the concept, but has so far not too much practical value. So let’s look at a few more interesting filters – ones that we know are not separable, but maybe are almost separable?

I picked four examples: circular, hexagon, exponential/Laplace distribution, and a difference of Gaussians. Circular and hexagon are interesting as they are very useful for bokeh; exponential/Laplace is somewhat similar to long-tailed BRDF like GGX, and can be used for some “glare”/veil type of effects, and the difference of Gaussians (laplacian of Gaussians) is both a commonly used band-pass filter, as well as can be used for a busy, “donut”-type bokeh.

circle = np.array(x*x+y*y < 0.8, dtype='float64')
exponential = np.exp(-3 * np.sqrt(x*x+y*y))

hexagon = np.array(np.logical_and(np.logical_and(np.logical_and(np.logical_and(np.logical_and(np.logical_and(np.logical_and(x+y < 0.9*np.sqrt(2), x+y > -0.9*np.sqrt(2)),x-y < 0.9*np.sqrt(2)),-x+y < 0.9*np.sqrt(2)),x<0.9),x>-0.9),y<0.9),y>-0.9), dtype='float64')

difference_of_gaussians = np.exp(-5 * (x*x+y*y))-np.exp(-6 * (x*x+y*y))
difference_of_gaussians /= np.max(difference_of_gaussians)

plt.matshow(np.vstack((np.hstack((circle, hexagon)), np.hstack((exponential, difference_of_gaussians)))), cmap='plasma')
plt.colorbar()

Now, if we try to do a rank 1 approximation by taking just the first singular value, we end up with results looking like this:

They look terrible, nothing like the original ones! Ok, so now we know they are not separable, which concludes my blog post…

I’m obviously kidding, let’s explore how “good” are those approximations and how to make them better. 🙂

Low~ish rank approximations

I mentioned that singular values represent percentage of “energy” or “information” in the original filter. If the filter is separable, all energy is in the first singular value. If it’s not, then it will be distributed among multiple singular values.

We can plot the magnitude of singular values (I normalized all the plots so they integrate to 1 like a density function and don’t depend on how bright a filter is) to see what is going on. Here are two plots: of the singular values, and “cumulative” (reaching 1 means 100% of the original filter preserved). There are 50 singular values in a 50×50 matrix, but after some point they reach zero, so I truncated the plots.

The second plot is very informative and tells us what we can expect from those low rank approximations.

To reconstruct fully the original filter, for difference of Gaussians we need just two rank 1 matrices (this is not surprising at all! DoG is difference of two Gaussians, separable / rank 1 filters), for Laplacian we get very close with just 2-3 rank 1 matrices. On the other hand, for circular and hexagonal filter we need many more, to reach just 75% accuracy we need 3-4 filters, and 14 to get 100% accurate representation…

How do such higher rank approximations look like? You can think of them as sum of N independent separable filters (independent vertical + horizontal passes). In practice, you would probably implement them as a single horizontal pass that produces N outputs, followed by a vertical one that consumes those, multiplies them with appropriate vertical filters, and produces a single output image.

Note 1: In general and in most cases, the “sharper” (like the immediate 0-1 transition of hex or a circle) or less symmetric a filter is, the more difficult it will be to approximate it. In a way, it is similar to other common decomposition, like the Fourier transform.

Note 2: You might wonder – if a DoG can be expressed as a sum of two trivially separable Gaussians, why SVD rank 1 approximation doesn’t look like one of the Gaussians? The answer is that with just a single Gaussian, the resulting filter will be more different from the source than our separable weird 4-dot pattern. SVD decomposes so that each progressive low rank decomposition is optimal (in L2 sense), in this case it means adding two non-intuitive, “weird” separable patterns together, where the second one contributes 4x less than the first one.

Low-rank approximations in practice

Rank N approximation of an MxM filter would have performance cost of O(2M*N), but additional memory cost of N * original image storage (if not doing any common optimizations like line buffering or local/tiled storage). If we need to approximate 50×50 filter, then even taking 10 components could be “theoretically” worth it; and for much larger filters it would be also worth in practice (and still faster and less problematic than FFT based approaches). One thing worth noting is that in the case of presented filters, we reach the 100% exact accuracy with just 14 components (as opposed to full 50 singular values!).

But let’s look at more practical, 1, 2, 3, 4 component approximations.

def low_rank_approx(m, rank = 1):
  U,E,V = np.linalg.svd(m)
  mn = np.zeros_like(m)
  score = 0.0
  for i in range(rank):
    mn += E[i] * np.outer(U[:,i], V[i,:])
    score += E[i]
  print('Approximation percentage, ', score / np.sum(E))
  return mn

Even with those maximum four components, we see a clear “diminishing returns” effect, and that the differences get smaller between 3 and 4 components as compared to 2 and 3. This is not surprising, since our singular values are sorted descending, so every next approximation will add less of information. Looking at the results, they might be actually not too bad. We see some “blockiness”, and even worse negative values (“ringing”), but how would that perform on real images?

Visual effect of low-rank approximation – boookeh!

To demonstrate how good/bad this works in practice, I will use filters to blur an image (from the Kodak dataset). On LDR images the effects are not very visible, so I will convert it to fake HDR by applying a strong gamma function of 7, doing the filtering, and inverse gamma 1/7 to emulate tonemapping. It’s not the same as HDR, but does the job. 🙂 This will also greatly amplify any of the filter problems/artifacts, which is what we want for the analysis – with approximations we often want to understand not just the average, but also the worst case.

Left: source image, middle: image blurred with a circular kernel and no gamma, right: the same image blurred with the same kernel, but in fake-HDR space to emphasize the filter shape.
def fake_tonemap(x):
  return np.power(np.clip(x, 0, 1), 1.0/7.0)
def fake_inv_tonemap(x):
  return np.power(np.clip(x, 0, 1), 7.0)
def filter(img, x):
  res = np.zeros_like(img)
  for c in range(3):
    # Note that we normalize the filter to preserve the image brightness.
    res[:,:,c] = fake_tonemap(scipy.ndimage.convolve(fake_inv_tonemap(img[:,:,c]), x/np.sum(x)))
  return res
def low_rank_filter(img, x, rank = 1):
  final_res = np.zeros_like(img)
  U, E, V = np.linalg.svd(x / np.sum(x))
  for c in range(3):
    tmd = fake_inv_tonemap(img[:,:,c])
    for i in range(rank):
      final_res[:,:,c] += E[i] * scipy.ndimage.convolve(scipy.ndimage.convolve(tmd, U[:, i, np.newaxis]), np.transpose(V[i,:, np.newaxis]))
    final_res[:,:,c] = fake_tonemap(final_res[:,:,c])
  return final_res

Without further ado, here are the results.

Unsurprisingly, for the difference of Gaussians and exponential distribution, two components are enough to get identical filtered result, or very close to.

For the hexagon and circle, even 4 components show some of the negative “ringing” artifacts (bottom of the image), and one needs 8 to get artifact-free approximation. At the same time, we are using extreme gamma to show those shapes and emphasize the artifacts. Would this matter in practice and would such approximations be useful? As usually, it depends on the context of the filter, image content, further processing etc. I will put together some thoughts / recommendations in the Conclusions section.

A careful reader might ask – why not clamp filters to not be negative? The reason is that low rank approximations will often locally “overshoot” and add more energy in certain regions, so the next rank approximation component needs to remove it. And since we do it separably and accumulate the filtered result (the whole point of low rank approximation!) we don’t know if the sum will be locally negative or not. We could try to find some different, guaranteed non-negative decomposition, but it’s a NP-hard problem and only some approximations / heuristics / optimization-based solutions exist. Note: I actually wrote a blog post on using optimization to solve (or at least minimize) those issues as a follow-up, be sure to check it out! 🙂

Limitation – only horizontal/vertical separability

Before I conclude my post, I wanted to touch two related topics – one being a disadvantage of the SVD-based decomposition, the other one being a beneficial side-effect. Starting with the disadvantage – low-rank approximation in this context is limited to filters represented as matrices and strictly horizontal-vertical separability. This is very limited, as some real, practical, and important applications are not covered by it.

Let’s analyze two use-cases – oriented, orthogonal separability, and non-orthogonal separability. Anisotropic Gaussians are trivially separable with two orthogonal filter – along the principal direction, and perpendicular to it. Similarly, skewed box blur filters are separable, but in this case we need to blur in two non-orthogonal directions. (Note: this separability is the basis of classic technique of fast hexagonal blur developed by John White and Colin Barre-Brisebois).

Such filtering can be done super-fast on the GPU, often almost as fast as the horizontal-vertical separation (depending if using texture sampler or compute shaders / local memory), yet the SVD factorization will not be able to find it… So let’s look at how the low rank horizontal-vertical approximations and accumulation of singular values look like.

From left to right: original, rank1, rank2, rank4, rank8.

Overall those approximations are pretty bad if you consider that they are trivially separable in those different direction. You could significantly improve them by finding the filter covariance principal components and rotate it to be axis aligned, but this is an approximation, requires filter resampling and complicates the whole process… Something to always keep in mind – as automatic linear algebra machinery (or any ML) is not a replacement for careful engineering and using the domain knowledge. 🙂

Bonus – filter denoising

My post is already a bit too long, so I will only hint at an interesting effect / advantage of SFV – how SVD is mostly robust to noise, and can be used for the filter denoising. If we take a pretty noisy Gaussian (Gaussian maximum of 1.0, noise standard deviation of 0.1), and do a simple rank1 decomposition, we get a significantly cleaner representation of it (albeit with horizontal and vertical “streaks”).

This approximation denoising efficacy will depend on the filter type; for example if we take our original circular filter (which is as we saw pretty difficult to approximate), the results are not going to be so great, as more components that would normally add more details, will add more noise back as well.

Approximations of noisy circular filter – original, rank1, rank2, rank4, rank8. The more components we add, the more noise comes back.

Therefore, there exists a point at which higher rank approximations will only add noise – as compared to adding back information/details. This is closely related to the bias-variance trade-off concept in machine learning, and we can see this behavior on the following plot showing normalized sum of squared error of the noisy circular filter next rank approximations (error is computed against ground truth, of non-noisy filter).

When we add next ranks of approximation to the noisy filter, we initially observe the error is decreasing, but then it starts to increase, as we don’t add more information and original structure, but only noise.

It is interesting to see on this plot that the cutoff point in this case is around rank 12-13 – while 14 components were enough to perfectly reconstruct the non-noisy version of the filter.

Left: clear circular filter, middle: circular filter with noise added, right: rank 12 approximation of the noisy circular filter – notice perfect circle shape while significantly less noise than the middle one!

Overall this property of SVD is used in some denoising algorithms (however decomposed matrices are not 2D patches of an image; instead they usually are MxN matrices where M is a flattened image patch, representing all pixels in it as a vector, and N corresponds to N different, often neighboring or similar image patches), which is a super interesting technique, but way outside of the topic of this post.

Conclusions

To conclude, in this blog post we have looked at image filter/kernel separability – its computational benefits, simple and analytical examples, and shown a way of analyzing if a filter can be converted into a separable (rank 1) representation through Singular Value Decomposition.

As we encountered some examples of non-separable filters, we analyzed what are the higher rank approximations, how we can compute them through sum of many separable filters, how do they look like, and if we can apply them for image filtering using an example of bokeh.

I love the scientific and analytical approach, but I’m also an engineer at heart, so I cannot leave such blog post without answering – is it practical? When would someone want to use it?

My take on that is – this is just another tool in your toolbox, but has some immediate, practical uses. Those include analyzing and understanding your data, checking if some non-trivial filters can be separable, if they can be approximated despite the noise / numerical imprecision, and finally – evaluate how good the approximation is.

For non-separable filters, low rank approximations where the rank is higher than one can often be pretty good with just a few components and I think they are practical and usable. For example, our rank 4 approximation of the bokeh filters was not perfect (artifacts), but computationally way cheaper than non-separable filter, it parallelizes perfectly/trivially, and is very simple implementation-wise. In the past I referenced a very smart solution from Olli Niemitalo to approximate circular bokeh using complex phasors – and low-rank approximation in the real domain covered in this blog post is some simpler interesting alternative. Similarly, it might not produce perfect hexagonal blur like the one constructed using multiple skewed box filters, but the cost would be similar, and the implementation is so much simpler that it’s worth giving it a try.

I hope that this post was inspiring for some future explorations, understanding of the problem of separability, and you will be able to find some interesting, practical use-cases!

Posted in Code / Graphics | Tagged , , , , , , , , , , , | 9 Comments

Analyze your own activity data using Google Takeout – music listening stats example

The goal of this post is to show how to download our own data stored and used by internet services to generate personalized stats / charts like below and will show step-by-step how to do it using colab, Python, pandas, and matplotlib.

Disclaimer: I wrote this post as a private person, not representing my employer in any way – and from a 100% outside perspective.

Intro / motivation

When we decide to use one particular internet service over the other, it is based on many factors and compromises – we might love our choices, but still find ourselves missing certain features.

Seeing “year in review” from Spotify shared by friends made me envious. Google Play Music that I personally use doesn’t have such a cool feature and didn’t give me an opportunity to ponder and reflect on the music I listened to in the past year (music is a very important part of my life).

Luckily, the internet is changing and users, governments, as well as internet giants are becoming more and more aware of what it means for a user to “own their data”, request it, and there is a need for increased transparency (and control/agency over data we generate while using those servies). Thanks to those changes (materializing themselves in improved company terms of service, and regulatory acts like GDPR or California Consumer Privacy Act), on most websites and services we can request or simply download the data that those collect to operate.

Google offers such a possibility in a very convenient form (just a few clicks) and I used my own music playing activity data together with Python, pyplot, and pandas to create a personalized “year in review” chart(s) and given how cool and powerful, yet easy it is, decided to share it in a short blog post.

This blog post comes with code in form of a colab and you should be able to directly analyze your own data there, but more importantly I hope it will inspire you to check out what kind of other user data your might have a creative use for!

Downloading the data

The first step is trivial – just visit takeout.google.com.

Gotcha no 1: Amount of different Google services is impressive (if not a bit overwhelming) and it took me a while to realize that the data I am looking for is not under “Google Play Music“, but under “My Activity”. Therefore I suggest to “Deselect all”, and then select “My Activity”. Under it, click on “All activity data included” and filter to just Google Play Music “My Activity” content:

The nextstep here would be to change the data format from HTML to just JSON – which is super easy to load and parse with any data analysis frameworks or environments.

Finally, proceed to download your data as “one-time archive” and a format that works well for you (I picked a zip archive).

I got a “scary” message that it can take up to a few days to get a download link to this archive, but then got it in less than a minute. YMMV, I suspect that it depends on how much data you are requesting at given time.

Downloaded data

Inside the downloaded archive, my data was in the following file:

Takeout\My Activity\Google Play Music\MyActivity.json

Having a quick look at this JSON file, looks like it contains all I need! For every activity/interaction with GPM, there is an entry with title of the activity (e.g. listened to, searched for), description (which is in fact artist name), time, as well as some metadata:

One thing that is interesting is that “subtitles” contain information about “activity” that GPM associated with given time, temperature, and weather outside. This kind of data was introduced between 2019-03-21 (is not included my entries on that date or before) and 2019-03-23 (the first entry with such information). The activity isn’t super accurate, most of them are “Leisure” for me, often even when I was at work or commuting, but some are quite interesting, e.g. it often got “Shopping” right. Anyway, it’s more of a curiosity and I haven’t found great use for it myself, but maybe you will. 🙂

Gotcha no 2: all times described there seem to be in GMT – at least for me. Therefore if you listen to music in any different timezone, you will have to convert it. I simplified my analysis to use pacific time. This results in some incorrect data – e.g. I spent a month this year in Europe and music that I listened to during that time will be marked incorrectly as during some “weird hours”.

Analyzing the data

Ok, finally it’s time for something more interesting – doing data analysis on the downloaded data. I am going to use Google Colaboratory, but feel free to use local Jupyter notebooks or even commandline Python.

Code I used is available here.

Note: this is the first time I used pandas, so probably some pandas experts and data scientists will cringe at my code, but hey, it gets the job done. 🙂

The first step to analyze a local file with a colab is to upload it to the runtime using the tab on the left of the colab window – see the screenshot below. I have uploaded just the MyActivity.json without any of the folder structure. If you use local scripts, this step is unnecessary. as you can specify full local path

Gotcha no 3: Note that uploading files doesn’t associate them with your notebook, account used to open colab or anything like that – just with current session in the current runtime, so anytime you restart your session, you will have to reupload the file(s).

Time to code! We will start with loading some Python libraries and set the style of plots to a bit prettier:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

Then we can read the JSON file as Pandas dataframe – I think of it as of table in SQL, which is probably oversimplification, but worked for my simple use-case. This step worked for me right away and required no data preprocessing:

df = pd.read_json('MyActivity.json')
print(df.columns)
Index(['header', 'title', 'subtitles', 'description', 'time', 'products',
       'titleUrl'],
      dtype='object')
                         title description                      time
0  Listened to Jynweythek Ylow  Aphex Twin  2019-12-31T23:33:55.865Z
1       Listened to Untitled 3  Aphex Twin  2019-12-31T23:26:23.465Z
2                Listened to 4  Aphex Twin  2019-12-31T23:22:46.829Z
3               Listened to #3  Aphex Twin  2019-12-31T23:15:02.427Z
4           Listened to Nanou2  Aphex Twin  2019-12-31T23:11:36.944Z
5  Listened to Alberto Balsalm  Aphex Twin  2019-12-31T23:06:25.999Z
6             Listened to Xtal  Aphex Twin  2019-12-31T23:01:32.538Z
7       Listened to Avril 14th  Aphex Twin  2019-12-31T22:59:27.321Z
8        Listened to Fingerbib  Aphex Twin  2019-12-31T22:55:37.856Z
9              Listened to #17  Aphex Twin  2019-12-31T22:53:32.866Z

This is great! Now a bit of clean up. We want to apply the following:

  • For easier further coding, rename the unhelpful “description” to “artist”,
  • Convert the string containing date and time to pandas datetime for easier operations and filtering,
  • Convert the timezone to the one you want to use as a reference – I used single one (US Pacific Time), but nothing prevents you from using different ones per different entries, e.g. based on your locations if you happen to have them e.g. extracted from other internet services,
  • Filter only “Listened to x/y/z” activities, as I interested in analyzing what I was listening to, and not e.g. searching for.
df.rename(columns = {'description':'artist'}, inplace=True)
df['time'] = df['time'].apply(pd.to_datetime)
df['time'] = df['time'].dt.tz_convert('America/Los_Angeles')
listened_activities = df.loc[df['title'].str.startswith('Listened to')]

Now we have a dataframe containing information about all of our listening history. We can do quite a lot with it, but I didn’t care for specific “tracks” (analyzing this could be useful to someone), so instead I went to a) filter only activities that happened in 2019, and b) group all events of listening to a single artist.

For the step b), let’s start with just annual totals. We group all activities and use size of each group as a new column (total play count).

listened_activities_2019 = listened_activities.loc[(listened_activities['time'] >= '2019-01-01') & (listened_activities['time'] < '2020-01-01')]
artist_totals = listened_activities_2019.groupby('Artist').size().reset_index(name='Play counts')
top = artist_totals.sort_values('Play counts', ascending=False).set_index('Artist')[0:18]
print(top)
                       Play counts
Artist                            
The Chemical Brothers          151
Rancid                         144
AFI                             98
blink-182                       89
Show Me The Body                69
The Libertines                  68
Ladytron                        64
Flying Lotus                    64
Boards Of Canada                62
Operation Ivy                   62
808 State                       62
Punish Yourself                 61
Lone                            60
Mest                            58
Refused                         54
Aphex Twin                      54
Dorian Electra                  53
Martyn                          52

This is exactly what I needed. It might not be surprising (maybe a little bit, I expected to see some other artists there – but they were soon after – I am a “long tail” type of listener), but cool and useful nevertheless.

Now let’s plot it with matplotlib. I am not pasting all the code here as there is a bit of boilerplate to make plots pretties, so feel free to check the colab directly. Here is the resulting plot:

From this starting point, let your own creativity and imagination be your guide. Some ideas that I tried and had fun with:

  • hourly or weekly listening activity histograms,
  • monthly listening totals / histograms,
  • checking whether I tend to listen to different music in the evening versus the morning (I do! Very different genres to wake up vs unwind 🙂 ),
  • weekly time distributions.

All of those operations are very simple and mostly intuitive in Python + pandas – you can extract things like hour, day of the week, month etc. directly from the datetime column and then use simple combinations of conditional statements and/or groupings.

Some of those queries/ideas that I ran for my data are aggregated in a single plot below:

Summary

Downloading and analyzing my own data for Google Play Music was really easy and super fun. 🙂

I have to praise here all of the following:

  • User friendly and simple Google data / activity downloading process, and its data produced in a very readable/parseable format,
  • Google Colab being an immediate, interactive environment that makes it very easy to run code on any platform and share it easily,
  • Pandas as data analysis framework – I was skeptical and initially would have preferred something SQL-like, but it was really easy to learn for such a basic use case (with help of the documentation and Stackoverflow 🙂 ),
  • As always, how powerful numpy and matplotlib (despite its quite ugly looking defaults and tricks required to do simple things like centering entries on label “ticks”) are – but those are already kind of industry standards, so won’t add here more compliments.

I hope this blog post inspired you to download and tinker with your own activity data. Hopefully this will be not just fun, but also informative to understand what is stored – and you can even judge what do you think is really needed for those services to run well. The activity data I downloaded from Google Play Music looks like a reasonable, bare minimum that is useful to drive the recommendations.

To conclude, I think that anyone with even the most basic coding skills can have tons of fun analyzing their own activity data – combining information to create personally meaningful stats, exploring own trends and habits, extending functionalities, or adding (though only for yourself and offline) completely new features for commercial services written by large teams.

Posted in Code / Graphics | Tagged , , , , , , | Leave a comment

Local linear models and guided filtering – an alternative to bilateral filter

Intro

In this blog post I am going to describe an alternative tool for the graphics and image processing programmers’ toolbox – guided filtering.

Guided filtering is a really handy tool that I learned about from my coworkers, and I was surprised that it is not more popular, especially for real time applications.

As in many – if not most – of my blog posts, I am not going to describe original research of mine, but try to educate, explain some ideas conceived and published by others, focusing on practical application, their pros and cons – and hopefully inspire others.

Given the recent advances in real time rendering, and popularity of “stochastic” techniques – both screen-space (SSAO, screen-space reflections), as well as world-space like ray tracing, there will be more need for efficient, robust, detail preserving denoising of different image signal components.  Most commonly used (joint) bilateral filter is just one of them, but there are some as fast (if not faster!) alternatives and I am going to describe one of them.

This blog post comes with code in form of Google Colab and Python / numpy implementation. I find it really useful that anyone can just run it in their browser, start modifying it and playing with it. Highly recommend it over lonely repos that might not even compile! 🙂

Problem statement

I am going to use here a problem of screen-space ambient occlusion. It is an approximation of some of global illumination effects (visibility, shadowing of indirect illumination) using screen-space depth information. I will not go deeply into how it is computed (I have mentioned a robust, fast and simple technique in an older blog post of mine in the context of temporal techniques), but to meet performance constraints, this technique is computed:

  • Usually at lower resolution,
  • Using randomized, stochastic sampling patterns, producing noisy results.

Here is an example of raw, unfiltered SSAO output on a Crytek Sponza scene:

Because of small sample count, the results are very noisy, and there are only a few discrete levels of signal (color banding). This image is also half resolution (quarter pixels) as compared to the full image.

We could just blur it to get rid of the noise with a Gaussian filter, but the effect would be
 well, blurry:

The main problem with this blurriness – apart from loss of any kind of details – is that it doesn’t preserve the original geometric edges, and when applied to a real scene, would creating halos.

The most common solution to it is joint bilateral filtering.

(Joint) bilateral filter

Bilateral filter is image filter that varies sample weights not only based on image-space distance in pixels, but also the similarity between color samples.

In an equation form this could be written as:

y = Sum(w(x, xij) * xij) / Sum(w(x, xij)
w(x, xij) = wx(i,j) * wy(x, xij)

Where wx is spatial weight, and wy is signal similarity weight. Similarity weight can be any kind of kernel, but in practice the most commonly used one is Gaussian exp(-1/sigma^2*distance^2). Note the sigma parameter here – it is something that needs to be provided by the user – or generally tuned.

Bilateral filters are one of the cornerstones of image processing techniques, providing basis for denoising, detail enhancement, tonemapping etc.

Joint bilateral filter is a form of bilateral filter that uses one signal to create weights for filtering of another signal.

It can be used anytime when we have some robust, noise-free signal along the unreliable, noisy one and we expect them to be at least partially correlated. Example and one of the earliest applications is would be denoising low light images using a second picture of the same scene, taken with a flash.

Joint bilateral filtering can be written in an equation form as:

w(x, xij) = wx(i,j) * wy(z, zij)

Note that here the similarity weight ignores the filtered signal x, and uses a different signal z. I will call this signal a guide signal / guide image – even outside of the context of guided filtering.

In the case of SSAO, we can simply use the scene linear depth (distance of a rendered pixel from the camera plane).

It is most commonly used because it comes for “free” in a modern renderer, is noise-free, clearly separates different objects, doesn’t contain details (like textures) that we wouldn’t want to correlate with the filtered signal – is piece-wise linear.

On the other hand, nothing really prevents us from using any other guide signal / information – like albedo, lighting color, geometric normals etc. Using geometric or detail normals is very often used for filtering of the screen space reflections – as reflections are surface orientation dependent and we wouldn’t want to blur different reflected objects. But in this post I will focus on the simplest case and keep using just the depth.

Let’s have a look at simple joint bilateral filter using depth:

Clearly, this is so much better! All the sharp edges are preserved, and the efficacy of denoising seems quite similar.

Bilateral upsampling

Ok, so we got the half resolution signal quite nicely filtered. But it is half resolution
 If we just bilinearly upsample it, the image will just become blurrier.

If we zoom in, we can see it in more details:

Left: 2x point upsampling, Right: 2x bilinear upsampling.

In practice, this results in ugly half-res edges and artifacts and small dark/bright jaggy aliased “halos” around objects.

Luckily, the joint bilateral filtering framework provides us with one solution to this problem.

Remember how joint bilateral filter takes data term into account when computing the weights? We can do exactly the same for the upsampling. (edit: a second publication also from Microsoft research covered this idea the same year!

When upsampling the image, multiply the spatial interpolation weights by data similarity term (between low resolution guide image and the high resolution guide image), renormalize them, and add weighted contributions.

This results in much better image sharpness, edge preservation and other than the computational cost, there are no reasons to not do it.

Left: 2x point upsampling, Middle: 2x bilinear upsampling, Right: 2x joint bilateral upsampling.

I highly recommend this presentation about how to make it practical for things like SSAO, but also other applications that don’t require any bilateral denoising/smoothing, like particle and transparency rendering.

As a side note – performing such bilateral upscaling N times – if we have N different effects applied in different places – might be costly at full resolution. But you can use the bilinear sampler and precompute UV offsets once, ahead of time to approximate bilateral upsampling. I mentioned this idea in an older blog post of mine.

To summarize the described workflow so far:

  1. Create a low resolution representation of some “robust” signal like depth.
  2. Compute low resolution representation of a noisy signal.
  3. Perform joint bilateral filtering passes between the noisy signal and the clean low resolution signal.
  4. Use a low and high resolution representation of the clean signal for joint bilateral upsampling.

In theory, we could do joint bilateral filtering of the low resolution signal directly at higher resolution, but in practice it is not very common – mostly because of the performance cost and option to use separable filtering directly in lower resolution.

Problems with bilateral filtering

Let me get this straight – bilateral filtering is great, works robustly, and it is the recommended approach for many problems. 

But at the same time, as every technique it comes with some shortcomings and it is worth considering alternatives, so I am going to list some problems.

Lack of separability

One of the caveats of the bilateral filter is that following its definition, it is clearly not separable – as weights of each filter sample, even off the x and y axis depend on the similarity with the filter center.

Unlike Gaussian, a pass over x dimension followed by a pass over the y dimension (to reduce the number of samples from N^2 to N) is not mathematically equivalent
 but in practice this is what we do most of the time in real time graphics.

The results are almost indistinguishable:

As expected, there have to be some artifacts:

Left: Full NxN bilateral filter. Right: Separable N+N bilateral filter.

Notice the banding on the top of the image, as well as streaks around the lion’s head.

Still, it’s significantly faster and in practice the presence of artifacts is not considered a showstopper. 

I will just briefly mention an alternative to slightly reduce those artifacts – to run the separable filter N times – so iterating over the image x, y, x, y etc. But below is a diagram of clean signal pattern where it will not work / improve things anyway.

Example pattern of clean signal that separable joint bilateral filtering will ignore completely. Neither horizontal, nor vertical pass can “see” the elements on the diagonal, while a non-separable version would read them directly.

Edit after a comment: Please note that there are many other different ways of accelerating the bilateral filter. They usually involve subsampling in space, either in one sparse pass, or in multiple progressively more dilated ones.

Tuning sensitivity / scale dependence

Bilateral filtering has a necessary tuning parameter “sigma”. This makes it inherently scale dependent. Areas where guide signal difference is multiple sigmas won’t get almost any smoothing, while the areas with larger difference will get over-smoothed. I will show this problem with an example of 1D signal.

Top: clean 1D guide signal, Bottom: Noisy 1D signal to filter.

We need to pick some sigma, and because of the signal variations here, no choice is going to be perfect. Too large sigma is going to cause over-blurring of the small signal (and rounding of the corners), while too small sigma will under-filter some areas.

Top: Too small sigma causes under-filtering of the signal and some staircase/gradient reversal artifacts, bottom: too large sigma over-blurs the corners, and over-smooths the smaller scale details.

Staircase / piecewise constant behavior / splotches

While bilateral filtering behaves perfectly when the guide/clean signal is piece-wise constant, it can produce undesired look when the clean signal is a gradient / linear ramp – or generally piece-wise linear. In such case it will either produce piece-wise-constant look (very often referred to “cartoony”), gradient reversals, or staircase type of artifacts.

Top: clean signal, middle: noisy signal, bottom: joint bilateral filtered signal – notice some local gradient reversals / staircase artifacts.

Performance cost

Finally, the bilateral filter is not that cheap. Each sample weight requires computation of the kernel function (most often exponential), and there is a need for normalization at the end.

Doing bilateral denoising then followed by bilateral upsampling while saves some computations by doing most of the wide-radius filtering work in low resolution, but still shows some redundancy as well – in the end we are computing bilateral weights and renormalizing twice.

Guided filter – local linear models

Some of those shortcomings are solved by the guided filter. While I as always recommend the original paper – and its idea is brilliant, I had some trouble parsing it (IMO the follow-up, fast guided filter is easier to understand), so will describe it with my own words. 

The main idea is simple: given some local samples, find a best-fit linear relationship between the guide and filtered signal. “Best-fit” is vague, so we will use the linear least squares method – both because it has the optimal behavior (maximum likelihood estimator) in the case of normally distributed data, but also because of it having a simple, closed form and computational efficiency.

This is effectively solving a linear regression problem “what is the linear equation that describes relationship in the data with the smallest squared error”.

Before delving into how to do it, let’s have a look at a few examples of why it might be a good idea. If there is no local variation of the guide signal, linear model will revert to just averaging – so will work correctly in the case of the piece-wise constant signals.

At the same time, one of the main and immediate motivations is that such method will perform perfectly in the case of piece-wise linear signal (as opposed to piece-wise constant of the bilateral filter). So in the case of SSAO – it should give better results anytime you have a depth gradient correlated with change in the filtered signal intensity. 

Similarly, it is in a way a machine learning technique and it should discover linear relationships no matter what is the scale of them – it will automatically infer the scaling parameter from the data.

Let’s see it on the two examples from the previous section – shortcomings of the bilateral filter.

Top: noisy signal, middle: joint bilateral filtered signal – notice some local gradient reversals / staircase artifacts, bottom: same signal filtered with a guided filter. Notice the lack of problematic artifacts.

Top: Bilateral filter with too small sigma, middle: bilateral filter with too large sigma, bottom: parameter-free guided filter.

“There is no such thing as a free lunch”, so there are some shortcomings, but I will cover them in a later section. Now, knowing the motivation behind the local linear regression, let’s have a look at how we can do it.

How to compute linear regression efficiently

How to solve linear least squares is one of the most common entry-level position machine learning interview questions. This is so widely described topic that it doesn’t make sense for me to rederive here the normal equations (Inverse(X*X^T)*X^T*b) using the matrix notation form. Instead, I will just point out to the univariate linear regression wikipedia article, that shows how to solve it easily with simple terms like covariance, variance, and mean.

The beauty of it is that you can do it just by accumulating raw (not even centered!) moments. This is not the most numerically robust way to do it, but it is extremely efficient. In pseudo-code, it would be something like:

X = Sum(x)

Y = Sum(y)

X2 = Sum(x*x)

XY = Sum(x*y)

A = (XY - X * Y) / (X2-X*X + reg_constant)

B = Y - beta * X

This is so simple and beautiful – we are looking at just sums, fused multiply-adds and a single division! The sums are not only completely separable, but also can be computed very efficiently using SAT/integral images.

We can also do a weighted linear least squares – and just multiply the moments by weights when accumulating them. I will use Gaussian weights in my examples due to reverting to pleasantly-looking (and not very alias prone) Gaussian filter in the case of constant signals.

Note the reg_constant here. In practice for perfectly flat signals you have to add some regularization to avoid divisions by zero or NaNs (or simply answering an ill-posed question “what is the linear relationship between a signal that doesn’t change at all and a different one?”). Such Tikhonov regularization can also make the model “smoother” and reverting more and more to local mean (like regular smoothing). You can observe the effect in the following animation – as regularization increases, the result looks more and more like a simply blurred signal.

If you use a separable kernel like Gaussian for weighting of the accumulated moments, this means that it can still be done in a separable way. While storing 4 values from 2 signals might seem sub-optimal and increase the memory bandwidth usage (especially those moments as squared variables are precision-hungry…) or the register pressure, total linearity of moment accumulation is suitable for optimizations like using local / group shared memory/cache and avoiding going back to the main memory completely. It can be done pretty efficiently with compute shaders / OpenCL / CUDA.

Results

How does it work in practice? Here is the result of filtering of our sponza SSAO:

And by comparison again bilateral filter:

The results are pretty similar! But there are some subtle (and less subtle) differences – both solutions having their quality pros and cons. 

For better illustration, I include also a version that toggles back and forth between them. Guided filter is the slightly smoother one.

I will describe pros and cons later, but for now let’s have a look at another pretty cool advantage of local linear models.

Upsampling linear models

After we have done all the computations of finding a local linear regression model that describes the relationship between the filtered and guide signal, we can use this model at any scale / resolution.

We can simply interpolate the model coefficients, and apply to a higher (or even lower) resolution clean / guide signal. It can be used to greatly accelerate the image filtering.

This way, there is no need for any additional bilateral upsampling.

In full resolution, just fetch the interpolated model coefficients – like using hardware bilinear sampling – and apply them with a single fused multiply-add operation. Really fast and simple. The quality and the results achieved such way are excellent.

Top: Joint bilateral upsampling of guide filtered image, bottom: bilinearly interpolated local linear model.

While those results are very similar, there are two quality advantages of the linear model upsampling – it can at the same time preserve some details better than joint bilateral upsampling, as well as produce less bilinear artifacts. Both of them come from the fact that it is parameter-free, and discovers the (linear) relationships automatically.

Left: joint bilateral upsampling of the results of the guided filter, right: upsampling of the local linear model coefficients. Notice that both the details are sharper, as well as have less of bilinear artifacts.

On the other hand, in this case there are two tiny 2 pixel-wide artifacts introduced by it – look closely at the right side of the image, curtains and a pillar. What can cause such artifacts? If the low resolution representation lacks some of the values (e.g. a single pixel hole), linear model that is fitted over a smaller range and then extrapolated to those values will most likely be wrong


The linear model upsampling is so powerful that it can lead to many more optimizations of image processing. This post is already very long, so I am only going to reference here two papers (one and two) that cover it well and hopefully will inspire you to experiment with this idea.

Pros and cons

Pro – amazingly simple and efficient to compute

Pretty self-explanatory. In the case of two variables (input-output), local linear regression is trivial to compute, uses only the most basic operations – multiply-adds and can be used with separable convolutions / summations or even SATs.

Pro – works at multiple image scales

The second pro is that local linear regression almost immediately extends to multi-resolution processing with almost arbitrary scaling factor. I have seen it successfully used by my colleagues with resolution compression factors higher than 2x, e.g. 4x or even 8x.

Pro – better detail preservation at small detail scales

Local linear models are relatively parameter-free, and therefore can discover (linear) relationships no matter the guide signal scale. 

Here is an example of this effect:

Left: joint bilateral filter, right: guided filter. 

Notice not only much cleaner image, but better preservation of AO that increases with depth around the top right of the image. Joint bilateral filter is unable to discover and “learn” such relationship. On the other hand, details are gone around the left-bottom part of the image, and I will explain the reason why in a below section called “smudging”.

Pro – no piecewise constant artifacts

I have mentioned it before, but piecewise constant behavior, “chunkiness”, gradient reversal etc. are all quite objectionable perceptual effects of the bilateral filter. Local linear regression fixes those – resulting in a more natural, less processed looking image.

Pro or con – smoother results

Before going into the elephant in the room of over blurring and smudging (covered in the next point), even correct and well behaved local linear models can produce smoother results.

Here is an example:

Left: joint bilateral filter, right: guided filter without any regularization.

Notice how it can be both great (shadow underneath the curtain), as well as bad – overblurred curtain shape.

Why does such oversmoothing happen? Let’s look at a different part of the image.

I have marked with a red line a line along which the depth increases just linearly. Along this line, our signal goes up, and down in oscillatory patterns. What happens if we try to fit a single line to something like this? The answer is simple – oscillations along this line will be treated as noise and smoothed no linear relationship will be discovered, so the linear model will revert to local mean.

Con – “smudging”

This is what I personally find the biggest problem with the guided filter / local linear models and limitation of a linear-only relationship. Notice the blurry kind of halo around here:

Left: joint bilateral filter, right: guided filter.

Why is it happening only around those areas? I think about it this way – a significant discontinuity of signal and a large difference “dominates” the small-scale linear relationships around it.

I have compared here two signals:

Left: flat signal followed by a small signal variation can be fit properly when there are no discontinuities or conflicting gradients, right: presence of signal discontinuity causes the line fitting to completely ignore the small “ramp” on the right.

Overall this is a problem as human visual system is very sensitive to any kind of discontinuities and details suddenly disappearing around objects look like very visible halos


This reminds me of the variance shadow mapping halos/leaking.

Would fitting a second order polynomial, so tiny parabolas help here (another shadow mapping analogy – like in moment shadow mapping)? The answer is yes! It is beyond the scope of this blog post, but if you are interested in this extension, be sure to read till the last section of this blog post. 🙂  

Con – potential to create artifacts

I have mentioned above that using bilateral upsampling can cause some artifacts like:

Those are less common and way less objectionable than jagged edges and broken-looking bilinear interpolation, but getting rid of those might be very hard without either too strong / too over-smoothing regularization, or some tricks that go well beyond the scope of a simple linear regression (additional data weighting, input clamping, stochastization of the input etc.).

Con – regression of multiple variables more costly

Unlike the joint bilateral filter where weight is computed once for all of the channels, if you want to filter an N channel signal using a single channel guide, the computation and storage cost goes up N times! This is because we have to find local linear model that describes the relationship between the guide signal and each of the channels separately
 In the case of RGB images this might be still practical and acceptable (probably not for multi-spectral imaging though 🙂 ), but it might be better to do it only on e.g. luma and upsample and process the chroma with some different technique(s)…

Con – memory requirements

The local storage memory costs of regression are at least twice as large as the bilateral filter for the number of channels. Unfortunately, those are also “hungry” for precision (because of using squared raw moments) and you might have to use twice more bit depth than for your normal filtered and guide signals.

Extending it and taking it further

I hope that after this post I have left you inspired to try – and experiment with – a slightly different tool for various different image processing operations – filtering, denoising, upsampling, mixed resolution processing. It is trivial to prototype and play with, but it goes beyond simple linear correlation of two variables. 

In fact, one can extend it and use: 

  • Any order polynomial least squares regression, including modelling of quadratic, cubic, or any higher order functions,
  • Multivariate modelling, including regression of the output from multiple guide signals.

Both of those require going back to the least squares framework, solving the normal equations, but can still be solved via accumulation of RAW moments of those signals to construct a larger covariance matrix.

Here is a quick, animated comparison in which I added depth squared for the regression.

A sequence of joint bilateral filter, first, and the second order least squares regression. Second order linear regression (parabola fitting) improves the results quite drastically
 but obviously at additional performance cost (7 moments accumulated instead of 4) and some more noise preservation.

Using more input signals comes with an additional cost, but does it always guarantee increased quality? There are two answers to this question. First one – “kind of”, as if there was no linear relationship with this additional variable and no correlation, it should be simply ignored. The second one is “yeah in terms of absolute error, but kind of not, when it comes to perceptual error”.

The problem is that more complex models can “overfit”. This is known in ML as bias-variance trade-off and more variance = more noise remaining after the filtering. We can see it in my toy example – there is clearly somewhat more noise on the left column.

As an example of a paper that uses multivariate regression, is this recent work from Siggraph 2019. It is work solving directly the problem of rendering denoising, and they use multiple input signals like normals, world space positions, world space positions squared etc. They use a lot of tricks to make linear models work over non-overlapping tiles (sparse linear models) – I personally found it very inspiring and highly recommend it!

I will conclude my post here, thanks for reading such a long post – and happy adventures in image filtering!

PS. Be sure to check out and play with Colab code that comes with this blog post.

References

Scalable Ambient Obscurance, Morgan McGuire, Michael Mara, David Luebke

Flash Photography Enhancement via Intrinsic Relighting, Elmar Eisemann, Freedo Durand

Joint Bilateral Upsampling, Johannes Kopf, Michael F. Cohen, Dani Lischinski, Matt Uyttendaele

Image-Based Proxy Accumulation for Real-Time Soft Global Illumination, Peter-Pike Sloan, Naga K. Govindaraju, Derek Nowrouzezahrai, John Snyder

Mixed Resolution Rendering in Skylanders: Superchargers, Padraic Hennessy 

A Low-Memory, Straightforward and Fast Bilateral Filter Through Subsampling in Spatial Domain, Francesco Banterle, Massimiliano Corsini, Paolo Cignoni, Roberto Scopigno

Edge-Avoiding À-Trous Wavelet Transform for fast Global Illumination Filtering, Holger Dammertz, Daniel Sewtz, Johannes Hanika, Hendrik P.A. Lensch

Guided Image Filtering, Kaiming He

Fast Guided Filter, Kaiming He, Jian Sun

Deep Bilateral Learning, Michaël Gharbi, Jiawen Chen, Jonathan T. Barron, Samuel W. Hasinoff, Frédo Durand

Bilateral Guided Upsampling, Jiawen Chen, Andrew Adams, Neal Wadhwa, Samuel W. Hasinoff

Variance Shadow Maps, William Donnelly, Andrew Lauritzen

Moment Shadow Mapping, Christoph Peters, Reinhard Klein 

Blockwise Multi-Order Feature Regression for Real-Time Path Tracing Reconstruction, Koskela M., Immonen K., MÀkitalo M., Foi A., Viitanen T., JÀÀskelÀinen P., Kultala H., and Takala J.

Posted in Code / Graphics | Tagged , , , , , , , , | 3 Comments

How (not) to test graphics algorithms

Intro

Siggraph 2019 is sadly over, but as always I came back super inspired and grateful for meeting many friends. 

Conferences are mostly not about seeing the presentations – but about all the interesting and inspiring discussions, and one of such casual lunch-time chats lead to me writing this blog post.

We chatted quite a lot about how to test automatically graphics features and while I am happy that the games industry starts to introduce various forms of testing, I believe that (in the games industry) many (if not most) people testing graphics features do it wrong.

This is a bold opinion, so I am going to elaborate on it, but first I will share my personal experience.

Personal backstory – frustrated by tests

My story with using tests is not glorious. 

Over my career switching workplaces, I went from teams using zero tests at all (not even trivial build tests! my first game gig binary builds were initially checked in to VCS by a programmer “from time to time”), through simple pre- and post-submit build tests, to full testing culture, but for most of my career and at most of those places, the code itself – the functionality – was not tested.

I encountered some form of “functional” tests quite late in my career – don’t want to mention here the company / team name, as it is not relevant, and I am sure a lot has changed since then. The important thing is – the amount of tests was orders of magnitude more than I have seen before. 

Some of the tests were very reasonable – for example smoke tests on code and content build and some of the actual game levels. They were catching lots of real issues, both in the code, as well as in the checked-in data.

On the other hand, as a graphics engineer, I quickly discovered that I have to deal with numerous “visual tests” for graphics features and they were source of some frustrations and tensions.

The testing framework itself and infrastructure were pretty good, but the team practices and processes around testing graphics features were far from useful – at least to me and in my experience.

They were essentially golden tests for each feature, all testing for exact output image rendered by the game engine. I will describe the basic workflow to test some new features in a new section, naming it Approach A.

It was quite an extreme process, but talking with colleagues from other companies, I learned that many developers use at least partially similar testing strategies!

That past experience caused the immature me to literally hate all kinds of tests (allergic Pavlovian reaction), but after some time and working with teams somewhat more experienced in terms of engineering practices and “code health culture”, I think I got to love well designed tests and I think I can dissect one by one how Approach A could have been improved.

In the next sections, I am going to compare the Approach A with an Approach B – an example of a how a graphics feature could be tested, and then by comparing both approaches analyze what distinguishes a “good” and a “bad” test process for graphics and imaging programmers.

How testing shouldn’t be done – Approach A

  1. Tech artist or a programmer creates a new “level”, placing meshes and materials (that use the new feature) with the main game editing tool.
  2. Virtual camera is placed in the world and a script is set up to take a screenshot after some time. Optionally, the script would toggle some rendering features on / off.
  3. The test is submitted in the data repository (not the code repository!) since it has some data dependencies.
  4. Screenshot (golden file) is stored in some database.
  5. During the testing itself, the screenshot is compared (hash, exact equality) with the output – any single difference = test failure.
  6. Some time later, someone changes something and suddenly, test output changes a few of the pixel values = failure. You can see both the hash difference, as well as gold and test + diff images.
  7. Any test failure after submission = submits are blocked for the whole team, until the tests are “fixed” or “blessed” (golden file updated).

Basically every single point on that list was either ineffective, or simply frustrating.

At least once a week I had to go through +/- thousand of screenshots and “bless” them under peer pressure (I have just blocked other colleagues from submitting) – overall very stressful experience, magnified by other artifacts of the submit/integration process like waiting in line for integration to happen, getting kicked out of the queue on broken tests etc.

At the same time, those tests were not catching many serious, real problems – like some important part of lighting being not normalized, NaNs appearing in BRDF at gloss values of 1, etc.

How it could be done – Approach B

Let’s say you have written a new awesome feature – subsurface skin scattering shader, requested so much by character artists. Both them and you are happy with the prototype results on a test model and you could click submit
 But instead you decide to improve the code a bit, modularize it, and write some tests.

You think about steps of your algorithm, and decide to test some following properties:

  • Does the skin shader work – does it actually diffuse the incoming light?
  • Does the skin shader preserve energy conservation (no matter what is the scatter profile, there should never be energy added)?
  • Does the skin shader respect specified diffusion profile?
  • Does the skin shader produce “reasonable” values – never negative, NaNs, inf.
  • Is the diffusion rotation invariant / isotropic?
  • Are the artist-authored material properties properly packed?
  • Does the diffusion stop at significant depth discontinuities? 
  • How much performance cost does it add?

Ok, knowing +/- what you want to test (this is all just an example), you finally decide to write some tests!

  1. You create a simple C++ (or the language of your choice) file along your code and add it to your build system with some “decoration” that it should be compiled as a part of an existing or a new test suite.
  2. In this file, you add simple macro decorated functors / classes that test behaviors one at a time.
  3. For every behavior, you create synthetic, procedural input from your code.
  4. For every behavior, you verify the output procedurally from your code. Apart from checking the actual behavior, you call some helper function e.g. ValidateOutput that checks for NaNs, inf, negative values.
  5. On test failures, you add code printing as much information as possible – expected vs actual, histogram of differences if checking multiple values, maybe additionally produced image disk dump.
  6. You write a (micro)benchmark that times the newly added pass depending on different sizes of inputs, different proportion of pixels with the feature enabled / disabled etc.

The points 3 and 4 are absolutely crucial and very different from the Approach A, and not very commonly used among colleagues I talked with.

I will make this example more concrete now – let’s say that you wanted to check for energy conservation of the diffusion process.

You would create a synthetic lighting buffer (and potentially a synthetic GBuffer if necessary) by filling it on CPU with zeros and a single pixel in the middle that would be “1” (or any reference value). You call your normal skin code pass, and then fetch the results to the CPU. On the CPU, you analyze the output programmatically – e.g. numerically integrate the pixel values. Such a test could have a few sub cases – testing unit response (no actual diffusion), perfectly flat box-filter like diffusion, and a Gaussian-like profile. Again, the most important part of this workflow is having very deterministic, extremely simple, and procedural inputs/outputs that you can reason about.

Important note: analyzing outputs numerically doesn’t mean that you can’t save the intermediate / final outputs for debugging and inspect them visually as well. On the contrary, I highly encourage having such option! Cases with all zeros or all infinity might be easy to see in a debug printout, but often visual inspection can provide insights on more non-obvious problem patterns (hey, why is this checkerboard pattern there? did I skip some samples?). Furthermore, visual verification of the test inputs / outputs when writing them can verify it they “make sense” and are representative of the expected outcome.

What is the difference between both approaches?

Let’s now dissect differences between the two approaches.

Tests (not) verifying correctness

What is the purpose of a test? There are many, but the most straightforward one is verifying correctness. If your input is some data, and it “looks kind of ok”, is this verifying the correctness?

If you want to test a skin shader, it might seem like a good idea to test it on a mesh of a head, but in such a setting you cannot verify any of the systems design or implementation assumptions  – just whether it “looks right”. This might be important for the user to verify whether the feature is what they asked for, but is useless for other engineers who will be looking at changed results of such tests in the future.

Are you sure that your Gaussian blur is not brightening the image? That its sigma is correct? Will the lighting shader work well with parameters at the end of the input range? Is your subsurface scattering shader really applying the requested diffusion profile? Is data after GBuffer packing/unpacking within the theoretical quantization precision across the range? None of those questions can be answered by eyeballing the output. 

Easiness/difficulty adding new tests

In this category, the problem with the Approach A was obvious – adding a new test involved multiple steps and workflows that were not typical programmer workflow. Launching the editing tool, setting up some geometry, setting up scripts, adding new golden data
 Quite a lot and very discouraging if you want to provide. My personal no1 rule of creating a healthy team culture is to make sure that valued and desired behaviors are “easy” to do (in a perfect world would be easier than the undesired ones). Having to jumping through many hoops to create a simple test doesn’t encourage testing culture.

Adding a new test that would be executed automatically should be just a few lines of code – and it is possible if you have testing process set up like in Approach B.

Tests close to / separated from the code tested

If tests “live far away” from the code tested like in the Approach A, it is hard to correlate one with another. If you are refactoring some feature and need to verify if tests that changed were actually supposed to change or not, it destroys your confidence that should come from using tests…

Furthermore, I believe that tests can serve as a supplemental “documentation” (just like well named functions and variables, well commented code etc) – and I often rely on them to see how a piece of code might be used, what are the assumptions etc. With the Approach B you can open the tests file and learn about the potential use-cases and assumptions that the code author has made.

If tests are completely separated and just test functionality, this advantage is also completely gone
 I might even find and open test scene, but not necessarily know what was set up there and how!

Testing on synthetic vs authored (and binary) data

Tests should be as simple, “atomic” and as isolated as possible (unless you want to do specifically integration tests). Relying on some arbitrary, authored and ad hoc data makes it very difficult to analyze / debug the test and the desired results – see “verifying correctness” above. 

A second problem is that now all your testing relies on your data processing pipelines. If you change your data pipelines even slightly (let’s say introduce subtle compression, quantization or anything), all your tests are going to change! This leads us into the next section…

Unit testing / testing end-to-end

Relying on e.g. data processing pipelines in all of your tests makes reasoning about safety of changes (one of the points of testing) impossible – you will see hundreds of tests changed their value, but among this noise might miss some real problem.

Notice how in the Approach B any changes in your content pipeline will not cause unexpected visual changes.

Testing end-to-end like in the Approach A relies on tens of different systems
 Content build system (previous point), material system, lighting system, camera system, mesh rendering, post processing, even gamma correction! Once after changing from regular z buffer to inverse z, I had to “bless” all the tests – not only unnecessary, but dangerous (I could have missed some legit regression). This is lots of moving pieces, and makes it impossible to correlate simple inputs to the output value. If it changes (suddenly test becomes broken by an “innocent” change) – good luck debugging where it comes from!

(Not) Understanding the breakages

Ok, your input got broken
 Why is that? Even ignoring the above (“testing end-to-end”), are those single pixel differences caused by “quantization noise”, or an inf/nan? Oh, the energy conservation broke – do we now have too much, not enough, or simply wrongly distributed outputs?

Having numerical analysis, histograms of differences, or simply asserts in tests (“assert that the output is always below 1”) like in the Approach B would immediately answer at least some of those questions.

(Lack of) tests documentation

Tests themselves should be documented and commented if possible. I find it much easier to do it through code comments and meaningful function naming (trivial example – e.g. VerifyNoNaNs called from within a test) than through some metadata attached to the test itself and the scene.

Test updates separated from changes / code CLs

Ok, let’s say that you have refactored some system and expect decreased/increased precision of some stages. In the case of Approach A you would submit your CL, and then update the goldens values. In the case of Approach B, you can put it in the same CL (again, change and test relative “locality”), and specifically reason about the changes “ok, I have lowered the precision of quantization by 2 bits, so I expect to change my test epsilons by no more than 4x”.

Relying on GPUs and floating point operations for exact comparisons

This one is a tough one and I don’t have a great answer for.

Doing graphics and work on GPUs, we want to test floating point operations, as well as catch some e.g. driver regressions. 

On the other hand, float point operations are flaky, can depend on the target platform (e.g. presence of SSE vs AVX), some runtime environment flags that change floating point behavior, or a driver version.

I personally think that having a hybrid tests that do all the input creation, packing, output fetching and analysis on the CPU, but execute the actual graphics production code on the GPU is a good middle ground, but as I said – it’s quite tough point, and every approach I have tried had its pros and cons.

If you suffer from lots of noise from driver / testing device changes, (and are sure that your tests are designed in a good way) then consider using a WARP device for DirectX, or excellent SwiftShader for OpenGL.

Test speed

This point might be too much of an implementation detail, so I will keep it short – but in the Approach B tests have almost no dependencies, are extremely minimal and execute in literally milliseconds. Fast tests encourage adding more tests, and testing often during the coding process itself.

When and how to test graphics?

Having described general ideas regarding how features can be tested, one might ask – when it is worth doing it? 

First use case – that I +/- already described here – is testing features when their interface and functionality are more or less defined, and most of the code is written. I want to emphasize that you are not limited to just the simplest single pass and image inputs / outputs.

Nothing prevents you from creating procedurally a simple scene with some e.g. decals, and verify if they get rendered correctly, and all the stencil buffer logic works (every game engine that I worked on and that used stencil buffer – it got broken on some platform for some feature during an unrelated refactor / optimization).

The second use of tests is to guide you and help you when writing the code.

While I think the whole concept of TDD is a classic over-complicated snake oil, it is often worth writing a test for the functionality you are about to add / in the process of adding. For example, writing GBuffer bit packing having a test that verifies that 0 maps to 0, 1 maps to 1 and 0.5 maps to 0.5 can help save you a lot of time. I cannot count instances of bugs when an engine had 127/128 or 255/256 instead of 1.0 because of wrong packing logic. 🙂 

Similarly you can write tests during feature development for any functionality from high level – like material blending, through mid level (what is the z buffer precision? Count all the discrete z values in the range 100m-101m and you have an immediate metric estimating z-fighting in that range!), to low level – I cannot imagine writing fixed-point math code without some hard check tests for under/overflows and verification of rounding.

Third use case that I highly encourage that is testing for performance – (micro)benchmarking. If you set up inputs procedurally, you can track the exact performance / timing of a given feature in isolation. Setting up the inputs/outputs procedurally allows you to control it very precisely and avoid inherent noisiness (and data dependence) of testing of the real scenes. Such benchmark can be used in the optimization process itself (especially with shader/code hot-reloading), but more importantly to track any performance changes and regressions. You want to track, locate, (and stop) any functionality regressions – why would you not want to do the same for performance? 🙂 Tracking it over time and having logs for many months can also help to analyze trends like immediately not obvious regression creep (death by a thousand paper cuts). Or conversely, you might immediately see a performance improvement from a new driver / compiler update – as a technical owner of a feature / system / technology, you should be aware of all of the changes, including the positive ones.

Offtopic – sanitizers and fuzzing

When talking about testing, I couldn’t resist myself from dedicating a tiny section and not mention here two techniques that are simply amazing when it comes to ROI – sanitizers, and fuzz testing.

They are especially effective when used together (given that sanitizers can slow down code to the point when manual testing is not possible/pleasant
), and will catch lots of real problems almost immediately.

I promise that TSan will trigger almost any time you introduce some new parallel code, and that it won’t be a false positive. 🙂 

Summary / caveats

Testing is not “scary” and should never be a burden and a struggle. After all, it is a tool for you – engineer, your future self, and your colleagues.

If testing is a burden, revisit your workflows and processes and figure out which parts can be improved!

I have compared two approaches of testing specific graphics features – Approach A, which is a real workflow that I used to work with, and Approach B, which is how I personally suggest approaching it.

I see not many virtues of the end-to-end / screenshot based approach in the case of testing features, however for the completeness and to be entirely fair, I see some good use cases for golden testing.

One is to have a simple smoke test and watch out for random breakages “somewhere in the pipeline”, when not anticipated and from unexpected sources. The second one is that it does provide a form of integration testing, testing the interaction of multiple systems. The third one is a bit paradoxical – the inherent flakiness and sensitivity of such tests makes then a good candidate to catch some unexpected compiler / toolchain / driver changes.

To combine those use-cases and some real advantages of golden tests, and not get into the problems / frustrations, I would suggest to have – a few (no more than 10-15!) golden “smoke tests”, with lots of features stuck into one scene, and testing the pipeline end-to-end. Expect that they might get changed pretty often, but be very thorough when investigating the differences (easier when four images change, than a thousand
). Finally, use programmatic tools like simple histograms, printouts and design the process of debugging the changes as well.

Finally – and the reason why I wrote this blog post – I hope that your adventures with finding the right testing strategy for you, and the potential productivity boost that comes from good tests will be easier to get for you than they were for me. 🙂

Posted in Code / Graphics | Tagged , , , , , , , | 2 Comments

Checkerboard rendering, rotated anti-aliasing and grid frequencies

This post is inspired by three interesting statements or questions that I heard and are interconnected by non-obvious theory:

First was regarding so called checkerboard rendering (example1 example2) – what is advantage of using it? After all it’s the same as rotating your grid by 45 degrees? Presentations on it claim better PSNR, but why?

Second one was regarding reconstruction filters. Why all classic interpolation kernels “separable”, applied separately on x and y sequentially and not being rotationally symmetrical?

Finally third one is something I often hear as a question – why are MSAA patterns like a 4x quad rotated diagonally?

I am going to try to introduce some theory first and show how it all relates to representable frequencies and aliasing. This post is accompanied by two shadertoys: one and two.

1D Sampling recap

Single dimensional signal sampling is well understood and taught in any signal processing course. In most basic words it is process when you take series of single measurements of some continuous process and every measurement is same distance apart. As long as your source signal has no frequency content higher than half of the sampling frequency (Nyquist-Shannon limit) and you use impractically perfect reconstruction filter, you can reconstruct back the original signal.

ReconstructFilter

Signal containing frequencies Nyquist limit can be reconstructed faithfully under perfect reconstruction filter. Source: Wikipedia.

If your signal contains higher frequencies, they are going to start to alias – manifest themselves as lower frequencies, mixing with existing frequency content and corrupting the signal measurements. In other words, frequencies beyond Nyquist rate are equivalent to lower frequencies under sampling.

459px-CPT-sound-nyquist-thereom-1.5percycle.svg

Two frequencies can alias under inadequate sampling rate. Source: wikipedia.

2D Sampling

This is where things start to be less covered by literature and more interesting. Process of 2D sampling is process of a series of sequentially performing two 1D sampling operations on two different axes (sampling is separable!).

In the first step we sample a bivariate continuous signal into discrete series (for example in X dimension) of continuous 1D signals (in Y direction) and then sample them again in Y dimension to get a grid of discrete samples. Pixels are not little squares (famous article by Alby Ray Smith). They are grids of samples.

However if you now look at consequences of Nyquist-Shannon theorem of two consecutive sampling operations, you will see that representable frequencies form a box:

2d nyquist.jpg

Most readers have probably seen this type of diagram many times – but I marked there something interesting and non-obvious. If you think about a rectangular Nyquist domain, in diagonal direction higher frequencies (sqrt(2) times higher) can be represented!

This means that lines, triangles and generally features aligned with X or Y axes (or near to them) will alias faster than the ones that are diagonal.

Zone plate and Siemens star

We can see this effect easily in any pattern designed to represent multiple frequencies.

Two commonly used ones are a zone plate and Siemens star.

Side note: I have no idea why they are not used more commonly by real time graphics rendering community when talking about anti-aliasing or post-processing. Much easier to test quality of your algorithms on known all-frequency pattern than just eyeballing on arbitrary content.

This is a zone plate variant:

zone.gif

Variant of a zone plate.

And this is a Siemens star variant:

siemens.gif

Variant of Siemens star (animated).

Both are supposed to capture variety of frequencies in a single image.

On both images, when frequencies increase (uniform / isotropic frequency change), on principal X / Y axes you can see aliasing or failed reconstruction artifacts appear much faster! Especially in motion, here magnified the issue on a static image:

siemens_star_crop.png

Cropped and nearest-neighbor magnified center of Siemens star – notice non-isotropic and non radially symmetric behavior.

Main axes exhibit much stronger aliasing and the appearance of “wrong” frequencies earlier.

Frequency spectrum of a grid being a box answers one of questions – most reconstruction and resampling filters are specified only in a single axis and interpolation is done independently. Why? Goal of interpolation and reconstruction is to remove any frequencies that cannot be represented before they alias (and not remove other ones) and any frequency cutoff will happen much more on those axes where aliasing will happen first.

Rotating the grid

Immediate question that arises would be – what would happen when you rotate your pixel grid by some angle, let’s say for simplicity of analysis 45 degree keeping the same sampling density?

Rotated grid frequency.jpg

By just rotating the area of represented spectrum cannot change – so we are guaranteed that we will gain some frequencies and lose some other ones. In this case we can get better axis aligned representation at the trade-off of more aliased diagonals. This would sound like a potentially attractive trade-off in some context (e.g. architecture or urban rendering), but rotated signal cannot be displayed on an axis-aligned grid screen. It has to be resampled and perfect resampling filters are impractical (large spatial support, potential for ringing of aliased content or noise), so we’d need to lose even more resolution and not filter out perfectly.

It’s not a great option and assuming this 1-1 ratio it’s not surprising that it’s not used in practice (at least I haven’t heard of – let me know in comments!).

But we can take advantage of rotating sampling patterns in different contexts!

Checkerboard rendering

Checkerboard rendering is a form of mixed resolution rendering. Goal here is to be able to reconstruct a full resolution (for example full 4K resolution like in PlayStation 4 Pro) signal from undersampled one. No decimation / removing frequencies should take place if possible. Obviously we cannot make up for what was missing in the first place without using other techniques like temporal supersampling (we can only hallucinate it), but at least we’d want to not have aliasing.

Checkerboard rendering tries to answer a question: how can you optimally render the image given just 50% of computational power?

(I know, concept of “optimally” is a whole topic of its own – what does it mean? but let’s assume most detailed and least aliased).

Uniform rescaling

One option is to render in uniformly rescaled resolution (both dimensions scaled by sqrt(2).

resampled.jpg

This means that we get uniform detail loss and aliasing in each direction. Vertical and horizontal directions will be even less represented than in our original grid, while we still have potential for expressing higher spatial resolution of diagonal edges.

Biggest issue with uniform rescaling is a need for very high quality reconstruction / resampling filter. Without it upscaling to native (ex. 4K) resolution can result in even more detail loss and aliasing.

Second problem is that no frequencies get fully represented and all content rendered in this lower resolution and resampled will look worse.

Interlaced rendering

Another option would be to just do form of interlaced rendering – so rendering one of dimensions in half resolution. It is a very attractive option as then reconstruction filter can be only 1D, which means better efficiency. We can even go with a very expensive reconstruction filter in that dimension (like a wide support windowed sinc).

interlaced.jpg

While it is cheap, it is also on its not great for rendering quality (under assumption of no temporal antialiasing / extra supersampling). It means that while in one direction edges will be represented as good as with full resolution rendering, in other directions they will be 2x more aliased and with 2x more detail loss.

On the other non-rescaled direction will not suffer from any aliasing from resampling / reconstruction.

Checkerboarded rendering

Finally, there is an option (with some special hardware sauce of adjusting sample positions in a quad plus some others) of doing checkerboarded rendering, so shifting every other row by half. This results in a rotated grid. How well does it perform?

checkerboard.jpg

Checkerboarded rendering allows us to represent horizontal and vertical edges as well as full native resolution! Under perfect reconstruction there would be no detail loss there. Frequencies suffering the most are the diagonals, they will be 2x undersampled compared to native resolution (but only sqrt(2) / 2 worse than horizontal / vertical).

Checkerboarded rendering needs a high quality reconstruction filter for it to hold, but even with a naive bilinear filling of missing samples it can look ok.

Compared / together

From Fourier domain perspective, checkerboard rendering looks very attractive. How does it perform in practice? I created a very simple shadertoy demo comparing those methods on a zone plate. Note: it uses most crude reconstruction possible (tent/bilinear).

output2.gif

When using this simple bilinear filters in my opinion checkerboard rendering looks the best (obviously after the full resolution) and only aliasing happens around the main diagonal.

Anti-aliasing rotated grid patterns

Another common use of rotated grids for different frequency aliasing is either super-sampling or hardware-accelerated techniques like MSAA. Goal of such variants or anti-aliasing is to sample at higher frequencies, avoiding aliasing (avoid high frequencies folding over as low frequencies) and before display use a reconstruction filter that will remove them prior to resolving onto final sampling grid.

As resampling/filtering (signal decimation) will happen anyway, we are free to chose any sample position.

Assuming 4x supersampling / MSAA two alternatives that we will consider here is just axis-aligned box and 45 degree rotated box and 4 samples.

four x aa grid.jpg

If we sample 4x with an uniform grid, we get uniformly 2x more non-aliased frequencies.

If we decide to use rotated grid 4x pattern, we get 4x larger rotated box. This one is a bit different. We get a lot more horizontal and vertical anti-aliasing and a bit more diagonal anti-aliasing. Near horizontal and near vertical lines will observe a massive improvement (2*sqrt(2)) and diagonals will observe some / medium improvement (sqrt(2)).

four x aa rotated.jpg

Using rotated grid gives us better representation of horizontal/vertical signal frequencies and smaller gain in diagonal ones.

Side note: this matters mostly when using a very cheap decimation filter like a box filter (for a long time commonly used as MSAA resolve…) that aliases a lot of frequencies around Nyquist – on principal directions it’s lower so it is preferred to not have any corrupted / aliased frequencies there.

In hardware (MSAA)

Interestingly, by default hardware 4x MSAA patterns are neither axis-aligned nor 45 degree rotated (for example in DirectX 11). They are not horizontally/vertically symmetric.

IC554624.png

Standard DirectX11 MSAA patterns. Source: MSDN documentation.

Since frequency spectrum will be unevenly rotated, frequency aliasing also will be uneven. So same a frequency rotated let’s say 15 degrees clockwise or counter-clockwise will get different amounts of aliasing. It is an interesting choice and I don’t know if there is a specific reason for it other than my guess “it’s good enough and integer offsets are hardware implementation friendly”.

Edit 05.17.2018: Sebastian Aaltonen wrote in comments the following great explanation of this pattern: N-rooks looks better than 45 degree rotated when you have lots of straight X/Y lines, such as walls/staircases/buildings, especially with box filter. Aligned lines are a common case in games as the camera up vector tends to be locked. When the building edge moves slowly across the pixel in X direction, you get four different blending values with N-rook pattern, while 45 deg rotated grid gives you only 3, and the middle one has 2x weight as the samples overlap. DirectX sampling pattern is as close to 45 degree rotation as you can get, while maintaining even spacing between the samples in both X and Y axis. It is a pretty good compromise for the worst case visible aliasing problem, which is straight lines slightly off X or Y axis.

Here is a shader-toy comparison of those methods (with a naive and suboptimal box filter).

output3.gif

 

The “ROTATE DX” corresponds to rotation like in DirectX documentation. You can see that more towards horizontal direction there are some more aliased frequencies.

I am not sure if this is very clear, but to me the rotation by 45 degree looks cleanest.

Summary

If I was to present just one single main takeaway from this post it would be that signals sampled on a grid can represent frequencies that also correspond to a grid. This means that elements of the image (signals) corresponding to diagonal frequencies will behave differently than horizontal / vertical ones.

This perspective is very important when designing reconstruction filters (separable!), anti-aliasing patterns or looking at mixed resolution rendering.

Second takeaway is less theoretical, more about practice: it is very useful to look at some standardized, all-frequency patterns instead of eye-balling content (especially under some animation / rendering camera). They can be animated as well, but this way you can immediately see effects of choices like anti-aliasing patterns, post-processing techniques etc. on your image. It’s not a replacement for testing and evaluating on real content, but a good starting point and one that might build better intuition and conjunctions why some technique might be better than the other one.

Posted in Code / Graphics | Tagged , , , , , | 4 Comments

Tech and scientific writing – graphics, diagram and graph creation tools

Few days ago, I asked a question on twitter: https://twitter.com/BartWronsk/status/919618905319997440
“What is industry/academic standard for diagrams in tech writing? PowerPoint/GoogleDocs diagrams? Dedicated tool like Visio/Dia? Procedural?
I usually use first, but far from satisfied
For proc tried SVG,JS,Graphviz,even Mathematica, but results are either ugly or *lots* of work…”

I got 26 replies, thanks everyone who contributed!

To not have it lost and hoping that someone else will find it useful, writing down answers in a few categories with some comments. Please comment on this post if I got anything wrong and/or I omitted something.

Note that there are no real “conclusions”, almost everyone used something else! Probably almost anything on the list will work. But as a rule of thumb, probably it’s useful to have each of:

  • Dedicated vector graphics program (I’ll give a try to Inkscape and Photoshop vector graphics),
  • Favorite scientific environment for data exploration and plotting (Python+Matplotlib, Mathematica, Matlab),
  • Some favorite environment for authoring documents (whether LaTeX, mathcha or just Powerpoint or Google Docs),
  • maybe something interactive for web/blog (shadertoy, D3, Mathbox).

Vector graphics software

Adobe Illustrator, Subscription $19.99/month, Mac/Windows

For many years standard / go-to for vector graphics. At the same time quite pricey subscription model and could be an overkill for simple diagrams?

Affinity Designer, $49.99, Mac/Windows

No subscription alternative to Illustrator. Lots of people recommended it, worth having a look!

Inkscape, Opensource/Free, Linux/Mac/Windows

Totally free and open source. Subjectively UI looks much cleaner than other open source graphics software.

Omnigraffle, $49/$99 iOS, $99/$199 Mac

Recommended by many people, UI looks pretty clean and productive, sadly Mac/iOS only.

Xfig, Opensource/Free, Linux/Mac/Windows

Dormant development, quite ugly interface (IMO). But open source and free.

Dedicated diagram / flow chart software

Microsoft Visio, $299 or $18 subscription, Mac/Windows

Professional industry grade diagram software. Quite expensive…

SmartDraw, Subscription $9.95/month, Mac/Windows/Cloud

Flowchart marker; don’t know if it’s useful for other, more complicated graphs and diagrams.

yEd, Free, Linux/Mac/Windows

Graph and flowchart only (?), free and multiplatform.

Dia, Free, Linux/Mac/Windows

Dormant open-source project. I used it few times and think there are better, more modern tools (?).

Lucidchart, both free and paid options

Don’t know much about it, looks like it’s more about flowcharts

draw.io, Free, cloud/web based

UI looks very similar to Google Docs, very low entry barrier. Again more about just simple, graph-like diagrams / flow charts.

Visual Paradigm, $349, Linux/Mac/Windows

Another dedicated UML-like, visual diagram tool.

Other software

Microsoft Powerpoint $109.99, Mac/Windows

My go-to drawing solutions for any presentations so far. Works great, supports animations, but is painfully limited.

Photoshop and vector graphics there, Subscription $9.99 a month, Mac/Windows

I’ve never used PS vector graphics, but it’s software I already pay for, so could be a good idea to try it.

Mathcha, free, cloud

This one is very interesting – equation and mathematical graphics (simple plots, diagrams) online editor. Looks like way easier entry / access WYSIWYG LaTeX editor.

Google Docs and family, Free, cloud based

Using this quite a lot at work (what a surprise!) for drawing simple diagrams and it works pretty well. Not leaving document is a big plus, just like with Powerpoint.

Grapher, part of MacOS

Dedicated tool embedded in MacOS; surprisingly powerful for something included in every install!

Interactive web based

Shadertoy, free

This could be a surprising choice, but check out e.g. this algorithm visualization from Leonard Ritter. Code + visualization in one place with animations and optional interactivity!

Mathbox, free, Javascript library

Quite amazing math visualization library for JS and web. Amazing results, but it’s quite a heavy tool that could be overkill for small diagrams.

Flot, free, Javascript library

Interactive plotting for jQuery. Don’t know much about it (not using JS as much as I’d like for creating web pages).

D3, free, Javascript library

Beautiful data visualizations – but not really for generic graphs.

Desmos calculator, free, online

Nice semi-interactive and powerful online mathematical function plotting tool. Good to not just visualize, but also explain reasoning behind some curve look.

Render diagrams (or second link), free, online

A fresh tool specialized in rendering diagrams and for graphics programmers! Has lots of built in primitives that occur very often in such use case / scenario. Looks like it’s early stage, but very promising, looking forward to its develpment!

Procedural languages and libraries

Asymptote, open source

Dedicated vector graphics language.

Markdeep, open source

“Markdeep is a technology for writing plain text documents that will look good in any web browser (…). It supports diagrams, calendars, equations, and other features as extensions of Markdown syntax.”

Limited diagramming capabilities, but all-in-one document creation solution.

Matplotlib, Python, open source

Go-to Python solution for visualizing data. By default not prettiest, but very powerful and “just works”.

LaTeX with PGF/TikZ, Tex, open source

LaTeX extension/language to define graphs and generally, graphics – programmaticly. Would love to learn it some day and become proficient, but even people who recommended it, warned of steep learning curve.

UMLet Free, has also UI

Free tool to create UML diagrams.

Matlab, $2150, Linux/Mac/Windows

The most expensive option of the list – numerical professional mathematical/engineering package with some graphics options among numerous others.

GNU Octave, open source, all platforms

Free and open source alternative to Matlab! Probably not as powerful for most uses, but for graphs and vector graphics should be comparable. Used it at college for computer vision classes and well, it definitely worked fine for most algebraic problems.

Gnuplot, open source, all platforms

Simple command line tool for plots that I used few times at college. For quick plots very convenient.

Mathematica, $320, Linux/Mac/Windows

Affordable mathematical symbolic and numerical analysis and data exploration software. I wrote about it in the past, used it numerous times for plots in many of my blog posts and always recommend. I also played with graph plotting, though found it a little bit frustrating and not customizable/pretty enough by default.

Graphviz, open source, all platforms

One of those “useful but ugly” pieces of software. Good for quickly creating graphs from code, for anything else no great (?).

Svgwrite for Python, open source, all platforms

Python library for writing directly into SVG. Quite low level, but powerful for specific very “procedural” use cases.

GGplot2 for R, open source, all platforms

I don’t know much about R, but my understanding is that is its matplotlib equivalent. If you use R for data exploration, probably a goto solution.

Editing directly PostScript, open source, all platforms

Very low level and hardcore option; maybe easier with outputting ps directly/proceduraly from code?

 

Posted in Code / Graphics | Tagged , , , , | 3 Comments

Separable disk-like depth of field

This is a short note accompanying shadertoy: https://www.shadertoy.com/view/lsBBWy .

It is direct implementation of “Circularly symmetric convolution and lens blur” by Olli Niemitalo (no innovation on my side, just a toy implementation) and got inspired by Kleber Garcia’s Siggraph 2017 presentation “Circular Separable Convolution Depth of Field”.

Why?

For depth of field effect, Gaussian blurs are seen artistically as “boring”, while hexagonal DoF (popular few years ago) can be subjectively not attractive (artificial, cheap camera qualities). I wrote a bit about bokeh, DoF and some crazy implementation in Witcher 2 in the past. Working on Far Cry 4 and other titles, I used a different approach – scatter as gather “stretched” Poisson bokeh to compute DoF together with motion blur and at the same time.

Circular kernels can be very expensive; prone to noise, ringing etc. One can try to approximate them, like Guerilla Games in Michal Valient’s Killzone presentation, undersample with blur afterwards like Crytek in Tiago Sousa’s presentation, or try to use some separable approaches like Colin BarrĂ©-Brisebois.

Unfortunately, there are no circularly symmetric separable filters other than Gaussian filter in real domain. However, in complex domain, one can find whole family of functions (complex phasors multiplied by Gaussian “bell”) that their magnitude is! This is what Olli Niemitalo’s post describes and introduces some “fitted” functions to approximate disk DoF.

Results – quality

As a proof-of-concept, I implemented it here: https://www.shadertoy.com/view/lsBBWy . It has a version with a single component and a version with two components (“harmonics”).

Version with a single component produces quite strong ringing, but much more “pronounced” bokeh than Gaussian one:

circular_vs_gaussian

Circular bokeh single component approximation vs Gaussian.

Version with two components is going to be twice more expensive memory and ALU-wise, but doesn’t have those artifacts so strong:

circular_two_vs_one

Two component vs single component approximation. Notice less ringing and smaller passband – almost perfect, circular shape.

I personally don’t think this ringing or “donut hole” is unpleasant at all; IMO it resembles some older lenses with “busy bokeh” and is something we simulated in Witcher on purpose:

Witcher 2 artistic “busy bokeh”.

ND7_1568

Real world lens “busy bokeh” with some ringing visible.

Results – implementation / performance

Shader toy implementation I provided is obviously not optimal. Recomputing weights and their sum is very wasteful; most of this can be precomputed and stored in an uniform / constant buffer. Instead I stored most of them in first texture / buffer (“Buffer A”).

If we exclude weights, per every component first (“horizontal”) pass requires a “simplified” real times complex multiply and accumulate (2 madds, one per real and one per imaginary component).

Second (“vertical”) pass requites full complex multiplies and accumulates – however after applying optimization from Olli we can keep real component only – (2 madds).

So assuming that there are N taps in both horizontal and vertical directions, every component needs roughly 4N “full rate”/simple instructions – not bad at all. For two components it’s 8N instructions . Both variation just read normal amounts of memory in the first pass (one fetch per tap), but on 2nd pass have either 2x or 4x more memory bandwidth required.

Similarly larger are the storage costs – every RGB channel is multiplied by 2 and by number of components. So using a single component we’d need 6 floating point channels (that can get negative – complex phasor!), for two components 12, so probably 4 RGBA16F textures.

In general, like with many post effects memory bandwidth is definitely a concern here; this technique requires even more of it and can get pretty expensive, but most blurring can be done in local / groupshared memory in compute shaders; and one can do some optimizations like blurring in YCoCg space and storing chroma components in lower res etc. So in general I definitely think it’s practical and cool approach.

In general I really like this technique, I don’t know if I’d use it in practice (still see lots of benefit in combining DoF and MB), but find it really beautiful and elegant, especially maths behind it. 🙂

References

Circularly symmetric convolution and lens blur“, Olli Niemitalo http://yehar.com/blog/?p=1495

“Circular separable convolution depth of field”, Kleber Garcia http://dl.acm.org/citation.cfm?id=3085022

Killzone: Shadow Fall Demo Postmortem“, Michal Valient https://www.slideshare.net/guerrillagames/killzone-shadow-fall-demo-postmortem

Graphic gems of CryEngine 3“, Tiago Sousa http://www.crytek.com/download/Sousa_Graphics_Gems_CryENGINE3.pdf

Hexagonal Bokeh Blur Revisited“, Colin BarrĂ©-Brisebois https://colinbarrebrisebois.com/2017/04/18/hexagonal-bokeh-blur-revisited/

Posted in Code / Graphics | Tagged , , , , , , , , , , | 2 Comments

Cull that cone! Improved cone/spotlight visibility tests for tiled and clustered lighting

In this blog post, I will present some common spotlight culling techniques and propose one small improvement that is able to correct results of cheap, tiled / clustered culling on the CPU/GPU with almost negligible ALU cost. If you know a lot about tiled light culling, feel free to skip till the last section.

The idea for this post came from the fact that many people talking about light culling talk mostly about spherical point lights. However in many practical applications point lights are actually rare! It is not uncommon to implement even point light rendering and their shadow casting not through cubemaps for shadowmaps, but through 1-6 clipped spot lights (getting the advantage of being able to vary their resolution, cull intersecting geometry separately and compute optimal near/far planes).

Main problem with culling spotlights is that unlike spherical / point lights, many widely used tests cause some overdraw / false positives. With physically-based correct light falloffs and BRDFs, light can get really huge (specular component can be seen very far away) and most common tests are not suited for culling of large primitives. I will have a look at few possible techniques in following sections.

Note: in sample code you will see instructions like sin or cos – obviously you don’t want to call them, but precompute them ahead – this is only for demonstration purposes.

You can find Mathematica notebook that comes with this post here and the pdf version of it here.

Introduction / assumptions

Talking about generic collision tests of spot lights, I am going to approximate them as cones (extra cap is easy to add; in most of those tests it can be skipped though). I will talk about culling them against mini-frustums with focus on tiled lighting – tiled deferred and clustered forward lighting – but they apply in theory to any kind of culling technique. In theory it could also be applied to tiled forward+ for transparent object lighting, however it won’t benefit that much from some of mentioned optimization techniques due to very long frusta. In those cases it is better to use clustered shading anyway! In general, clustered shading is preferable and can be made extremely efficient with some low level hardware tricks, for example as presented by Tiago Sousa and Jean Geffroy.

In most of this post, I will assume that tile frustum is not extremely long, so has relatively uniform dimensions. While it could be considered limiting, in practice it is possible to achieve that assumption even withing just classic tiled deferred by splitting tiles depth bounds based on the depth distribution.

I will also demonstrate ideas here in 2D for simplicity, but they extend one to one to 3 dimensions (unless noted otherwise).

Simple approach – cone as a sphere

First approach is quite trivial. Instead of dealing with a spotlight / cone, we can have a look at its bounding sphere. Obviously taking equivalent pointlight bounding sphere would be extremely wasteful, so instead we want to find an exact bounding sphere of a cone. To find it, we can have a look at just 2 cases: either triangular cone slice is obtuse or not.

coneboundingsphere.gif

Cone slice bounding sphere

In case of obtuse triangle, bounding sphere’s center will be on the cone cap with radius equal to the radius of the cap. In the acute case, we want to find a circumscribed triangle. There is a very easy formula for finding one based on one of triangle angles and length of opposite side, it is A / sin(a). After taking into account half angle and half-diameter and doing some trigonometric reductions, it becomes trivial 1/(2cos(halfAngle)).

float4 boundingSphere(in float3 origin, in float3 forward, in float size, in float angle)
{
	float4 boundingSphere;
	if(angle > PI/4.0f)
	{
		boundingSphere.xyz = origin + cos(angle) * size * forward;
		boundingSphere.w   = sin(angle) * size;
	}
	else
	{
		boundingSphere.xyz = origin + size / (2.0f * cos(angle)) * forward;
		boundingSphere.w   = size / (2.0f * cos(angle));
	}

	return boundingSphere;
}

This can fail in many situations even with a simplest plane vs sphere tests:

cone_boundingsphere_thin_fail.png

cone_boundingsphere_large_fail.png

However in some it will work reasonably well against a single plane. Having precomputed the spheres on the CPU or just ahead of time, it is extremely cheap test to perform.

coneboundingspherevsplane.gif

Unfortunately, this test it suffers from a “corner” problem typical to all primitive vs planes partial swept line tests:

coneboundingspherevsfrustum.gif

Here is a stop frame from this animation, you can see how every plane test succeeds and cone / sphere are not culled by any of them, so the whole bounding sphere cannot be culled:

cone_collision_sphere_frustum_fail.png

Thomas Gareth proposed a simple trick / workaround: precise, Minkowski sum culling of a sphere against bounding box of the frustum.

Here is how it performs in the same case, simple and cheap box test against the sphere gives expected result without false negatives:

coneboundingspherevsbox.gif

Different approach – plane cone tests

While not forgetting about frustum planes test failures, we can try addressing those very thin or very wide cones with better plane vs primitive collision test and see how it’d work in practice.

The main idea is simple – find a closest point on the primitive to the plane and test it against that plane. In 2D this is trivial – the point will be the closest one of 3 triangle corners:

conevsplane.gif

3D case

In 3D it gets a bit more tricky, but same assumptions remain. But this time we either test the cone apex or closest point on the cap rim. Idea one might to do it efficiently comes from Christer Ericson’s book “Real-Time Collision Detection”.

Let me visualize first how the algorithm works:

3d_cone_vs_plane.png

We are trying to find vector V2 that is closest to the plane surface. To do it, we first compute a vector V1 that is perpendicular to both plane normal and the cone “forward” vector. This can be done by performing a simple cross product. We also get a guarantee that this vector will be on the cone cap (since cone cap plane has all vectors perpendicular to cone axis). Then we do yet another cross product with the cone axis – to get a vector perpendicular to both the cone axis (so it will still be on the cone cap) as well as this vector perpendicular to the plane normal. It means it will be a vector that points in direction most similar to the plane normal or in the opposite direction, depending on the order of cross product operations.

It is very important here to not forget a “safe” normalize, since depending on the sine of the initial angle between plane normal and the cone orientation, result can be a very short vector. If it is zero, we can totally ignore it as it means that cone cap is parallel to the plane and we need to only check the apex.

Here we can see how this works in practice for a test that detects an intersection:

3d_cone_vs_plane_collide.png

Finally, we need to also check the cone origin point – since cone cap can be quite far away from the plane, but the cone intersect it anyway:

3d_cone_vs_plane_corner.png

We can code it easily in hlsl for example like this:

bool TestConeVsPlane(in float3 origin, in float3 forward, in float size, in float angle, in float4 testPlane)
{
	const float3 V1 = cross(testPlane.xyz, forward);
	const float3 V2 = cross(V1, forward);

	const float3 capRimPoint = 	origin +
								size * cos(angle) * forward +
								size * sin(angle) * V2;

	return dot(float4(capRimPoint, 1.0f), testPlane) >= 0.0f || dot(float4(origin, 1.0f), testPlane) >= 0.0f;
}

Back to a 2D example

How does it perform? In theory it could be better (we can get much closer than the bounding sphere and still get some culling), but it is still suffering from the same problems like all frustum plane tests:

conevsfrustums.gif

In practice, this is a huge problem. Just have a look in a a little bit contrived scenario of a regular grid:

conevsplanegrid.gif

This looks almost totally useless!

There are few solutions to it. I will mention first a heavy / complicated, but best solution, suggested by Matt Pettineo – rasterizing the cone in tested space and then either simply checking against this rasterized mask, or doing proper, rasterized computation of tile / cluster bounds in 2D. It is not so easy to implement, requires smart tricks to work around lack of conservative rasterization on most available consumer hardware and adds extra passes to the rendering pipeline. I would suggest doing it this way if you have time, but I will propose a bit coarser, but very cheap and simple solution if you don’t.

Flipping the problem

In the idea I propose (I am sure it is not new/revolutionary; however haven’t seen it in any published material related to tiled light culling) I got inspired by both Thomas Gareth and Inigo Quilez. Instead of doing partial separating axis test by testing frustum planes, in case of small frusta against large primitives, we should do the opposite!

Just test the cone against some other primitive.

What other primitive? The easiest one; a sphere – their symmetry makes Minkowski-sum tests quite trivial. Just find a furthest point on the tested primitive towards the sphere center and perform a distance check.

Here is how it would work geometrically:

conevssphere.gif

We can see a vector V that goes from cone origin towards the sphere. Then this vector V is decomposed into perpendicular vectors V1 and V2. V1 is projection of V onto the cone axis that is clipped against cone axis length, while V2 is V minus the projection – clipped against tan(coneAngle) times V1 length.

This way we get a point that is closest to the cone and we can simply check the distance of this point from the sphere center against the radius of the sphere. Note that this is the correct test, giving accurate results in every case – but it is quite expensive (many reciprocals) and has relatively large register pressure (operations on whole 3D vectors).

We can slightly simplify it by not actually decomposing the vector and just comparing few distances – for example as explained by Charles Bloom. Please note that his original test is very simplified and actually doesn’t test the cone range / behind the cone situation, something like:

conevssphere_simplified.gif

To fix it without adding too much extra cost, I’d add some simple distance checks for the distance of the V1 on the cone axis from cone start/end, so the final simplified test would have only one small case for false positive, when both axis and angle/distance check fail (very small area for small sphere radii):

conevssphere_simplified_corrected.gif

Final, corrected version in hlsl might look like that (after multiplying by sin(a)):

bool TestConeVsSphere(in float3 origin, in float3 forward, in float size, in float angle, in float4 testSphere)
{
    const float3 V = testSphere.xyz - origin;
    const float  VlenSq = dot(V, V);
    const float  V1len  = dot(V, forward);
    const float  distanceClosestPoint = cos(angle) * sqrt(VlenSq - V1len*V1len) - V1len * sin(angle);

    const bool angleCull = distanceClosestPoint > testSphere.w;
    const bool frontCull = V1len >  testSphere.w + size;
    const bool backCull  = V1len < -testSphere.w;
return !(angleCull || frontCull || backCull);
}

Is it worth? Let’s have a look at previous grid test (overlapping spheres from tiny boxes):

conevssphere_simplified_grid.gif

I’d say – absolutely yes. 🙂 With frusta that tend to be thinner/longer, the gain won’t be that great (as bounding spheres volume starts to get much larger than frustum volume), but it is very cheap test (cheaper than testing 6 planes!) and still works really well.

conevssphere_simplified_grid_long.gif

Summary

I highly recommend trying sphere vs cone tests, even if the primary primitive for which you are culling is not a sphere at all and it might seem that a sphere will be too conservative. It is also not all or nothing – you can add this relatively cheap test on top of your other, different and existing tests and you are not sacrificing anything (other than some CPU / GPU cycles).

In general, I would also discourage frustum plane tests in probably any case when the tested primitives might be large compared to frusta – you almost always can do better, even with tests that might have seemed more “conservative”. If you don’t have that many lights, you can keep plane tests for some marginal culling extra efficiency in some cases (e.g. cone parallel and close to a very thin frustum). When in doubt – visualize!

And for a general advice, it is often worth looking at the problem from the opposite side / flipping it around. Obviously if you spend too much time on the problem on your own, it often takes talking to your colleagues just to realize that there might be a different solution. Changing domain is extremely common idea in mathematics (integration domain change, variable change, relative coordinate system change etc.), but I tend to often forget about it while solving “easier” CS problems.

References

https://takahiroharada.files.wordpress.com/2015/04/forward_plus.pdf “Forward+: Bringing Deferred Lighting to the Next Level”, Takahiro Harada, Jay McKee, and Jason C.Yang

https://software.intel.com/en-us/articles/deferred-rendering-for-current-and-future-rendering-pipelines “Deferred Rendering for Current and Future Rendering Pipelines”, Andrew Lauritzen

http://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf “Clustered Deferred and Forward Shading”, Ola Olsson, Markus Billeter, and Ulf Assarsson

http://advances.realtimerendering.com/s2016/ “The devil is in the details: idTech 666”, Tiago Sousa, Jean Geffroy

http://twvideo01.ubm-us.net/o1/vault/gdc2015/presentations/Thomas_Gareth_Advancements_in_Tile-Based.pdf “Advancements in Tiled-Based Compute Rendering”, Gareth Thomas

http://www.iquilezles.org/www/articles/frustumcorrect/frustumcorrect.htm Inigo Quilez on “correct” frustum culling.

http://realtimecollisiondetection.net/books/rtcd/buy/ Christer Ericson, “Real-time collision detection”

https://mynameismjp.wordpress.com/2016/03/25/bindless-texturing-for-deferred-rendering-and-decals/ “Bindless Texturing for Deferred Rendering and Decals” – Matt Pettineo

http://www.cbloom.com/3d/techdocs/culling.txt “View Culling”, Charles Bloom

Posted in Code / Graphics | Tagged , , , , , , , , , , , , | 9 Comments

Small float formats – R11G11B10F precision

While this post is not yet dithering related, it is in a way a part of my series about dithering. You can check index of all parts here or check the previous part.

I will talk here about use of not very popular / well knows format R11 G11 B10 Float (R11G11B10F) format – its precision, caveats and how to improve it.

I want to note here that this post will not touch on many float subtleties, namely NaNs, denorms and infinities. Lots of GPU shader compilers use fast math anyway (unless asked to do strict IEEE compliance) and ignore them – and programmers have to be double careful when their used values.

You can find Mathematica notebook for this post here and corresponding pdf here.

Update: I updated section about losing dynamic range in denorm range after correction from Tom Forsyth and Peter Pike-Sloan that GPUs are standardized to support denorm on write to small floats.

Problem description

Most common representation of colors in rendering is not integer / natural / rational number representation, but floating point representation. Floating point numbers and their large range are useful for few different reasons, but the most important are:

  • Encoding HDR values and lighting,
  • Need for fractional values when operating on multiple colors, mixing them, filtering with filters with fractional or negative weights,
  • Need for larger precision in darker areas without any gamma encoding,
  • Need for bound relative quantization error (constant upper bound relative to signal magnitude),
  • Fact that floating point numbers are “natural” representation for GPUs (for a long time GPUs didn’t have any integer number support or it was “emulated” using float operations… And still some integer operations are slower than floating point operations).

That said, rendering techniques very rarely store 32bit floating point values even for HDR color – because of both memory storage cost as well as performance. Memory bandwidth and caches are usually most sacred resource and simplistic rule of thumb is “ALU is cheap, memory access is expensive”. Even simplest memory access operations have latencies of hundreds of cycles (at least on AMD GCN). Furthermore, cost increases when texturing unit is used – as filtering operations get more and more expensive and operate with slower rates.

Therefore, rendering programmers usually use smaller float formats as intermediate in-memory storage – 2 most common being RGBA16F (4 16bit half float channels) and R11G11B10F (channels R and G having 11 bit small float and channel B using 10 bit small floats).

Let’s have a look at the difference between those formats and full 32bit IEEE float. If you feel comfortable with float representation, feel free to skip the next section.

Floats – recap

I am assuming here that reader knows how floating values are represented, but as for a reminder – typical floating point value is represented by some bits for:

  • sign – just sign of the number, max single bit value and optional (more later),
  • exponent – some bits that are represented in biased, integer format and describe biased exponent of number of 2 before multiplying with rest of the number,
  • mantissa – some bits representing the fractional part of the number before multiplying by exponent. It is assumed that there is a leading 1, decimal point, so for example mantissa of 01011000 corresponds to number 1.01011000 represented binary (in base of 2).

Therefore final typical number is sign(+/- 1) * 2decoded exponent * 1.mantissa.
There are lots of “special” cases of floats that use special smallest and largest values of exponent (denorms, infinity, NaN, zero), but for the purpose of this post, we will have a look later at only one special case – encoding of zero – it is achieved by putting all exponent and mantissa bits to zero. (note: because sign can be still set, there are two zeros, +0 and -0).

Floating points are a very clever representation with lots of nice properties (for example positive floats interpreted as integers can be sorted or atomically min/maxed! Or that integer zero corresponds to just positive float zero), however come with many problems with precision that are not always the most intuitive. I will be mentioning here only some of them – the ones that are relevant to discussed problem.

Regular and small floats

So far I was trying to stay very generic and not specify any bit numbers, but to use floats in hardware (or software emulation), we need to define them.

Here is a table showing various bit depths of regular 32 bit floats as well as half floats and 11 and 10 bit floats as used by graphics hardware / standards:

Bit depth Sign bit present? Exponent bits Mantissa bits
32 Yes 8 23
16 Yes 5 10
11 No 5 6
10 No 5 5

We can immediately see few interesting observations:

  • 11 and 10 floats do not have sign bit! This decision was probably driven by the fact that they have already quite poor precision for most of uses, so they were designed in graphic APIs only to store color; using a sign bit here would be an extra waste.
  • 16 bit “half” floats and 11 and 10bit floats all have same exponent! This is pretty interesting choice, but it guarantees that they can represent +/- similar range of values. Exponent of 5 guarantees that values can go to 65500 and 65000 (depending on their mantissas), which is pretty large even for HDR lighting (unless using non-biased, absolute exposure values or doing increasing precision trick I will cover later). Exponent can be negative, so we can go to similarly (“one over”) low values.
  • Mantissa suffers the most. The difference is quite crazy – 23 vs. 5 bits in the worst case! We are dropping 18 bits of precision. This is very unfortunate information, as it means that relatively, between numbers that are in similar range (similar exponent), we are losing lots of precision.

Also, because of different bit depths of 11 11 10 float format, problem arises from different mantissa bit depths of blue channel and other channels – it will produce various discolorations and hue shifts – similar to ones that appear often in BC1 block compression (with 565 endpoint bit depths), but not being green/purple, but yellow/blue instead. I will show an example of it later in the post. Obviously, this decision makes sense – 11 11 10 format fits nicely in a single dword and perceptually, human vision is least sensitive to blue channel.

So as we see, we are dropping lots of information by converting 32 bit floats to 16 or 11/10 bits. Furthermore, information loss is not proportional between exponent and mantissa – in every small float case, we lose much more information in the mantissa. This can lead to some quantization and banding errors.

Before analyzing quantization, one thing is worth mentioning – IEEE standard defines few different rounding modes (e.g. to nearest, to zero, to +inf and to -inf). I don’t think they are in any way configurable on GPUs (at least in standard, cross vendor APIs) and I will write rest of the post ignoring this complexity and assuming that simplest rounding is used.

Small float mantissa precision – concrete example

I hope that previous section and looking at some numbers for bit depths shows clearly problem of losing lots of numerical precision of smaller format floating point numbers because of very small mantissa.

First, some numerical example. Let’s take 3 simple, 8 bit integer values and represent them as a float in range 0-1 – common operation for colors.

N[252/255, 8]
0.98823529

N[253/255, 8]
0.99215686

N[254/255, 8]
0.99607843

Let’s try to represent them as floats. Using knowledge about float values and knowing that mantissa always starts with one, we need to multiply them by 2 and exponent will be 2-1.

After multiplication we get:

BaseForm[N[2*252/255, 8], 2]
1.1111100111111001111110100

BaseForm[N[2*253/255, 8], 2]
1.1111101111111011111111000

BaseForm[N[2*254/255, 8], 2]
1.1111110111111101111111100

I highlighted the first 5 bits, why? Recall that 10-bit half float has only 5 bits of mantissa! Therefore 10bit half floats (blue channel of R11 G11 B10F) cannot represent accurately even 3 almost-last 8 bit color values! At the same time, you can see that the next bit actually differs – therefore those 3 numbers will produce 2 different values in 11F and produce wrong coloration of white values.

Small float mantissa precision – visualized

Ok, so we know that small floats cannot represent accurately even simple 8bit luminance! But how bad they really are? I created some Mathematica visualizations (see top of the page for link) – first for the worst case, B10F, so dropping 18 bits of mantissa.

10bit_floatvs8bit_linear.png

Things look ok (or even much better – not surprising given how floats are encoded!) close to zero, but error starts increasing and is almost 4x larger close to one compared to linear 8 bit values quantization error!

This comparison however is quite unfair – we don’t use 8bit linear color because of perceptual sensitivities to darks vs brights (“gamma”) and use sRGB instead, so don’t care as much about those bright areas and decide to encode more information into darker parts. This is how comparison of those 3 methods of encoding look like:

10bit_floatvs8bit_srgb.png

Ok, things are a bit more even. Looks like 10bit float precision is a bit better for values up to linear 0.125, but later get worse. Maximum error is almost 2x larger around 1 for 10 bit floats, not great… This will create visible bands on smooth gradients.

Just for fun, extra visualization, relative error (divided by original value):

8bitsrgb_vs_10float_relative.png

As expected, float value quantization relative error is bounded and has a maximum in ranges corresponding to next exponents (if we don’t count here going lower than minimum normalized float representation), while 8 bit linear or sRGB relative errors increase as we approach zero. Floating point relative error is also represented in “bands” corresponding to next exponents and getting 2x larger between 2 adjacent bands.

We will have a look at how to improve things a bit, but first – a bit more about a second problem.

Small float uneven mantissa length problem

Because R11G11B10 floats have uneven mantissa bit length distribution, they will quantize differently. How bad is it? As always with floats, absolute error depends on the range:

11bit_float_vs_10bit_float.png

The larger the number – the higher the error. In last part of the plot it looks pretty bad:

11bit_float_vs_10bit_float_smallerrange.png

What this different quantization mean in practice? It means that there will be discoloration / wrong saturation of the signal. Let’s have a look at a simple gradient from 0.5 to 0.6.

gradient_quantized.png

This is very bad (if you have a good monitor / viewing conditions). And now imagine that art director that you work with likes contrasty, filmic look with saturation boosted:

gradient_quantized_contrast.png

This looks quite unusable… We will have a look at improving it. In this post by changing the signal dynamic range, in the next post by dithering.

Rescaling will not work

Quite common misconception is that it is enough to multiply a float by large number, encode it and divide after decode. It is not going to work, for example, let’s see quantization error when premultiplying by 16:8bitsrgb_vs_10float_relative_vs_premultiplied.png

Zero difference at all! Why? Let’s think what it means to divide by 16 in float representation. Well, mantissa is not going to change! Only thing is that we will subtract 4 from the exponent. So relative error due to mantissa quantization will be exactly the same. One can try to multiply by a number between 1/2 and 2 and we will see a difference in ranges shifting, but it is going to only shift error to either more white or more dark parts:

8bitsrgb_vs_10float_relative_vs_premultiplied_smaller.png

Error bands only slide left or right.

Improving error by applying some gamma

Let’s have a look here at a different method – that will take advantage of the fact that probably (if image is pre-exposed!) we don’t care about extremely small values, where most precision is positioned (to achieve bound relative precision).

I mentioned in my previous post about dynamic range commonly used workaround for shifting precision precision problems – stretching the dynamic range by taking some power of the signal (smaller or larger). For storing higher precision dark areas of images in integers, we wanted to take lower power for encoding – for example famous gamma 1/2.2. However, in this case we would like to do… the opposite! So taking larger power – to understand why, just look at the original comparison where we introduced sRGB variant:10bit_floatvs8bit_srgb.png

We rescaled blue plot from constantly oscillating in fixed bounds to one that grows. Here with 10bit floats the problem is opposite – we have a function that asymptotically grows too quickly – we want to undo it.

Think a bit about it, it’s quite interesting problem. It has a lot to do with the way floats precision is distributed – it is non-linear, logarithmic distribution that handles large dynamic ranges very well; furthermore, exponential-like signal curve will be represented almost linearly! Therefore to take the most from our floating point representation with low bit depths, we would like to increase dynamic range as much as we can prior to encoding. We can do it by for example squaring the signal or taking larger powers. For the initial 3 floats that I used this requires actually quite large exponent, 3 for given values:

BaseForm[N[2*(252/255)*(252/255)*(252/255), 8], 2]
1.1110111000100100001001000

BaseForm[N[2*(253/255)*(253/255)*(253/255), 8], 2]
1.1111010000001100000101000

BaseForm[N[2*(254/255)*(254/255)*(254/255), 8], 2]
1.1111101000000000000001000

Note how they are different (though first two will round the same way).

Let’s have a look at absolute error with applying gamma 3 (note: this graph assumes correct denorm handling, more below):

8bitsrgb_vs_10float_gamma.png

Our error looks asymptotically smaller than 8bit sRGB error – this could be already quite useful storage base. Our previously banded gradient also looks better, as well as its higher contrast version (though not perfect – recall that contrast kind of redoes the gamma):

Before:

gradient_quantized.png

After:

gradient_quantized_gamma.png

Before with contrast:

gradient_quantized_contrast.png

After with contrast:

gradient_quantized_gamma_contrast.png

There is no free lunch though!

First of all, there is ALU cost. As we do this operation per 3 channels, it can get quite significant! Taking x*x*x is 2 full rate operations, but for example pow(x,1/3) is log2 + exp2 + multiply, so 2 quarter rate + 1 full rate = 9 FR instructions per color channel! Cheapest variant is just squaring and sqrt(x) is a single quarter rate instruction = equivalent of 4 FR instructions.

Secondly, this data is now obviously not filterable / blendable… Blending in this space wold ceate over-brightening. This can be an issue (if you need hw blending or to resample it with bilinear taps) or not (if you can do it all manually / in a compute shader).

Thirdly, this extra precision is achieved by sacrificing the dynamic range. It is +/- equivalent to dividing abs value of exponent by the used gamma. So for example, with gamma 3 our maximum representable value will be around pow(65000,1/3) ~= only 40! Is it HDR enough for your use? If pre-exposing the scene probably yes, but hottest points will be clipped… The squared variant looks much better, as around 250+.

Potential problem with small numbers

Note: this section got slightly rewritten after correction from Tom Forsyth and Peter Pike-Sloan.  My assumptions were pessimistic (denorm flush to zero), but apparently, GPUs in for example DirectX are obliged to handle them correctly. Thanks for noticing that!

Another problem could be in a different part – smallest representable numbers. The same abs of exponent division is applied to smallest representable numbers! Therefore smallest normalized representable number after applying gamma 3 will be 0.03125, which is around 8/255 and if we don’t have denorms or denorms are flushed to zero, this would result in a clip! Without handling denorms, the zoomed-in actual graph of error would look:

10bitgamma_zoom_error.png

As the graph would look:

10bitgamma_clip.png

You could try to fix it by preexposing for example by 4:

10bitgamma_clip_preexposed.png

But not only it’s not perfect, but also you’d start losing it again from the top range. (hottest representable values) Instead of already limiting 40, you’d get only 10! This is probably not enough even for displaying the signal on a HDR TV…

Therefore, if denorms were not handled correctly, I’d rather recommend to stick to gamma 2 with preexposure of 4 and accept the slightly higher quantization errors:

10bitgamma2_clip.png

10bitgamma2_error.png

Fortunately, as I got corrected – this is not the case and we can assume that denorms will be handled – so can use those higher exponents if needed – only thinking about how much dynamic range we are sacrificing in the upper/higher parts.

Before finishing this section, interesting side note: have you ever considered how low is normalized float precision when operating on 16 bit floats? Half floats have same exponent bit depth, so if you apply contrast operations to them, you might be entering denorm range very quickly! Which theoretically could result in clipping.

Untested idea – using YCoCg color space?

Some interesting (?) idea could be trying to use some different color space like YCoCg or similar instead of RGB. In (signed) YCoCg smaller chroma = smaller magnitudes of Co Cg components = more precision. This would help decorrelate color channels and avoid ugly chroma shifts when the color is less saturated (and when those shifts are more visible).

Unfortunately, R11G11B10 has no sign bit available – we would need to store 2 extra sign bits “somewhere” (different surface? lowest bit of mantissa / highest bit of exponent?).

Summary – to use R11G11B10F or not to use?

R11G11B10 and small 11 and 10 bit floats have many limitations, but are also extremely compelling storage format. They halve memory storage and bandwidth requirements compared to RGBA16F, are capable of storing high dynamic range signal and after some numerical tricks also provide precision acceptable in most color encoding scenarios. I use them a lot to non critical signals (ambient buffer, many post effects buffers), but I think that they are practical also for regular color buffers if you don’t need alpha blending or filtering and can tinker with the input data a bit.

Update: I got information from Volga Aksoy and Tom Forsyth that Oculus SDK now supports and recommends outputting into this format, so it is definitely practical. Because of darker / perfect viewing conditions with a HMD, human perception is much more sensitive in darks and R11G11B10F performs better than 8bit sRGB in this lower range.

In the next post I will show how to dither floats and get even better results with almost no perceived banding (trading it for noise).

Bonus – comparison with 10bit sRGB

As a small bonus, simple comparison with 10bit sRGB encoding (no hardware support, but some video out libraries support it to allow for more precise color profile / curves conversions). Two plots show error in full 0-1 range and in 0-0.1 darks range.

final_comparison.png

final_comparison_small.png

We can see that 10bit sRGB is much more superior throughout most of the range, but in very low/dark values 10bit floats are either equivalent or even a bit more superior.

References

https://www.opengl.org/wiki/Small_Float_Formats

http://steve.hollasch.net/cgindex/coding/ieeefloat.html Steve Hollasch, “IEEE Standard 754 Floating Point Numbers”

http://community.wolfram.com/groups/-/m/t/274061 Mathematica help – Convert floating point representation to any scientific notation & back

https://msdn.microsoft.com/en-us/library/windows/desktop/cc308050(v=vs.85).aspx#alpha_11_bit Direct3D 10 Floating point rules

Posted in Code / Graphics | Tagged , , , , , , , , , | 6 Comments

Dithering part three – real world 2D quantization dithering

In previous two parts of this blog post mini-series I described basic uses mentioned blue noise definition, referenced/presented 2 techniques of generating blue noise and one of many general purpose high-frequency low-discrepancy sampling sequences.

In this post, we will look at some more practical example – use of (blue) noise in 2D image dithering for quantization!

You can find Mathematica notebook for this post here and its pdf version here.

Bit quantization in 2D

Finally, we are getting to some practical use case. When you encode your images in 8 bits (typical framebuffer) – you are quantizing. When you encode your GBuffers, render targets and others – you are quantizing. Even typical 8 bit is enough to cause some banding on very smooth, slow gradients – like skies or lens flares.

We will cover here a more extreme example though – extreme 3 bit quantization of a linear gradient.

2dditheringorigsignal.png

2dditheringquantizenodither.png

We call those quantization artifacts – 8 visible bands the “banding” effect.

As we learned in previous parts, we can try to fix it by applying some noise. At first, let’s try regular, white noise:

2dditheringquantizewhitenoise.png

Doesn’t look too good. There are 2 problems:

  • “Clumping” of areas, identical to one we have learned before and we will address it in this post.
  • Still visible “bands” of unchanged values – around center of bands (where banding effect was not contributing to too much error.

2dditheringwhitebands.gif

Error visualized.

Those bands are quite distracting. We could try ti fix them by dithering even more (beyond the error):

2dditheringquantizewhitenoisex2.png

This solves this one problem! However image is too noisy now.

There is a much better solution that was described very well by Mikkel Gjoel.

Using triangular noise distribution fixes those bands without over-noising the image:

2dditheringquantizewhitenoisetriangle.png

Use of triangular noise distribution.

Since this is a well covered topic and it complicates analysis a bit (different distributions), I will not be using this fix for most of this post. So those bands will stay there, but we will still compare some distributions.

Fixed dithering patterns

In previous part we looked at golden ratio sequence. It is well defined and simple for 1D, however doesn’t work / isn’t defined in 2D (if we want it to be uniquely defined).

One of oldest, well known and used 2D dithering patterns is so called Bayer matrix or ordered Bayer. It is defined as a recursive matrix of a simple pattern for level zero:

1 2

3 0

With next levels defined as:

4*I(n-1) + 1  —– 4*I(n-1) + 2

4*I(n-1) + 3  —– 4*I(n-1) + 0

It can be replicated with a simple Mathematica snippet:

Bayer[x_, y_, level_] :=
Mod[Mod[BitShiftRight[x, level], 2] + 1 + 2*Mod[BitShiftRight[y, level], 2],
   4] + If[level == 0, 0, 4*Bayer[x, y, level – 1]]

2dditheringbayer.png

8×8 ordered Bayer matrix

What is interesting (and quite limiting) about Bayer is that due to its recursive nature, signal difference is maximized only in this small 2×2 neighborhood, so larger Bayer matrices add more intermediate steps / values, but don’t contribute much to any visible pattern difference. Therefore most game engines that I have seen used up to 4×4 Bayer pattern with 16 distinctive values.

If you plot a periodogram (frequency spectrum) of it, you will clearly see only 2 single, very high frequency dots!

2dditheringbayerperiodogram.png

2D periodogram – low frequencies are in the middle and high frequencies to the sides.

Obviously signal has some other frequencies, but much lower intensities… Plotting it in log scale fixes it:

2dditheringbayerperiodogramlogscale.png

So on one hand, Bayer matrix has lots of high frequency – would seem perfect for dithering. However presence of strong single frequency bands tends to alias it heavily and produce ugly pattern look.

This is our quantized function:

2dditheringquantizebayer.png

If you have been long enough playing with computers to remember 8 bit or 16 bit color modes and palletized images, this will look very familiar – as lots of algorithms used this matrix. It is very cheap to apply (a single look up from an array or even bit-magic ops few ALU instructions) and has optimum high frequency content. At the same time, it produces this very visible unpleasant patterns. They are much worse for sampling and in temporal domain (next 2 parts of this series), but for now let’s have a look at some better sequence.

Interleaved gradient noise

The next sequence that I this is working extremely well for many dithering-like tasks, is “interleaved gradient noise” by Jorge Jimenez.

Formula is extremely simple!

InterleavedGradientNoise[x_, y_] :=
FractionalPart[52.9829189*FractionalPart[0.06711056*x + 0.00583715*y]]

But the results look great, contain lots of high frequency and produce pleasant, interleaved smooth gradients (be sure to check Jorge’s original presentation and his decomposition of “gradients”):

interleavedgradientnoise.png

What is even more impressive is that such pleasant visual pattern was invented by him as a result of optimization of some “common web knowledge” hacky noise hashing functions!

Unsurprisingly, this pattern has periodogram containing frequencies that correspond to those interleaved gradients + their frequency aliasing (result of frac – similar to frequency aliasing of a saw wave):

2dditheringinterleavedgradientnoiseperiodogram.png

And the 3D plot (spikes corresponding to those frequency):

2ddithering3dplotinterleaved.png

Just like with Bayer, those frequencies will be prone to aliasing and “resonating” with frequencies in source image, but almost zero low frequencies given nice, clean smooth look:

2dditheringquantizeinterleavedgradient.png

Some “hatching” patterns are visible, but they are much more gradient-like (like the name of the function) and therefore less distracting.

Blue noise

Finally, we get again to using a blue noise pre-generated pattern. To recall from previous part, blue noise is loosely defined as a noise function with small low frequency component and uniform coverage of different frequencies. I will use here a pattern that again I generated using my naive implementation of “Blue-noise Dithered Sampling” by Iliyan Georgiev and Marcos Fajardo.

So I generated a simple 64×64 wrapping blue noise-like sequence (a couple hours on an old MacBook):

2dbluenoise.png

It has following periodogram / frequency content:

2dditheringblueperiodogram.png

And in 3D (who doesn’t love visualizations?! 😀 ):

2ddithering3dplot.png

Compared to white noise, it has a big “hole” in the middle, corresponding to low frequencies.

2dwhitenoisevsbluenoise.gif

White noise vs blue noise in 2D

At the same time, it doesn’t have linear frequency increase for higher frequencies, like audio / academic definition of blue noise. I am not sure if it’s because my implementation optimization (only 7×7 neighborhood is analyzed + not enough iterations) or the original paper, but doesn’t seem to impact the results for our use case in a negative way.

Without further ado, results of dithering using 2D blue noise:
2dditheringquantizeblue.png

It is 64×64 pattern, but it is optimized for wrapping around – so border pixels error metric is computed with taking into account pixels on the other side of the pattern. In this gradient, it is repeated 2×2.

And this is how it compared to regular white noise:

2dwhitenoisevsbluenoise.gif

White noise vs blue noise

Because of high frequency contents only, you can’t see this problematic “clumping” effect.

It also means that if we oversample (like with all those new fancy 1080p -> 1440p -> 2160p displays), blur it or apply temporal (one of next parts), it will be more similar to original pattern! So when we filter them with 2-wide Gaussian:

2dditheringwhitevsbluefiltered.png

Left: Gaussian-filtered white noise dithering. Right: Gaussian-filtered blue noise dithering.

And while I said we will not be looking at triangle noise distribution in this post for simplicity, I couldn’t resist the temptation of comparing them:

2dwhitenoisevsbluenoisetriangle.gif

White noise vs blue noise with triangular distribution remapping applied.

I hope that this at least hints at an observation that with a good, well distributed large enough blue noise fixed pattern you might get results maybe not the same quality level of error diffusion dithering, but in that direction and orders of magnitude better than standard white noise.

All four compared

Just some visual comparison of all four techniques:

2dditheringallfourcompared.png

White noise, blue noise, Bayer, interleaved gradient noise

2dditheringalltechniquescompared.gif

And with the triangular remapping:

2dditheringallfourcomparedtriangular.png

2dditheringallfourcomparedtriangular.gif

 

My personal final recommendations and conclusions and here would be:

  • Whenever possible, avoid ordered Bayer! Many game engines and codebases still use it, but it produces very visible unpleasant patterns. I still see it in some currently shipped games!
  • If you cannot spare any memory look-ups but have some ALU, use excellent interleaved gradient noise by Jorge Jimenez. It produces much more pleasant patterns and is extremely cheap with GPU instruction set! However patterns are still noticeable and it can alias.
  • Blue noise is really great noise distribution for many dithering tasks and if you have time to generate it and memory to store it + bandwidth to fetch, it is the way to go.
  • White noise is useful for comparison / ground truth. With pixel index hashing it’s easy to generate, so it’s useful to keep it around.

Summary

In this part of the series, I looked at the topic of quantization of 2D images for the purpose of storing them at limited bit depth. I analyzed looks and effects of white noise, ordered Bayer pattern matrices, interleaved gradient noise and blue noise.

In next part of the series (coming soon), we will have a look at the topic of dithering in more complicated (but also very common) scenario – uniform sampling. It is slightly different, because often requirements are different. For example if you consider rotations, values of 0 and 2pi will “wrap” and be identical – therefore we should adjust our noise distribution generation error metric for this purpose. Also, for most sampling topics we will need to consider more that 1 value of noise.

Blog post mini-series index.

Edit 10/31/2016: Fixed triangular noise remapping to work in -1 – 1 range instead of -0.5-1.5. Special thanks to Romain Guy and Mikkel Gjoel for pointing out the error.

References

http://loopit.dk/banding_in_games.pdf “Banding in games”, Mikkel Gjoel

https://engineering.purdue.edu/~bouman/ece637/notes/pdf/Halftoning.pdf C. A. Bouman: Digital Image Processing – January 12, 2015

B. E. Bayer, “An optimum method for two-level rendition of continuous-tone pictures”, IEEE Int. Conf. Commun., Vol. 1, pp. 26-1 1-26-15 (1973).

http://www.iryoku.com/next-generation-post-processing-in-call-of-duty-advanced-warfare “Next generation post-processing in Call of Duty: Advanced Warfare”, Jorge Jimenez, Siggraph 2014 “Advances in Real Time Rendering in games”

“Blue-noise Dithered Sampling”, Iliyan Georgiev and Marcos Fajardo

https://github.com/bartwronski/BlueNoiseGenerator 

http://www.reedbeta.com/blog/quick-and-easy-gpu-random-numbers-in-d3d11/ “Quick And Easy GPU Random Numbers In D3D11”, Nathan Reed

Posted in Code / Graphics | Tagged , , , , , , , , , , , , | 6 Comments

Dithering part two – golden ratio sequence, blue noise and highpass-and-remap

In previous part of the mini-series I covered dithering definition and how dithering changes error characteristics of simple 1D quantization and functions.

In this part I will try to look at what blue noise is, but first wanted to have a look at a number sequence that I used in the previous post and I find very useful.

You can find a Mathematica notebook for golden sequence here and its pdf version here.

For the second part of the post you can find the notebook here and its pdf version here.

Golden ratio sequence

In previous post I used “some” quasi-random function / sequence and mentioned that it’s not perfect, but very useful. The sequence is a sequence made of fractional part of next multiplications of a golden number.

FractionalPart[N*GoldenRatio]

I found idea for using it in a paper “Golden Ratio Sequences For Low-Discrepancy Sampling” by Colas Schretter and Leif Kobbelt.

This is incredibly fascinating sequence, as it seems to distribute next values very well and quite far apart:

golden_sequence_lines.gif

The differences between next elements are in modulus 1:

{0.381966, 0.618034, 0.381966, 0.381966, 0.618034, 0.381966, 0.618034, 0.381966}

So oscillating golden number modulo 1 itself and 2 minus golden number modulo 1. Both numbers are distant enough from zero and one to produce well-distributed sequence where next samples add lots of information.

Edit: Mikkel GjÞl observed that modulo/toroidal minimum distance of 0.618034 is 0.381966, so the sequence becomes:

{0.381966, 0.381966, 0.381966, 0.381966, 0.381966, 0.381966, 0.381966, 0.381966}

I find it a remarkably beautiful property!

Even for small number of “samples” in the sequence, they cover whole 0-1 range very well:

golden_sequence_comparison.gif

Numbers plotted as colors also look “pleasant”:

golden1d.png

If we look at its periodogram:

GoldenNumberSequencePeriodogram.png

We also find some fascinating properties. First of all, energy seems to increase with frequencies. There are visible “spikes” in some frequencies and what is even more interesting is that every next spike happens at a frequency that is golden ratio times higher and it has golden ratio times more energy! I don’t have any explanation for it… so if you are better than me at maths, please contribute with a comment!

This frequency characteristic is extremely useful, however doesn’t satisfy all of our dithering needs. Why? Imagine that our source signal that we are dithering contains same frequencies. Then we would see extra aliasing in those frequencies. Any structure in noise used for dithering can become visible, and can produce undesired aliasing.

I still find this sequence extremely valuable and use it heavily in e.g. temporal techniques (as well as hemispherical sampling). There are other very useful low-discrepancy sequences that are heavily used in rendering – I will not cover those here, instead reference the physically based rendering “bible” – “Physically Based Rendering, Third Edition: From Theory to Implementation” by Matt Pharr, Wenzel Jakob and Greg Humphreys and chapter 7 that authors were kind enough to provide for free!

For now let’s look at blue noise and theoretically “perfect” dithering sequences.

Blue noise

What is blue noise? Wikipedia defines it as:

Blue noise is also called azure noise. Blue noise’s power density increases 3 dB per octave with increasing frequency (density proportional to f ) over a finite frequency range. In computer graphics, the term “blue noise” is sometimes used more loosely as any noise with minimal low frequency components and no concentrated spikes in energy.

And we will use here this more liberal definition (with no strict definition of frequency distribution density increase).

We immediately see that previous golden ratio sequence is not blue noise, as it has lots of visible spikes in spectrum. Perfect blue noise has no spikes and therefore is not prone to aliasing / amplifying those frequencies.

There are many algorithms for generating blue noise, unfortunately many of them heavily patented. We will have a look at 2 relatively simple techniques that can be used to approximate blue noise.

Generating blue noise – highpass and remap

The first technique we will have a look at comes from Timothy Lottes and his AMD GPUOpen blog post “Fine art of film grain”.

The technique is simple, but brilliant – in step one let’s take a noise with undesired frequency spectrum and just reshape it by applying high pass filter.

Unfortunately, arbitrary high pass filter will produce a signal with very uneven histogram and completely different value range than original noise distribution:

afterHighpass.png

After arbitrary highpass operation of random noise originally in 0-1 range.

This is where part 2 of the algorithm comes in. Remapping histogram to force it to be in 0-1 range! Algorithm is simple – sort all elements by value and then remap the value to position in the list.

Effect is much better:

afterHighpassAndRemap.png

Unfortunately, histogram remapping operation also changes the frequency spectrum. This is inevitable, as histogram remapping changes relative value of elements not linearly. Values in middle of the histogram (corresponding to areas that originally had lost of low frequency component) will be changed much more than values in areas with high frequency content. This way part of high-pass filtering effect is lost:

highpass_periodogram_before_after_remap.png

Comparison of histogram before (red) and after (black) the remap, renormalized manually. Note how some low frequency component reappeared.

Still, effect looks pretty good compared to no high pass filtering:

before_after_lowpass.png

Top – regular random noise. Bottom – with high pass and histogram remapping.

Its frequency spectrum also looks promising:

afterHighpassAndRemapPeriodogram.png

However, there is still this trailing low pass component. It doesn’t contain lots of energy, but still can introduce some visible low pass error in dithered image…

What we can do is to re-apply the technique again!

This is what we get:

highPassRemapTwice.gif

Frequency spectrum definitely looks better and whole algorithm is very cheap so we can apply it as many times as we need.

Unfortunately, no matter how many times we will reapply it, it’s impossible to “fix” all possible problematic spots.

I think about it this way – if some area of the image contains only very low frequency, after applying highpass filter, it will get few adjacent values that are very close to zero. After histogram remapping, they will get remapped to again similar, adjacent values.

HighpassAndRemapUnfixable.png

Small part of a sequence with a local minimum that algorithm repeated even 10 times cannot get out of. Notice few areas of almost uniform gray.

It’s possible that using a different high pass filter or adding some noise between iterations or detecting those problematic areas and “fixing” them would help – but it’s beyond scope of this post and the original technique.

What is worth noting is that original algorithm gives sequence that is not perfect, but often “good enough” – it leaves quite bad local spots, but optimizes frequency spectrum globally .Let’s check it in action.

Results

Let’s have a look at our initial, simple 1D dithering for binary quantization:

HighpassAndRemapDitherComparison.png

Rows 1, 3, 5 – original sine function. Row 2 – dithering with regular noise. Row 4 – dithering with golden ratio sequence. Row 6 – dithering with “highpass and remap” blue-noise-like sequence.

We can see that both golden ratio sequence and our highpass and remap are better than regular noise. However it seems like golden ratio sequence performs better here due to less “clumping”. You can see though some frequency “beating” corresponding to peak frequencies there:

HighpassRemapDitherPeriodograms.gif

Black – white noise. Red – golden ratio sequence. Green – highpass and remap noise sequence.

So this is not a perfect technique, but a) very fast b) tweakable and c) way better than any kind of white noise.

Better? Slower blue noise

Ok, what could we do if we wanted some solution that doesn’t contain those local “clumps”? We can have a look at Siggraph 2016 paper “Blue-noise Dithered Sampling” by Iliyan Georgiev and Marcos Fajardo from Solid Angle.

The algorithm is built around the idea of using probabilistic technique of simulated annealing to globally minimize desired error metric (in this case distance between adjacent elements).

I implemented a simple (not exactly simulated annealing; more like a random walk) and pretty slow version supporting 1, 2 and 3D arrays with wrapping: https://github.com/bartwronski/BlueNoiseGenerator/

As usually with probabilistic global optimization techniques, it can take pretty damn long! I was playing a bit with my naive implementation for 3D arrays and on 3yo MacBook after a night running it converge to at best average quality sequence. However, this post is not about the algorithm itself (which is great and quite simple to implement), but about the dithering and noise.

For the purpose of this post, I generated a 2000 elements, 1D sequence using my implementation.

This is a plot of first 64 elements:

generatedbluepiece.png

Looks pretty good! No clumping, pretty good distribution.

Frequency spectrum also looks very good and like desired blue noise (almost linear energy increase with frequency)!

generatedbluefrequencies.png

If we compare it with frequency spectrum of “highpass and remap”, they are not that different; slightly less very low frequencies and much more of desired very high frequencies:

highpassremapvsgenerated.gif

Highpass and remap (black) vs Solid Angle technique (red).

We can see it compared with all other techniques when applied to 1D signal dithering:

all1dtechniquescomparison.png

Every odd row is “ground truth”. Even rows: white noise, golden ratio sequence, highpass and remap and finally generated sequence blue noise.

It seems to me to be perceptually best and most uniform (on par with golden ratio sequence).

We can have a look at frequency spectrum of error of those:

all1dtechniquescomparedfrequencies.gif

Black – white noise. Red – golden ratio sequence. Green – highpass and remap. Yellow – generated sequence.

If we blur resulting image, it starts to look quite close to original simple sine signal:

comparisonall1dblurred.png

If I was to rate them under this constraints / scenario, I would probably use order from best to worst:

  • Golden ratio sequence,
  • Blue noise generated by Solid Angle technique,
  • Blue noise generated by highpass and remap,
  • White noise.

But while it may seem that golden ratio sequence is “best”, we also got lucky here, as our error didn’t alias/”resonate” with frequencies present in this sequence, so it wouldn’t be necessarily best case for any scenario.

Summary

In this part of blog post mini-series I mentioned blue noise definition, referenced/presented 2 techniques of generating blue noise and one of many general purpose high-frequency low-discrepancy sampling sequences. This was all still in 1D domain, so in the next post we will have a look at how those principles can be applied to dithering of a quantization of 2D signal – like an image.

Blog post mini-series index.

References

https://www.graphics.rwth-aachen.de/media/papers/jgt.pdf “Golden Ratio Sequences For Low-Discrepancy Sampling”, Colas Schretter and Leif Kobbelt

“Physically Based Rendering, Third Edition: From Theory to Implementation”, Matt Pharr, Wenzel Jakob and Greg Humphreys.

https://en.wikipedia.org/wiki/Colors_of_noise#Blue_noise 

http://gpuopen.com/vdr-follow-up-fine-art-of-film-grain/ “Fine art of film grain”, Timothy Lottes

“Blue-noise Dithered Sampling”, Iliyan Georgiev and Marcos Fajardo

Posted in Code / Graphics | Tagged , , , , , , , | 10 Comments

Dithering part one – simple quantization

Introduction

First part of this mini-series will focus on more theoretical side of dithering -some history and applying it for 1D signals and to quantization. I will try to do some frequency analysis of errors of quantization and how dithering helps them. It is mostly theoretical, so if you are interested in more practical applications, be sure to check the index and other parts.

You can find Mathematica notebook to reproduce results here and the pdf version here.

What is dithering?

Dithering can be defined as intentional / deliberate adding of some noise to signal to prevent large-scale / low resolution errors that come from quantization or undersampling.

If you have ever worked with either:

  • Audio signals,
  • 90s palletized image file formats.

You must have for sure encountered dithering options that by adding some noise and small-resolution artifacts “magically” improved quality of audio files or saved images.

However, I found on Wikipedia quite an amazing fact about when dithering was first defined and used:


[O]ne of the earliest [applications] of dither came in World War II. Airplane bombers used mechanical computers to perform navigation and bomb trajectory calculations. Curiously, these computers (boxes filled with hundreds of gears and cogs) performed more accurately when flying on board the aircraft, and less well on ground. Engineers realized that the vibration from the aircraft reduced the error from sticky moving parts. Instead of moving in short jerks, they moved more continuously. Small vibrating motors were built into the computers, and their vibration was called dither from the Middle English verb “didderen,” meaning “to tremble.” Today, when you tap a mechanical meter to increase its accuracy, you are applying dither, and modern dictionaries define dither as a highly nervous, confused, or agitated state. In minute quantities, dither successfully makes a digitization system a little more analog in the good sense of the word.

— Ken Pohlmann, Principles of Digital Audio
This is inspiring and interesting historical fact and as I understand it that it works by avoiding bias in computations and resonances by randomly breaking up some mechanical vibration feedback loops.
But history aside, let’s look at the dithering process for 1D signals first, like audio.

Dithering quantization of a constant signal

We will start first with analyzing the most boring possible signal – a constant signal. If you know a bit about audio and audio-related DSP, you might ask – but you promised looking at audio and and audio by definition cannot have a constant term! (furthermore, both audio software and hardware deliberately remove so called DC offset)
That’s true and we will have a look at more complicated functions in a second, but first things first.
Imagine that we are doing a 1 bit quantization of a normalized floating point signal. This means we will be dealing with final binary values, 0 or 1.
If our signal is 0.3, simple rounding without any dithering will be the most boring function ever – just zero!
Error is also constant, 0.3 and therefore average is also 0.3. This means that we introduced quite big bias to our signal and completely lost original signal information.
We can try to dither this signal and have a look at results.
Dithering in this case (used rounding function) is applying just plain, random white noise (random value per every element, producing uniform noise spectrum) and adding random value from range (-0.5, 0.5) to the signal prior to quantization.

quantizedDitheredSignal =
  Round[constantSignalValue + RandomReal[] – 0.5] & /@ Range[sampleCount];

Constant_dither_noise.png
Constant_dither_noise_img.png
It’s difficult to see anything here, just that now result of quantization is some random ones and zeros… With (as hopefully expected) more zeros. It’s not terribly interesting signal on its own, but what is quite interesting is the plot of the error and average error.
Constant_dither_error.png
Ok, we can see that as expected, error is also alternating… but what is quite scary is that error got sometimes bigger (0.7 absolute value)! So our maximum error is worse, pretty unfortunate… however, average noise is:

Mean[ditheredSignalError]
0.013

Much much smaller than original error of 0.3. With sufficiently large amount of samples this error would go towards zero (limit). So error for constant term got much smaller, but let’s have a look at frequency plot of all errors.

spectrum_quantization_noise_comparison.gif

Red plot/spike = frequency spectrum of error when not using dithering (constant, no frequencies). Black – with white noise dithering.

Things are getting more interesting! This shows first major takeaway of this post – dithering distributes quantization error / bias among many frequencies.
We will have a look in the next section how it helps us.

Frequency sensitivity and low-pass filtering

So far we observed that dithering a quantized constant signal:
  • Increased maximal error.
  • Almost zeroed average, mean error.
  • Added constant white noise (full spectral coverage) to the error frequency spectrum, reducing the low-frequency error.
By itself it doesn’t help us too much… However, we are not looking at quantization of any arbitrary mathematical function / signal. We are looking here at signals that will be perceived by humans. Human perception is obviously limited, some examples:
  • Our vision has a limit of acuity. Lots of people are short-sighted and see blurred image of faraway objects without correction glasses.
  • We perceive medium scale of detail much better than very high or very low frequencies (small details of very smooth gradients may be not noticed).
  • Our hearing works in specific frequency range (20Hz -20kHz, but it gets worse with age) and we are most sensitive to middle ranges – 2kHz-5kHz.

Therefore, any error in frequencies closer to upper range of perceived frequency will be much less visible.

Furthermore, our media devices are getting better and better and provide lots of oversampling. In TVs and monitors we have “retina”-style and 4K displays (where it’s impossible to see single pixels), in audio we use at least 44kHz sampling file formats even for cheap speakers that often can’t reproduce more than 5-10kHz.

This means, that we can approximate perceptual look of a signal by low-pass filtering it. Here I did a low pass filtering (padding with zeros on the left -> “ramp up”):

Constant_dither_noise_lowpass.png

Red – desired non-quantized signal. Green – quantized and dithered signal. Blue – low pass filter of that signal.

Signal starts to look much closer to original, unquantized function!
Unfortunately we started to see some low frequencies that are very visible and were not present in the original signal. We will look at fixing it by using blue noise in part 3 of the series, but for now this is how it could look like with some quasi-noise function that has much less lower frequency content:
Constant_dither_noise_golden_lowpass.png
This is possible because our quasi-random sequence has following frequency spectrum:
Constant_dither_noise_golden_spectrum.png
But for now enough looking at simplistic, constant function. Let’s have a look at a sine wave (and if you know Fourier theorem – a building block of any periodic signal!).

Quantizing a sine wave

If we quantize a sine wave with 1 bit quantization, we get a simple… square wave.
sine_quantize.png
Square wave is quite interesting, as it comprises base frequency as well as odd harmonics.
It is interesting property that is used heavily in analog subtractive synthesizers to get hollow/brassy sounding instruments. Subtractive synthesis starts with a complex, harmonically rich sound and filters it by removing some frequencies (with filter parameters varying over time) to shape sounds in desired way.
Square wave frequency spectrum:
sine_quantize_spectrum.png
But in this post we are more interested in quantization errors! Let’s plot the error as well as frequency spectrum of the error:
sine_quantize_error.png
sine_quantize_error_spectrum.png
In this case we are in much better situation – average error is close to zero! Unfortunately, we still have lots of undesired low frequencies, very close to our base frequency (odd multiplies with decreasing magnitudes). This is known as aliasing or dithering noise – frequencies that were not present in the original signal appear and they have pretty large magnitudes.
Even low-pass filtering cannot help this signal much… As error has so many low frequencies:

sine_quantize_lowpass.png

Low-pass filtered quantized sine

sine_quantize_error_lowpass.png

Low-pass filtered quantized sine error

Let’s have a look at how this changes with dithering. At first sight things don’t improve a lot:
sine_quantize_dither.png
However if we display it as an image, it starts to look better:
sine_quantize_dither_img.png
And notice how again, quantization error gets distributed among different frequencies:
spectrum_quantization_noise_comparison_sine.gif
This looks very promising! Especially considering that we can try to filter it now:
sine_quantize_dither_lowpass.png
That’s slightly crooked sine, but looks much closer to original one than the non-dithered one with the exception of a phase shift introduced by asymmetrical filter (I am not going to describe or cover it here; it is fixable simply by applying symmetrical filters):

sine_quantize_dither_lowpass_comparison.png

Red – original sine. Green – low pass filtered undithered signal. Blue – low pass filtered dithered signal.

Plotting both error functions confirms numerically that error is much smaller:

sine_quantize_dither_lowpass_error.png

Red – error of low-pass filtered non-dithered signal. Blue – error of low-pass filterer dithered signal.

Finally, let’s just quickly look at a signal with better dithering function containing primarily high frequencies:

sine_quantize_dither_vs_golden_img.png

Upper image – white noise function. Lower image – a function containing more higher frequencies.

sine_quantize_dither_golden_lowpass_comparison.png

Low-pass filtered version dithered with a better function – almost perfect results if we don’t count filter phase shift!

And finally – all 3 comparisons of error spectra:

spectrum_quantization_noise_golden__comparison_sine.gif

Red – undithered quantized error spectrum. Black – white noise dithered quantized error spectrum. Blue – noise with higher frequencies ditherer error spectrum.

Summary

This is end of part one. Main takeaways are:
  • Dithering distributes quantization error / bias among many different frequencies that depend on the dithering function instead of having them focused in lower frequency area.
  • Human perception of any signal (sound, vision) works best in very specific frequency ranges. Signals are often over-sampled for end of perception spectrum in which perception is almost marginal. For example common audio sampling rates allow reproducing signals that most adults will not be able to hear at all. This makes use of dithering and trying to shift error into this frequency range so attractive because of previous point.
  • Different noise functions produce different spectra of error that can be used knowing which error spectrum is more desired.

In the next part we will have a look at various dithering functions – the one I used here (golden ratio sequence) and blue noise.

Blog post mini-series index.

Posted in Code / Graphics | Tagged , , , , , | 2 Comments

Dithering in games – mini series

This an opening post of mini blog post series about various uses of dithering for quantization and sampling in video games. It is something most of us use intuitively in every day work, so wanted to write down some of those concepts, explain and analyze them in Mathematica.

This post is just a table of contents.

Part one – dithering – definition and simple 1D quantization.

Part two – dithering – golden ratio sequence, white and blue noise for 1D quantization.

Part three – dithering in quantization of 2D images, Bayer matrix, interleaved gradient noise and blue noise.

Update April 2020: I didn’t expect to come back to blue noise a few years later, but I wrote a small additional blog post on alternative method for generating blue noise patterns for dithering by “optimizing” frequency spectrum. You can treat it as part three-point-five. 🙂

You can find some supplemental material for posts here: https://github.com/bartwronski/BlogPostsExtraMaterial/tree/master/DitheringPostSeries

Posted in Code / Graphics | Tagged , , , , , , , | 6 Comments

Short names are short

Intro

This blog post is a counter-argument and response to a post by Bob Nystrom that got very popular few months ago and was often re-shared. I disagree so much that I thought it’s worth sharing my opinion on this topic.
Mine response here is very subjective, and contains opinions about quite a polarizing topic – if you want to read a pure facts / knowledge sharing post, you may as well stop reading it. This post is supposed to be short, fun and a bit provocative.
opinion
Also, I don’t think that single piece of advice “use shorter names” or “use longer names” without any context has any real value, so at the end instead of just ranting I will try to summarize my recommendations.
Having said that and if you want to read some rant, let’s go!

What I agree with

Some personal story here – I used to write terrible, hacked code that now I am ashamed of. Fortunately, working at Ubisoft with colleagues with much higher standards and some discussion with them helped me understand clean code value. Working on one of codebases that they have cleaned up I was amazed how pleasant it can be to work with clear, good quality and well architected code and it made me realize my wrongdoings!
Today I believe that readability and simplicity are the most important features of good code (assuming that it’s correct, performs good enough etc.) to the point that I would always keep clear reference implementation until it’s really necessary to do magic, unclear optimization(s). You write code once, but you and your colleagues read it thousands of times and for years etc.
I definitely agree with Bob Nystrom that names should be:
Clear.
Precise.
However, this is I guess where points that we could agree upon end. I don’t think that short names for variables or functions serve readability at all!

Understandability without context

Imagine that you debug a crash or weird functionality in a piece of code you are not familiar with. You start unwinding a callstack and see a variable named “elements” or looking at author’s example, “strawberries”.
What the hell is that? What are those elements? Is it local, temp array, or a member of debugged class? Wouldn’t tempCollectedStrawberiesToGarnish or m_garnishedStrawberies be more readable?
If you look at a crashdump on some crash aggregation website server and before downloading it and opening in IDE, it gets even more difficult! You will have completely no idea what given code does.
And this is not uncommon scenario – we work in teams and our teammates will debug our code many times. In a way, we are writing our code for them…

Confusion and ambiguity

Second thing – I don’t believe that short names can be precise. Working in the past with some component based game engines and seeing “m_enabled” being checked / set with some extra logic around, I just wanted to face-palm. Fortunately people were not crazy enough to skip the member prefix and just operate on some “enabled” variable in long function – this would lead to even more / extreme confusion!
What does it mean that component is enabled / disabled? I guess that animation component is not updating skeleton / hierarchy (or is it?) and mesh component is probably not rendered, but how can I be sure? Wouldn’t be m_isSkeletonUpdated or m_pauseTick or m_isVisible more readable?
Side note: this point could be an argument against as well general class polymorphism / inheritance and reusing same fields for even slightly different functionalities.

Less context switching

With slightly longer and more verbose names, it is easier to keep all information on one screen and within single mental “context”. The shorter the name, the larger and less obvious context you need to understand its purpose, role, lifetime and type.
If you need to constantly remind yourself of class name or variable type, or even worse need to check some name/type of parent class (again, not a problem if you don’t (ab)use OOP) the less effective and more distracted you are. In long, complex code this context switching can be detrimental to code understanding and make you less focused.

Search-ability

This is probably my biggest problem with short names and a biggest NO. I use grep-like tools all the time and find them better and more robust than any IDE-specific symbol searching. Don’t get me wrong! For example VisualAssistX is an amazing extension and I use it all the time. It’s way faster than IntelliSense, however still can choke on very large code solutions – but this is not the main issue.
The main issue is that every codebase I worked in (and I guess any other serious and large codebase) has many different languages. I work daily with C/C++, HLSL/PSSL, Python, JSON, some form of makefiles and custom data definition languages. To look for some data that can be in either of those places (sometimes in many!) I use good, old “search in files”. I can recommend here a plugin called Entrian Source Search (colleague from Ubisoft, Benjamin Rouveyrol recommended it to me and it completely transformed the way I work!) and it perfectly solves this problem. I can easily look for “*BRDF*” or “SpecularF0” and be sure that I’ll find all HLSL, C++ and data definitions references.
Going back to the main topic – this is where short, ambiguous names completely fail! If you find 1000 references or given name then it could be considered just useless.
Just some examples.
Let’s look for “brilliantly” named variable or function enable – why would anyone need more context?
enable.png
Note that this shows only whole words matching! Hmm, not good… How about “brilliant” short name update?
update.png
Good luck with checking if anyone uses your class “update” function!

Help with refactoring

Related to the previous topic – before starting refactoring, I always rely on code search. It’s really useful to locate it where given variable / function / class is being used, why, can it be removed?
Large codebases are often split into many sub-solutions to be lighter on memory (if you worked in a huge one in Visual Studio you know the pain!) and more useful for daily use / common use scenario. This is where IDE symbol search fails completely and can be an obstacle in any refactoring.
Again – in my personal opinion and experience, using straight grep-like search works much better than any flaky symbol search, and works across many code solutions and languages. Good, uncommon, clear and unique names really help it. I can immediately see when a function or variable is not used anywhere, who uses it, which parts of the pipeline need information about it. So basically – all necessary steps in planning a good refactor!

Risks with refactoring

This point is a mix of previous one and section about confusion / ambiguity. It is a) very easy to misunderstand code with common names b) overload the original term, since it’s short and seems kind-of still accurate. This leads often to even more meaningless and confusing terms.
If you have a longer, more verbose name then you will think twice before changing its type or application – and hopefully rename it and/or reconsider/remove all prior uses.

Self-documenting code

I believe in use of comments do document “non-obvious” parts of an algorithm or some assumptions (much better than offline documents that get sent in emails and lost or are put at some always-outdated wiki pages), but hate code constantly interleaved with short comments – for me because of context switching it increases cognitive load. Code should be self-documenting to some extent and I think it’s possible – as long as you don’t try to remove whole context information from variable names.

When using short names is ok?

Ok, I listed hopefully enough arguments against using short, context-less names – but I sometimes use them myself!
As I said at the beginning – it all depends and “single advice fits all” attitude is usually piece of crap.
My guidelines would be – use rather long, unique and memorable names, giving enough information about the context, unless:
  1. It’s something like iterator in a short loop. I use C-style names like i, j, k quite often and don’t see it as a major problem provided that the loop is really short and simple.
  2. In general, it’s a local variable in a short function with clear use. So think about it this way – if someone else “lands” there accidentally with a debugger, would they require understanding of the system to figure out its purpose?
  3. If they are class or function names, only if they are guaranteed to be local in the scope of a file and not accessible from the outside. If you change such function from static to global, make sure you change the name as well!
  4. It is a part of convention that all programmers in the codebase agreed upon. For example that your physics simulation classes will start with a “P”, not “Physics”.
  5. You use POD types like structs and never write functions inside them – it’s fine to have their names short as you know they relate to the struct type.
  6. Similar to 5 – you use (not abuse!) namespaces and/or static class functions to provide this extra scope and information and always access it with a prefix (no “using” so search-ability is not impacted by it).

Summary

Rant mode over! 🙂 I hope that this provocative post shown some disadvantages of over-simplified pieces of advice like “long names are long” and some benefits of slightly increased verbosity.
Posted in Code / Graphics | Tagged , , , , , | 1 Comment

Image dynamic range

Intro

This post is a second part of my mini-series about dynamic range in games. In this part I would like to talk a bit about dynamic range, contrast/gamma and viewing conditions.

You can find the other post in the series here and it’s about challenges that lay ahead of properly exposing a scene with too large luminance dynamic range – I recommend checking it out! 🙂

This post is accompanied by Mathematica Notebook so you can explore those topics yourself.

.nb Matematica version

PDF version of the notebook

Dynamic range

So, what is dynamic range? It is most simply a ratio between highest possible value a medium can reproduce / represent and lowest one. It is measured usually in literally “contrast ratio” (a proportion like for example 1500:1), decibels (difference in logarithm in base of 10) or “stops” (or simply… bits; as it is difference in logarithms in base of 2). In analog media, it is represented by “signal to noise ratio”, so lowest values are the ones that cannot be distinguished from noise. In a way, it is analogous to digital media, where under lowest representable value is only quantization noise.

This measure is not very useful on its own and without more information about representation itself. Dynamic range will have different implications on analog film (where there is lots of precision in bright and mid parts of the image and dark parts will quickly show grain), digital sensors (total clipping of whites and lots of information compressed in shadows – but mixed with unpleasant, digital noise) and in digital files (zero noise other than quantization artifacts / banding).

I will focus in this post purely on dynamic range of the scene displayed on the screen. Therefore, I will use EV stops = exposure value stops, base2 logarithm (bits!).

Display dynamic range

It would be easy to assume that if we output images in 8 bits, dynamic range of displayed image would be 8 EV stops. This is obviously not true, as information stored there is always (in any modern pipeline and OS) treated with an opto-electrical transfer function – OETF prior to 8bit quantization/encoding that is given for a display medium. Typically used OETF is some gamma operator, for example gamma 2.2 and sRGB (not the same, but more about it later).

Before going there, let’s find a way of analyzing it and displaying dynamic range – first for the simplest case, no OETF, just plain 8bit storage of 8bit values.

Since 0 is “no signal”, I will use 1/256 as lowest representable signal (anything lower than that gets lost in quantization noise) and 1 as highest representable signal. If you can think of a better way – let me know in comments. To help me with that process, I created a Mathematica notebook.

Here is output showing numerical analysis of dynamic range:

dynrange_linear.png

Dynamic range of linear encoding in 8bits. Red line = dynamic range. Blue ticks = output values. Green ticks = stops of exposure. Dots = mappings of input EV stops to output linear values.

Representation of EV stops in linear space is obviously exponential in nature and you can immediately see a problem with such naive encoding – many lower stops of exposure get “squished” in darks, while last 2 stops cover 3/4 of the range! Such linear encoding is extremely wasteful for logarithmic in nature signal and would result in quite large, unpleasant banding after quantization. Such “linear” values are unfortunately not perceptually linear. This is where gamma comes in, but before we proceed to analyze how it affects the dynamic range, let’s gave a look at some operations in EV / logarithmic space.

Exposure in EV space

First operation in EV space is trivial, it’s adjusting the exposure. Quite obviously, addition in logarithmic space is multiplication in linear space.

exp2(x+y) == exp2(x)*exp2(y)

Exposure operation does not modify dynamic range! It just shifts it in the EV scale. Let’s have a look:

exposure.gif

Blue lines scale linearly, while green lines just shift – as expected.

underexposed_linear

Underexposed image – dynamic range covering too high values, shifted too far right.

Gamma in EV space

This is where things start to get interesting as most people don’t have intuition about logarithmic spaces – at least I didn’t have. Gamma is usually defined as a simple power function in linear space. What is interesting though is what happens when you try to convert it to EV space:

gamma(x,y) = pow(x, y)

log2(gamma(exp2(x),y)) == log2(exp2(x*y))

log2(exp2(x*y)) == x*y

Gamma operation becomes simple multiplication! This is a property that is actually used by GPUs to express a power operation through a series of exp2, madd and log2. However, if we stay in EV space, we only need the multiply part of this operation. Multiplication as a linear transform obviously preserves space linearity, just stretching it. Therefore gamma operation is essentially dynamic range compression / expansion operation!

gamma.gif

You can see what it’s doing to original 8 bit dynamic range of 8 stops – it multiplies it by reverse of the gamma exponent.

Contrast

Gamma is a very useful operator, but when it increases dynamic range, it makes all values brighter than before; when decreasing dynamic range, it makes all of them darker. This comes from anchoring at zero (as 0 multiplied by any value will stay zero), which translates to “maximum” one in linear space. It is usually not desired property – when we talk about contrast, we want to make image more or less contrasty without adjusting overall brightness. We can compensate for it though! Just pick another, different anchor value – for example mid grey value.

Contrast is therefore quite similar to gamma – but we usually want to keep some other point fixed when dynamic range gets squished instead of 1 – for example linear middle gray 0.18.

contrast(x, y, midPoint) = pow(x,y) * midPoint / pow(midPoint, y)

contrast.gif

Increasing contrast reduces represented dynamic range while “anchoring” some grey point so it’s value stays unchanged.

How it all applies to local exposure

Hopefully, with this knowledge it’s clear why localized exposure works better to preserve local contrast, saturation and “punchiness” than either contrast or gamma operator – it moves parts of histogram (ones that were too dark / bright), but:

  1. Does it only to the parts of the image / histogram that artist wanted, not affecting others like e.g. existing midtones.
  2. Only shifts values (though movement range can vary!), instead of squishing / rescaling everything like gamma would do.

This property and contrast preservation is highly desired and preserves other properties as well – like saturation and “punchiness”.

gamma_vs_both

Comparison of gamma vs localized adaptive exposure – notice saturation and contrast difference.

EOTF vs OETF

Before talking about output functions, it’s important to distinguish between a function that’s used by display devices to transfer from electric signals (digital or analog), called EOTF – Electro Optical Transfer Function and the inverse of it, OETF, Opto-Electrical Transfer Function.

EOTF came a long way – from CRT displays and their power response to voltage that was seen as perceptually uniform in some viewing conditions.

crt_gamma.gif

CRT response to voltage

While early rendering pipelines completely ignored gamma and happily used its perceptual linearity property and did output values directly to be raised to a gamma power by monitor, when starting to get more “serious” about things like working in linear spaces, we started to use inverse of this function, called OETF and this is what we use to encode signal for display in e.g. sRGB or Rec709 (and many more other curves like PQ).

While we don’t use CRTs anymore, such power encoding was proven to be very useful also on modern displays as it allows to encode more signal and this is why we still use them. There is even standard called BT1886 (it specifies only EOTF) that is designed to emulate gamma power look of old CRTs!

Monitor OETF gamma 2.2 encoding

Equipped with that knowledge, we can get back to monitor 2.2 gamma. As EOTF is 2.2, OETF will be a gamma function of 1/2.2, ~0.45 and hopefully it’s now obvious that encoding from linear space to gamma space, we will have a dynamic range 2.2 * 8 == 17.6.

dynrange_gamma_2_2.png

Now someone would ask – ok, hold on, I thought that monitors use a sRGB function, that is combination of a small, linear part and an exponent of 2.4!

Why I didn’t use precise sRGB? Because this function was designed to deliver average gamma 2.2, but have added linear part in smallest range to avoid numerical precision issues. So in this post instead of sRGB, I will be using gamma 2.2 to make things simpler.

What is worth noting here is that just gamma 2.2 of linear value encoded in 8 bits has huge dynamic range! I mean, potential to encode over 17 stops of exposure / linear values – should be enough? Especially that their distribution also gets better – note how more uniform is coverage of y axis (blue “ticks”) in the upper part. It still is not perfect, as it packs many more EV stops into shadows and puts most of the encoding range to upper values (instead of midtones), but we will fix it in a second.

Does this mean that OETF of gamma 2.2 allow TVs and monitors to display those 17.6 stops of exposure and we don’t need any HDR displays nor localized tonemapping? Unfortunately not. Just think about following question – how much information is really there between 1/256 and 2/256? How much those 2.2 EV stops mean to the observer? Quite possibly they won’t be even noticed! This is because displays have their own dynamic range (often expressed as contrast ratio) and depend on viewing conditions. Presentation by Timothy Lottes explained it very well.

But before I proceed with further analysis of how tonemapping curves change it and what gamma really does, I performed a small experiment with viewing conditions (let you be the judge if results are interesting and worth anything).

Viewing conditions

I wanted to do small experiment – compare an iPhone photo of the original image in my post about localized tonemapping (so the one with no localized tonemapping/exposure/contrast) on my laptop in perfect viewing conditions (isolated room with curtains):

2016-08-21 13.31.05.jpg

Poor dynamic range of the phone doesn’t really capture it, but it looked pretty good – with all the details preserved and lots of information in shadows. Interesting, HDR scene and shadows that look really dark, but full of detail and bright, punchy highlights. In such perfect viewing conditions, I really didn’t need any adjustments and the image and display looked “HDR” by themselves!

Second photo is in average/typical viewing conditions:

2016-08-21 13.29.59.jpg

Whole image was still clear and readable, though what could be not visible here is that darks lost details; Image looks more contrasty (Bartleson-Breneman effect) and Kratos back was unreadable anymore.

Finally, same image taken outside (bright Californian sun around noon, but laptop maxed out on brightness!):

2016-08-21 13.30.31.jpg

Yeah, almost invisible. This is with brightness adjustments (close to perception of very bright outdoor scene):

2016-08-21 13.30.31_c.jpg

Not only image is barely visible and lost all the details, but also is overpowered with reflections of surroundings. Almost no gamma settings would be able to help such dynamic range and brightness.

I am not sure whether this artificial experiment shows it clearly enough, but depending on the viewing conditions image can look great and clear or barely visible at all.

My point is to emphasize what Timothy Lottes shown in his presentation – viewing conditions define the perceived dynamic range.

I tried to model it numerically – imagine that due to bright room and sunlight, you cannot distinguish anything below encoded 0.25. This is how dynamic range of such scene would look like:

poor_viewing_conditions.png

Not good at all, only 4.4 EV stops! And this is one of reasons why we might need some localized tonemapping.

Viewing conditions of EOTF / OETF

Viewing conditions and perception is also the reason for various gamma EOTF curves and their inverse. Rec709 EOTF transfer curve is different than sRGB transfer curve (average gamma of 2.4 vs 2.2) because of that – darker viewing conditions of HDTV require different contrast and reproduction.

Due to mentioned Bartleson-Breneman effect (we perceive more contrast, less dynamic range as surroundings get brighter) living room at night requires different EOTF than one for view conditions web content in standard office space (sRGB). Rec709 EOTF gamma will mean more contrast produced by the output device.

Therefore, using inverse of that function, OETF of Rec709 you can store 2.4*8 == 19.2 stops of exposure and the TV is supposed to display them and conditions are supposed to be good enough. This is obviously not always the case and if you tried to play console games in sunlit living room you know what I mean.

gamma_2_2vs2_4.gif

Gammas 2.2 and 2.4 aren’t only ones used – before standardizing sRGB and Rec709 different software and hardware manufacturers were using different values! You can find historical references to Apple Mac gammas 1.8 or 2.0 or crazy Gameboy extreme gammas of 3.0 or 4.0 (I don’t really understand this choice for Gameboys that are handheld and supposed to work in various conditions – if you know, let me and readers know in comments!).

“Adjust gamma until logo is barely visible”

Varying conditions are main reason for what you see in most games (especially on consoles!):

logo.png

Sometimes it has no name and is called just a “slider”, sometimes gamma, sometimes contrast / brightness (incorrectly!), but it’s essentially way of correcting for imperfect viewing conditions. This is not the same as display gamma! It is an extra gamma that is used to reduce (or increase) the contrast for the viewing conditions.

It is about packing more EV stops into upper and midrange of the histogram (scaling and shifting right), more than expected by display before it turns it into linear values. So a contrast reduction / dynamic range compression operation as in brighter viewing conditions, viewer will perceive more contrast anyway.

It often is “good enough” but what in my opinion games usually do poorly is:

  1. Often user sets it only once, but plays the game during different conditions! Set once, but then play same gam during day/night/night with lights on.
  2. Purpose of it is not communicated well. I don’t expect full lecture on the topic, but at least explanation that it depends on brightness of the room where content is going to be viewed.

Let’s take our original “poor viewing conditions” graph and apply first 0.7 gamma (note: this additional gamma has nothing to do with the display EOTF! This is our “rendering intent” and knowledge about viewing conditions that are not part of the standard) on the linear signal before applying output transform, so:

(x^0.7)^(1/2.2) == x^(0.7/2.2)

poor_viewing_conditions.gif

A little bit better, ~1.42 more stops. 🙂 However, in good viewing conditions (like player of your game launching it at night), this will make image look less contrasty and much worse…

Tonemapping operators

From my analysis so far it doesn’t seem immediately obvious why we need some tonemapping curves / operators – after all we get relatively large dynamic range just with straight linear encoding in proper viewing conditions, right?

There are two problems:

  1. First one is non-uniform distribution of data, biased towards last visible EV stops (values close to white) occupying most of the encoded range. We would probably want to have store most EV stops around middle values and larger range and have upper and lower range squishing many EV stops with smaller precision. So a non-uniform dynamic range scaling with most of precision bit range used for midtones.
  2. Fixed cutoff value of 1.o defining such white point. It is highly impractical with HDR rendering, where while after exposure we can have most of values in midrange, we can also have lots of (sometimes extreme!) dynamic range in bright areas. This is true especially with physically based rendering and very glossy speculars or tiny emissive objects with extreme emitted radiance.

Methods presented so far don’t offer any solution for those very bright objects – we don’t want to reduce contrast or shift the exposure and we don’t want them to completely lose their saturation; we still want to perceive some details there.

proper_exp_linear

Correctly exposed image, but with linear tonemapping curve shows the problem – completely clipped brights!

This is where tonemapping curves come to play and show their usefulness. I will show some examples using good, old, Reinhard curve – mostly for its simplicity. By no means I recommend this curve, there are better alternatives!

There is a curve that probably every engine has implemented as a reference – Hable Uncharted tonemapping curve.

Another alternative is adapting either full ACES workflow – or cheaper (but less accurate; ACES RRT and ODT are not just curves and have some e.g. saturation preserving components!) approximating it like Kris Narkowicz proposed.

Finally, option that I personally find very convenient is nicely designed generic filmic tonemapping operator from Timothy Lottes.

I won’t focus here on them – instead try good, old Reinhard in simplest form – just for simplicity. It’s defined as simply:

tonemap(x) = x/(1+x)

It’s a curve that never reaches white point, so usually a rescaling correction factor is used (as division by tonemap(whitePoint) ).

Let’s have a look at Reinhard dynamic range and its distribution:

reinhard.png

Effect of Reinhard tonemapping with white point defined as 128

This looks great – we not only gained over 7 stops of exposure, have them tightly packed in brights without affecting midtones and darks too much, but also distribution of EV stops in final encoding looks almost perfect!

It also allows to distinguish (to some extent) details in objects that are way brighter than original value of EV 0 (after exposure) – like emissives, glowing particles, fire, bright specular reflections. Anything that is not “hotter” than 128 will get some distinction and representation.

linear_vs_filmic

Comparison of linear and filmic operators. Here filmic tonemapping operator is not Reinhard, so doesn’t wash out shadows, but the effect on highlights is same.

As I mentioned, I don’t recommend this simplest Reinhard for anything other than firefly fighting in post effects or general variance reduction. There are better solutions (already mentioned) and you want to do many more things with your tonemapping – add some perceptual correction for darks contrast/saturation (“toe” of filmic curves), some minor channel crosstalk to prevent total washout of brights etc.

Summary

I hope that this post helped a bit with understanding of the dynamic range, exposure and EV logarithmic space, gamma functions and tonemapping. I have shown some graphs and numbers to help not only “visualize” impact of those operations, but also see how they apply numerically.

Perfect understanding of those concepts is in my opinion absolutely crucial for modern rendering pipelines and any future experiments with HDR displays.

If you want to play with this “dynamic range simulator” yourself, check the referenced Mathematica notebook!

.nb Matematica version

PDF version of the notebook

Edit: I would like to thank Jorge Jimenez for a follow up discussion that allowed me to make some aspects of my post regarding EOTF / OETF hopefully slightly clearer and more understandable.

References

http://renderwonk.com/publications/s2010-color-course/ SIGGRAPH 2010 Course: Color Enhancement and Rendering in Film and Game Production

http://gpuopen.com/gdc16-wrapup-presentations/  “Advanced Techniques and Optimization of HDR Color Pipelines”, Timothy Lottes.

http://filmicgames.com/archives/75 John Hable “Filmic Tonemapping Operators”

https://github.com/ampas/aces-dev Academy Color Encoding Standard

https://knarkowicz.wordpress.com/2016/01/06/aces-filmic-tone-mapping-curve/ ACES filmic tonemapping curve

http://graphicrants.blogspot.com/2013/12/tone-mapping.html Brian Karis “Tone-mapping”

 

Posted in Code / Graphics | Tagged , , , , , , , , | 13 Comments