Study of smoothing filters – Savitzky-Golay filters

Last week I saw Daniel Holden tweeting about Savitzky-Golay filters and their properties (less smoothing than a Gaussian filter) and I got excited… because I have never heard of them before and it’s an opportunity to learn something. When I checked the wikipedia page and it mentioned it being a least squares polynomial fit, yet yielding closed formula linear coefficients, all my spidey senses and interests were firing!

Convolutional, linear coefficients -> means we can analyze their frequency response and reason about their properties.

I’m writing this post mostly as my personal study of those filters – but I’ll also write up my recommendation / opinion. Important note: I believe my use-case is very different than Daniel’s (who gives examples of smoothed curves), so I will analyze it for a different purpose – working with digital sequences like image pixels or audio samples.

(If you want a spoiler – I personally probably won’t use them for audio or image smoothing or denoising, maybe with windowing, but think they are quite an interesting alternative for derivatives/gradients, a topic I have written before).

Savitzky-Golay filters

The idea of Savitzky-Golay filters is simple – for each sample in the filtered sequence, take its direct neighborhood of N neighbors and fit a polynomial to it. Then just evaluate the polynomial at its center (and the center of the neighborhood), point 0, and continue with the next neighborhood. 

Let’s have a look at it step by step.

Given N points, you can always fit a (N-1) degree polynomial to it.

Let’s start with 7 points and a 6th degree polynomial:

Red dots are the samples, and the blue line is sinc interpolation – or more precisely, Whittaker–Shannon interpolation, standard when analyzing bandlimited sequences.

It might be immediately obvious that the polynomial that goes through those points is… not great. It has very large oscillations. 

This is known as Runge’s phenomenon and equispaced points (our original sampling grid) generally don’t fit polynomials to many functions well, especially with higher polynomial degrees. Fitting high degree polynomials to arbitrary functions on equispaced grids is a bad idea…

Fortunately, Savitzky-Golay filtering doesn’t do this.

Instead, they find a lower order polynomial that fits the original sequence the best in least-squares sense – the sum of squared distances between the polynomials sampled at the original points is minimized.

(If you are a reader of my blog, you might have seen me mention least squares in many different contexts, but probably most importantly when discussing dimensionality reduction and PCA)

Since this is a linear least squares problem, it can be solved directly with some basic linear algebra and produces following fits:

On those plots, I have marked the distance from the original point at zero – after all, this is the neighborhood of this point and the polynomial fit value will be evaluated only at this point.

Looking at these plots there’s an interesting and initially non-obvious observation – degrees in pairs (0,1), (2,3), (3,4) have exactly the same center value (at x position zero). Adding a next odd degree to the polynomial can cause “tilt” and changes the oscillations, but doesn’t impact the symmetric component and the center.

For the Savitzky-Golay filters used for filtering signals (not their derivatives – more on it later), it doesn’t matter if you take the degree 4 or 5 – coefficients are going to be the same.

Let’s have a look at all those polynomials on a single plot:

As well as on an animation that increases the degree and interpolates between consequential ones (who doesn’t love animations?):

The main thing I got from this visualization and playing around is how lower orders not only fit the sequence worse, but also have less high frequency oscillations.

As mentioned, Savitzky-Golay filter repeats this on the sequence of “windows”, moving by single point, and by evaluating their centers – obtains the filtered value(s).

Here is a crude demonstration of doing this with 3 neighborhoods of points in the same sequence – marked (-3, 1), (-2, 2), (-1, 3) and fitting parabolas to each.

Savitzky-Golay filters – convolution coefficients

The cool part is that given the closed formula of the linear least squares fit for the polynomial and its evaluation only at the 0 point, you can obtain a convolutional linear filter / impulse response directly. Most signal processing libraries (I used scipy.signal) have such functionality and the wikipedia page has neat derivation.

Plotting the impulse responses for bunch of 7 and 5 point filters:

The degree of 0 – an average of all points – is obviously just a plain box filter.

But the higher degrees start to be a bit more interesting – looking like a mix of lowpass (expected) and bandpass filters (a bit more surprising to me?).

It’s time to do some spectral analysis on those!

Analyzing frequency response – comparison with common filters

If we plot the frequency response of those filters, it’s visually clear why they have smoothing properties, but not oversmoothing:

As compared to a box filter (Savgol filters of degree 0) and a binomial/Gaussian-like [0.25, 0.5, 0.25] filter (my go to everyday generic smoother – be sure to check out my last post!), they preserve a perfectly flat frequency response for more frequencies.

What I immediately don’t like about it, is that, similarly to a box filter, it is going to produce an effect similar to comb filtering, preserving some frequencies and filtering out some others – oscillating frequency response; they are not lowpass filters. So for example a Savgol filter of 7 points and degree 4 is going to leave zero frequencies at 2.5 radians per cycle, while leaving out almost 40% at Nyquist! Leaving out frequencies around Nyquist often produces “boxy”, grid-like appearance.

When filtering images, it’s definitely an undesirable property and one of the reasons I almost never use a box filter.

We’ll see in a second why such a response can be problematic on a real signal example.

On the other hand, preserving some more of the lower frequencies untouched could be a good property if you have for example a purely high frequency (blue) noise to filter out – I’ll check this theory in the next section (I won’t spoil the answer, but before you proceed I recommend thinking why it might or might not be the case).

Here’s another comparison – with IIR style smoothing, classic simple exponential moving average (probably the most common smoothing used in gamedev and graphics due to very low memory requirements and great performance – basically output = lerp(new_value, previous_output, smoothing) ) and different smoothing coefficients.

What is interesting here is that the Savitzky-Golay filter is a bit “opposite” of its behavior – keeps much more of the lower frequencies, decays to zero at some points, while IIR never really reaches a zero.

Gaussian-like binomial filter on the other hand is somewhere in between (and has a nice perfect 0 at Nyquist).

Example comparison on 1D signals

On relatively smooth and smaller filter spatial support signals, there’s not much difference between higher degree savgol and “Gaussian” (while the box filter is pretty bad!):

While on more noisy signals and higher spatial support, they definitely show less rounding of the corners than a binomial filter:

On any kind of discontinuity it also preserves the sharp edge, but…

Looks like the Savitzky-Golay filter causes some overshoots. Yes, it doesn’t smoothen the edge, but to me overshooting over the value of the noise is a very undesirable behavior; at least for images where we deal with all kinds of discontinuities and edges all the time…

Example comparison on images

…So let’s compare it on some example image itself.

Note that I have used it in separable manner (convolution kernel is an outer product of 1D kernels; or alternatively applying 1D in horizontal, and then vertical direction), not doing the regression on a 2D grid.

…On the first glance, I quite “perceptually” like the results for the sharpest Savitzky-Golay filter…

…on the other hand, it makes the existing oversharpening of the image from Kodak dataset “wider”. It also produces a halo around the jacket on the left of the image for the 7,2 filter. It is not great at removing noise, way worse than even a simple 3 point binomial filter. It produces a “boxy pixels” look (result of both remaining high frequencies, as well as being separable / non-isotropic).

I thought they could be great on filtering out some “blue” noise…

…and it turns out not so much… The main reason is this lack of filtering of the highest frequencies:

…behavior around Nyquist – 40% of the noise and signal is preserved there!

Luckily, there is a way to improve behavior of such filters around Nyquist (by compromising some of the preservations of higher frequencies).

Windowing Savitzky-Golay filters

One of the reasons for not-so-great behavior is use of equal weights for all samples. This is equivalent to applying a rectangular window, and causes “ripples” in frequency domain.

This is similar to applying a “perfect” lowpass filter of sinc and truncating it – which produces undesireable artifacts. Common Lanczos resampling improves it by applying an (also sinc!) window function.

Let’s try the same on the Savitzky-Golay filters. I will use a Hann window (squared cosine), no real reason here, other than “I like it” and I found it useful in some different contexts – overlapped window processing (sums to 1 with fully overlapped windows).

Applying a window here is as easy as simply multiplying the weights by the window function and renormalizing them. Quick visualization of the window function and windowed weights:

Note that window function tends to shrink outer samples strongly towards zero and to get similar behavior, you often need to go with a larger spatial support filter (larger window).

How does it affect the frequency response?

It removes (or at least, reduces) some of the high frequency ripples, but at the same time makes the filter more lowpass (reduces its mid frequency preservation).

Here is how it looks on images:

If you look closely, box-shaped artifacts are gone in the first two of the windowed versions, and are reduced in the 9,6 case.

Savitzky-Golay smoothened gradients

So far I wasn’t super impressed with the Savitzky-Golay filters (for my use-cases) without windowing, but then I started to read more about their typical use – producing smoothened signal gradients / derivatives.

I have written about image (and discrete signal in general) gradients a while ago and if you read my post… it’s not a trivial task at all. Lots of techniques with different trade-offs, lots of maths and physics connections, and somehow any solution I picked it felt like a messy compromise.

The idea of Savitzky-Golay gradients is kind of brilliant – to compute the derivative at a given point, just compute the derivative of the fitted polynomial. This is extremely simple and again has a closed formula solution.

Note however that now we definitely care about differences between odd and even polynomial degrees (look at the slope of lines around 0):

The derivative at 0 can have even an opposite sign between degrees of 2 vs 3 and 4 vs 5.

Since we get the convolution coefficients, we can also plot the frequency responses:

That’s definitely interesting and potentially useful with the higher degrees.

And it’s a neat mental concept (differentiating a continuous reconstruction of the discrete signal) and something I definitely might get back to.

Summary – smoothing filter recommendations

Note: this section is entirely subjective!

After playing a bit with the Savitzky-Golay filters, I personally wouldn’t have a lot of use for them for smoothing digital sample sequences (your mileage may vary).

For preserving edges and not rounding corners, it’s probably best to go with some non-linear filtering – from a simple bilateral, anisotropic diffusion, to Wiener filtering in wavelet or frequency domain.

When edge preservation is less important than just generic smoothing, I personally go with Gaussian-like filters (like the mentioned binomial filter) for their simplicity, trivial implementation, very compact spatial support. Or you could use some proper windowed filter tailored for your desired smoothing properties (you might consider windowed Savitzky-Golay).

I generally prefer to not use exponential moving average filters: a) they are kind of the worst when it comes to preserving mid and lower frequencies b) they are causal and always delay your signal / shift the image, unless you run it forwards and then backwards in time. Use them if you “have to” – like the classic use in temporal anti-aliasing where memory is an issue and storing (and fetching / interpolating) more than one past framebuffer becomes prohibitively expensive. There are obviously some other excellent IIR filters though, even for images – like Domain Transform, just not the exponential moving average.

On the other hand, the application of Savitzky-Golay filters to producing smoothened / “denoised” signal gradients was a bit of an eye opener to me and something I’ll definitely consider the next time I play with gradients of noisy signals. The concept of reconstructing a continuous signal in a simple form, and then doing something with the continuous representation is super powerful and I realized it’s something we explicitly used for our multi-frame super-resolution paper as well.

Posted in Code / Graphics | Tagged , , , , , | Leave a comment

Practical Gaussian filtering: Binomial filter and small sigma Gaussians

Gaussian filters are the bread and butter of signal and image filtering. They are isotropic and radially symmetric, filter out high frequencies extremely well, and just look pleasant and smooth.

In this post I will cover two of my favorite small Gaussian (and Gaussian-like) filtering “tricks” and caveats that are not appreciated by textbooks, but are important in practice.

The first one is binomial filters – a generalization of literally the most useful (in my opinion) filter in computer graphics, image processing, and signal processing.

The second one is discretizing small sigma Gaussians – did you ever try to compute a Gaussian of sigma 0.3 and wonder why you get close to just [0.0, 1.0, 0.0]?

You can treat this post as two micro-posts that I didn’t feel like splitting and cluttering too much.

Binomial filter – fast and great approximation to Gaussians

First up in the bag of simple tricks is “binomial filter”, as called by Aubury and Tuk.

This one is in an interesting gray zone – if you do any signal or image filtering on CPUs or DSPs, it’s almost certain you used them (even if you didn’t know under such a name). On the other hand, through almost ~10 years of GPU programming I have not heard of them.

The reason is their design is perfect for fixed point arithmetic (but I’d argue that is actually also neat and useful with floats).

Simplest example – [1, 2, 1] filter

Before describing what “binomial” filters are, let’s have a look at the simplest example of them – a “1 2 1” filter.

[1, 2, 1] refers to multiplier coefficients that then are normalized by dividing by 4 (a power of two -> can be implemented either by a simple bit shift, or super efficiently in hardware!).

In a shader you can use directly a [0.25, 0.5, 0.25] form instead – and for example on AMD hardware those small inverse power of two multipliers can become just instruction modifiers!

This filter is pretty close to a Gaussian filter with a sigma of ~0.85.

Here is the frequency response of both:

For the Gaussian, I used a 5 point Gaussian to prevent excessive truncation -> effective coefficients of [0.029, 0.235, 0.471, 0.235, 0.029].

So while the binomial filter here deviates a bit from the Gaussian in shape, but unlike this sigma of Gaussian, it has a very nice property of reaching a perfect 0.0 at Nyquist. This makes this filter a perfect one for bilinear upsampling.

In practice, this truncation and difference is not going to be visible for most signals / images, here is an example (applied “separably”):

Last panel is the difference enhanced x10. It is the largest on a diagonal line due to how only a “true” Gaussian is isotropic. (“You say Gaussians have failed us in the past, but they were not the real Gaussians!” 😉 )

For quick visualization, here is the frequency response of both with a smooth isoline around 0.2:

Left: [121] filter applied separably, Right: Gaussian with a comparable sigma.

As you can see, Gaussian is perfectly radial and isotropic, while a binomial filter has more “rectangular” response, filtering less of the diagonals.

Still, I highly recommend this filter anytime you want to do some smoothing – a) extremely fast to implement, simple to memorize, b) just 3 taps in a single direction, so can be easily done also as non-separable 9-tap one c) has a zero at Nyquist, making it great for both downsampling, as well as upsampling (removed grid artifacts!).

Binomial filter coefficients

This was the simplest example of the binomial filter, but what does it have to do with the binomial coefficients or distribution? It is the third row from Pascal’s triangle.

You can construct the next Binomial filter by taking the next row from it.

Since we typically want odd-sized filters, I would compute the following row (ok, I have a few of them memorized) of “n over from 0 to 4”:

[1, 4, 6, 4, 1]

The really cool thing about Pascal’s triangle rows is that they always sum to a power of two – this means that here we can normalize simply by dividing by 16!

…or alternatively you can completely ignore the “binomial” part, and look at it like a repeated convolution. If we keep repeating convolving [1, 1] with itself, we get next binomial coefficients:

a = np.array([1, 1])
b = np.array([1])
for _ in range(7):
  print(b, np.sum(b))
  b = np.convolve(a, b)

[1] 1
[1 1] 2
[1 2 1] 4
[1 3 3 1] 8
[1 4 6 4 1] 16
[ 1  5 10 10  5  1] 32
[ 1  6 15 20 15  6  1] 64

At the same time, looking at coefficients of (a + b) ^ n == (a + b)(a + b)…(a + b) and is exactly the same as repeated convolution.

From this repeated convolution we can also see why the coefficients sum is powers of two – we always get 2x more from convolving with [1, 1].

Larger binomial filters

Here the [1, 4, 6, 4, 1] is approximately equivalent to a Gaussian of sigma ~1.03.

The difference starts to become really tiny. Similarly, the next two binomial filters:

Wider and wider binomial filters

Now something that you might have noticed on the previous plot is that I have marked Gaussian sigma squared as 3 and 4. For the next ones, the pattern will continue, and distributions will be closer and closer to Gaussian. Why is that so? I like how the Pascal triangle relates to a Galton board and binomial probability distribution. Intuitively – the more independent “choices” a falling ball can make, the closer final position distribution will resemble the normal one (Central Limit Theorem). Similarly by repeating convolution with a [1, 1] “box filter”, we also get closer to a normal distribution. If you need some more formal proof, wikipedia has you covered.

Again, all of them will sum up to a power of two, making a fixed point implementation very efficient (just widening multiply adds, followed by optional rounding add and a bit shift).

This leads us to a simple formula. If you want a Gaussian with a certain sigma, square it, and compute a binomial of 4*sigma^2 over x.

For example for sigma 2 and 3, almost perfect match:

Binomial filter use – summary

That’s it for this trick!

Admittedly, I rarely had uses for anything above [1, 4, 6, 4, 1], but I think it’s a very useful tool (especially for image processing on a CPU / DSP) and worth knowing the properties of those [1, 2, 1] filters that appear everywhere!

Discretizing small sigma Gaussians

Gaussian (normal) distribution probability distribution is well defined on an infinite, continuous domain – but it never reaches zero. When we want to discretize it for filtering a signal or an image – things become less well defined.

First of all, the fact that it never reaches zero means that we should use “infinite” filters… Luckily, in practice it decays so quickly that using ceil(2-3 sigma) for radius is enough and we don’t need to worry about this part of the discretization.

Another one is more troublesome and ignored by most textbooks or articles on Gaussian – that we point sample a probability density function, which can work fine for any larger filters (sigma above ~0.7), but definitely cause errors and undersampling for small sigmas!

Let’s have a look at an example and what’s going on there.

Gaussian sigma of 0.4…?

If we have a look at Gauss of sigma 0.4 and try to sample the (unnormalized) Gaussian PDF, we end up with:

The values sampled at [-1, 0, 1] end up being: [0.0438, 0.9974, 0.0438].

This is… extremely close to 0.0 for the non-center taps.

At the same time, if you look at all the values at the pixel extent, it doesn’t seem to correspond well to the sampled value:

If you wonder if we should look at the pixel extent and treat it as a little line/square (oh no! Didn’t I read the memo..?), the answer is (as always…) – it depends. It assumes that nearest-neighbor / box would be used as the reconstruction function of the pixel grid. Which is not a bad assumption with contemporary displays actually displaying pixels, and it simplifies a lot of the following math.

On the other hand, value for the center is clearly overestimated under those assumptions:

I want to emphasize here – this problem is +/- unique to small Gaussians. If we look only at mere sigma of 0.8, middle-side samples start to look much closer to proper representative value:

Yes, edges and the center still don’t match, but with increasing radii, this problem becomes minimal even there.

Literature that observes this issue (most don’t!) suggests to ignore it; and they are right – but only assuming that you don’t want to use a sigma of anything below 1.0.

Which sometimes you actually want to use; especially when modelling some physical effects. Modeling small Gaussians is also extremely useful for creating synthetic training data for neural networks, or tasks like deblurring.

Integrating a Gaussian

So how do we compute an average value of Gaussian in a range?

There is good news and bad news. 🙂 

Good news are – this is very well studied, we just need to integrate a Gaussian and we can just use the error function. The bad news… it doesn’t have any nice and closed formula; even in numpy/scipy environment it is treated as a “special” function. (In practice those are often computed by a set of approximations / tables followed by Newton-Raphson or similar corrections to get to desired precision.)

We just evaluate this integral at pixel extent endpoints, and subtract them and divide it by two.

The formula for a Gaussian integral “corrected” this way becomes:

def gauss_small_sigma(x: np.ndarray, sigma: float):
  p1 = scipy.special.erf((x-0.5)/sigma*np.sqrt(0.5))
  p2 = scipy.special.erf((x+0.5)/sigma*np.sqrt(0.5))
  return (p2-p1)/2.0

Is it a big difference? For small sigmas, definitely! For the sigma of 0.4, it becomes [0.1056, 0.7888, 0.1056] instead of [0.0404, 0.9192, 0.0404]!

We can see how good is this value representing the average on the plots:

When we renormalize the coefficients, we get this huge difference.

The difference is also (hopefully) visible with the naked eye:

Left: Point sampled Gaussian with a sigma of 0.4, Right: Gaussian integral for sigma of 0.4.

Verifying the model and some precomputed values

To verify the model, I also plotted a continuous Gaussian frequency response (which is super usefully also a Gaussian, but with an inverse sigma!) and compared it to two small sigma filters, 0.4 and 0.6:

The difference is dramatic for the sigma of 0.4, and much smaller (but definitely still visible; underestimating “real” sigma) for 0.6.

How does it fare in 2D? Luckily, with error function, we can still use a product of two error functions (separability!). However, I expected that it won’t be radially symmetric and fully isotropic anymore (we are integrating its response with rectangles), but luckily, it’s actually not bad and way better than undersampled Gaussian:

Left: Continuous 2D Gaussian with sigma 0.5 frequency response. Middle: Integrated 2D Gaussian with sigma 0.5 frequency response. Right: undersampled Gaussian with sigma 0.5 frequency response.

Plot shows that the integrated Gaussian has very nicely radially symmetric frequency response until we start getting close to Nyquist and very high frequencies (which is expected) – looks reasonable to me, unlike the undersampled Gaussian which suffers from both this problem, diamond-like response in lower frequencies, and being overall far away in the response shape/magnitude from the continuous one.

Finally, I have tabulated some small sigma Gaussian values (for quickly plugging those in if you don’t have time to compute). Here’s a quick cheat sheet:

SigmaUndersampledIntegralMean relative error
0.2 [0., 1., 0.] [0.0062, 0.9876, 0.0062]67%
0.3[0.0038, 0.9923, 0.0038][0.0478, 0.9044, 0.0478]65%
0.4[0.0404, 0.9192, 0.0404][0.1056, 0.7888, 0.1056]47%
0.5[0.1065, 0.787,  0.1065] [0.1577, 0.6845, 0.1577]27%
0.6[0.1664, 0.6672, 0.1664][0.1986, 0.6028, 0.1986]14%
0.7[0.2095, 0.5811, 0.2095] [0.2288, 0.5424, 0.2288]8%
0.8[0.239, 0.522, 0.239] [0.2508, 0.4983, 0.2508]5%
0.9[0.2595, 0.481,  0.2595] [0.267, 0.466, 0.267]3%
1.0[0.2741, 0.4519, 0.2741] [0.279, 0.442, 0.279]2%

To conclude this part, my recommendation is to always consider the integral if you want to use any sigmas below 0.7.

And for a further exercise for the reader, you can think of how this would change under some different reconstruction function (hint: it becomes a weighted integral, or an integral of convolution; for the box / nearest neighbor, the convolution is with a rectangular function).

Posted in Code / Graphics | Tagged , , , , , , , | 2 Comments

Modding Korg MS-20 Mini – PWM, Sync, Osc2 FM

Almost every graphics programmer I know is playing right now with some electronics (virtual high five!), so I won’t be any different. But I decided to make it “practical” and write something that might be useful to others – how I modded a Korg MS-20 Mini.

The most important ones are a PWM mod input – a quick phone video grab here (poor quality audio, demo only, might upload separately recorded MP3s later):

And oscillator sync together with a oscillator 2 FM input:

I am not taking any credit for the invention of the described modifications, but I am going to cover in some more detail as compared to the previous sources.

Korg MS-20

Korg MS-20 mini is a reissue of the legendary synthesizer from Korg which was made in 1978 (!) as a cheap alternative to super expensive Moogs of that era. It has two extremely “dirty”, screeching filters that was made famous throughout the emerging electronic music scene and where distorted, aggressive sound fit perfectly. It has also a very non-linear behavior and weird interactions between the lowpass and highpass filters when their resonance is cranked up, making it a very hands-on type of gear with many cool sounding sweet spots.

I mean, just look at the wikipedia list of its notable users:

From organic “Da Funk” of Daft Punk, through dirty blog house electro of MSTRKRFT, The Presets, and Mr Oizo, broken beats of Aphex Twin, to pulsating rhythms of the Industrial EBM sounds of Belgian and German artists of the 80s – this is an absolute classic, in many genres exceeding the use of Roland Juno’s or Moogs.

It was both super affordable (well you know… most bands bands on that list are not prog rock millionaires who could have afforded Moog or Roland modular systems, or even Minimoogs 🙂 ), and just sounds great on its own.

I tried some emulations and thought they were pretty cool, but nothing extraordinary, but when I visited my friend Andrzej and we jammed a bit on it + some other synths, I knew I need to get the hardware one!

Semi-modular architecture allows for great flexibility and some “weird” patching and exploration of sound design and synthesis (it’s easy to synthesize whole drumkit!), and the External Signal Processor is something extraordinary and not heard of anymore.

There are a few quirks (the way its envelopes work), and notable limitations though as compared to modern synths – but some of those limitations can be “fixed”! The lack of PWM control of the square/pulse wave, lack of oscillator sync, lack of flexible oscillator wave mixing for the first oscillator, and lack of separate frequency modulation of the second oscillator.

Modding MS-20 – sources

Luckily, the synth has been around for such a long time that many have explored extremely easy mods to allow for that and it’s pretty well documented.

I have followed mainly the Korg MS-20 Retro Expansion by Martin Walker, who in turn expanded on the work of Auxren: MS-20 Mini PCB Photos and OSC Sync Mod, MS-20: PWM CV In and Passive 4X Multiplier.

Please visit those links, and check out their diagrams, which I’m not going to re-post out of respect for the original authors.

I used them as my guidance and definitely without them would not have been able to mod it.

That said, Martin Walker’s description had a tiny bug in the text (diagrams and photos are rock solid and greatly valuable!), and Auxren didn’t cover the most exciting (to me! 🙂 ) Osc2 FM mod essential to make the oscillator sync sound like its most common modulated uses.

One of the mods – PWM input jack – is extremely easy. It requires less disassembly, soldering is trivial. Similarly separate oscillator outputs. Those I can recommend to almost everyone.

The other ones are a bit more tricky and involve soldering to the top of tiny surface components.

Required components, tools, and skills

Here’s a list of minimum what you need to have (apologies if I didn’t get all the tool names right, English is not my native language):

  • A set of screwdrivers. Among those you’d probably want something pretty small and short, some unscrewing is inside of the synth with limited space.
  • A driller with a set of metal drills to drill the holes for pots and jacks. Step drills are an option as well if you know how to use them.
  • A set of wrench keys. Surprisingly not as useful as I thought (more on that later).
  • Soldering kit. Hopefully you have a good one already!
  • A decent multimeter and potentially a component tester.
  • Some optional clamps / holders. This makes a lot of steps easier and safer.
  • Safety goggles and gloves for drilling.
  • Some non-electrostatic (if possible) cloth to put components safely on.
  • Small wire cutters.
  • Some sanding paper.

You will need to buy / have / use as components:

  • 5 mono minijack (3.5mm jack) sockets.
  • Two 100k resistors, one 10k resistor. But it wouldn’t hurt just buying some resistor set.
  • A mini toggle switch (on/off, 2 pins are enough).
  • A single diode like 1N4148 or similar – doesn’t really matter.
  • Three 100k mini potentiometers. Martin suggests log taper (type A), but I’d probably suggest two type A for the volume, and one type B linear for the FM mod intensity.
  • Bunch of colored cables.

Finally, I’d say that if you have never soldered before, I would advise against modding a few hundred dollars (even if bought second hand…) synth that has tons of tiny components, and you’d need to solder to some of the surface-mounted extremely tiny and fragile spots.

If anything, stick to the relatively easy PWM mod.

You can very easily burn off the parts of PCB, make tin drops that will connect things that are not supposed to be connected together…

Despite the very small amount of soldering required, this is rather not a starter DIY project – but I won’t say it’s impossible (I don’t want to gatekeep anyone, and learning by doing with some extra “pressure” is also fun). But I’d still recommend first soldering together a guitar pedal or delay effect and practicing soldering in general, knowing what are the potential issues etc.

Opening it up

Before opening any piece of hardware, the most important advice:

Always take photos of everything you are about to disassemble.

Always place all the parts separately and clearly marked.

This will remove quite a lot of the stress of “figuring out how to put it back together”.

The first step (that I did a bit later and it doesn’t matter, but it’s easiest when done first) – take off all the knobs. Large ones come off easily, smaller ones take some delicate pulling from different sides. Don’t use a screwdriver to lift them, they are made of quite delicate plastic and you could damage it!

Important note: unfortunately, the silver jack “screws” are a decoration. I tried taking them off using a wrench only to realize I’m breaking some plastic. 😦 Leave those in place. I was disappointed at the quality of this and the cost cutting measures… Other than that, MS-20 mini is built well and very smart (in my very limited experience).

Unscrew however with a wrench key the plastic black headphone minijack.

I suggest taking off one of the sides (I took off the one closer to the patchbay), and the back screws.

This is how it looks like inside, with its back taken off:

Now, the cool part is that to do just the PWM input mod, you don’t need to disassemble anything anymore! Just solder the cable to the back of the board in the required point (described later) and you’re done!

If you want to go with the other mods, then you need to take off some more screws. Those include some on the other side (you will see which ones), as well as this annoying metal piece that holds everything more stable:

It’s annoying as I later needed to trim it due to my incorrect pot placement, and it is a bit hard to reach, but you should have no problem.

After it and a few screws on the surface, the main board is disassembled, very carefully (watch out for the connected keyboard cables!) put it down:

Finally, you’ll need a wrench key to take off the nuts holding pots onto metal grounding shield:

And voila, Korg MS-20 mini (almost) disassembled!

Soldering wires

Ok, so now for the tricky part – soldering.

On the back it’s almost trivial:

You solder to one of the ground points (black cable), there are plenty of those, to the pulse width pot (purple cable), and to three middle legs of the oscillator wave selection switch.

On the front, things get trickier:

The osc sync requires soldering two cables at two diodes, marked on my picture with red and yellow – D4 and D7. D4 is an annoying one, you need to solder very precisely between some other components.

Virtual mixing ground for inputs is purple cable at IC14 op-amp.

Finally, for the oscillator 2 FM mod, you need to solder to R183 (green cable). This is where Martin’s diagram is correct, but his text (mentioning soldering to the wiper of the potentiometer) is a bit wrong.

The orange cable was there and was not soldered by me. 🙂 

At this point, I suggest double checking every spot, testing it around with a multimeter (with the synth power off and disconnected, at least as the first step) if you didn’t short anything.

Drilling holes and mounting pots

Unfortunately I didn’t take photos from my drilling process.

It’s pretty standard – mark the spots using a ruler and some sharp piece “puncturing” the paint there, and progressively drill with first small, and then larger and larger drills, until your switch/pots/jacks go through without a problem. You can use step drills, I even bought some, but decided to not use them (as I have never used those before).

After each step, clear everything up from scraps, make sure everything is solid mounted, and at the end optionally sand-paper around it.

Now, other than a few scratches (that I have no idea how happened, but those don’t bother me personally too much…) I made a more serious mistake in my hole placement.

The problem is that this placement: 

Is too far to the left by ~1.5cm (more than half an inch). 😦 This made putting it back problematic with this part:

This is not visible on the picture, but inside, it has metal piece that touches the upper part. And for me it was overlapping with the mini switch and mini jack socket… So I had to cut it slightly. I don’t think it’s a problem – everything worked out well, it holds well, but it was stressful, and be sure to verify the dimensions / sizing.

After your pots and jacks are in place, get back to soldering. I suggest starting with connecting all the grounds together – you are going to need a lot of cables there and be smart about their placement to not turn it into a mess.

It’s much easier to do it when you don’t have everything mounted together and other cables “dangling” around. Then solder the diode and the resistors to the legs of the switch / potentiometers.

At the end finally solder the cables from the mainboard – and be careful at this stage to not “pull” anything.

Putting it back

Before you put everything back together, I suggest very carefully, taking care to not short anything and with proper insulation, turn the MS20 on and start testing. Does it work as expected with the basic functionality? Is the new functionality working?

(Internally, it uses 18V max, so it’s not very unsafe for yourself, but still be careful to not fry the synth!)

If it does (hopefully in the first go!), time to start putting everything together. In the meanwhile, at two more steps (after putting the main board in its place, and after screwing everything except for the back) I still tested if everything’s ok.

Final results

Overall it took me a few whole evenings (a few hours each).

Generally I wouldn’t expect this to take less than ~2 days, unless you know exactly what you’re doing and you’re less scared of screwing something up than I was.

It was stressful at times (soldering), but other than a few scratches that you can clearly see on my photos, everything turned out well.

Ok, so this is again how the synth looks like now (oops, didn’t realize it’s so dusty):

It was a super fun adventure and I love how it sounds! 🙂 

I was also impressed with how one puts together such a large, “mechanical” construction – mostly from plastic and cheap parts, but other than faux jack covers, built really well.

Posted in Audio / Music / DSP | Tagged , , , , , , | Leave a comment

Processing aware image filtering: compensating for the upsampling

This post summarizes some thoughts and experiments on “filtering aware image filtering” I’ve been doing for a while.

The core idea is simple – if you have some “fixed” step at the end of the pipeline that you cannot control (for any reason – from performance to something as simple as someone else “owning” it), you can compensate for its shortcomings with a preprocessing step that you control/own.

I will focus here on the upsampling -> if we know the final display medium, or an upsampling function used for your image or texture, you can optimize your downsampling or pre-filtering step to compensate for some of the shortcomings of the upsampling function.

Namely – if we know that the image will be upsampled 2x with a simple bilinear filter, can we optimize the earlier, downsampling operation to provide a better reconstruction?

Related ideas

What I’m proposing is nothing new – others have touched upon those before.

In the image processing domain, I’d recommend two reads. One is “generalized sampling”, a massive framework by Nehab and Hoppe. It’s very impressive set of ideas, but a difficult read (at least for me) and an idea that hasn’t been popularized more, but probably should.

The second one is a blog post by Christian Schüler, introducing compensation for the screen display (and “reconstruction filter”). Christian mentioned this on twitter and I was surprised I have never heard of it (while it makes lots of sense!).

There were also some other graphics folks exploring related ideas of pre-computing/pre-filtering for approximation (or solution to) a more complicated problem using just bilinear sampler. Mirko Salm has demonstrated in a shadertoy a single sample bicubic interpolation, while Giliam de Carpentier invented/discovered a smart scheme for fast Catmul-Rom interpolation.

As usual with anything that touches upon signal processing, there are many more audio related practices (not surprisingly – telecommunication and sending audio signals were solved by clever engineering and mathematical frameworks for many more decades than image processing and digital display that arguably started to crystalize only in the 1980s!). I’m not an expert on those, but have recently read about Dolby A/Dolby B and thought it was very clever (and surprisingly sophisticated!) technology related to strong pre-processing of signals stored on tape for much better quality. Similarly, there’s a concept of emphasis / de-emphasis EQ, used for example for vinyl records that struggle with encoding high magnitude low frequencies.

Edit: Twitter is the best reviewing venue, as I learned about two related publications. One is a research paper from Josiah Manson and Scott Schaefer looking at the same problem, but specifically for mip-mapping and using (expensive) direct least squares solves, and the other one is a post from Charles Bloom wondering about “inverting box sampling”.

Problem with upsampling and downsampling filters

I have written before about image upsampling and downsampling, as well as bilinear filtering.

I recommend those reads as a refresher or a prerequisite, but I’ll do a blazing fast recap. If something seems unclear or rushed, please check my past post on up/downsampling.

I’m going to assume here that we’re using “even” filters – standard convention for most GPU operations.

Downsampling – recap

A “Perfect” downsampling filter according to signal processing would remove all frequencies above the new (downsampled) Nyquist before decimating the signal, while keeping all the frequencies below it unchanged. This is necessary to avoid aliasing.

Its frequency response would look something like this:

If we fail to anti-alias before decimating, we end up with false frequencies and visual artifacts in the downsampled image. In practice, we cannot obtain a perfect downsampling filter – and arguably would not want to. A “perfect” filter from the signal processing perspective has an infinite spatial support, causes ringing, overshooting, and cannot be practically implemented. Instead of it, we typically weigh some of the trade-offs like aliasing, sharpness, ringing and pick a compromise filter based on those. Good choices are some variants of bicubic (efficient to implement, flexible parameters) or Lanczos (windowed sinc) filters. I cannot praise highly enough seminal paper by Mitchell and Netravali about those trade-offs in image resampling.

Usually when talking about image downsampling in computer graphics, the often selected filter is bilinear / box filter (why are those the same for the 2x downsampling with even filters case? see my previous blog post). It is pretty bad in terms of all the metrics, see the frequency response:

It has both aliasing (area to the right of half Nyquist), as well as significant blurring (area between the orange and blue lines to the left of half Nyquist).

Upsampling – recap

Interestingly, a perfect “linear” upsampling filter has exactly the same properties!

Upsampling can be seen as zero insertion between replicated samples, followed by some filtering. Zero-insertion squeezes the frequency content, and “duplicates” it due to aliasing.

When we filter it, we want to remove all the new, unnecessary frequencies – ones that were not present in the source. Note that the nearest neighbor filter is the same as zero-insertion followed by filtering with a [1, 1] convolution (check for yourself – this is very important!).

Nearest neighbor filter is pretty bad and leaves lots of frequencies unchanged.

A classic GPU bilinear upsampling filter is the same as a filter: [0.125, 0.375, 0.375, 0.125] and it removes some of the aliasing, but also is strong over blurring filter:

We could go into how to design a better upsampler, go back to the classic literature, and even explore non-linear, adaptive upsampling.

But instead we’re going to assume we have to deal and live with this rather poor filter. Can we do something about its properties?

Downsampling followed by upsampling

Before I describe the proposed method, let’s have a look at what happens when we apply a poor downsampling filter and follow it by a poor upsampling filter. We will get even more overblurring, while still getting remaining aliasing:

This is not just a theoretical problem. The blurring and loss of frequencies on the left of the plot is brutal!

You can verify it easily looking at a picture as compared to downsampled, and then upsampled version of it:

Left: original picture. Right: downsampled 2x and then upsampled 2x with a bilinear filter.

In motion it gets even worse (some aliasing and “wobbling”):

Compensating for the upsampling filter – direct optimization

Before investigating some more sophisticated filters, let’s start with a simple experiment.

Let’s say we want to store an image at half resolution, but would like it to be as close as possible to the original after upsampling with a bilinear filter.

We can simply directly solve for the best low resolution image:

Unknown – low resolution pixels. Operation – upsampling. Target – original full resolution pixels.  The goal – to have them be as close as possible to each other.

We can solve this directly, as it’s just a linear least squares. Instead I will run an optimization (mostly because of how easy it is to do in Jax! 🙂 ). I have described how one can optimize a filter – optimizing separable filters for desirable shape / properties – and we’re going to use the same technique.

This is obviously super slow (and optimization is done per image!), but has an advantage of being data dependent. We don’t need to worry about aliasing – depending on presence or lack of some frequencies in an area, we might not need to preserve them at all, or not care for the aliasing – and the solution will take this into account. Conversely, some others might be dominating in some areas and more important for preservation. How does that work visually?

Left: original image. Middle: bilinear downsampling and bilinear upsampling. Right: Low resolution optimized for the upsampling.

This looks significantly better and closer – not surprising, bilinear downsampling is pretty bad. But what’s kind of surprising is that with Lanczos3, it is very similar by comparison:

Left: original image. Middle: Lanczos3 downsampling and bilinear upsampling. Right: Low resolution optimized for the upsampling.

It’s less aliased, but the results are very close to using bilinear downsampling (albeit with less aliased edges), that might be surprising, but consider how poor and soft is the bilinear upsampling filter – it just blurs out most of the frequencies.

Optimizing (or directly solving) for the upsampling filter is definitely much better. But it’s not a very practical option. Can we compensate for the upsampling filter flaws with pre-filtering?

Compensating for the upsampling filter with pre-filtering of the lower resolution image

We can try to find a filter that simply “inverts” the upsampling filter frequency response. We would like a combination of those two to become a perfect lowpass filter. Note that in this approach, in general we don’t know anything about how the image was generated, or in our case – the downsampling function. We are just designing a “generic” prefilter to compensate for the effects of upsampling. I went with Lanczos4 for the used examples to get to a very high quality – but I’m not using this knowledge.

We will look for an odd (not phase shifting) filter that concatenated with its mirror and multiplied by the frequency response of the bilinear upsampling produces a response close to an ideal lowpass filter.

We start with an “identity” upsampler with flat frequency response. It combined with the bilinear upsampler, yields the same frequency response:

However, if we run optimization, we get a filter with a response:

The filter coefficients are…

Yes, that’s a super simple unsharp mask! You can safely ignore the extra samples and go with a [-0.175, 1.35, -0.175] filter.

I was very surprised by this result at first. But then I realized it’s not as surprising – as it compensates for the bilinear tent-like weights.

Something rings a bell… sharpening mip-maps… Does it sound familiar? I will come back to this in one of the later sections!

When we evaluate it on the test image, we get:

Left: original image. Middle: Lanczos4 downsampling and bilinear upsampling. Right: Lanczos4 downsampling, sharpening filter, and bilinear upsampling.

However if we compare against our “optimized” image, we can see a pretty large contrast and sharpness difference:

Left: original image. Middle: Lanczos4 downsampling, sharpening filter, and bilinear upsampling. Right: Low resolution optimized for the upsampling (“oracle”, knowing the ground truth).

The main reason for this (apart from local adaptation and being aliasing aware) is that we don’t know anything about downsampling function and the original frequency content.

But we can design them jointly.

Compensating for the upsampling filter – downsample

We can optimize for the frequency response of the whole system – optimize the downsampling filter for the subsequent bilinear upsampling

This is a bit more involved here, as we have to model steps of:

  1. Compute response of the lowpass filter we want to model <- this is the step where we insert variables to optimize. The variables are filter coefficients.
  2. Aliasing of this frequency response due to decimation.
  3. Replication of the spectrum during the zero-insertion.
  4. Applying a fixed, upsampling filter of [0.125, 0.375, 0.375, 0.125].
  5. Computing loss against a perfect frequency response.

I will not describe all the details of steps 2 and 3 here -> I’ll probably write some more about it in the future. It’s called multirate signal processing.

Step 4. is relatively simple – a multiplication of the frequency response.

For step 5., our “perfect” target frequency response would be similar to a perfect response of an upsampling filter – but note that here we also include the effects of downsampling and its aliasing. We also add a loss term to prevent aliasing (try to zero out frequencies above half Nyquist).

For step 1, I decided on an 8 tap, symmetric filter. This gives us effectively just 3 degrees of freedom – as the filter has to normalize to 1.0. Basically, it becomes a form of [a, b, c, 0.5-(a+b+c), 0.5-(a+b+c), c, b, a]. Computing frequency response of symmetric discrete time filters is pretty easy and also signal processing 101, I’ll probably post some colab later.

As typically with optimization, the choice of initialization is crucial. I picked our “poor” bilinear filter, looking for a way to optimize it.

Without further ado, this is the combined frequency response before:

And this is it after:

The green curve looks pretty good! There is small ripple, and some highest frequency loss, but this is expected. There is also some aliasing left, but it’s actually better than with the bilinear filter. 

Let’s compare the effect on the upsampled image:

Left: original image. Middle: Lanczos3 downsampling and bilinear upsampling. Right: Our new custom downsampling and bilinear upsampling.

I looked so far at the “combined” response, but it’s insightful to look again, but focusing on just the downsampling filter:

Notice relatively strong mid-high frequency boost (this bump above 1.0) – this is to “undo” the subsequent too strong lowpass filtering of the upsampling filter. It’s kind of similar to sharpening before!

At the same time, if we compare it to the sharpening solution:

Left: original image. Middle: Lanczos4 downsampling, sharpen filter, and bilinear upsampling. Right: Our new custom downsampling and bilinear upsampling.

We can observe more sharpness and contrast preserved (also some more “ringing”, but it was also present in the original image, so doesn’t bother me as much).

Different coefficient count

We can optimize for different filter sizes. Does this matter in practice? Yes, up to some extent:

Top-left: original. Next ones are 4 taps, 6 taps, 8 taps, 10 taps, 12 taps.

I can see the difference up to 10 taps (which is relatively a lot). But I think the quality is pretty good with 6 or 8 taps, 10 if I there is some more computational budget (more on this later).

If you’re curious, this is how coefficients look like on plots:

Efficient implementation

You might wonder about the efficiency of the implementation. 1D filter of 10 taps means… 100 taps in 2D! Luckily, proposed filters are completely separable. This means you can downsample in 1D, and then in 2D. But 10 samples is still a lot.

Luckily, we can use an old bilinear sampling trick.

When we have two adjacent samples of the same sign, we can combine them together and in the above plots, turn the first filter into a 3-tap one, the second one into 4 taps, third one into an impressive 4, then 6, and also 6.

I’ll describe quickly how it works.

If we have two samples with offsets -1 and -2 and the same weights of 1.0, we can instead take a single bilinear tap in between those – offset of -1.5, and weight of 2.0. Bilinear weighting distributes this evenly between sample contributions.

When the weights are uneven, it still is w_a + w_b total weight, but the offset between those is w_a / (w_a + w_b).

Here are the optimized filters:

[(-2, -0.044), (-0.5, 1.088), (1, -0.044)]
[(-3, -0.185), (-1.846, 0.685), (0.846, 0.685), (2, -0.185)]
[(-3.824, -0.202), (-1.837, 0.702), (0.837, 0.702), (2.824, -0.202)]
[(-5, 0.099), (-3.663, -0.293), (-1.826, 0.694), (0.826, 0.694), (2.663, -0.293), (4, 0.099)]
[(-5.698, 0.115), (-3.652, -0.304), (-1.831, 0.689), (0.83, 0.689), (2.651, -0.304), (4.698, 0.115)]

I think this makes this technique practical – especially with separable filtering.

Sharpen on mip maps?

One thing that occurred to me during those experiments was the relationship between a downsampling filter that performs some strong sharpening, sharpening of the downsampled images, and some discussion that I’ve had and seen many times.

This is not just close, but identical to… a common practice suggested by video game artists at some studios I worked at – if they had an option of manually editing mip-maps, sometimes artists would manually sharpen mip maps in Photoshop. Similarly, they would request such an option to be applied directly in the engine.

Being a junior and excited programmer, I would eagerly accept any feature request like this. 🙂 Later when I got grumpy, I thought it’s completely wrong – messing with trilinear interpolation, as well as being wrong from signal processing perspective.

Turns out, like with many many practices – artists have a great intuition and this solution kind of makes sense.

I would still discourage it and prefer the downsampling solution I proposed in this post. Why? Commonly used box filter is a poor downsampling filter. Applying sharpening onto it enhances any kind of aliasing (as more frequencies will tend to alias as higher ones), so can have a bad effect on some content. Still, I find it fascinating and will keep repeating that artists’ intuition about something being “wrong” is typically right! (Even if some proposed solutions are “hacks” or not perfect – they are typically in the right direction).

Summary

I described a few potential approaches to address having your content processed by “bad” upsampling filtering, like a common bilinear filter: an (expensive) solution inverting the upsampling operation on the pixels (if you know the “ground truth”), simple sharpening of low resolution images, or a “smarter” downsampling filter that can compensate for the effect that worse upsampling might have on content.

I think it’s suitable for mip-maps, offline content generation, but also… image pyramids. This might be topic for another post though.

You might wonder if it wouldn’t be better to just improve the upsampling filter if you can have control over it? I’d say that changing the downsampler is typically faster / more optimal than changing the upsampler. Less pixels need to be computed, it can be done offline, upsampling can use the hardware bilinear sampler directly, and in the case of complicated, multi-stage pipelines, data dependencies can be reduced.

I have a feeling I’ll come back to the topics of downsampling and upsampling. 🙂 

Posted in Code / Graphics | Tagged , , , , , , , | 5 Comments

Comparing images in frequency domain. “Spectral loss” – does it make sense?

Recently, numerous academic papers in the machine learning / computer vision / image processing domains (re)introduce and discuss a “frequency loss function” or “spectral loss” – and while for many it makes sense and nicely improves achieved results, some of them define or use it wrongly.

The basic idea is – instead of comparing pixels of the image, why not compare the images in the frequency domain for a better high frequency preservation?

While some variants of this loss are useful, this particular take and motivation is often wrong.

Unfortunately, current research contains a lot of “try some new idea, produce a few benchmark numbers that seem good, publish, repeat” papers – without going deeper and understanding what’s going on. Beating a benchmark and an intuitive explanation are enough to publish an idea (but most don’t stick around).

Let me try to dispel some myths and go a bit deeper into the so-called “frequency loss”.

Before I dig into details, here’s a rough summary of my recommendations and conclusion of this post:

  1. Squared difference of image Fourier transforms is pointless and comes from a misunderstanding. It’s the same as a standard L2 loss.
  2. Difference of amplitudes of Fourier spectrum can be useful, as it provides shift invariance and is more sensitive to blur than to mild noise. On the other hand, on its own it’s useless (by discarding phase, it doesn’t correspond at all to comparing actual images).
  3. Adding phase differences to the Frequency loss makes it more meaningful image comparisons, but has a “singularity” that makes it not very useful of a metric. It’s a weird, ad hoc remapping…
  4. A hybrid combination of pixel comparisons (like L2) and frequency amplitude loss combines useful properties of all of the above and can be a desirable tool.

Interested in why? Keep on reading!

Loss functions on images

I’ve touched upon loss functions in my previous machine learning oriented posts (I’ll highlight the separable filter optimization and generating blue noise through optimization, where in both I discuss some properties of a good loss), but for a fast recap – in machine learning, loss function is a “cost” that the optimization process tries to minimize.

Loss functions are designed to capture aspects of the process / function that we want to “improve” or solve.

They can be also used in a non-learning scenario – as simple error metrics.

Here we will have a look at reference error metrics for comparing two images.

Reference means that we have two images – “image a” and “image b” and we’re trying to answer – how similar or different are those two? More importantly, we want to boil down the answer to just a single number. Typically one of the images is our “ground truth” or target.

We just produce a single number describing the total error / difference, so we can’t distinguish if this difference comes from one being blurry, noisy, or a different color.

Most common loss function is a L2 loss – average squared difference:

It has some very nice properties for many math problems (like close form solution, or that it corresponds to statistically meaningful estimates, like with assumption of white Gaussian noise). On the other hand, it’s known that it is also not great for comparing images, as it doesn’t seem to capture human perception – it’s not sensitive to blurriness, but very sensitive to small shifts.

A variant of this loss called PSNR is where we add a logarithm transform:

This makes values more “human readable” as it’s easier to tell that PSNR of 30 is better than 25 and by how much, while 0.0001 vs 0.0007 is not so easy to understand. Generally values of PSNR in the range above 30 are considered to be acceptable, above 36+ very good.

In this post, I will be using simple scipy “ascent” image:

To demonstrate the shortcomings of L2 loss, I have generated 3 variations of the above picture.

One is shifted by 1 pixel to the left/up, one is noisy, and one is blurry:

Top left: reference. Top right: image shifted by (1, 1) pixels. Bottom left: image with added noise. Bottom right: blurred image. All of the three distortion have same L2 error and PSNR!

The first two images look completely identical to us (because they are! Just slightly shifted), yet have the same PSNR and L2 error value like a strongly blurry and just mildly noisy one! All of those have a PSNR of ~22.5.

I think on its own this shows why it’s desired to find an alternative to L2 loss / PSNR.

Also it’s worth noting that sensitivity to small shifts is the exactly same phenomenon that causes this “overblurring”. If a few different shifts generate same large error, you can expect their average (aka… blur!) to also generate a similar error:

To combat those shortcomings, many better alternatives were proposed (a very decent recent one is nVidia’s FLIP).

One of the alternatives proposed by researchers is “frequency loss”, which we’ll focus on in this post.

Frequency spectrum

Let’s start with the basics and a recap. Frequency spectrum of a 2D image is a discrete Fourier transform of the image performed to obtain decomposition of the signal into different frequency bands.

I will use here the upper case notation for a Fourier transform:

A 2D Fourier transform can be expressed as such a double sum, or by applying a one dimensional discrete Fourier transform first in one dimension, then in the other one – as multi-dimensional DFT/FFT is separable. Through a 2D DFT, we can obtain a “complex” (as in complex numbers) image of the same size.

Note above that in DFT, every X value is a linear combination of the input pixels x with constant complex weights that depend only on the position – and as such is a linear transform and can be expressed as a gigantic matrix multiply – more on that later.

Every complex number in the DFT image corresponds to a complex phasor; or in simpler terms, amplitude and phase of a sine wave.

If we take a DFT of the image above and visualize complex numbers as red/green color channels, we get an image like this:

This… is not helpful. Much more intuitive is to look at the amplitude of the complex number :

This is how we typically look at periodogram / spectrum images.

Different regions in this representation correspond to different spatial frequencies:

Center corresponds to low frequencies (slowly changing signal), horizontal/vertical axis to pure horizontal/vertical frequencies of increasing period, while corners represent diagonal frequencies.

The analyzed picture has more strong diagonal structures and it’s visible in the frequency spectrum.

Reference image presented again – notice many diagonal edges.

Most “natural” images (pictures representing the real world – like photographs, artwork, drawings) have a “decaying” frequency – which means much more of the lower frequencies than the higher ones. In the image above, there are very large almost uniform (or smoothly varying) regions, and a DFT captures this information well.

But taking the amplitude of the complex numbers, we have lost information – projected a complex number onto a single real one. Such an operation (taking the amplitude) is irreversible and lossy, but we can look at what we discarded – phase, which looks kind of random (yet structured):

Relationship between magnitude/amplitude, phase, and the original complex number can be expressed as . Phase is as important, and we’ll get back to it later.

Comparing periodograms vs comparing images

After such an intro, we can analyze the motivation behind using such a transformation for comparing images.

One of the main reasons for the “frequency loss” (we’ll try to clarify this term in a second) to compare images is (hand-waving emerges) “if you use the L2 loss, it’s not sensitive to small variations of high frequencies and textures, and it’s not sensitive at all to blurriness. By directly comparing frequency spectrum content, we can reproduce the exact frequency spectrum, and help alleviate this and make the loss more sensitive to high frequencies”.

Is this intuition true? It depends. Different papers define frequency loss differently. As a first go, I’ll start with the one that is wrong and I have seen in at least two papers – use of the squared difference of complex numbers in the power spectrum:

Here is how average L2 (average squared difference) looks like compared to “normalized” power spectrum (depending on the implementation, some FFT implementations divide by N, some divide by N^2):

They are the same. PSNR is the same as log transform of this squared pseudo-frequency-loss.

Average squared complex spectrum difference is the same as squared image difference.

Squared complex spectrum difference is pointless

Those papers that use a standard squared difference on DFTs and claim it gives some better properties are simply wrong. If there are any improvements in experiments, it’s a result of poor methodology or some hyperparameter tuning, not having validation set etc. Well… 🙂 

But why is it the same?

I am not great at proofs and math notation / formalism. So for the ones who like more formal treatment, go see Parseval’s theorem and its proofs. You could also try to derive it yourself from expanding the DFT and Euler’s formula. Parseval’s theorem states:

Just consider x and X as not the image pixels and Fourier coefficients, but pixels and Fourier coefficients of differences – and those two are the same.

(Side note: Parseval’s theorem is super useful when you deal with computing statistics and moments like variance of noise that is non-white / spatially correlated. You can directly compute variance from the power spectrum by simply integrating it.)

But this explanation and staring at lines of maths would not be satisfactory to me, I prefer a more intuitive treatment.

So another take is – DFT transform is a unitary transformation. And L2 norm (squared difference) is unitary transformation invariant. Ok, so far this probably doesn’t sound any more intuitive. Let’s take it apart piece by piece.

First – think of unitary transformation as a complex generalization of a rotation matrix. The DFT matrix (a matrix form of the discrete Fourier transform) can be analyzed using the same linear algebra tools and has many of exactly the same properties like a rotation matrix that you can check – verify that it’s normal, orthogonal etc. 🙂 With just one complication – it does the “rotation” in the complex numbers domain.

When I heard about this perspective of DFT in a matrix form and its properties, it took me a while to conceptualize it (and I still probably don’t fully grasp all of its properties). But it’s very useful to think of DFT as a “weird generalization of the rotation” (I hope more mathematically inclined readers don’t get angry at my very loose use of terminology here 🙂 but that’s my intuitive take!).

Having this perspective, we can look at a squared difference – the L2 norm. This is the same as squared Euclidean distance – which doesn’t depend on the rotations of vectors! You can see it on the following diagram:

On the other hand, some other norms like L1 (“Manhattan distance”) definitely depend on those and change, for example:

By rotating a square, you can get L1 distance between two corners from to 2.

We could look at using loss like this. But instead, we can have a look at a better and luckily more common use of spectral loss – comparing the amplitude (instead of full complex numbers).

Spectrum amplitude difference loss

We can look instead at the difference of squared amplitudes:

This representation is more common and has some interesting properties. Let’s use this loss / error metric to compare the three image distortions above:

Ok, now we can see something more interesting happening!

First of all – no, the shifted value is not missing. Shifting an image has an error of zero, so its negative logarithm is at “infinity”. I considered how to show it on this plot, but decided it’s best left out – yes, it is “infinity”.

This time, we are similarly sensitive to blur, but… not sensitive at all to shifts!

The implication that shifting an image by a pixel (or even 100 or 1000 if your “wrap” the image) gives exactly the same frequency response magnitude and zero difference in amplitudes is very important.

This should make some intuitive sense – the image frequency content is the same; but it’s at a different phase.

Spectrum amplitude is shift invariant for pixel-sized shifts and when not resampling!

This can be a very useful property for some super-resolution like tasks and other inverse problems, where we don’t care about small shifts, or might have some problems in for example image alignment stage and we’d like to be robust to those.

We are also less sensitive to noise. In this context, we can say that we are relatively more sensitive to blurriness. This metric has improved quite a lot the score of the noisy picture as compared to the blurry one – which also seems to be doing +/- what we need.

To get them to comparable score with this new metric, we’d need significantly more noise:

With frequency amplitude loss, you have to add more noise to get similar distortion score as compared to blurring.

But let’s have a look at the direct consequences of this phase invariance that makes this loss useless on its own…

Phase information is essential

Ok, now let’s use our amplitude loss on the original picture, and this one:

Pixels of a different image with the same frequency spectrum amplitudes.

The loss value is… 0. According to the frequency amplitude error metric, they are identical.

Again, we are comparing the picture above with: 

Yep, those two pictures have exactly the same spectral content amplitudes, just a different phase. I actually just reset the phase / angle to be all zeros (but you could set it to any random values).

By comparison, standard L2 metric error would indicate a huge error there.

For natural images, the phase is crucial. You cannot just ignore it.

This is in an interesting contrast with audio, where phases don’t matter as much (until you get into non-linearities, add many signals, or have quickly changing spectral content -> transients)

Comparing phase and amplitude?

I saw some papers “sneakily” (not explaining why and what’s going on, nor analyzing!) try to solve it by comparing both amplitude, and a phase and adding them together:

Comparing phases is a bit tedious (as it needs to be “toroidal” comparison, -pi needs to yield the same results and zero loss as pi; I ignored it above), but does it make sense? Kind of?

I propose to look at it this way – comparing a sum of frequency and phase loss is like instead of applying a shortest path between two numbers, summing a path along the diagonal, and the angle difference, which can be considered a wide arc:

Above, La represents the amplitude difference, while Lp represents phase difference. In this drawing, Lp is supposed to be on the circle itself, but I was not able to do it in Slides.

Obviously, relative length of components can be scaled arbitrarily by the coefficient and the effect of phase reduces.

Here’s a visualization of a single frequency with real part of 1, and imaginary of 0 with real/imaginary numbers on the x/y axis, standard L2 loss:

Standard L2 loss.

The L2 loss is “radial” – the further we go away from our reference point (0, 1), the error increases radially. Make sure you understand the above plot before going further.

If we looked only at the frequency amplitude difference, it would look like this:

Frequency amplitude loss.

Hopefully this is also intuitive – all vectors with the same magnitude also form a circle of 0 error – whether (1, 0), (0,1), or ( , ). So we have infinitely many values (corresponding to different phases) that are considered the same and are centered around (0, 0) this time.

And here’s a sum of squared amplitude and angle/phase loss (with the phase slightly downweighted):

Combination of frequency amplitude and phase loss.

So kind of a mix of the above behaviors; but with a nasty singularity around the center – arbitrary small displacements to a (0, 0) vector cause phase to shift and flip arbitrarily…

Imagine a difference between (0, 0.0000001) and (0, -0.0000001) – they have completely opposite phases!

This is a big problem for optimization. In the regions where you get close to zero signal, and the amplitudes are very small, the gradient from the phase is going to be very strong (while irrelevant to what we want to achieve) and discontinuous. And we also lose our original goal – of relying mostly on frequency amplitude similarity of signal.

So does this combined loss make sense? Because of the above reasons, I would recommend against it. I’d argue that papers that propose the above loss and not discuss this behavior are also wrong.

I’ll write my recommendation in the summary, but first let’s have a look at some other frequency spectrum related alternatives that I won’t discuss, but can be an excellent research direction.

Tiled/windowed DFTs/FFTs and wavelets

Generally looking at the whole image and all of the frequencies with global/infinite support (and wrapping!) is rarely desired. Much more interesting mathematical domain was developed for that purpose – all kinds of “wavelets”, localized frequency decomposition. I believe it’s my third blog post where I mention wavelets, but say I don’t know enough about them and I have to defer the reader to literature, so maybe time for me to catch up on their fundamentals… 🙂 

A simpler alternative (that doesn’t require a PhD in applied maths) is to look at “localized” DFTs. Chop up the image into small pieces (typically overlapping tiles), apply some window function to prevent discontinuities and weight the influence by the distance, and analyze / compare locally.

This is a very reasonable approach, used often in all kinds of image processing – but also one of the main audio processing building blocks: Short-time Fourier Transform. If you ever looked at audio spectrograms and spectral methods – you have seen those for sure. This is also a useful tool for comparing and operating on images in different contexts, but that’s a topic for yet another future post.

My recommendations

Ok, so going back to the original question – does it make sense to use frequency loss?

Yes, I think so, but in a more limited capacity.

As a first recommendation, I’d say – amplitude if not enough, but don’t compare the phases. You can get a similar result (without the singularity!) by mixing a sum of the squared frequency amplitude and regular, pixel space square loss, something looking like this (but further tuneable):

This is how this loss shape looks like:

A combination of L2 and frequency amplitude loss. Has desirable properties of both.

With such a formulation, we can compare all the three original image distortion cases:

This way, we can obtain:

  • Slightly more sensitivity to blur than to noise as compared to standard L2,
  • Some shift invariance for minor shifts,
  • Robustness to major shifts – as L2 will dominate,
  • Similarly being robust to completely scrambled phases,
  • Smooth gradient and no singularities.

Note that I’m not comparing it with SSIM, LPIPS, FLIP, or one of numerous other good “perceptual” losses and will not give any recommendation about those – use of loss and error metric is heavily application specific!

But in a machine learning scenario, in the simplest case where you’d like to improve L2 – add some shift invariance, and care about preserving general spectral content  – for example for preserving textures in a GAN training scenario or for computer vision tasks where registration might be imperfect – it’s a useful simple tool to consider.

I would like to acknowledge my colleague Jon Barron with whom I discussed those topics on a few different occasions and who also noticed some of wrong uses of “frequency loss” (as in squared DFT difference) in some recent computer vision papers.

Posted in Code / Graphics | Tagged , , , , , | 3 Comments

On leaving California and the Silicon Valley

Beginning of the year, my wife and I made a final call – to leave California, “as soon as possible”. To be fair, we talked about this for a long time; a few years, but without any concrete date or commitment. But now it was serious, and even the still ongoing pandemic and all related challenges couldn’t stop us. On the first of April, we were already in a hotel in Midtown of NYC, preparing to look for a new apartment that we found two weeks later.

Not California anymore!

I get asked almost daily “why did you do it? Did your Californian dream not work out?”, and some New Yorkers are in shock and somewhat question how a sane person would move out of “perfect, sunny California”.

The Californian dream did work out – after all, we spent almost 7 years there!

Out of which four were in the San Francisco Bay Area – so it wasn’t “suffering”. I am quick to make impulsive/irrational/radical decisions in my life, so if things were bad, we wouldn’t stay there. There are some absolutely beautiful moments and memories that I’d like to keep for the rest of my life. And I could consider “retiring” in Los Angeles at some older age – it was a super cool place to be and live, I loved many things about it (it was way more diverse – in every possible way – than SFBA); even if it didn’t satisfy all my needs.

But there are also many reasons we “finally” decided for the move, reasons why I would not want to spend any more time in the Silicon Valley, and most are nuanced beyond a short.

Therefore, I wrote a personal post on this – detailing some of my thoughts on the Bay Area, California, Silicon Valley (mostly skipping LA, as we left that area over four years ago) – and why they were not an ultimate match for me. If you expect any “hot takes”, then you’ll be probably disappointed. 🙂

Some disclaimers

It’s going to be a personal, and a very subjective take – so feel free to strongly disagree or simply have opposite preferences. 🙂 Many people love living there, many Americans love suburbs, and as always different people can have completely opposite tastes!

It’s a decision that we made together with my wife (who was even stronger advocating for it, as she didn’t enjoy as much some of the things that I find cool in the South Bay), but I’m going to focus on just my perspective (even if we’d agree on 99% of it).

Finally, I am Polish, a Slav, and an European – so consider that my wording might be on a more negative side that most Americans would use. 🙂

Suburbs

Let me start with the obvious – living in California, especially in the Bay Area is for vast majority of people equivalent to living in the suburbs. We used to live for over a year in San Francisco, but it didn’t live up to our expectations (that’s a longer story on its own).

Even Los Angeles, “big city” is de facto a collection of independent, patched suburbs. Ok, but what’s wrong with that?

Flying back above the Bay Area (here probably Santa Clara) – endless suburbs…

It’s super boring! Extremely suburban lifestyle, with a long drive almost anywhere. To me, suburbs are a complete dystopia in many ways – from car culture, lack of community, lack of inspiring places, zero walkability, segregation based on income and other factors, obsession over safety.

If you are a reader from Europe, I need to write a disclaimer – American suburbs are nothing like European. The scale (and consequences of thereof) is unbelievably different! When I was 14, my family moved from a small apartment in the city center to a house in the suburbs of Warsaw. Back then they were “far suburbs”, my parents build a house in the middle of a huge empty field. But I’d still go to a school in the city center, all my friends lived across the city, and I’d always use great public transport to be within an hour to almost any part of the city. I could even walk on foot to any of those other districts (and I often did!) and at college every afternoon in the summer I’d be cycling across the city. By comparison, in Bay Area walking from one suburb town – Mountain View – to an adjacent one, Palo Alto, would take 1.5 uninspiring hours to walk with many intersections on busy highway streets. Even a drive would be ~half an hour! Going to the city (San Francisco) meant either 1.5h train ride (and then switch to SF public transportation, where some parts could be another 1h away), or 1h car ride with next half an hour looking for a parking spot. Back on topic!

Generally the “vibe” I had was that nothing is happening, so as a typical leisure you have a choice of doing outdoors sports, driving for some hike, watching a movie, playing video games, or going to an expensive, but mediocre restaurant. (Ok, I have to admit – it was mediocre in the South Bay / Silicon Valley. SF restaurants were extremely expensive, but great!)

I grew up in cities and love city life in general.

By comparison this is a collage of photos I took all within 5minutes walk of my current home in NYC – hard to not get immediate visual stimulation and inspiration!

I like density, things going on all the time, busy atmosphere, and urban life. I love walking out, seeing some events happening, streets closed, music playing, concerts and festivals happening, and just getting immersed in it and taking part in some of those events. Or going out to a club with friends, but first checking out some pub or a house party, changing the location a few times, and finally going back home on foot and grabbing a late night bite. It’s all about spontaneity, many (sometime overwhelming!) options to pick from, and not having to plan things ahead – and not feel trapped in a routine. Also, often the best things that you remember are all those that you didn’t plan, just happened and surprised you in a new way. As an emigrant willing to experience some of the new culture directly and get immersed in it- Montreal was great, Los Angeles mediocre, South Bay terrible.

Suburban boredom was extremely uninspiring and demotivating, and especially during the pandemic it felt like Groundhog Day.

With everything closed, one would think you’ll be able to just walk around outdoors?

This is how people imagine that suburbs look like (ok, Halloween was great and just how I’d imagine it from American movies):

But this is how most of it looks like: strip malls, parking lots, cars, streets (rainbows optional):

Public transport is super underfunded and not great (most people use cars anyway), and while some claim that owning and driving a car is “liberating”, for me it is the opposite – liberating is a feeling that I can end up in a random part of a city and easily get back home by a mix of public transport, walking, and city bikes (yeah!), with plenty of choices that I can pick based on my mood or a specific need.

Unfortunately, suburbs typically mean zero walkability. You cannot really just walk outside of your house and enjoy it – pedestrian friendly infrastructure is non-existent. Sidewalks that abruptly end, constant high traffic roads that you need to cross. Even if it wasn’t a problem, there are miles where there is nothing around except for people’s houses, some occasional strip malls – and it doesn’t make for an interesting or pleasant walk. In many areas you kind of have to drive for closest corner store type of small grocery. This meme captures the walkability and non-spontaneous, car heavy and sometimes stressful suburban life very well:

May be an image of text that says 'You need eggs for dinner but have none in your kitchen, what do you do? Rural farmer Urban Suburbanite pedestrian Who Child, go see if Bessie has laid more eggs for us. gets rowded warmu the Child,come walk with me to the comner store to pick up some eggs. oesnt taxes Itrying but'

I don’t want to make this blog post just about suburbs, so I’d recommend the Eco Gecko youtube channel, covering various issues with suburbs – and particularly American suburbs – very well. A quote that I heard a few times that resonated with me is that after WW2, driven by economic boom, Americans decided to throw away thousands of years of know-how on architecture, urban planning and the way we always lived, and instead perform a social experiments named “suburbs” and lead a “motorized” lifestyle. I think it has failed.

As noted, those are my preferences. Suburbs have “objective” problems, strip malls and parkings are ugly, cars are not eco friendly etc – but yes, there are people who love suburbs, relative safety (in US it’s a problem – at least a perceived one), having a large house, clean streets, and say that it’s good for family life. I cannot argue with anyone’s preferences and I respect that – different people have different needs. But I’d like to dispel a myth that suburbs are great for kids. It might be true when they are 1-10 years old, later it gets worse. As I mentioned, when I was ~14 I moved to suburbs – and I hated it… I think that for teenagers, it’s bad for their social life and ability to grow up as independent humans. Now being an adult, I think my parents have a very nice house and a beautiful garden and I can appreciate chilling there with a bbq – but back then I felt alone and isolated, and if it wasn’t for my friends in other parts of the city and great public transport, I’d be very miserable. Which teenager wants to be driven around and “helicoptered” by parents and have whole life revolving around school and “activities”?

Lifestyle

Ok, but people live there and are very happy, so what do they do?

This brings me to a difference in “West Coast lifestyle” (it’s definitely a thing).

Most of the people living there like a specific type of lifestyle – in their spare time they enjoy camping, hiking, rock climbing, mountain biking and similar.

And this is cool, while I’m not into any hardcore sports or camping (being a “mieszczuch”, apparently is translated to “city slicker” – not sure if this is accurate translation), I do love beautiful nature and can appreciate a hike.

Californian hikes and nature can be truly awesome, even outside of the famous spots:

But as much as I love hiking, it gets boring to me very quickly – either you go to see the same stuff and type of landscape over and over, or you need to drive for many hours.

Living in the Silicon Valley, the closest hikes were driving 45min one way, then hunting for half an hour for a parking spot (it doesn’t matter if there is vast open space if some percentage of over 7 million people living around tries to find a parking spot by the road… and forget about getting there by any public transport), doing an hour hike, and the driving back. Boom, half of the weekend gone, most of it spent in a car. And quite a lot of them have same, “harsh” and drought affected climate with dusty roads and dried out, yellow vegetation:

Anyway, the vast majority of friends and colleagues had such “outdoor” and active interests and hobbies (I’m not mentioning other classic techie hobbies that are interesting, but solitary like bread making or brewing and similar).

On the other hand, I am much more interested in other things like music, art, culture, museums, street activities, and clubbing. Note again – there is nothing wrong with loving outdoors vs indoor – just a matter of preference.

“Wait a second!” I might hear someone say. Most of it is available “somewhere” around – some museums, events, concerts, restaurants. Sure, but due to distances and things being sparse and scattered around, everything needs to be planned and done on purpose. If you want to drive to a museum, it’s one hour, but then to go to a particular restaurant, it’s another hour. There is no room for randomness, things just happening and you randomly taking part in them. You need to actively research, plan, book/reserve ahead, drive, do a single activity, and come back home. As I write it, I shiver; this lack of spontaneity kind of still terrifies me. There’s “American prepper” stereotype, people who are super organized ahead of time and often over-prepared – for a reason.

Similarly, I was growing up being a part of alternative cultures communities (punk rock, some goth, new wave, later nu-rave, blog house, rave, electronica, house, disco, underground techno) and was missing it. The kind of thing when I’d go to some club even when nothing special is going on – to get to know some new artist, band, or a dj; to see if any of my friends are around, maybe meet new people, maybe just hang around. This is how I made lifelong and best friends – completely randomly and through shared interests and communities and we are still friends up till today, even if those interests changed later. I missed it greatly and it contributed to a feeling of loneliness.

Photo taken in my hometown, Warsaw. The most missed aspect of culture that was non existent in the Bay Area is kind of “pop-up” culture and community gathering spaces. A place where you just randomly end up with a bunch of your friends, meet some new people, discover some new place or a new music artist, and can expect the unexpected.

So this is matter of preference and just something I’m into vs not into, so I consider it “neutral”. On the other hand, there are other things I definitely didn’t like and consider negative about the vibe I got from some (obviously not all) people I met in Silicon Valley. 

Car culture and driving to all places like restaurants, clubs, and bars has also a “dark side”. Quite a lot of people are drunk driving – way more than in Europe. In Poland blood alcohol limit is 0.02%, in California it is 0.08%. I’d see people visibly drunk with slurry speech that go pick up their car from valet and go home. By comparison in Poland among my friends if someone would have a single weak beer, all of the friends would shout at them “wait a few hours, you cannot drive you idiot”.

Side note: There is also the whole “tech bro” “alternative” culture – people into Burning Man, hardcore libertarianism, radical self expression and radical individualism, polyamory, and microdosing (to increase productivity and compete, obviously…) – but none of it is my thing. If you know me, I’m as far “ideologically” from technocratic libertarianism and rand-esque/randian “freedom for rich individuals” (or rather, attempting to slap a “culture” label on the lifestyle of self-obsessed young bros) as possible. Also, I think I haven’t really met in person even a single person (openly) into this stuff, so maybe it’s an exaggerated myth, or maybe simply much more common in just VC/startup/big money circles?

Competition, money and success

Lots of people in the Bay Area are very competitive and success obsessed.

There are jokes about some douches bragging on their phone while in a supermarket checkout line about how many million they will get from their upcoming IPO – those “jokes” are real situations.

Everyone talks about money, success, position in their company (1 person company = opportunity to be a founder, CEO, CFO and CTO, and a VP of engineering – all in one!), entrepreneurship. At corps / companies people think and talk about promos, bonuses, managers about getting more headcount, higher profile project, and claiming “impact” – everything in some unhealthy competition sauce (sometimes up to a direct conflict between 2 or more teams). You are often judged by your peers by your “success” in those metrics.

Such people see collaboration only as a tool of fulfilling your own goals (once they are fulfilled, at best you are a stranger, at worst they’ll backstab you). Teams are built to fulfil goals of their leads who want to “work through others” and even express it openly and unironically (I still shiver when I hear/read this phrase). Lots of “friendships” are more like just networking or purely transactional. It’s kinda disappointing when a “friend” of yours invites you “for socializing and catching up” and then turns it into a job or startup pitch.

I personally find it toxic.

Ok, at least some of the tech “competition” is not-so-serious. 🙂

It’s also a bad place to have a family (even though I’m not planning one now) – teenage suicide rates are very high around Palo Alto due to this competitive vibe and extremely high expectations and peer pressure… I cannot imagine growing up in an environment where “competition” and you competing starts before even you are born or conceived – with parents securing a house in a place with the best schooling district and fighting and filing applications for spots in a daycare.

The competition area is also very specific – there is a lack of cultural diversity, everyone works in tech, works a lot, and talks mostly about tech.

In the Bay Area, tech is everywhere – here food delivery robots mention “hunger”, while there are actually hungry people on the street (more on that later).

I know, I’m a tech person, so I shouldn’t complain… but I actually thrive in more diverse environments; it’s fun when you have tech people meeting art people meeting humanities people meeting entertainment people; 🙂 extraverts mixing with introverts, programmers and office clerks and musicians and theater people and movie people etc. None of it is there, no real diversity of ideas or cultures.

I don’t think we really met anyone not working in tech or general entrepreneurship (other than spouses of tech colleagues) – which is kind of crazy given being there four years (!).

Tech arrogance

This one is even worse. Silicon Valley is full of extremely smart and talented people, who excel in their domain – technology. People who were always the best students at school, are ridiculously intelligent and capable of extreme levels of abstraction. I often felt very stupid surrounded by peers who grew up there and graduated from those prime colleges and grad schools. Super smart, super educated, super ambitious, and super hard working.

Unfortunately for some people who are smart and successful in one domain, one of the side effects is arrogance outside of their domain – “this is a simple problem, I can solve it”.

Isn’t this what the Silicon Valley entrepreneurship is about? 🙂 A “big vision” that is indifferent to any potential challenges of obstacles (ok, I admit – this is actually a mindset/skill I am jealous of; I wish my insecurities about potentially not delivering didn’t go into pessimism). This comes with a tendency to oversimplify things, ignore nitty gritty details, and jump into it with hyper optimism and technocratic belief that everything can be solved just with technology.

Don’t get me started over the whole idea of “disruption” which is often just ignoring and breaking existing regulations, stripping away worker’s rights, and extracting their money at a huge profit (see Uber, Airbnb, electric scooters, and many more).

“Bro, you have just ‘reinvented’ coworkers/roommates/hotel/bus/taxi/wire transfers”

And you can feel this vibe around all the time – it’s absolutely ridiculous, but being surrounded with such a mindset (see my notes above about lack of cultural diversity…), you can start truly losing perspective.

Economy and prices

Finally, a point that every reader who knows anything about California was expecting – everything is super expensive.

Housing prices and issues around it are very well known, and I’m not an expert on this. It’s enough to say that most apartments and houses around cost $1-4mln, and it’s the buyers who compete (again!) and outbid who gets to buy one, with final sales prices 30% above the asking price. Those “mansions” are typically made from cardboard, ugly, badly built, and with outdated, dangerous electrical installation. Why would sellers bother doing anything better, renovating etc. if people are going to pay fortunes for it anyway?

I didn’t buy a house, my rent was high (but not that bad), but this goes further.

On top of housing, also every single “service” is way more expensive than in other parts of the US (or most of the world). In LA for repairing my car (scratch and dent on a side) in a bodyshop with stellar reviews we paid $800 and were very happy with the service; in Bay Area for a much smaller, but similar type of repair (only scratched side, no dent) of the same car I got a quote for over $4K, of which only $400 was material cost (probably inflated as well) – and we decided not to repair, our car is probably worth less than twice that. ¯\_(ツ)_/¯ 

I am privileged and a high earner, so why would I care?

The worst part to me was the heartbreaking wealth disparity and homeless population.

Campers and cars with people living in them are everywhere, including parked at a side of some tech companies campuses / buildings.

A huge topic on its own, so I won’t add anything original or insightful here, people have written essays and books on it. But still – I cannot get over seeing one block with $4mln mansions, and a few blocks away campers and cars with people living there (who are not unemployed! Some of them work as actually work as so called “vendors” – security, cleaning, and kitchen staff – for the big tech…).

And this is the luckier ones, who still own a camper or a car and have a job – there are many homeless living in tents (or under starry sky) on the streets of San Francisco, often people with disabilities, mental health problems, or addictions.

I won’t exaggerate at all saying that many people in the BA care much more for their dogs, than for fellow humans that get dehumanized.

Climate…? And other misc

This one will be a fun and weird one, as many (even Americans!) assume that California is perfect weather.

It’s definitely not the case in San Francisco and generally closer to the coast – damp, cold, windy most of the year. My mom visiting me in the summer with her friend brought mainly shorts and dresses and shirts. They had to spend the first day shopping for a jacket, hoodie, and long trousers.

Ok, San Francisco is an extreme outlier – going more in-land, it is definitely warm and sunny, which comes at a price.

All the last few years the whole summer you get extreme heat waves, wildfires, bad air quality, and power blackouts. Last year it was particularly bad… Climate change has already come and we get to pay the price – desert areas that we made inhabitable will soon stop being inhabitable anymore. Feels truly depressing (and that humanity cannot get its s..t together).

Apocalypse? No, just Californian summer and wildfires. And some “beautiful” suburbs.

Ok, so you have to spend a few weeks inside with AC on and good air filters, but it’s just a small part of the year?

Here again comes the personal preference – I got tired by almost zero rain, no weather, or seasons. I’m not a big fan of cold or long winters, but I missed autumn, rainy days, excitement of spring, hot summer nights (not many people realize that on the West Coast even in the summer it very rarely gets warmer than 15C at night – desert climate)! You get long summer, and autumn-like “winter” with some nice foliage colors around late November / mid December, and you used to get a few rainy days December – February – but not really anymore.

I have to admit that a few weeks of autumn-like “winter” can be very pretty!

Climate is getting worse, droughts and wildfires are making California worse and worse place to live, but there’s another thing that is on the back of your mind – earthquakes. You feel some minor ones once a month (sometimes once a week), but that’s not a problem at all (I sleep well, so even pretty heavy ones with the furniture shaking wouldn’t wake me up). But once every few decades, there is an earthquake that destroys whole cities and with huge casualties. 😦 I’d hope it never happens (especially having some friends who live there), but it will – and living above faults is like living on a ticking timebomb…

From other misc and random things, California is (too) far away from Europe. This is the most personal of all of the above reasons, buy the family gets older and I don’t get to be part of their lives, and I miss some of my friends. It’s a long and expensive trip, involving a connection (though there might be a direct one soon), so any trip needs to be planned and quite long, using plenty of “precious” vacation days.

Some other gripe that might be surprising for non-Americans – lots of technology and infrastructure is pretty bad quality in the Silicon Valley as well! Unless you are lucky to have fiber available, internet connections are way slower than in Europe (and way more expensive), financial infrastructure old school (though this is a problem of the whole US). When I heard of Apple Pay or Venmo, I was surprised – what’s the big deal? We had paypass contactless payments and instant zero-fee wire transfers (instead of Venmoing a friend, just wire them, they’ll get it right away) in Poland since the early 2000s!

Let me finish this with a more “random”and potentially polarizing (because politics) point.

I am a (democratic) socialist, left wing, atheist etc. and I thought California would at least partially fit my vision of politics and my political preferences. But I quickly learned that Democrats, especially Californian are simply arrogant plutocratic political elites and work towards the interests of local lobbies. (On the major political axis defining the world post 2016 – elitist vs populist – many are as elitist as it gets).

It was very clear during the pandemic with way stricter restrictions than rest of the country, Local and state officials threatening people with cutting off their electricity and water for having a house party – while at the same time, same people who imposed those lockdowns were ignoring them, like California governor having a large party with lobbyists at a 3 Michelin star restaurant, mayor of SF also dining out while telling citizens to stay home, or the speaker Nancy Pelosi sneaking to a hairdresser despite all services closures. Different laws and rules for us, the uneducated masses, different rules for our great leaders.

And there is no room for real democracy or public discourse – as it will be shut down with “trust the science and experts!” (whatever that means; science is a method of investigating the world and building a better understanding of it, and has nothing to do with conducting politics or policies).

Things I will miss

I warned you that I’m a pessimistic Eastern European. But I don’t want to end on such a negative note. Very often, life was also great there, I didn’t “suffer”. 🙂 

Here are some things I will definitely miss about California, and even the Bay Area.

I actually already miss some and am nostalgic about it – after all, it was 4 amazing years of my life.

Inspiring legacy

It was cool to work in THE Silicon Valley. You feel the history of technology all around you, see all the landmarks, and random buildings turn out to be of historical significance.

For example Googleplex is the collection of old Silicon Graphics buildings – how cool and inspiring is that for someone in graphics? Simple bike ride around would be seeing all the old and new offices of companies whose products I used since I was a kid. It feels like being immersed in the technology (and all the great things about it that I dreamt of with the 90s technooptimism).

Yeah, when I was a kid and I was reading about the mythical Silicon Valley (that I imagined as some secret labs and factories in the middle of a desert 😀 funny, but might actually become true in a few decades…) and all of the famous people who created computers and modern electronics, I never imagined I’d be given an opportunity to work there, or sometimes even with those people in person (!).

I still find it dream-like and kind of unbelievable – and I do appreciate this once in a lifetime, humbling experience. Being surrounded by it and all the companies plus their history was inspiring in many ways as well.

I’d often think about the future and in some ways I felt like I’m part of it / being a tiny cogwheel contributing to it in some (negligible) way. This felt – and still feels – surreal and amazing.

Career impact

Let me be clear – by moving out, I have severely limited my career growth opportunities. 

And it’s not just about the tech giants, but also all the smaller opportunities, start ups, or even working with some of the academics.

This is a fact – and the only other almost comparable hub would be the Seattle area.

In NYC (or worse, in Europe if I ever wanted to go back…), tech presence is limited and most likely I will have to work remotely for a foreseeable future. Most people will return to offices, and many of the career growth options (I don’t mean just ladder climbing that I just criticized; but also exposure to projects, opportunities for collaboration, being in person and in the same office with all those amazing people) will be available only at headquarters.

Having an opportunity to present your work to the Alphabet CEO in person is not something that happens often – I’ll most likely never have a chance like that again in my life. But it’s possible if you live and work in “the place where it all happens”.

But life is not only a career – yes, I’ll have it a little bit more difficult, but I’ll manage and be just fine 🙂 – and life is so much more beyond that. You cannot “outwork” or buy happiness.

Life is not about narrowly understood “success” (that can be understood only by niche groups of experts) or competition, but fulfilment and happiness (or pursuit of it…).

Suburban pleasures

While I am not a suburban person, there are things I will miss.

Lots of green and foliage. Nice smells of flowers and freshly mowed grass – and omg those. bushes of jasmine and their captivating smell! Able to ride a bike around smaller neighborhoods without worrying too much about the traffic. But most importantly – outdoor swimming pools in community/apartment complexes, and ubiquitous barbecues!

I love lap swimming and have been doing it almost every single day for the past few years and I will miss it (hey, I already miss it).

Similarly, a regular weekend evening grilling on your large terrace when it just starts to cool down after a hot day with a cold drink in your hand was quite amazing.

Weekend (or week!) night pizza making or a barbecue? No problem, you don’t even need to leave your house / apartment.

I love cycling and some of the casual strolls were nice. I guess I’d miss it already (cycling in NYC is fun, but a different experience – a slalom between badly parked cars and unsafe traffic) if not for a bit of pandemic cycling burnout (every morning I was cycling around, but due to limited choice there were ~3-4 routes that I went through each dozens+ of times).

Finally, our neighborhood was full of cats. I love cats (who doesn’t…), but we cannot have one, and stil we got some pandemic stress relief in petting some of our purring neighbors.

We’d get pretty often some nice unexpected guests entering the apartment.

Californian easy going

Finally – Californian optimistic and positive, easy going spirit is super nice and contagious.

Despite me complaining about having a hard time making friends and criticizing the tech culture, I truly liked most of the people I met there.

If you don’t make any troubles (and have money…), life is very easy, comfortable, everyone is smiling, you exchange pleasant small talks, nothing is bothering you.

Some stranger grinning “hi how are you???” (that seems more extreme than in the other parts of the US) might be superficial, but it can make your day.

There’s some certain Californian optimism that I’ll definitely miss. After all, it’s a land of Californian dream. Yes, this is cheesy (or cheugy?), and the photo is as well. But I like it.

Conclusions

There are no conclusions! This was just my take and I went very personal – the most I ever did on this blog. I know some people share of my opinions; some problems of the Silicon Valley are very real, but don’t read too much into my opinions on it. And it was such an important and great part of my life and my career, and I know I’ll miss many things about it. That said, if you get an offer to relocate to the Bay Area – I’d recommend thinking about all those aspects of living that are outside of work and if they’d fit your preferences or not.

Is New York City going to better serve my needs? I certainly hope so, it seems so far (a completely different world!) and I’m optimistic looking into the future, but only time will tell.

Posted in Travel / Photography | 11 Comments

Neural material (de)compression – data-driven nonlinear dimensionality reduction

Proposed neural material decompression (on the right) is similar to the SVD based one (left), but instead of a matrix multiplication uses a tiny, local support and per-texel neural network that can run with very small amounts of per-pixel computations.

In this post I come back to something I didn’t expect coming back to – dimensionality reduction and compression for whole material texture sets (as opposed to single textures) – a significantly underexplored topic.

In one of my past posts I described how the most simple linear correlation can be used to significantly reduce material texture set dimensionality, and in another one ventured further into a fascinating field of dictionary and sparse compression.

This time, I will describe how we can abandon the land of simple linear correlation (that we explored through the use of SVD) and use data-driven techniques (machine learning!) to discover non-linear functions – using differentiable programming / neural networks.

I’m going to cover:

  • Using non-linear decoding on per-pixel data to get better dimensionality reduction than with linear correlation / PCA,
  • Augmenting non-linear decoders with neighborhood information to improve the reconstruction quality,
  • Using non-linear encoders on the SVD data for target platform/performance scaling,
  • Possibility of using learned encoders/decoders for GBuffer data.

Note: There is nothing magic or “neural” (and for sure not any damn AI!) about it, just non-linear relationships, and I absolutely hate that naming.

On the other hand, it does use neural networks, so investors, take notice!

Edit: I have added some example code for this post on github/colab. It’s very messy and not documented, so use at your own responsibility (preferably with a GPU colab instance)!

Inspiration

There have been a few papers recently that have changed the way I think about dimensionality reduction specifically for representing materials and how the design space might look like.

I won’t list all here, but here (and here or here) are some more recent ones that stood out and I remembered off the top of my head. The thing I thought was interesting, different, and original was the concept of “neural deferred rendering” – having compact latent codes that are used for approximating functions like BRDFs (or even final pixel directional radiance) expanded by a tiny neural network, often in the form of a multi-layer perceptron.

What if we used this, but in a semi-offline setting? 

What if we used tiny neural networks to decode materials?

To explain why I thought it could be interesting to bother to use a “wasteful” NN (NNs are always wasteful due to overcompleteness, even if tiny!), let’s start with function approximation and fitting using non-linear relationships.

Linear vs non-linear relationships

In my post about SVD and PCA, I discussed how linear correlation works and how it can be used. If you haven’t read my post, I recommend to pause reading this one, as I will use the same terminology, problem setting, and will go pretty fast.

It’s an extremely powerful relationship; and it works so well because many problems are linear once you “zoom in enough” – the principle behind truncated Taylor series approximations. Local line equations rule the engineering world! (And can be used as an often more desirable replacement for the joint bilateral filter).

So in the case of PCA material compression, we assume there is a linear relationship between the material properties and color channels and use it to decorrelate the input data. Then. we can ignore the components that are less correlated and contribute less, reducing the representation and storage to just a few channels.

Assuming linear correlation of data and finding line fits allows us to reduce the dimensionality – here from 2D to 1D.

Then the output can be computed by a weighted combination of adding them together.

Unfortunately, once we “zoom out” of local approximations (or basically look at most real world data), relationships in the data stop being so “linear”.

Often PCA is not enough. Like in the above example – we discard lots of useful distinction between some points.

But let’s go back and have a look at a different set of simple observations(represented as noisy data points on axis x and y):

Attempting line fitting on quadratically related data fails to discover the distribution shape.

A linear correlation cannot capture much of the distribution shape and is very lossy. We can guess why – because this looks like a good old parabola, a quadratic function.

Line fit is not enough and errors are large both “inside” the set, as well as approaching its ends.

This is what drove the further developments of the field of machine learning (note: PCA is definitely also a machine learning algorithm!) – to discover, explain, and approximate such growingly more complex relationships, with both parametric (where we assume some model of reality and try to find the best parameters), as well as non-parametric (where we don’t formulate the model of reality or data distribution) models.

The goal of Machine Learning is to give reasonable answers where we simply don’t know the answers ourselves – when we don’t have precise models of reality, there are too many dimensions, relationships are way too complex, or we know they might be changing.

In this case, we can use many algorithms, but the hottest one are “obviously” neural networks. 🙂 The next section will try to look at why, but first let’s approximate our data with a small, few neurons in a hidden layer Mult-Layer Perceptron NN with a ReLU activation function (ReLU is a fancy name for max(x, 0) – probably the simplest non-linear function that is actually useful and differentiable almost everywhere!) with 3 hidden neurons:

Three neuron ReLU neural network approximation of the toy data.

Or for example with 4 neurons as:

Four neuron ReLU neural network approximation of the toy data.

If we’re “lucky” (good initialization), we could get a fit like this:

“Lucky” initialization leads to matching the distribution shape closely with a potential for extrapolation.

It’s not “perfect”, but much better than the linear model. SVD, PCA, or line fits in a vanilla form can discover only linear relationships, while networks can discover some other, interesting nonlinear ones.

And this is why we’re going to use them!

Digression – why MLP, ReLU, and neural networks are so “hot”?

A few more words on the selected “architecture” – we’re going to use a most old-school, classic Neural Network algorithm – Multilayer Perceptron.

A 3 input, 3 neurons in a hidden layer, and 2 output neurons Multilayer Perceptron. Source: wikipedia.

When I first heard about “Neural Networks” at college around the mid 00s, in the introductory courses, “neural networks” were almost always MLPs and considered quite outdated. I also learned about Convolutional Networks and implemented one for my final project, but those were considered slow, impractical, and generally worse performing than SVMs and other techniques – isn’t it funny how things change and history turns around? 🙂

Why are neural networks used today everywhere and for almost everything?

There are numerous explanations, from deeply theoretical ones about “universal function approximation”, through hardware based ones (efficient to execute on GPUs, but also HW accelerators being built for those), but for me the most both pragmatic, as well as perhaps a bit cynical is – a circular explanation – because the field is so “hot” that a) actual research went into building great tools and libraries for those, plus b) there is a lot of know-how.

It’s super trivial to use a neural network to solve a problem where you have “some” data, the libraries and tooling are excellent, and over a decade of recent research (based on half a century of fundamental research before that!) went into making them fast and well understood.

MLP with a ReLU activation provides piecewise linear approximations to the actual solutions. In our example, they will never be able to learn a real parabola, but given enough data it can perform very well if we evaluate on the same data.

While it doesn’t make sense to use NNs for data that we know is drawn from a small parabola (because we can use a parametric model – both more accurate, as well as way more efficient), if we had more than three a few dimensions or a few hundred samples, human reasoning, intuition and “eyeballing” starts to break down.

The volume space of possible solutions increases exponentially (“curse of dimensionality”). For the use-case we’ll look at next (neural decompression of textures), we cannot just quickly look at the data and realize “ok, there is a cubic relationship in some latent space between the albedo and glossiness”- so finding those with small MLPs is an attractive alternative.

But it can very easily overfit (fail to discover real, underlying relationship in the data – variance bias trade-off); for example increasing the hidden layer neuron count to 256, we can get:

With 256 neurons in the hidden layer, network overfits and completely fails to discover a meaningful or correct approximation of the “real” distribution.

Luckily, for our use-case – it doesn’t matter. For compressing fixed data, the more over-fitting, the better! 🙂 

Non-linear relationships in textures

With the SVD compressed example, we were looking at N input channels (stored in textures), with K (9 in the example we analyzed) output channels, with each output channel given by x_0 * w_0 + … + x_n-1 w_n-1, so a simple linear relationship.

Let’s have a look again at N input channels, K output channels, but in between them a single hidden layer of 64 neurons with a ReLU function:

Instead of a SVD matrix multiplication, we are going to use a simple single hidden layer network.

This time, we perform a matrix multiplication (or a series of multiply-adds) for every neuron in the hidden layer and use those intermediate, locally computed values as our “temporary” input channels that get fed to a final matrix multiplication producing the output.

This is not a convolutional network. The data still stays “local”, we read same amount of information per channel. During network training, we are training both the network, as well as the input compressed texture channels.

During training, we optimize the contents of the texture (“latent code“), as well as a tiny decoder network that operates per pixel.

If those terms don’t mean much to you, unfortunately I won’t go much deeper into how to train a network, what optimizers and learning rates to use (I used ADAM with learning rate of 0.01) etc. – there are people who are much more competent and explain it much better. I learned from the Deep Learning book and can recommend it, a clear exposition to both theory and some practice, but it was a while ago, so maybe better resources are available now.

I did however write about optimizing latent codes! About optimizing SVD filters with Jax, optimizing blue noise patterns, and about finding latent codes for dictionary learning. In this case, texture contents are simply a set of variables optimized during network training, also through gradient descent / backpropagation. The whole process is not optimized at all and takes a couple of minutes (but I’m sure could be made much faster).

Without further ado, here’s the numerical quality comparison, number of channels vs PSNR:

Comparison of PSNR of SVD, optimal linear encoding with non-linear encoding optimized jointly with a tiny NN decoder.

When I first saw the numbers and behavior for the very low channel count, I was pretty impressed. The quantitative improvement is significant, over a few db. In the case of 4 channels, it turns it from useless (PSNR below 30dB), to pretty good – over 36dB!

Here’s a visual comparison:

A non-linear decoding of four channel data is way better than linear SVD on four channels not just numerically, but also immediately and perceptually visible even when zoomed out – see the effect on the normal map. Textures / material credit cc0textures, Lennart Demes.

That’s the case where you don’t even need gifs and switching back and forth – the difference is immediately obvious on the normal map as well as the height map (middle). It is comparable (almost exactly the same PSNR as well) to the 5 channels linear encoding.

This is the power of non-linearity – it can express approximations of relationships between the normal map channels like x^2 + y^2 + z^2 = 1, which is impossible to express through a single linear equation.

How do the latent codes look like? Surprisingly similar to PCA data and reasonably uncorrelated, like:

Latent space looks reasonably close to the PCA of the input data.

Efficient implementation

Before I go further, how fast / slow would that be in practice? 

I have not measured the performance, but the most straightforward implementation is to compute the per-pixel 64 values, storing them in either registers or LDS through a series of 64×4 MADDs, and then do another 64×9 MADDs to compute the final channels, totaling 832 multiply add ops.

For intermediate channel (64):
  For input channel (1-5):
    Intermediate += Input * weight
    Intermediate = max(0, Intermediate)
For output channel (9):
  For intermediate channel (64):
    Output += Intermediate * weight

This sounds expensive, however all “tricks” of optimizing NNs apply – I’m pretty sure you can use “tensor cores” on nVidia hardware, and very aggressive quantization – not just half floats, but also e.g. 8 or 4 bit smaller types for faster throughput / vectorization.

There’s also a possibility of decompressing this into some intermediate storage / cache like virtual textures, but I’m not sure if this would be beneficial? There’s an idea floating around for at least half a decade about texel shaders and how they could be the solution to this problem, butI’ve been far away from GPU architecture for the past 4 years to not have an opinion on that.

Would this work under bilinear interpolation?

A question that came to my mind was – does this non-linear decoding of data break under linear interpolation (or any linear combination) of the decoded inputs?

I have performed the most trivial test – upscale encoded data bilinearly by 4x, decode, downsample, and compare to the original decoded. It seems to be very close, but this is an empirical experiment on only one example. I am almost sure it’s going to work well on more materials – the reason is how tiny is the network as compared to the number of texels, with not much room to overfit.

To explain this intuition, here is again example from above – where small parameter count and undefitting (left) behaves well between the input points, while the overfit case (right) would behave “badly” and non-linearly under linear interpolation:

Small networks (left) have no room for overfitting, while larger (right) don’t behave well / smoothly / monotonicly between the data points.

What if we looked at neighbors / larger neighborhoods?

When you look at material textures, one thing that comes to mind is that per channel correlations might be actually less obvious than another type of correlations – spatial ones.

What I mean here is that some features depend not only on the contents of the single pixel, but much more on where it is and what it represents.

For example – a normal map might represent the derivative / gradient of the height map, and cavity or AO map darker spots be surrounded by heightmap slopes. 

Would looking at pixels of target resolution image, as well as the larger context of blurred / lower mip level neighborhood give non-linear decoder better information?

We could use either authored features like gradients, or just general convolution networks and let the network learn those non-linearities, but that could be too slow.

I had another idea that is much simpler/cheaper and seems to work quite ok – why not just look at the blurred version of the texel neighborhood? In the extreme close to zero cost approximation – if we looked at one of the next levels in the mip chain which is already built?

So as an input of the network, I add simulated reading from 2 mip levels down (by downsampling the image 4x, and then upsampling back).

This seems to improve the results quite significantly in this my favorite (strong compression, good PSNR results) 2-5 channel regime:

Adding lower mip levels as the input to the small decoder network allows to improve the PSNR significantly.

Maybe not as spectacular, but now together with those two changes we’re talking about improving PSNR for storing just 3 channels and reconstructing 9 from 25dB to over 35.5dB, this is a massive improvement!

Why not go ahead with full global encoding and coordinate networks?

Success of recent techniques in neural rendering and computer vision where the input are just spatial coordinates (like brilliant NeRF), begs a question – why not use same approach for compressing materials / textures? My understanding is that a fully trained NeRF actually uses more data than in the input (expansion instead of compression!), but even if we went with a hybrid solution (some latent code + a mixed coordinate and regular input network), for the decoding use on GPUs we’d generally don’t want to have a representation with “global support” (where every piece of information describes the behavior / decoded data in all locations).

If you need to access information about the whole texture to encode every single pixel, the data is not going to fit in the cache… And it doesn’t make sense. I think of it this way – if you were rendering an open world, to render a small object in one place would you want be required to always access information about a different object 3mile / 5km away?

Local support is way more efficient (as you need to read less data that fits in the local cache), and even techniques like JPEG that use global support basis like DCT only per a local, small tile.

Neural decompressor of linear SVD data

This is an idea that occurred to me when playing with adding those spatial relationships. Can a “neural” decompressor learn some meaningful non-linear decoding of linearly encoded data? We could use a learned decoder on fast, linearly encoded data.

This would have two significant benefits:

  1. The compression / encoding would be significantly faster.
  2. Depending on the speed / performance (or memory availability if caching) of the platform, one could pick either fast, linear decompression, or a neural network to decompress.
  3. Artists could iterate on linear, SVD data and have a guarantee that the end result will be higher quality on higher end platforms.
Depending on available computational budget, you could pick between either linear, single matrix-multiply decoding, or a non-linear one using NNs!

In this set-up, we just freeze the latent code computed through SVD and optimize only the network. The network has access to the SVD latent code and two of the following mip levels.

Comparison of linear vs non-linear decoding of the linearly compressed, SVD data.

The results are not as impressive be before, but still getting ~4dB extra from exactly same input data is an interesting and most likely applicable result?

GBuffer storage?

As a final remark, in some cases, small NNs could be used for decoding data stored in GBuffer in the context of deferred shading. If many materials could be stored using the same representation, then decoding could happen just before shading.

This is not as bizarre an idea as it might seem – while compression would most likely be more modest, I see a different benefit – many game engines that use deferred shading have hand-written hard-coded spaghetti code logic for packing different material types and variations with many properties, with equally complicated decoding functions.

When I worked on GoW, I spent weeks on “optimal” packing schemes that were compromising the needs of different artists and different types of materials (anisotropic specular, double specular lobes, thin transmission, retroreflective specular, diffusion with different profiles…).

…and there were different types of encoding/decoding functions like perceptually linearizing roughness, and then converting it to a different representation for the actual BRDF evaluation. Lots of work that was interesting (a fun challenge!) at times, but also laborious, and sometimes heartbreaking (when you have to pick between using more memory and the whole game losing 10% performance, and telling an artist that a feature you wrote for them and they loved will not be supported anymore).

It could be an amazing time saver and offloading for the programmer to rely on automatic decoding instead of such imperfect heuristics.

Summary

In this post, I returned to some ideas related to compressing whole material sets together – and relying on their correlation.

This time, we focused on the assumption of the existence of non-linear relationships in data – and used tiny neural networks to learn this relationship for efficient representation learning and dimensionality reduction. 

My post is far from exhaustive or applicable – it’s more of an idea outline / sketch, and there are a lot of details and gaps to fill – but the idea definitely works! – and provides real, measurable benefit – on the limited example I tried it on. Would I try moving in this direction in production? I don’t know, maybe I’d play more with compressing material sets if I was under dire need for reducing material sizes, but otherwise I’d probably spend some more time following the GBuffer compression automation angle (knowing it might be tricky to make practical – how do you obtain all the data? how do you eveluate).

If there’s a single outcome outside of me again toying with material sets, I hope it somewhat demystified using neural networks for simple data driven tasks:

  • It’s all about discovering, non-linear relationships in lots of complicated, multi-dimensional data,
  • Despite limitations (like ReLU can provide only a piecewise linear function approximation) they can indeed discover those in practice,
  • You don’t need to run a gigantic end-to-end network that has millions of parameters to see benefits of them,
  • Small, shallow networks can be useful and fast (832 multiply-adds on limited precision data is most likely practical)!

Edit: I have added some example code for this post on github/colab. It’s very messy and not documented, so use at your own responsibility (preferably with a GPU colab instance)!

Posted in Code / Graphics | Tagged , , , , , , , , , , | 5 Comments

Superfast void-and-cluster Blue Noise in Python (Numpy/Jax)

This is a super short blog post to accompany this Colab notebook.

It’s not an official part of my dithering / Blue Noise post series, but thematically fits it well and be sure to check it out for some motivation why we’re looking at blue noise dither masks!

It is inspired by a tweet exchange with Alan Wolfe (who has a great write-up about the original, more complex version of void and cluster and a lot of blog posts about blue noise and its advantages and cool mathematical properties) and Kleber Garcia who has recently released “Noice”, a fast, general tool for creating various kinds of 2D and 3D noise patterns.

The main idea is – one can write a very simple and very fast version of void-and-cluster algorithm that takes ~50 lines of code including comments!

How did I do it? Again in Jax. 🙂

Algorithm

The original void and cluster algorithm comprises 3 phases – initial random points and their reordering, removing clusters, and filling voids.

I didn’t understand why three phases are necessary, the paper didn’t explain them, so I went to code a much simpler version with just initialization and a single loop:

1. Initialize the pattern to a few random seed pixels. This is necessary as the algorithm is fully deterministic otherwise, so without seeding it with randomness, it would produce a regular grid.

2. Repeat until all pixel set:

  1. Find an empty pixel with the smallest energy.

  2. Set this pixel to the index of the added point.

  3. Add the energy contribution of this pixel to the accumulated LUT.

  4. Repeat until all pixels are set.

Initial random points

Nothing too sophisticated there, so I decided to use a jittered grid – it prevents most clumping and just intuitively seems ok.

Together with those random pixels, I also create a precomputed energy mask.

Centered energy mask for pattern sized (16×16)
Uncentered energy mask for pattern sized (16×16)

It is a toroidally wrapped Gaussian bell with a small twist / hack – in the added point center I place float “infinity”. The white pixel in the above plots is this “infinity”. This way I guarantee that this point will never have smaller energy than any unoccupied one. 🙂  This simplifies the algorithm a lot – no need for any bookkeeping, all the information is stored in the energy map!

This mask will be added as a contribution of every single point, so will accumulate over time, with our tiny infinities 😉 filling in all pixels one by one.

Actual loop

For the main loop of the algorithm, I look at the current energy mask, that looks for example like this: 

And use numpy/jax “argmin” – this function literally returns what we need – index of the pixel with the smallest energy! Beforehand I convert the array to 1D (so that argmin works), get an 1D index, but then easily convert it to 2D coordinates using division and modulo by single dimension size. It’s worth noting that operations like “flatten” are free in numpy – they just readjust the array strides and the shape information.

I add this pixel with an increased counter to the final dither mask, and also take our precomputed energy table and “rotate it” according to pixel coordinates, and add to the current energy map.

After the update it will look like this (notice how nicely we found the void):

After this update step we continue the loop until all pixels are set.

Results

The results look pretty good:

I think this is as good as the original void-and-cluster algorithm.

The performance is 3.27s to generate a 128×128 texture on a free GPU Colab instance (first call might be slower due to jit compilation) – I think also pretty good!

If there is anything I have missed or any bug, please let me know in the comments!

Meanwhile enjoy the colab here, and check out my dithering and blue noise blog posts here.

Posted in Code / Graphics | Tagged , , , , , | 1 Comment

Computing gradients on grids of pixels and voxels – forward, central, and… diagonal differences

In this post, I will focus on gradients of image signals defined on grids in computer graphics and image processing. Specifically, gradients / derivatives of images, height fields, distance fields, when they are represented as discrete, uniform grids of pixels or voxels.

I’ll start with the very basics – what do we typically mean by gradients (as it’s not always “standardized”), what are they used for, what are the ypical methods (forward or central differences), their cons and problems, and then proceed to discuss an interesting alternative with very nice properties – diagonal gradients.

My post will conclude with advice on how to use them in practice in a simple useful scheme, how to extend it with a little bit of computations to a super useful concept of a structure tensor that can characterize dominating direction of any gradient field, and finish with some signal processing fun – frequency domain analysis of forward and central differences.

Image gradients

What are image gradients or derivatives? How do you define gradients on a grid?

It’s not very well defined problem on discrete signals, but the most common and useful way to think about it is inspired by signal processing and the idea of sampling:

Assuming there was some continuous signal that got discretized, what would be the partial derivative with regards to the spatial dimension of this continuous signal at gradient evaluation points?

This interpretation is useful both for computer graphics (where we might have discretized descriptions of continuous surfaces; like voxel fields, heightmaps, or distance fields), as well as in image processing (where we often assume that images are “natural images” that got photographed).

Continuous gradients and derivatives used for things like normals of procedural SDFs are also interesting, but a different story. I will not cover those here, and instead recommend you check out this cool post by Inigo Quilez with lots of practical tricks (re-reading it I learned about the tetrahedron trick) and advice. I am sure that no matter your level of familiarity with the topic, you will learn something new.

In my post, I will focus on computer graphics, image processing, and basic signal processing takes on the problem. There are two much deeper connections that I haven’t personally worked too much with. So I leave it as something that I wish I can expand my knowledge in the future, but also encourage my readers to explore it:

Further reading one: The first connection is with Partial Differential Equations and their discretization. Solving PDEs and solving discretized PDEs is something that many specialized scientific domains deal with, and computing gradients is an inherent part of numerical discretized PDE solutions. I don’t know too much about those, but I’m sure literature covers this in much detail.

Further reading two: The second connection is wavelets, filter banks, and the frequency precision / localization trade-off. This is something used in communication theory, electrical engineering, radar systems, audio systems, and many more. While I read and am familiar with some theory, I haven’t found too many practical uses of wavelets (other than the simplest Gabor ones or in use for image pyramids) in my work, so again I’ll just recommend you some more specialized reading. 

Applications

Ok, why would we want to compute gradients? Two common uses:

Compute surface normals – when we have something like a scalar field – whether distance field describing an underlying implicit surface, or simply a height field, we might want to compute its gradients to compute normals of the surface, or of the heightfield for normal mapping, evaluating BRDFs and lighting etc. Maybe even for physical simulation, collision, or animation on terrain!

Left: an example heightmap image, right: partial derivatives (exaggerated for demonstration) that can be used for generating normal maps for lighting, collision, and many other uses in a 3D environment.

Find edges, discontinuities, corners, and image features – the other application that I will focus much more on is simply finding image features like edges, corners, regions of “detail” or texture; areas where signal changes and becomes more “interesting”. The human visual system is actually built from many edge detectors and local differentiators and can be thought of as a multi-scale spatiotemporal gradient analyzer! This is biologically motivated – when signal changes in space or time, it means it’s a potential point of interest or a threat.

The bonus use-case of just computing the gradients is finding the orientation and description of those features. This is bread and butter of any image processing, but also very common in computer graphics – from morphological anti-aliasing, selectively super-sampling only edges, or special effects. I will go back to it in the description of local features in the Structure Tensor section, but for now let’s use the most common application – just using the gradient vector magnitude, used for example in the Sobel operator: .

Gradient magnitude can be used for simple edge detection in an image, but also many more (described later in the post!).

Test signals

I will be testing gradients on three test images:

  • Siemens star – a test pattern useful to test different “angles” and different frequencies,
  • A simple box – has corners that can immediately show problems,
  • 1 0 1 0 stripe pattern – frequencies exactly at Nyquist.
Test patterns / images

The Siemens star was rendered at 16x resolution and downsampled with a sharp antialiasing filter (windowed sinc). There is some aliasing in the center, but it’s ok for our use-case (we want to see a sharp signal change there and then detect it).

Forward differences

The first, most intuitive approach is computing forward difference.

This is an approximation of by computing f(x+1) – f(x).

What I love about this solution is that it is both intuitive, naive, as well as theoretically motivated! I don’t want to rephrase the wikipedia and most of readers of my blog don’t care so much for formal proofs, so feel free to have a look there, or in specialized courses (side note: 2010s+ are amazing, with so many best professors sharing their course slides openly on the internet…). It’s basically built around the Taylor expansion of the f(x+1) – f(x).

This difference operator is the same as convolving the image with a [-1, 1] filter.

It’s easy, works reasonably well, is useful. But there are two bigger problems.

Problem 1 – even-shift

If you have read my previous blog post – on bilinear down/upsampling – one of challenges might be visible right away. Convolving with an even-sized filter, shifts the whole image by a half a pixel.

It’s easily visible as well:

Images vs image gradient magnitude. Notice the shift to left/top.

Depending on the application, this can be a problem or not. “A solution” is simple – undo it with another even sized filter. Perfect solution would be resampling, but resampling by a half pixel is surprisingly challenging (see my another blog post), so instead we can blur it with a symmetric filter. Blurring the gradient magnitude fixes it (at the cost of blur):

Blurring with an even sized filter fixes the even sized shift. But one other problem might have became much more visible right now.

Note: You might wonder, what would happen if we would blur just the partial differences instead? We will see in a second.

But let’s focus on another, more serious problem.

Problem 2 – different gradient positions for partial differences

Second problem is more severe; what happens if we compute multiple partial derivatives; like df/dx and df/dy this way?

Partial derivatives are approximated at different positions!

Notice how gradient being defined between pixels means that two partial forward derivatives are defined at different points! This is a big problem.

Now let’s have a look at the same plot again, this time focusing on “symmetry”. This shows why it’s a real, not just theoretical issue. If we look at gradients of a Siemens star and the box, we can see some asymmetry:

Notice left-top gradients being stronger for Siemens star, and producing a “hole” on the box corner. Oops.

This is definitely not good and will cause problems in image processing or computing normals, but it’s even worse if we look at the gradient magnitude of a simple “square” in the center of the image – notice what happens to the image corners, one corner is cut off, and another one 2x too intense!

This is a big problem for any real application. We need to look at better approximations.

Central differences

I mentioned that “blurring” partial derivatives can “recenter” them, right? What if I told you that it is also numerically more accurate? This is the so-called central difference.

It is evaluated by . The division by two is important and intuitively can be understood as dividing by the distance between the pixels, or the differential dx, where dx is 2x larger.

When forward difference was an approximation accurate to the O(h), this one is more accurate, to O(h^2). I won’t explain those terms here, but the wikipedia has some basic intro, and in numerical methods literature you can find a more detailed explanation.

It is also equivalent to “blurring” the forward difference with a [0.5, 0.5] filter! [0.5, 0.5] o [-1, 1] leads to [-0.5, 0, 0.5]. I was initially very surprised by this connection – of a more accurate, theoretically motivated Taylor expansion, and just “some random ad-hoc practical blur”. This is even sized blur, so this also fixes the half pixel shift problem and centers both partial derivatives correctly:

Central difference leads to perfectly symmetric (though not isotropic) results!

Problem – missing center…

However, there is another problem. Notice how the gradient estimate on the pixel doesn’t take pixel’s value into account at all! This means that patterns like 0 1 0 1 0 1 will have… zero gradient

This is how both methods compare on all gradients, including the “striped” pattern:

Top: analyzed images, center: forward difference, bottom: central difference. Notice how both stripes show no gradient, as well as the center of the Siemens star seems to be missing!

The central difference predicts zero gradient for any 1 0 1 0-like, highest frequency (Nyquist) sequence! On Siemens star it demonstrates as missing / zero gradients just outside of the center.

We know that we have strong gradients there, and strong patterns/edges, so this is clearly not right.

This can and will lead to some serious problems. If we look at the Siemens star, we can notice how it is nicely symmetric, but misses gradient information in the center

Sobel filter

It’s worth mentioning here what is the commonly used Sobel filter. Sobel filter is nothing more or less than central difference that is blurred (with a binomial filter approximation of Gaussian) in the direction perpendicular to evaluated gradient, so something like:

The practical motivation for it in Sobel filter is a) filtering out some noise and tendency to be sensitive to single pixel gradients that are not very visible, and more importantly b) its blurring introduces more “isotropic” behavior (you will see below how “diagonals” and “corners” are closer magnitude to the horizontal/vertical edges).

Sobel operator would compare to the central difference (comparing those due to the largest similarity):

Does the isotropic and fully rotationally symmetric behavior matter?

Mentioning isotropic or anisotropic response on grids is always weird to me. Why? Because the Nyquist frequencies are not isotropic or radial, they form a box. This means that we can represent higher frequencies on the diagonals than the principal axes. I have covered in an older post of mine how this property can be used in checkerboard rendering by rotating the sample grid (used for example in many Playstation 4 Pro to achieve 4K rendering with spatiotemporal upsampling). But any analysis becomes “weird” and I often see discussions of whether separable sinc is “the best” filter for images, as opposed to rotationally symmetric jinc.

I don’t have an easy answer for how important it is for the image gradients, and the question is “it depends”. Do you care about detecting higher frequency diagonal gradient as stronger gradients or not? What’s the use-case? I haven’t thought about it in the context of normal mapping, as this could lead to “weird” curvatures, but I’m not sure. I might come back to it out of personal curiosity at some point, so please let me know in the comments what are your thoughts and/or experiences.

Image Laplacians

Edit: Jeremy Cowles suggested an interesting topic to add here and an area of common confusion. Sobel operator can be sometimes used interchangeably / confused with Laplace operator. Laplace operator can be discretized like:

Both can be used for detecting the edges, but it’s very important to distinguish the two. Laplace operator is a scalar value corresponding to the sum of second order derivatives. It measures a very different physical property and has a different meaning – it measures smoothness of the signal. This means that for example a linear gradient will have a Laplacian of exactly zero! It can be very useful in image processing and computer graphics, and sometimes even better than the gradient; but it’s important to think what you are measuring and why!

Laplace operator is signed, but its absolute value on our test signals looks like:

Absolute value of the image laplacian – laplacian magnitude.

So despite being centered it doesn’t have problem of vanishing 0 1 0 1 0 sequence – great! On the other hand notice how it’s almost zero outside of the Siemens star center, and stronger in the center and on the corners – very different semantic meaning and practical applications.

Larger stencils?

Notice that as you can use the central difference for increase of the accuracy of the forward difference / higher order approximation, you can go even further and further, adding more taps to your derivative-approximating filter!

They can help with disappearing high frequency gradients, but your gradients become less and less localized in space. This is the connection to wavelets, audio, STFT and other areas that I mentioned.

I also like to think of it intuitively as half pixel resampling of the [-1, 1] filter. 🙂 For example if you were to resample [-1, 1] with a half pixel shifting larger sinc filter, you’d get similar behavior:

Left: [1, -1] filter shifted with a windowed sinc. Right: Least-squares optimal 9 tap derivative.

Diagonal differences

With all this background, it’s finally time for the main suggestion of the post.

This is something I have learned from my coworkers who worked on the BLADE project (approximating image operations with discrete learned filters), and then we used in our work that materialized as the Handheld Multiframe Super-Resolution paper and Super-Res Zoom Pixel 3 project that I had a pleasure of leading.

I get a lot of questions about this part of the paper in some emails as well, so I think it’s not a very commonly known “trick”. If you know where it might have originated and domains that use it more often, please let me know.

Edit: Don Williamson has noticed on twitter that it’s known as Robert’s Cross and was developed in 1963!

The idea here is to compute gradients on the diagonals. For example in 2D (note that it extends to higher dimensions!):

We can see that with the diagonal difference, both partial derivative approximations are centered in the same point. We compute those just like forward differences, but need to compensate for larger distance, sqrt(2), size of the diagonal of the pixels, so and .

If you use gradients like this you can “rotate it” for normals. For just gradient magnitude you don’t need to do anything special, it’s computed the same way (length of the vector)!

Gradient computed like this doesn’t have missing high frequency gradients problems; nor asymmetry problems:

Diagonal gradients offer an easy solution to two problems of both central, and forward differences.

It has the problem of pixel shift, and some anisotropy, but the pixel shift can be easily fixed.

Diagonal differences – fixing shift in a 3×3 pixel window

In a 3×3 window, we can compute our gradients like this:

How to compute symmetric and not phase shifting diagonal gradients in a 3×3, 9 pixel window.

This means sum all the squared gradients, and divide by 4 (number of gradient pairs) – all very efficient operations. Then you just take the square root and get the gradient magnitude.

This fixes the problem of phase shift and IMO is the best of all alternatives discussed so far:

Diagonal differences detect high frequencies properly, don’t have a pixel shift problem, and while they show perfect isotropy, in practice it is often “good enough”.

Some blurring is on the one hand reducing the “localization”, but on the other hand it can be desired and improve robustness to noise and single pixel gradients.

But why stop there?

With just a few lines of code (and some extra computation…) we can extract even more information from it!

Structure Tensor and the Harris corner detector

We have a “bag” or gradient values for a bunch of pixels… What if we tried to find a least squares solution to “what is the dominant gradient direction in the neighborhood”?

I am not great at math derivations, but here’s an elegant explanation of the structure tensor. Structure tensor is a Gram matrix of the gradients (vector of “stacked” gradients multiplied by its transpose).

Lots of difficult words, but in essence it’s a very simple 2×2 matrix, where entries are averages of dx^2, dy^2 on the diagonal, and dx*dy on the off diagonal:

Now if we do eigendecomposition of this matrix (which is always possible for a Gram matrix – symmetric, positive semidefinite), we get the following simple closed forms for the eigenvalues:

You might want to normalize the vectors; in the case of the off-diagonal Gram matrix entry of zero, the eigenvectors are simply coordinate system principal axes (1, 0) and (0, 1) (be sure to “sort” them in that case though , same for underdetermined systems, where the ordering doesn’t matter.

Larger eigenvalue and associated eigenvector describe the direction and magnitude of the gradient in dominant direction; it gives us both its orientation and how strong the gradient is in that specific direction.

The second, smaller eigenvalue and perpendicular, second eigenvector gives us information about the gradient in the perpendicular direction.

Together this is really cool, as it allows us to distinguish edges, corners, regions with texture, as well as orientation of edges. In our Handheld Multiframe Super-Resolution paper we used two “derived” properties – strength (simply the square root of the structure tensor dominant eigenvalue) describing how strong is the gradient locally, and coherence computed as . (Note: the original version of the paper had a mistake in the definition of the coherence, a result of late edits for the camera ready version of the paper that we missed. This value has to be between 0 and 1).

Coherence is a bit more uncommon; it describes how much stronger is the dominant direction over the perpendicular one.

By analyzing the eigenvector (any of them is fine, since they are perpendicular), we can get information about the dominant gradient orientation.

In the case of our test signals that are comprised of simple frequencies, those look as follow (please ignore the behavior on borders/boundaries between images in my quick and dirty implementation):

Row 1: Analyzed signals, Row 2: Gradient strength in the dominant direction, Row 3: Gradient orientation angle remapped to 0-1, Row 4: Gradient coherence – most of gradients are very coherent in all of the presented examples, with an exception of zero signal region (coherence undefined -> can be zero), and some corners.

Here’s some other example on an actual image, with a remapping often used for displaying vectors – hue defined by the angle, and saturation by the gradient strength:

Left: Processed image. Right: Color representation of the gradient vectors, where angle describes hue in HSV color space, and the saturation comes from the gradient strength

Such analysis is important for both edge-aware, non-linear image processing, as well as in computer vision. Structure tensor is a part of Harris corner detector, which used to be probably the most common computer vision question prior to the whole deep learning revolution. 🙂 (Still, if you’re looking for a job in the field even today, I suggest not skipping on such fundamentals as interviewers can still ask for classic approaches to verify how much of the background you understand!)

Fixing rotation

One thing that is worth mentioning is how to “fix” the rotation of structure tensor from diagonal gradients. If you compute the angle through arctan, this is not something you’d care about (just add value to your angle), but in scenarios where you need higher computational efficiency, you don’t want to be doing any inverse trigonometry, and this matters a lot. Luckily, rotating Gram or covariance matrices is super easy! Formula for rotating a covariance matrix is simply . Constructing a rotation matrix by 45 degree from angle formula, we get first:

And then:

(Note: worth double checking my math. But it makes sense that having exactly the same 4 diagonal elements cancels them and suggests a vertical only or diagonal only gradient!)

Frequency analysis

Before I conclude the post, I couldn’t resist looking at comparing the forward differences and central differences from the perspective of signal processing.

It’s probably a bit unorthodox and most people don’t look at it this way, but I couldn’t stop thinking about the disappearing high frequencies; ideal differentiator of continuous signals should have “infinite” frequency response as frequencies go to infinity!

Obviously frequencies can go only up to Nyquist in the discrete domain. But let’s have a look at the frequency response of the forward difference convolution filter and the central difference:

The central difference convolution filter has zero frequency response at Nyquist and seems to match the frequency response worse than a simple forward difference filter with increasing frequencies. This is the “cost” we pay for compensating for the phase shift.

But given this frequency response, it makes sense that a sequence of 1 0 1 0 completely “disappears”, as it is exactly at Nyquist, where the response is zero!

Finally, if we look at higher order / more filter tap gradient approximations, we can see that it definitely improves for most mid-high frequencies, but the behavior when approaching Nyquist is still a problem:

Ok, with 9 taps we get better and better behavior and closer to the differentiator line.

At the same time, the convergence is slow and we won’t be able to recover the exact Nyquist:

This is due to “perfect” resampling having frequency response of 0 at Nyquist. Whether this is a problem or not depends on your use-case. In both photography, as well as computer graphics we get get sequences occurring at perfect Nyquist (from rasterization or sampling photographic signals). Anyway, I highly recommend the diagonal filters which don’t have this issue!

Conclusions

In my post, I only skimmed the surface of the deep topic of grid (image, voxel) gradients, but covered a bunch of related topics and connections:

  1. I discussed differences between the forward and central differences, how to address some of their problems, and some reasons for the classic Sobel operator.
  2. I presented a neat “trick” of using diagonal gradients that fixes some of the shortcomings of the alternatives and I’d suggest it to you as the go-to one. Try it out with other signal sources like voxels, or for computing normal maps!
  3. I mentioned how to robustly compute centered gradients on a 3×3 grid using the diagonal method.
  4. I briefly touched on the concept of structure tensor, a least squares solution for finding dominant gradient direction on a grid, and how its eigenanalysis provides many more extensions / useful descriptors over just gradients.
  5. Finally, I played with analyzing frequency response of discrete difference operators and how they compare to theoretical perfect differentiator in signal processing.

As always, I hope that at least some of those points are going to be practically useful in your work, inspire you in your own research, or to do further reading and maybe correct me or engage in discussion in comments on on twitter. 🙂

Posted in Code / Graphics | Tagged , , , , , | 6 Comments

Bilinear down/upsampling, aligning pixel grids, and that infamous GPU half pixel offset

See this ugly pixel shift when upsampling a downsampled image? My post describes where it can come from and how to avoid those!

It’s been more than two decades of me using bilinear texture filtering, a few months since I’ve written about bilinear resampling, but only two days since I discovered a bug of mine related to it. 😅 Similarly, just last week a colleague asked for a very fast implementation of bilinear on a CPU and it caused a series of questions “which kind of bilinear?”.

So I figured it’s an opportunity for another short blog post – on bilinear filtering, but in context of down/upsampling. We will touch here on GPU half pixel offsets, aligning pixel grids, a bug / confusion in Tensorflow, deeper signal processing analysis of what’s going on during bilinear operations, and analysis of the magic of the famous “magic kernel”.

I highly recommend my previous post as a primer on the topic, as I’ll use some of the tools and terminology from there, but it’s not strictly required. Let’s go!

Edit: I wrote a follow-up post to this one, about designing downsampling filters to compensate for bilinear filtering.

Bilinear confusion

The term bilinear upsampling and downsampling is used a lot, but what does it mean? 

One of the few ideas I’d like to convey in this post is that bilinear upsampling / downsampling doesn’t have a single meaning or a consensus around this term use. Which is kind of surprising for a bread and butter type of image processing operation that is used all the time!

It’s also surprisingly hard to get it right even by image processing professionals, and a source of long standing bugs and confusion in top libraries (and I know of some actual production bugs caused by this Tensorflow inconsistency)!

Edit: there’s a blog post titled “How Tensorflow’s tf.image.resize stole 60 days of my life” and it’s describing same issue. I know of some of my colleagues that spent months on fixing it in Tensorflow 2 – imagine effort of fixing incorrect uses and “fixing” already trained models that were trained around this bug…

Image for post
Image credit/source: Oleksandr Savsunenko

Some parts of it like phase shifting are so tricky that a famous blog post of “magic kernel” comes up every few years and again, experts re(read) it a few times to figure out what’s going on there, while the author simply rediscovered the bilinear! (Important note: I don’t want to pick on the author, far from it, as he is a super smart and knowledgeable person, and willingness to share insights is always respect worthy. “Magic kernel” is just an example of why it’s so hard and confusing to talk about “bilinear”. I also respect how he amended and improved the post multiple times. But there is no “magic kernel”.)

So let’s have a look at what’s the problem. I will focus here exclusively on 2x up/downsampling and hope that some thought framework I propose and use here will be beneficial for you to also look at and analyze different (and non-integer factors).

Because of bilinear separability, I will again abuse the notation and call “bilinear” a filter when applied to 1D signals and generally a lot of my analysis will be in 1D.

Bilinear downsampling and upsampling

What do we mean by bilinear upsampling?

Let’s start with the most simple explanation, without the nitty gritty: it is creating a larger resolution image where every sample is created from bilinear filtering of a smaller resolution image.

For the bilinear downsampling, things get a bit muddy. It is using a bilinear filter to prevent signal aliasing when decimating the input image – ugh, lots of technical terms. I will circle back to it, but first address the first common confusion.

Is this box or bilinear downsampling? Two ways of addressing it

When downsampling images by 2, we every often use terms box filter and bilinear filter interchangeably. And both can be correct. How so?

Let’s have a look at the following diagram: 

(Bi)linear vs box downsampling give us the same effective weights. Black dots represent pixel centers, upper row is the target/low resolution texture, and the bottom row the source, higher resolution one. Blue lines represents discretized weights of the kernel.

We can see that a 2 tap box filter is the same as a 2 tap bilinear filter. The reason for it is that in this case, both filters are centered between the pixels. After discretizing them (evaluating filter weights at sample points), there is no difference, as we no longer know what was the formula to generate them, and how the filter kernel looked outside of the evaluation points.

The most typical way of doing bilinear downsampling is the same as box downsampling. Using those two names for 2x downsampling interchangeably is both correct! (Side note: Things diverge when taking about more than 2x downsampling. This might be a good topic for another blog post.) For 1D signals it means averaging every two elements together, for 2D images averaging 4 elements to produce a single one.

You might have noticed something that I implicitly assumed there – pixel centers there were shifted by half a pixel, and the edges/corners were aligned.

There is “another way” of doing bilinear downsampling, like this:

A second take on bilinear downsampling – this time with pixel centers (black dots) aligned. Again the source image / signal is on the bottom, target signal on the top.

This one definitely and clearly is also a linear tent, and it doesn’t shift pixel centers. The resulting filter weights of [0.25 0.5 0.25] are also called a [1 2 1] filter, or the simplest case of a binomial filter, a very reasonable approximation to a Gaussian filter. (To understand why, see what happens to the binomial distribution as the trial count goes to infinity!). It’s probably the filter I use the most in my work, but I digress. 🙂

Why this second method is not used that much? This is by design and a reason for half texel shifts in GPU coordinates / samplers, and you might have noticed the problem – the last texel of high resolution array gets discarded. But let’s not get ahead of ourselves, first we can have a look at the relationship with upsampling.

Two ways of bilinear upsampling – which one is “proper”?

If you were to design a bilinear upsampling algorithm, there are a few ways to address it.

Let me start with a “naive” one that can have problems. We can take every original pixel, and between them just place averages of the other ones.

Naive bilinear upsampling when pixel centers are aligned. Some pixels receive a copy of the source (green line), the other ones (alternating) a blend between two neighbors.

Is it bilinear / tent? Yes, it’s a tent filter on zero-inserted image (more on it later). It has an unusual property; some pixels get blurred, some pixels stay “sharp” (original copied).

But more importantly, if you do box/bilinear downsampling as described above, and then upsample an image, it will be shifted:

Using box downsampling, and then copy / interpolate upsampling shifts the image by half a pixel. This is a wrong way to do it!

Or rather – it will not correct for the half pixel shift created by downsampling.

It will work however with downsampling using the second method. The second method interpolates every single output pixel; all are interpolated:

When done properly, bilinear down/upsample doesn’t shift the image.

This another way of doing bilinear upsampling that might first feel initially unintuitive: every pixel is 0.75 of one pixel, and 0.25 of another one, alternating “to the left” and “to the right”. This is exactly what a GPU does when you upsample a texture by 2x:

There are two simple explanations for those “alternating” weights. The first, easiest one is just looking at the “tents” in this scheme:

If we draw interpolation “tents”, we can see that the lower resolution image samples are alternating on the either side of the high resolution sample.

I’ll have a look at the second interpretation of this filter – it’s [0.125 0.375 0.375 0.125] in disguise 🕵️‍♀️, but first with this intro, I think it’s time to make the main claim / statement: we need to be careful to use same reference coordinate frames when discussing images of different resolutions.

Be careful about phase shifts

Your upsampling operations should be aware of what downsampling operations are and how they define the pixel grid offset, and the other way around!

Even / odd filters

One important thing to internalize is that signal filters can have odd or even number of samples. If we have an even number of samples, such a filter doesn’t have a “center”, so it has to shift the whole signal by a half pixel in either direction. By comparison, symmetric odd filters can shift specific frequencies, but don’t shift the whole signal:

Odd length filters can stay “centered”, while even length filters shift the signal/image by half a pixel.

If you know signal processing, those are the type I and II linear phase filters.

Why shifts matter

Here’s a visual demonstration of why it matters. A Kodak dataset image processed with different sequences, first starting with box downsampling:

Using box / tent even downsampling followed by either even, or odd upsampling.

And now with [1 2 1] tent odd downsampling:

Using tent odd downsampling followed by either even, or odd upsampling.

If there is a single lesson from my post, I would like it to be this one: Both “takes” on the bilinear up/downsampling above can be the valid and correct ones, you simply need to pick the proper one for your use-case and the convention used throughout your code/frameworks/libraries; always use a consistent coordinate convention for the downsampling and upsampling. When you see term “bilinear”, always double check what it means! Because of it, I actually like to reimplement those and be sure that I’m consistent…

That said, I’d argue that the “box” bilinear downsampling and the “alternating weights” are better for average use-case. The first reason might be somewhat subjective / minor (because bilinear down/upsampling is inherently low quality and I don’t recommend using it when the quality matters more than simplicity / performance). If we visually inspect the upsampling operation, we can see more leftover aliasing (just look at the diagonal edges) in the odd/odd combo:

Two types of upsampling/downsampling can prevent image shifting, but produce differently looking and differently aliased images.

The second reason, IMO a more important one is how easily they align images. And this is why GPU sampling has this “infamous” half a pixel offset.

That half pixel offset!

Ok, so my favorite part starts – half pixel offsets! Source of pain, frustration, misunderstanding, but also a super reasonable and robust way of representing texture and pixel coordinates. If you started graphics programming relatively recently (DX10+ era) or are not a graphics programmer – this might be not a big deal for you. But basically, with older graphics APIs framebuffer coordinates didn’t have a half texel offset, while the texture sampler expected it, so you had to add it manually. Sometimes people added it in the vertex shader, sometimes in the pixel shader, sometimes setting up uniforms on the CPU… a complete mess; it was a source of endless bugs found almost every day, especially on video games shipping on multiple platforms / APIs!

What do we mean by half pixel offset?

If you have a 1D texture of size 4, what are your pixel/texel coordinates?

They can be [0, 1, 2, 3]. But GPUs use a convention of half pixel offsets, so they end up being [0.5, 1.5, 2.5, 3.5]. This translates to UVs, or “normalized” coordinates [0.5/4, 1.5/4, 2.5/4, 3.5/4], which spans a range of [0.5/width, 1 – 0.5/width].

This representation seems counterintuitive at first, but what it provides us is a guarantee and convention that the image corners are placed at [0 and 1] normalized, or [0, width] unnormalized.

This is really good for resampling images and operating on images with different resolutions.

Let’s compare the two on the following diagrams:

Half pixel offset convention aligns pixel grids perfectly, by aligning their corners/edges.
No offset convention aligns the first pixel center perfectly – and in the case of 2x scaling, also every other pixel. But images “overlap” outside of the 0,1 range and are not symmetric!

While the half a pixel align pixel corners, the other way of down/upsampling comes from aligning the first pixel centers in the image.

Now, let’s have a look at how we compute the bilinear upsampling weights in the half a pixel shift convention:

This convention makes it amazingly simple and obvious where the weights come from – and how simple the computation is once we align the grid corners. I personally use it as well even in APIs outside of GPU shader realm – everything is easier. If adding and removing 0.5 adds performance cost, then can be removed at microoptimizations stage, but usually doesn’t matter that much.

Reasonable default?

Half a pixel offset for pixel centers used in GPU convention for both pixels and texels is a reasonable default for any image processing code dealing with images of different resolutions.

This is expecially important when to dealing with textures of different resolutions and for example mip maps of non power of 2 textures. A texture with 9 texels instead of 4? No problem:

A texture with 9 texels aligns easily and perfectly with a one with 4 texels. Very useful for graphics operations, where you want to abstract the texture resolutions away.

It makes sure that grids are aligned, and the up/downsampling operations “just work”. To get box/bilinear downsampling, you can just take a single bilinear tap of the source texture, the same with the upsampling.

So trivial to use it that when you start graphics programming, you rarely think about it. Which is a double edge sword – both great for an easy entry point for beginners, but also a source of confusion once you start getting deeper into it and analyzing what’s going on or do things like fractional or nearest neighbor downsampling (or e.g. create a non-interpolable depth map pyramid…).

Even if there were no other reasons, this is why I’d recommend treating phase shifting box downsample and the [0.25 0.75] / [0.75 0.25] upsamplers as your default when talking about bilinear as well.

Bonus advantage: having texel coordinates shifted by 0.5 means that if you want to get an integer coordinate – for example for texelFetch instruction – you don’t need to round. Floor / truncation (which in some settings can be a cheaper operation) gives you the closest pixel integer coordinate to index!

Note: Tensorflow got it wrong. The “align_corners” parameter aligns… centers of the corner pixels??? This is a really bad and weird naming plus design choice, where upsampling a [0.0 1.0] by factor of 2 produces [0, 1/3, 2/3, 1], which is something completely unexpected and different from either of the conventions I described here.

Signal processing – bilinear upsampling

I love writing about signal processing and analyzing signals also in the frequency domain, so let me explain here how you can model bilinear up/downsampling in the EE / signal processing framework.

Upsampling usually is represented as two operations: 1. Zero insertion and 2. Post filtering.

If you never heard of this way of looking at it (especially the zero insertion), it’s most likely because in practice nobody in practice (at least in graphics or image processing) implements it like this, it would be super wasteful to do it in such a sequence. 🙂

Zero insertion 

Zero insertion is an interesting, counter-intuitive operation. You insert zeros between each element (often multiplying the original ones by 2x to preserve the constant/average energy in the signal; or we can fold this multiplication in our filter later) and get 2x more samples, but they are not very “useful”. You have an image consisting of mostly “holes”…

In 2D zero insertion causes every 2×2 quad contain one pixel and three zeros.

I think that looking at it in 1D might be more insightful:

1D zero insertion – notice high frequency oscillations.

From this plot, we can immediately see that with zero insertion, there are many high frequencies that were not there! All of those zeros create lots of high frequency coming from alternating and “oscillating” between the original signal, and zero. Filters that are “dilated” and have zeros in between coefficients (like a-trous / dilated convolution) are called comb filters – because they resemble a comb teeth!

Let’s look at it from the spectral analysis. Zero insertion duplicates the frequency spectrum:

Upsampling by zero insertion duplicates the frequency spectrum.

Every frequency of the original signal is duplicated, but we know that there were no frequencies like this present in the smaller resolution image; it wasn’t possible to represent anything above its Nyquist! To fix that, we need to filter them out after this operation with a low pass filter:

To get properly looking image, we’d want to remove high frequencies from zero insertion by lowpass filtering.

I have shown some remainder frequency content on purpose, as it’s generally hard to do “perfect” lowpass filtering (and it’s also questionable if we’d want this – ringing problems etc).

Here is how progressively filtered 1D signal looks like, notice high frequencies and “combs” disappearing:

Notice how progressively more blurring causes upsampled signal lose the wrong high frequency comb teeth and it converges to 2x higher resolution original one!

Here’s an animation of blurring/filtering on the 2D image and how there it also causes this zero-inserted image to become more and more like just properly upsampled:

Blurring zero inserted image converges to upsampled one!

Looks like image blending, but it’s just blending filters – imo it’s pretty cool. 😎

Nearest neighbor -> box filter!

Obviously, the choice of the blur (or technically – lowpass) filter matters – a lot. Some interesting connection: what if we convolve this zero-inserted signal with a symmetric [0.5, 0.5] (or 1,1 if we didn’t multiply the signal by 2 when inserting zeros) filter?

Convolving image with a [1, 1] filter is the same as nearest neighbor filter!

The interesting part here is that we kind of “reinvented” the nearest neighbor filter! After a second of though, this should be intuitive; a sample that is zero gets contributions from the single non-zero neighbor, which is like a copy, while the sample that is non-zero is surrounded by two zeros, and they don’t affect it.

We can see on the spectral / Fourier plot where the nearest neighbor hard edges and post-aliasing comes from (red part of the plot):

The nearest neighbor upsampling is also shifting the signal (because it is even number of samples) and will work well to undo the box downsampling filter, which fits the common intuition of replicating samples being the “reverse” of box filtering and causing no shift problem.

Bilinear upsampling take one – direct odd filter

Let’s have a look at how the strategy of “keep one sample, interpolate between” can be represented in this framework.

It’s equivalent to filtering our zero-upsampled image with a [0.25 0.5 0.25] filter.

The problem is that in such setup, if we multiply the weights two (to keep average signal the same) and then by zeros (where the signal is zero), we get alternating [0.0 1.0 0.0] and [0.5 0.0 0.5] filters, with very different frequency response and variance reduction… I’ll reference you here again to my previous blog post on it, but basically you get alternating 1.0 and 0.5 of original signal variance (sum of effective weights squared).

Bilinear upsampling take two – two even filters

The second approach of alternating weights of [0.25 0.75] can be seen as simply: nearest neighbor upsampling – a filter of [0.5 0.5], and then [0.25 0.5 0.25] filtering!

This sequence of two convolutions gives us an effective kernel of [0.125 0.375 0.375 0.125] on the zero inserted image, so if we multiply it by 2 simply alternating [0.25 0.0 0.75 0.0] and [0.0 0.75 0.0 0.25]. Corners aligned bilinear upsampling (standard bilinear upsampling on the GPU) is exactly the same as the “magic kernel”! 🙂 This is also this second, more complicated explanation of bilinear 0.25 0.75 weights I promised.

Advantage of it is that with the effective weight of [0.25 0.75] and [0.75 0.25] (ignoring zeros) on alternating pixels, they have the same amount of filtering and variance reduction of 0.625 – very important!

This is how the combined frequency response compares to the previous one: 

Two ways of bilinear upsampling both leave aliasing (everything after the dashed line half Nyquist), as well as blur the signal (everything before it).

So as expected, more blurring, less aliasing, consistent behavior between pixels.

Neither is perfect, but the even one will generally cause you less “problems”.

Signal processing – bilinear downsampling

By comparison, downsampling process should be a bit more familiar to readers who have done some computer graphics or image processing and know of aliasing in this context.

Downsampling consists of two steps in opposite order: 1. Filtering the signal. 2. Decimating the signal by discarding every other sample.

The ordering and step no 1 is important, as the second step, decimating is equivalent to (re)sampling. If we don’t filter the signal spectrum above frequencies representible in the new resolution, we are going to end up with aliasing, folding back of frequencies above previous half Nyquist:

When decimating, original signal frequencies will alias, appearing as wrong ones after decimation. To prevent aliasing, you generally want to prefilter the image with a strong antialiasing – lowpass – filter.

This is the aliasing the nearest-neighbor (no filtering) image downsampling causes:

Aliasing manifests as wrong frequencies; notice on the bottom plot how end of the spectrum looks like 2x smaller frequency than before decimation.

Bilinear downsampling take one – even bilinear filter

First antialiasing filter we’d want to analyze would be our old friend “linear in box disguise”, [0.5, 0.5] filter. It is definitely imperfect, and we can see both blurring, and some leftover aliasing:

The Graphics community realized this a while ago – when doing a series of downsamples for post-processing, for example bloom / glare; the default box/tent/bilinear filters are pretty bad in such case. Even small aliasing like this can be really bad when it gets “blown” to the whole screen, and especially in motion. It was even a large chunk of Siggraph presentations, like this excellent one from my friend Jorge Jimenez.

I also had a personal stab at addressing it early in my career, and even described the idea – weird cross filter (because it was fast on the GPU) – please don’t do it, it’s a bad idea and very outdated! 🙂 

Bilinear downsampling take two – odd bilinear filter

By comparison the odd bilinear filter (that doesn’t shift the phase) looks like a little different trade-off:


Less aliasing, more blurring. It might be better for many cases, but the trade-offs from breaking the half-pixel / corners aligned convention are IMO unacceptable. And it’s also more costly (not possible to do a single tap 2x downsampling).

To get better results -> you’ll need more samples, some of them with negative lobes. And you can design an even filter with more samples too, for example even Lanczos:

It’s possible to design better downsampling filters. This is just an example, as it’s an art and craft of its own (on top of the hard science). 🙂

Side note – different trade-offs for up/downsampling?

One interesting thing that has occurred to me on a few occasions is that the trade-offs for low pass filtering for upsampling and downsampling are different. If you use a “perfect” upsampling lowpass filter, you will end up with nasty ringing.

This is typically not the case for downsampling. So you can opt for a sharper filter when downsampling, and a less sharp for upsampling, and this is what Photoshop suggests as well:

Photoshop also suggests smoother/more blurry upsampling filter, while a sharper (closer to “perfect”) lowpass filter, because ringing / halos tend to not be as much of a problem there as in the case of upsampling.

Conclusions

I hope that my blog post helped to clarify some common confusions coming from using the same, very broad terms to represent some different operations.

A few of main takeaways that I’d like to emphasize would be:

  1. There are a few ways of doing bilinear upsampling and downsampling. Make sure that whatever you use uses the same convention and doesn’t shift your image after down/upsampling.
  2. Half pixel center offset is a very convenient convention. It ensures that image borders and corners are aligned. It is default on the GPU and happens automatically. When working on the CPU/DSP, it’s worth using the same convention.
  3. Different ways of upsampling/downsampling have different frequency response, and different aliasing, sometimes varying on alternating pixels. If you care about it (and you should!), look more closely into which operation you choose and optimal performance/aliasing/smoothing tradeoffs.

I wish more programmers were aware of those challenges and we’d never again again hit bugs due to inconsistent coordinate and phase shifts between different operations or libraries… I also with we could never see those “triangular” or jagged aliasing artifacts in images, but bilinear upsampling is so cheap and useful, that instead we should be just simply aware of potential problems and proactively address them.

To finish this section, I would again encourage you to read my previous blog post on some alternatives to bilinear sampling.

PS. What was my bug that I mentioned at beginning of the post? Oh, it was simple “off by one” – in numpy when convolving with np.signal.convolve1d and 2d I assumed wrong “direction” of the convolution of even filters. Subtle bug, but it was shifting everything by one pixel after sequence of downsamples and upsamples. Oops. 😅

Posted in Code / Graphics | Tagged , , , , , , , | 9 Comments

Is this a branch?

Let’s try a new format – “shorts”; small blog posts where I elaborate on ideas that I’d discuss at my twitter, but they either come back over and over, or the short form doesn’t convey all the nuances.

I often see advice about avoiding “if” statements in the code (especially GPU shaders) at all costs. The worst of all is when this happens in glsl and programmers replace some simple if statements with (subjectively) horrible step instruction which IMO completely destroys all readability, or even worse – sequences of mixes, maxes, clips and extra ALU.

After all, those ifs are branches, and branches are bad, right?

TLDR: No, definitely not – on all accounts.

Why do programmers want to avoid branches?

Let’s first address why programmers would like to avoid branches. There are some good reasons for that.

Scalar CPU

On the CPU, “branches” in a non-SIMD code might seem “innocent”, it’s just a conditional jump to some other code. Instead of executing an instruction a next address, we jump to instruction b at some other address. Unfortunately, there are many complicated things going on under the hood: All modern CPUs are heavily pipelined, superscalar, and out-of order.

All modern CPUs are superscalar and deeply pipelined. Many instructions are processed and executed at the same time. Source: wikipedia

To extract maximum efficiency, they use a model of speculative execution – a CPU “gambles” and “guesses” results of a branch ahead of time, not knowing if it will be taken or not, and starts executing those instructions. If it’s right – great! However if it’s not, then the CPU needs to stall, cancel all the guesses, go back to the branch, and execute the other instructions. Oops.

Luckily, CPUs are usually very good at this “guessing” (they actually analyze past branch patterns and take them into account!). On the other hand, without going into too much detail – each branch consumes some of the branch predictor unit resources. One extra branch on some short, innocent code can make results of the branch prediction worse, and slow down other code (where branch predictor might be necessary for good performance!).

SIMD and the GPU

When considering SIMD or the GPU, the situation is even worse.

Simplifying – single program and same instruction stream executes on multiple “lanes” of the same data, processing 4/8/16/32/64 elements at the same time.

If you want to take a branch when some of the elements take one path, some another one, you have two main options: either abandon vectorization and scalarize code (very bad option; usually means that it’s not worth vectorizing given piece of code), or execute both code paths – first the first one, store the results somewhere, execute the second one, and “combine” the results.

This is a general (over)simplification, and example GPUs have dedicated hardware to help do it efficiently and without wasting resources (hardware execution masks etc), but generally means extra cost of managing those masks, storing results somewhere, double execution costs and finally – branches can serve as barriers, preventing some latency hiding (see my old blog post on it).

Combined

Given those two, it might seem that avoiding branches seems reasonable, and the common advice of avoiding ifs makes sense. But whether it makes sense to avoid branch or not is more complicated (more on it later), and the second one (whether an if is a branch) is not the case.

Are if statements branches?

Let’s clear the worst misconception.

When you write an “if” in your code, the compiler won’t necessarily generate a branch with a jump.

I will focus now on CPUs, mostly because how easy it is to test it with Matt Godbolt’s amazing compiler explorer. 🙂 (Plus how there are two mainstream architectures, unlike many different shader ISAs)

Let’s have a look at the following simple integer function.

int is_this_a_branch(int v, int w) {
    if (v < 10) {
        return v + w; 
    } 
    else {
        return 2 * v - 2 * w;
    }
}

I tested it with 3 x64 compilers (GCC x64 trunk, Clang x64 trunk, MSVC 19.28) and one ARM (armv8-a clang), and only MSVC generates an actual branch there.

clang              
        mov     eax, edi
        sub     eax, esi
        add     esi, edi
        add     eax, eax
        cmp     edi, 10
        cmovl   eax, esi
        ret
gcc
        mov     eax, edi
        lea     edx, [rdi+rsi]
        sub     eax, esi
        add     eax, eax
        cmp     edi, 9
        cmovle  eax, edx
        ret
msvc        
        cmp     ecx, 10
        jge     SHORT $LN2@is_this_a_ BRANCH!!!
   BRANCH!
        lea     eax, DWORD PTR [rcx+rdx]
        ret     0
$LN2@is_this_a_:
        sub     ecx, edx
        lea     eax, DWORD PTR [rcx+rcx]
        ret     0

arm clang
        sub     w9, w0, w1
        add     w8, w1, w0
        lsl     w9, w9, #1
        cmp     w0, #10                         // =10
        csel    w0, w8, w9, lt

When using floats:

float is_this_a_branch2(float v, float w) {
    if (v < 10.0f) {
        return v + w; 
    } 
    else {
        return 2.0f * v - 2.0f * w;
    }
}

Now 2 out of 4 compilers generate a branch:

clang           
        movaps  xmm2, xmm0
        addss   xmm2, xmm1
        movaps  xmm3, xmm0
        addss   xmm3, xmm0
        addss   xmm1, xmm1
        cmpltss xmm0, dword ptr [rip + .LCPI1_0]
        subss   xmm3, xmm1
        andps   xmm2, xmm0
        andnps  xmm0, xmm3
        orps    xmm0, xmm2
        ret

gcc
        movss   xmm2, DWORD PTR .LC0[rip]
        comiss  xmm2, xmm0
        jbe     .L10  BRANCH!!!

        addss   xmm0, xmm1
        ret
.L10:
        addss   xmm1, xmm1
        addss   xmm0, xmm0
        subss   xmm0, xmm1
        ret

msvc
        movss   xmm2, DWORD PTR __real@41200000
        comiss  xmm2, xmm0
        jbe     SHORT $LN2@is_this_a_  BRANCH!!!
        addss   xmm0, xmm1
        ret     0
$LN2@is_this_a_:
        addss   xmm0, xmm0
        addss   xmm1, xmm1
        subss   xmm0, xmm1
        ret     0

arm clang
        fadd    s2, s0, s1
        fadd    s3, s0, s0
        fadd    s1, s1, s1
        fmov    s4, #10.00000000
        fsub    s1, s3, s1
        fcmp    s0, s4
        fcsel   s0, s2, s1, mi
        ret

What are those conditionals and selects?

This is basically CPUs’ and compilers’ way of “serializing” both code paths. If it considers overhead of a branch larger than the amount of work it can save, it makes sense to compute both code paths, and use a single instruction to select which one is the actual result.

This obviously “depends” on the use case and whether it is even “legal” for a compiler to do such a transformation (like “side effects”; or if a compiler executed some reads of memory, it could be unsafe and lead to segmentation fault). 

Does writing ternary conditional help avoid branches?

No.

This is a common advice, but it’s generally not true. If you look at the above example rewritten slightly:

float is_this_a_branch3(float v, float w) {
    return v < 10.0f ? (v + w) : (2.0f * v - 2.0f * w);
}

The generated code is identical (!).

It might be possible (?) that some compilers do generate branchless code for like this, but giving this as a general advice makes no sense. Whether you write an actual if, or a ternary expression, it’s still a high level if statement, just expressed differently.

What about the GPU?

On the GPU, it’s the same; the compiler will generate conditional selects when appropriate.

I used Tim Jones Shader Playground and AMD ISA (mostly because I know it relatively better than the others), and the unfortunate step function:

#version 460

layout (location = 0) out vec4 fragColor;
layout (location = 0) in float inColor;

void main()
{
#if 1
	fragColor = vec4(step(1.0, inColor));
#else
    float x;
    if (inColor < 1.0)
        x = 1.0;
   	else
        x = 0.0;
   	fragColor = vec4(x);
#endif
}

See for yourself, both versions generate identical assembly:

  s_mov_b32     m0, s3
  s_nop         0x0000
  v_interp_p1_f32  v0, v0, attr0.x
  v_interp_p2_f32  v0, v1, attr0.x
  v_cmp_gt_f32  vcc, 1.0, v0
  v_cndmask_b32  v0, 1.0, 0, vcc
  v_cvt_pkrtz_f16_f32  v0, v0, v0
  exp           mrt0, v0, v0, v0, v0 done compr vm

When using fastmath and not obeying strict IEEE float specification (which is standard for compiling shaders for real time graphics), compilers can even generate other instructions like min or max from your if statements. But YMMV and be sure to check the assembly.

So generally – using “step” just to avoid if statements is pointless.

If you like it for your coding style, then it’s obviously up to you, use as much as you like just for this reason. But please consider programmers who don’t have as much shader or GLSL experience and will be always puzzled by it. 🙂

Are branches always to be avoided?

No.

This is way beyond my aimed “short” format, but branches are not necessarily “bad”.

They can be actually an awesome optimization, even on the GPU!

If you start to use a lot of conditionals

Lots of advice on how branches are bad and have to be avoided comes from a prehistoric era of old CPUs and GPUs or old compilers; an advice that is passed without re(verifying) – which is understandable, I am not re-checking my whole knowledge or intuition every few months either. 🙂

But the CPU/GPU/compiler architects improved designs significantly and adapted them to general, highly branchy code, and whether there is a penalty or not depends on way too many details to list or give advice.

Rule of thumb: if a branch is generally coherent, it makes sense to use it when it allows to save on memory bandwidth or cache utilization. On GPUs it is usually worth adding branches if you can avoid texture fetches this way with a high probability and coherence. And it’s usually better to have some conditional masks and selects around simple/short sequences of arithmetic operations.

Final advice

  1. Don’t assume that an “if” statement will generate a branch. For most of simple “if” statements, the compiler can and will use conditional selects. Compilers can do many powerful transformations, sometimes ones that you wouldn’t even consider.
  2. Unless writing performance critical code, optimize for code readability. Don’t use obscure instructions and weird arithmetic.
  3. If you care for performance of a hot section, always read the resulting assembly and verify what is the compiler output.
  4. Don’t assume that a branch is bad. Profile your code – both in microbenchmarks, as well as in surrounding context (effects on cache, memory bandwidth, and branch predictor).
Posted in Code / Graphics | Tagged , , , , , , | 6 Comments

Converting wavetables to Ableton Operator AMS waves

This blog post comes with Ableton Operator AMS “wavetables” here.

In Ableton’s FM synth you can use different types of oscillator waves as your operators (both carriers as well as modulators), as well as draw custom ones:

What is not very well known is that you can save those as AMS files (and obviously also load them). The idea for this small project and a blog post came from a video about Ableton Live tips and tricks from Cymatics where I learned about this possibility.

If Operator can write/read such files, this this means that it’s also possible to easily create one with code, and it’s possible to use the harmonic information of arbitrary wavetables!

Note: This post is going to be slightly less technical than most of mine – as it is targeted towards electronic musicians using such software and not software engineers.

Here you can download some Operator wavetables from the harmonics of Native Instruments Massive wavetables I found in an online wavetable pack. I’ve tested those AMS files in Operator in Ableton Live 10 Suite.

To use them, just drag a file and drop it onto the Oscillator tab in Ableton Operator.

This way you can use any wavetable harmonic information as your FM synthesis operator oscillators – for both carriers and modulators!

I also include a Python script that I used to convert those (note: it’s quite naive and hacky, so some programming knowledge and debugging might be required; when programming at home I go with least effort), so you use it yourself on some other wavetables and convert them to AMS files.

Disclaimer / legal note: I don’t know the legal status of those Massive wavetables that I downloaded and converted nor can verify how legit they are (sound pretty accurate to me). IANAL, but I don’t think that sound data that can be extracted / generated by playing Massive by legal owners is copyrightable in any way, but anyway, not going to share those.

The files I uploaded are simply text descriptions of harmonics, lack phase information (more about it below) and don’t use any of the wavetables directly. The format looks like below:

AMS 1
Generate MultiCycleFM
BaseFreq 261.625549
RootKey 60
SampleRate 44100
Channels 1
BitsPerSample 32
Volume Auto
Sine 1 0.73
Sine 2 0.38
Sine 3 0.30
Sine 4 0.32
Sine 5 0.51
Sine 6 1.00
Sine 7 0.23
Sine 8 0.09

Why not just use some Wavetable synthesizer (like NI Massive, Ableton Wavetable, Xfer Serum etc.)?

You definitely can and should, they are excellent soft synths! This is just a fun experiment and an alternative.

However using those in Operator gives you:

  • Very low CPU usage (in my experience most optimized of Ableton’s synths),
  • Simplicity and immediacy of Operator,
  • Possibility to tweak or change existing Operator FM patches by just swapping the oscillator waves,
  • Opportunity to use wavetables as modulators or carriers and possibility of doing weird FM modulations of wavetables in different configurations including self-feedback,
  • A quick inspiration point – open Operator and just start loading those user waves for different weird sounds and play with modulating them one by another. A subtle organ chord modulated by a formant growl? Why not!

At the same time you don’t get wavetable interpolation (you can approximate it with fading between different operators, but it’s definitely not the same) and generally don’t get other benefits of those dedicated wavetable synths.

So it’s a new tool (or a toy), but not a replacement.

How additive synthesis works

While the Operator is FM synth, every of the Oscillators uses additive synthesis to generate its waveforms.

Additive synthesis adds together different harmonics (multiplies of the base frequency) to approximate more complicated sounds.

If we look at a simple square wave:

According to (oversimplified) Fourier theorem, any repeating signal can be also represented as a sum of sine and cosine waves, for example for the square wave:

1 sin(w) + 1/3 sin(3w) + 1/5 sin(5w) + …

This is known as harmonic series and amount of harmonics (those 1, 1/3, 1/5 etc) multipliers and the type of harmonics (notice that all of those 1w, 3w, 5w are odd numbers! This is typical of the square waves and their “oboe”-like sound and as opposed to saw waves with both odd and even harmonics) represent the timbre of the sound.

Additive synthesis is simply summing up different harmonics represented as sines of increasing frequency multiplied by a precomputed amplitude.

Square wave might not be the best example because of so-called Gibbs phenomenon (those nasty oscillations at discontinuities / jumps of signals), but the reconstruction still looks (and will sound) similar.

Additive synthesis uses this concept: Instead of storing and representing waves faithfully, you can store the amplitude of harmonics.

This allows for very efficient data compression! Many waveforms can be represented very well with just a few numbers. Limitation to 64 harmonics might seem relatively large, but on the other hand this is the same as 6 octaves – so from a C0 note you can still get a C5 tone, not bad for most practical uses.

As I mentioned, the representation in Ableton’s Operator doesn’t use the wave phase (relative shift of those sine waves as compared one to another). Without using the exact phase (let’s assume just random phase for fun), the square wave reconstruction could look like this:

With random (or simply wrong) phase, our signal doesn’t “look” like a square anymore… But will still sound like one!

Not like a square at all! But… It will sound generally exactly like a square wave. So does it matter? No. Yes. Kind of. More on that later!

How to get harmonics from any wavetable?

Ok, so let’s say we have a wavetable with 2048 samples, how can we get harmonics for it? The answer is – Discrete Fourier Transform, and in general, process known as spectral decomposition.

It takes our 2048 sample signal and produces 2048 complex numbers representing shifted sine/cosine waves. This is beyond scope of this post, but for “real” signals (like audio samples) and ignoring the phase information you can extract out of is simply 1023 amplitudes of different harmonics and an information about “constant term” (average value, known in audio also as DC term).

Let’s take some simple wavetable (recording of a TB-303 wave with it’s famous diode ladder filter applied):

When we run it through FFT and take the amplitude of the complex numbers, we get the information about the frequency and harmonic content:

From this plot, translating it to the format used by Ableton is very straightforward. Literally take the harmonic index (1, 2, 3, ….), and the amplitude value. And this is all, the amplitude is the harmonic content for given harmonic index! Unsurprisingly this can be coded up in a ~dozen lines of code. 🙂

In this case, we are going to discard the phase information, but here’s how the phase looks like for the curious:

Here’s a comparison of the original (left) with the reconstructed:

Left: Original wave. Right: reconstruction from additive harmonics with the phase information missing.

Notice that the waves are definitely different by discarding the phase. Interesting characteristic of this reconstructed wave is that it is (and has to be!) “symmetrical” due to use of sine waves and no phase information.

But if you listen to a wav file, then again, they should sound +/- identical.

Does the phase information matter?

No. Yes. Kind of.

Hmm, the answer is complicated. 🙂 

Generally when talking about the timbre of continuous tones (and not transients or changing signals), the human ear cannot tell the phase difference.

If we did, then if when we move around (which changes the phase), sounds would change their timbre drastically. Similarly, for example every single EQ (including the linear phase EQs) changes the phase of sounds in addition to amplitude.

But the human hearing does use the phase information, just in a different way – the difference between the phase of signals between two ears is super useful for stereo localization of sounds.

Generally it would seem that the phase doesn’t matter so much?

It doesn’t matter if we are talking about a) linear processes and b) single signal sources.

If you have two copies of a signal, even subtle phase shifts can introduce some cancellations and phase shifts (this is +/- how phaser effects work). In this case, the effect is subtle (as it is very signal dependent), but notice how the phase cancellation reduces some of the harmonics:

Additionally, when introducing non-linearities like distortion or saturation, interactions of different harmonics and different “local” amplitude can result in different distortion, let’s have a look again at the waves with added lines corresponding to some clipping points:

When soft clipping against the black lines, saturated signals will produce different harmonics.

For example after feeding through a “saturator” (I used a hyperbolic tangent function) the waves will be very different:

Left: Original wave fed through saturator. Right: additive harmonics reconstruction fed through saturator.

Notice the asymmetric soft clipping (rounding of the peaks) of the original vs symmetric on the reconstruction.

It results in different harmonics introduced through saturation:

It’s not that one is more saturated than the other, just that the harmonics are different. This goes into symmetric / asymmetric clipping, tube vs transistor odd vs even harmonics etc. A very deep topic on which I’m not an expert.

This can and will sound different, hear for yourself.

Note: those phase differences affecting distortion so much is one of the reasons why it’s hard to get properly sounding emulations of hardware, for example on 303 emulations done through simple analog synths with saw/square waves and a generic filter after a distortion will sound not exactly like the squelchy acid we love. 

So generally for raw, unprocessed sounds the phase often can be ignored, but for further processing it can be necessary and YMMV. 

Conclusions

Despite the lack of phase encoding or wavetable interpolation, I still think using additive synthesis like in Operator for emulating more complex wavetable sounds is a cool and inspiring tool and a fun toy to play with. Next time looking for some weird tones that could be a starting point for a track, why not have fun with modulating different weird additive oscillators? 

Bonus usage

Here’s a weird bonus: you can use those AMS files inside Ableton’s Simpler/Sampler as the single cycle waveforms, just drag and drop those. It doesn’t make any sense to me though. 🙂  

Side note: a bug in Ableton?

On another note, I think I found a bug in Ableton (when trying to figure out the format used by AMS files). When I save a slightly modified square wave and then load it, I get completely different results:

Notice the wave shape look on the second screen, it’s completely wrong (the loaded one also sounds horrible…). 

I think I know where the bug comes from – the AMS files contain the absolute value of amplitude of a given harmonic, while the default waves visual representation (and AMS saved values) have 1/harmonic index normalization. Kinda weird – I can imagine why they would include this normalization from the user’s perspective, but not sure why the exporter/importer loop is wrong.

So don’t worry if your square waves look differently than Ableton default ones.

Posted in Audio / Music / DSP | Tagged , , , , , , , | 1 Comment

Why are video games graphics (still) a challenge? Productionizing rendering algorithms

Intro

This post will cover challenges and aspects of production to consider when creating new rendering / graphics techniques and algorithms – especially in the context of applied research for real time rendering. I will base this on my personal experiences, working on Witcher 2, Assassin’s Creed 4: Black Flag, Far Cry 4, and God of War.

Many of those challenges are easily ignored – they are real problems in production, but not necessarily there only if you only read about those techniques, or if you work on pure research, writing papers, or create tech demos.

I have seen statements like “why is this brilliant research technique X not used in production?” both from gamers, but also from my colleagues with academic background. And there are always some good reasons!

The post is also inspired by my joke tweet from a while ago about appearing smart and mature as a graphics programmer by “dismissing” most of the rendering techniques – that they are not going to work on foliage. 🙂 And yes, I will come back to vegetation rendering a few times in this post.

I tend to think of this topic as well when I hear discussions about how “photogrammetry, raytracing, neural rendering, [insert any other new how topic] will be a universal answer to rendering and replace everything!”. Spoiler alert: not gonna happen (soon).

Target audience

Target audience of my post are:

  • Students in computer graphics and applied researchers,
  • Rendering engineers, especially ones earlier in their career – who haven’t built their intuition yet,
  • Tech artists and art leads / producers,
  • Technical directors and decision makers without background in graphics,
  • Hardware engineers and architects working on anything GPU or graphics related (and curious what makes it complicated to use new HW features),
  • People who are excited and care about game graphics (or real time graphics in general) and would like to understand a bit “how sausages are made”. Some concepts might be too technical and too much jargon, but then feel free to skip those.

Note that I didn’t place “pure” academic researchers in the above list – as I don’t think that pure research should be considering too many obstacles. Role of the fundamental research is inspiration and creating theory that can be later productionized by people who are experts in productionization.

But if you are a pure researcher and somehow got here, I’ll be happy if you’re interested in what kinds of problems might be on the long way from idea or paper to a product (and why most new genuinely good research will never find its place in products).

How to interpret the guide

Very important note – none of the “obstacles” I am going to describe are deal breakers.

Far from it – most successful tech that became state of the art violates many of those constraints! It simply means that those are challenges that will need to be overcome in some way – manual workarounds, feature exclusivity, ignoring the problems, or applying them only in specific cases.

I am going to describe first the use-cases – potential uses of the technology and how those impact potential requirements and constraints.

Use case

The categories of “use cases” deserve some explanation and description of “severity” of their constraints.

Tech demo 

Tech demo is the easiest category. If your whole product is a demonstration of a given technique (whether for benchmarking, showing off some new research, artistic demoscene), most of the considerations go away.

You can actually retrofit everything: from the demo concept, art itself, camera trajectory to show off the technology the best and avoid any problems.

The main considerations will be around performance (a choppy tech demo can be seen as a tech failure) and working very closely with artists able to show it off.

The rest? Hack away, write one-off code – just don’t have expectations that turning a tech demo into a production ready feature is simple or easy (it’s more like the 99% of work remaining).

Special level / one-off

The next level “up” in the difficulty is creating some special features that are one-off. It can be some visual special effect happening in a single scene, game intro, or a single level that is different from the other ones. In this case, a feature doesn’t need to be very “robust”, and often replaces many others.

An example could be lighting in the jungle levels in Assassin’s Creed 4: Black Flag that I worked on. 

Source: Assassin’s Creed 4: Black Flag promo art. Notice the caustics-like lightshafts that were key rendering feature in jungle levels – and allowed us to save a lot on the shadows rendering!

Jungles were “cut off” from the rest of the open world by special streaming corridors and we completely replaced the whole lighting in them! Instead of relying on tree shadow maps and global lighting, we created fake “caustics” that looked much softer and played very nicely with our volumetric lighting / atmospherics system. They not only looked better, but also were much faster – obviously worked only because of those special conditions.

Cinematics

A slightly more demanding type of feature is cinematic-only one. Cinematics are a bit different as they can be very strictly controlled by cinematic artists and most of their aspects like lighting, character placement, or animations are faked – just like in regular cinema! Cinematics often feature fast camera cuts (useful to hide any transitions/popping) and have more computational budget due to more predictable nature (and even in 60fps console games rendered in 30fps).

Witcher 2 cinematic featuring higher character LODs, nice realistic large radius bokeh and custom lighting – but notice how few objects to render are there!

Regular rendering feature

“Regular” features – lighting, particles, geometry rendering – are the hardest category. They need to be either very easy to implement (most of the obstacles / problems solved easily), provide huge benefits surpassing state of the art by far to facilitate the adoption pain, or have some very excited team pushing for it (never underestimate the drive of engineers or artists that really want to see something happen!).

Most of my post will focus on those. 

Key / differentiating feature

Paradoxically, if something is a key or differentiating feature, this can alleviate many of the difficulties. Let’s take VR – there stereo, performance (low latency), and perfect AA with no smudging (so rather forget about TAA), are THE features and absolutely core to the experience. This means that you can completely ignore for example rendering foliage or some animations that would look uncannily – as being immersed and the VR experience of being there are much more important!

Feature compatibility

Let’s have a look at compatibility of a newly developed feature with some other common “features” (the distinction between “features”, and the next large section “pipeline” is fuzzy).

Features are not the most important of challenges – arguably the category I’m going to cover at the end (the “process”) is. But those are fun and easy to look at! 

Dense geometry

Dense geometry like foliage – a “feature” that inspired this post – is an enemy of most rendering algorithms.

The main problem is that with very dense geometry (lots of overlapping and small triangles), many “optimizations” and assumptions become impossible.

Early Z and occlusion culling? Very hard. 

Decoupling surfaces from volumes? Very hard.

Storing information per unique surface parametrization? Close to impossible.

Amount of vertices to animate and pixels to shade? Huge, shaders need simplification!

Witcher 2 foliage – that density! Still IMO one of the best forests in any game.

Dense geometry through which you can see (like grass blades) is incompatible with many techniques, for example lightmapping (storing a precomputed lighting per every grass blade texel would be very costly).

If a game has a tree here and there or is placed in a city, this might not be a problem. But for any “natural” environment, a big chunk of the productionization of any feature is going to be combining it to coexist well with foliage.

Alpha testing

Alpha testing is an extension of the above, as it disables even more hardware features / optimizations.

Alpha testing is a technique, when a pixel evaluates “alpha” value from a texture or pixel shader computations, and based on some fixed threshold, doesn’t render/write it.

It is much faster than alpha blending, but for example disables early z writes (early z tests are ok), and requires raytracing hit shaders and reading a texture to decide if a texel was opaque or not.

It also makes antialiasing very challenging (forget about regular MSAA, you have to emulate alpha to coverage…).

For a description and great visual explanation of some problems, see this blog post of Ben Golus.

Animation – skeletal

Most animators work with “skeletal animations”. Creating rigs, skinning meshes, animating skeletons. When you create a new technique for rendering meshes that relies on some heavy precomputations, would animators be able to “deform” it? Would they be able to plug it into a complicated animation blending system? How does it fit in their workflow?

Note that it can also mean rigid deformations, like a rotating wheel – it’s much cheaper to render complicated objects as a skinned whole, than splitting them.

And animation is a must, cannot be an afterthought in any commercial project.

Witcher 2 trebuchets were not people, but also had “skeletons” and “bones” and were using skeletal animations!

Animation – procedural and non-rigid

The next category of animations are “procedural” and non-rigid. Procedural animations are useful for any animation that is “endless”, relatively simple, and shouldn’t loop too visibly. The most common example is leaf shimmer and branch movement.

See this video of middleware Speedtree movement – all movement there is described by some mathematical formulas, not animated “by hand” and looks fantastic and plausible.

A good rendering technique that is applicable on elements like foliage (again!) needs to support the option of displacing it in any arbitrary fashion from simple shimmer to large bends – otherwise the world will look “dead”.

Non-rigid animation, modifying vertex positions, or even streaming whole vertex buffers causes headaches for the raytracing – as it requires readjusting the spatial acceleration structures (essential for RT) every single frame, which is impractical.

Animation – textures

Yet another type of animation is animating textures on surfaces of objects. This is not just for emulating neons or TVs, but also for example raindrops and rain ripples. Technical and FX artists have lots of amazing tricks there, from just sliding and scaling UVs, having flowmaps, to directly animating “flipbook” textures.

Assassin’s Creed 4 rain was based on displacing normals with a procedurally generated rain heightmap and texture. See my Digital Dragons talk on rain rendering there.

Dynamic viewpoint

Many techniques work well assuming a relatively fixed camera viewpoint. From artists tricks and hacks, to techniques like impostor rendering.

Some rendering acceleration techniques optimize for a semi-constrained view (like Project Seurat that my colleagues worked on). Degrees of camera freedom are something to be considered when adapting any technique. A billboard-based tree can look ok from a distance, but if you get closer, or can see the scene from a higher viewpoint, it will break completely.

Also – think of early photogrammetry that was capturing the specular reflections as textures, which look absolutely wrong when you change the viewpoint even slightly.

Dynamic lighting

How dynamic is the lighting? Is there a dynamic time of day? Weather system? Can the user turn on/off lights? Do special effects cast lights? Are there dynamic shadow casting objects?

The answer is going to be “yes” to many of those questions for most of the contemporary real-time rendering products like games; especially with a tendency of creating larger, living worlds.

This doesn’t necessarily preclude precomputed/baked solutions (like our precomputed GI solution for AC4 that supported dynamic time of day!) but needs extra considerations.

There are still many new publications coming out that describe new approaches and solutions to those problems (combining precomputations of the light transport, and the dynamic lighting).

Shadows

Shadows, another never ending topic and an unsolved problem. Most games still use a mixture of precomputed shadows and shadow maps, some start to experiment with raytraced shadows.

Anytime you want to insert a new type of object to be rendered, you need to consider: is it going to be able to receive shadows from other objects? Is it going to cast shadows on other objects?

For particles or volumetrics, the answer might not be easy (as partial transmittance is not supported by shadow maps), but also “simple” techniques like mesh tessellation or parallax occlusion mapping are going to create a mismatch between shadow caster and the receiver, potentially causing shadowing artifacts!

A slide from my GDC 2014 talk showing parallax occlusion mapping in AC4 – notice complete lack of shadows there. 🙂

Dynamic environment

Finally, if the environment can be dynamic (through game story mandated changes, destruction, or in the extreme case through user content creation), any techniques relying on offline precomputation become challenging.

On God of War the Caldera Lake and the bridge, moving levels of water and bridge rotations were one of the biggest concerns throughout the whole production from the perspective of lighting / global illumination systems that rely on precomputations. There was no “general” solution, it all relied on manual work and streaming / managing the loaded data…

God of War Caldera Lake bridge. Based on what happens there throughout the story, it was quite challenging to manage its lighting changes! Source: Sony promo art.

Transparents and particles

Finally, there are transparent objects and particles – a very special and different category than anything else.

They don’t write depth or motion vectors (more on it later), require back-to-front sorting, need many evaluations of costly computations and texture samplers per final output piels, usually are lit in a simpler way, need special handling of shadows… They are also very expensive due to overdraw – a single output pixel can be many evaluations of alpha blended particles’ pixel shaders.

Person-years of work (on most projects I worked on there were N dedicated FX artists and at least one dedicated FX and particle rendering engineer) that cannot be just discarded or ignored by a newly introduced technique.

Pipeline compatibility

The above were some product features and requirements to consider. But what about some deeper, more low-level implications? Rendering of a single frame requires tens of discrete passes that are designed to work with each other, tens of intermediate outputs and buffers. Let’s look at parts of the rendering pipeline to consider.

Depth testing and writing

I mentioned above some challenges with alpha testing, alpha blending, sorting, and particles. But it gets even more challenging when you realize that many features require precise Z buffer values in the depth buffer for every object!

That creamy depth of field effect and many other camera effects? Most require a depth buffer. Old school depth fog? Required depth buffer!

I love bokeh on this screenshot and the work I’ve done on it with artists on Witcher 2, but notice the errors: Flame in front of the dragon is sharp, while the plane in front of the background is blurred. This is a bug as it should have the same DoF, but the depth read for the bokeh effect is the depth of the solid objects and not particles!

Ok, this seems like a deal breaker, as we know that alpha blended objects don’t write those and there cannot be a single depth value corresponding to them… But games are a mastery of smoke and mirrors and faking. 🙂 If you really care you can create depth proxies by hand, sort and reorder some things manually, alpha blend depth (wrong but can look “ok”), tweak depth of field until artifacts are not distracting… Lots of manual work and puzzles “which compromise is less bad”!

Motion vectors

A related pipeline component is writing of the motion vectors. While depth is essential for many mentioned depth based effects, proper occlusions, screen-sapce refractions, fog etc, the motion vectors are used for “only” two things: motion blur and temporal antialiasing (see below).

Motion blur seems like “just an effect”, but having small amounts of it is essential to reduce the “strobing” effect and generally cheap feel (see movie 24fps half shutter motion blur).

(Some bragging / Christmas time nostalgia: thinking about motion blur made me feel nostalgic – motion blur was the first feature I got interviewed about by the press – I still feel the pride the 23 year old me living in Eastern Europe and that didn’t even finish the college felt!)

Producing accurate motion vectors is not trivial – for every pixel, you need to estimate where this part of the object was in the previous frame. This can mean re-computing some animation data again, storing it for the previous frame (extra used memory), or can be too difficult / impossible (dealing with occlusions, shadows, or texture animations).

On AC4 we have implemented it for almost everything – with an exception of the ocean and ignored the TAA and motion blur problems on it…

Temporal antialiasing

Temporal antialiasing… one of the biggest achievements of the rendering engines in the last few years, but also one of the biggest sources of problems and artifacts. Not going to discuss here if there are alternatives to it or if it’s a good idea or not, but it’s here to stay for a while – not just for the antialiasing, but also temporal distribution of samples in Monte Carlo techniques, supersampling, stochastic rendering, and allowing for slow things to become possible or higher quality in real-time. 

It can be a problem for many new rendering techniques – not only because of the need for pretty good motion vectors, but also its nature can lead to some smearing artifacts, or even cancel out visual effects like sparkly snow (glints, sparkles, muzzle flash etc).

Sparkly snow was one of the most requested features by artists on God of War, but I was not able to create a robust one, as temporal AA was removing it all treating it like temporal flicker / aliasing…

Deferred shading / lighting

Majority of the engines use deferred shading (yes, there are many exceptions and they look and perform fantastic). It has many desirable properties and “lifts” / decouples parts of the rendering, simplifies things like decals, can provide a sweet reduction of the overdraw…

An example of a G-Buffer filled with geometrical data of a scene in OpenGL
An example GBuffer representation – textures representing different material and object properties. Lighting is applied later to those. Source: learnopengl

But having a “bottleneck” in form of the GBuffer and lighting not having access to any other data can be very limiting.

New shading model? New look-up table? New data precomputed / prebaked and stored in vertices or textures? Need to squeeze those into GBuffer!

Modern GPUs handle branching and divergence with ease (provided there is “some” coherency), but it can complicate the pipeline and lead to “exclusive” features.

For example on God of War you couldn’t use subsurface scattering at the same time as the cloth/retroreflective specular model due to them (re)using same slots in the GBuffer. The GBuffer itself was very memory heavy anyway and there were many more mutually exclusive features – I had many possible combinations written out in my notebook (IIRC there were 6 bits just for encoding the material type + its features) and “negotiating” those compromises between different artists who all had different uses and needs.

Any new feature meant cutting out something else, or further compromises…

Camera cuts / no camera cuts

In most cinematics, there are heavy camera cuts all the time to show different characters, different perspectives of the scene, or maybe action happening in parallel. This is a tool widely used in all kinds cinema (maybe except for Dogme 95 or Birdman 😉 ), and if a cinematic artist wants to use it, it needs to be supported.

But what happens when the camera cuts with all the per-pixel temporal accumulation history for TAA/temporal supersampling? How about all the texture streaming that suddenly sees a completely new view? View-dependent amortized rendering like shadowmap caching? All solvable, but also a lot of work, or might introduce unacceptable delay / popping / prohibitive cost of the new frame. A colleague of mine also noted that this is a problem for physics or animations – often when the camera cuts and animators adjusted some positions, you see physical objects “settling in”, for example a necklace move on a character. Another immersion breaker that requires another series of custom “hacks”.

Conversely, lack of camera cuts (like in God of War) is also difficult, especially for cinematics lighting, good animations, transitions etc. In any case – both need to be solved/accounted for. Even worth adding a flag “camera was cut” in the renderer.

Frustum culling

Games generally don’t render what you don’t see in the camera frustum – which is obviously desirable, as why would you waste cycles on it?

Horizon: Zero Dawn gif showing frustum culling. Lots of gamedevs were a little amused that the gaming media and gamers “discovered” games do frustum culling (it was always there).

Now, it gets more complicated! Why would you animate objects that are not visible? Not doing so can save a lot of CPU time.

However things being visible in the main view is only part of the story – there are reflections, shadows… I cannot count in how many games I have seen the shadows of the off-screen characters in the “T-pose” (default pose when a character is not animated). Raytracing that can “touch” any objects poses a new challenge here!

Character in a T-Pose. I am sure you have seen it as a glitch way too many times! Worth paying attention to shadows and reflections as well. 🙂

Occlusion culling

Occlusion culling is the next step after frustum culling. You don’t want to render things outside of the camera? Sure. But if in front of the camera there is a huge building, you also don’t want to render the whole city behind it!

Robust occlusion culling (together with the LOD, streaming etc. below) is in many ways an unsolved problem – all existing solutions require compromises, precomputation, or extremely complex pipelines.

In a way an occlusion culling system is a new rendering feature that has to go through all the steps that I list in this post! 🙂 But given its complexity and general fragility of many solutions – yet another aspect to consider, we don’t want to make it even more difficult.

Level of detail

Any technique that requires some computational budget to render and memory usage needs to “scale” properly with distance. When the rendered object is 100m away and occupies a few pixels, you don’t want it to eat 100MB of memory or its rendering/update take even a millisecond.

Enter “level of detail” – simplified representation of objects, meshes, textures (ok, this one is automatic in form of mip-mapping, but you still need to stream those properly and separately), computations. 

Source: wikipedia.

Unfortunately while the level of detail for meshes or textures is relatively straightforward (though in open world games it is not enough and gets replaced by custom long-distance rendering tech), it’s often hard to solve it for other effects.

How should the global illumination scale with distance? Volumetric effects? Lighting? Particles?

All hard problems that are often solved in very custom ways, through faking, hacking, and manual labor of artists…

Loading, LOD switching and streaming

Level of detail needs to be streamed. Ok, an object far away takes a few pixels on the screen, we can use just tens of vertices and maybe no textures at all – in total maybe a few kilobytes, all good. But then the camera starts to move closer, and we need to load the material, textures, meshes… All of this needs systems and solutions coded up. What happens when the camera just teleports? Streaming system needs to handle it.

Furthermore, switching between those LODs has to be as seamless as possible. Most visual glitches in games with some textures missing / blurry, things flickering, popping, appearing are caused by the streaming, loading, and the LOD switching systems!

Open world

I put the “open world” as a separate item, but it’s a collection of many constraints and requirements that together is different in a qualitative way – it’s not just a robust streaming system, but everything needs to be designed for streaming in open world games. It’s not just good culling – but a great culling system working with AI, animations and many other systems. Open world and large scale rendering is a huge pipeline constraint on almost everything, and has to be considered as such.

When I joined the AC4 team at Ubisoft and saw all the “open world” tech for streaming, long distance rendering and similar, or the long range shadows elements for Far Cry, I was amazed with how specialized (and smart) work was needed to fit those on consoles.

Budget

Finally, the budget… I’ll tackle the “production” budget below, but hope this one is self-explanatory. If a new technique needs some memory or CPU/GPU cycles, they need to be allocated.

One common misconception is that 30ms is “real time technique”. It can be, but only if this is literally the only thing rendered, and only in a 30fps game.

For VR you might have to deal with budgets of sub 10ms to render 2 eye views for the whole frame! This means that a lighting technique that costs 1ms might already way be too expensive!

Similarly with RAM memory – all the costs need to be considered holistically and usually games already use all available memory. Thus any new component means sacrificing something else. Is a feature X worth reducing texture streaming budget and risking texture streaming glitches?

Finally, another “silent” budget is disk space. This isn’t discussed very often, but with modern video game pipelines we are at a point where games hardly fit on Blu Ray disks (it was a bigger challenge on God of War than fitting in memory or the performance!).

Similarly, patches that take 30GBs – it’s kind of ridiculous and usually a sign of an asset packaging system that didn’t consider patching properly… But anyway, a new precomputed GI solution that uses “only” maybe 50MB per level chunk “suddenly” scales to 15GB on disk for the whole 30h game and that most likely will not be acceptable!

The process

While I left the process for the end (as might be least interesting to some audience), challenges listed here are the most important.

Many “alternative” approaches to rendering are already practical in many ways (like signed distance fields), but the lack of tooling, pipelines, and production experience is a complete blocker for all but for special examples that for example rely on user created content (see brilliant Dreams or Claybook).

Artistic control

It’s easy to get excited with some amazing results of procedural rendering (been there for many decades), photogrammetry, or generally 3D content capture and re-rendering – the results look like eye-candy and might seem like removing a huge chunk of production costs.

However artists need full control and agency in the process of content making.

Capturing a chair with a neural implicit representation is easy, but can you change the color of the seat, make some part of it longer, and maybe change the paint to varnished? Or when the art director decides that they want a baroque style chair, are you going to look for one?

Generally most artists are skeptical of procedural generation and capture (other than to quickly fill huge areas of levels that are not important to them) as they don’t have those controls.

On Assassin’s Creed 4, one of my teammates worked on procedural sky rendering (it was a collaboration with some other teams, and also IIRC had an evaluation of some middleware). It looked amazing, but the art director (who was also a professional painter) was not able to get exactly the look he wanted, and we went with very old-school, static (but pretty!) skyboxes.

Developing a procedural rendering system that will produce sky exactly like this painterly Assassin’s Creed 4 concept art is going to be challenging! Souce: Ubisoft promo art.

Tools

Every tool or technique that is creative / artistic, needs tooling to express the control, place it on levels, allow for interaction. “Tools” can be anything from editing CSV files to dedicated WYSIWYG full editors.

Even the simplest new techniques need some tools / control (even if it is as simple as on/off) and it needs to be written.

The next level of complexity is for example a new BRDF or material model – it will have to be integrated in a material editor, textures processed through platform pipeline and baking, compressed to the right format, shader itself generated and combined with other features… All while not breaking the existing functionality.

Finally, if you propose something to replace mesh representation you need to consider that suddenly the whole ecosystem that artists have used – from content authoring like Z Brush, 3D Studio Max and Maya, animation tools and exporters like Motion Builder or dedicated mocap software, to finally all engine plugins, importers/exporters. This is like redoing work of the decades of the whole industry. Doable? Sure. Reasonable? It depends on your budget and the benefits it might bring.

I have a theory about the popularity of engines like Unity and Unreal Engine – it didn’t come from any novel features (though there are many!), but from the mature and stable, expressive, and well known tooling (see below).

Education and experience

Some video game artists have been around for a few decades and built lots of knowledge, know-how and experience. Proposing something that doesn’t build on it can be a “waste” of this existing experience. To break old habits, benefits have to be huge, or become standard across the whole industry (Physically Based Rendering took a few years to popularize).

If you want to change some paradigms, are you able to spend time educating people using those in their daily work, and would they believe it is truly worth it?

Support of the others

It took me a while and some hard lessons to realize that the “emotional” side of collaboration is more important than the technical one.

If your colleagues like artists or other programmers are excited about something – everything will be much easier. If the artists are not willing to change their workflow, or don’t believe that something will help them, the whole feature or project can fail.

I have learned it myself and only in hindsight and hard way: early in my career I was mostly on teams with people similarly excited and similarly junior as me; so later on and on some other team the first time I proposed something that my colleagues didn’t care for, I didn’t understand the “resistance”, proceeded anyway, and the feature has failed (simply was not used).

Good “internal marketing”, hype, and building meaningful relationships full of trust in each other’s competence and intentions are super important and this is why it’s hard to ship something if you are an “outsider”, consultant, or a remote collaborator.

Testing

I’ve written a post about automatic testing of graphics features, and it goes without saying that productionizing anything takes many months of testing – by the researcher/programmer themselves, by artists, QA.

Anything that is not tested in practice can be assumed to be broken, or not working at all.

Debuggers and profilers

Every feature that requires some input data needs to provide means of validation and debugging behaviors and this input. Visualizations of intermediate results, profilers showing how much time is spent on different components, integration with other debugging tools.

Most graphics programmers I know truly enjoy writing those (as they both look cool, and pay off the time investment very quickly), but it has to be planned, costs time and needs to be maintained.

Developing a new technique, try to think “if things go wrong for some reason, how myself or someone less technical will be able to tell why and fix the problem? What kind of tools are needed for that?”

Edge cases

Ah, the edge cases. Things that are supposed to be “rare” or even negligible, but are guaranteed to happen during actual, large scale production. Non-watertight or broken topology meshes? Guaranteed. Zero-area triangles? Guaranteed. Flipped normals? Guaranteed. Materials with roughness of exactly zero? Sure. Black or fully white albedo? Yes. This one level where the walls have to be so thin that the GI leaks? Yep.

Production is full of those – sometimes to the extent that what is considered a pathologically bad input in a research paper, represents most of in-game assets.

Many junior graphics programmers who implement some papers in the first months of their careers, find out that described techniques are useless on the “real” data – hitting an edge case on literally the first tried asset. This doesn’t encourage future experimentations (and to be fair – could be avoided if authors were more honest in the “limitations” sections of publications and actually explored it more).

Practitioner mindset is to think immediately “How can this break? How can I trigger this division by zero? What happens if x/y/z?” and generally preparing to spend most of the time handling and robustifying those edge cases.

Production budget

Finally, something obvious – the production budget. Everything added/changed/removed from production has a monetary cost – of time, software licenses, QA, modifying existing content, shifting the schedule. Factoring it in is difficult, especially from an engineer position (limited insight into full costs and their breakdown), but tends to come naturally with some experience.

Anecdote from Witcher 2the sprite bokeh was something I implemented during a weekend ~2 weeks before gold master. I was super excited about it, and so were the environment and cinematic artists. We committed to shipping it and between us spent the next two weeks crunching to re-work the DoF settings in all existing cinematics (obviously many didn’t need to be tweaked, only some). Glad that we were super inexperienced and didn’t know not to do stuff like this – as the feature was super cool, and I cannot imagine it happening in any of my later and more mature companies. 🙂 

Summary

To conclude, going from research to product in the case of graphics algorithms is quite a journey!

A basic implementation for a tech demo is easy, but the next steps are not:

  • One needs to make sure the new algorithm can coexist with and/or support all the other existing product features,
  • It has to fit well into an existing rendering pipeline, or the pipeline needs to be modified,
  • All the aspects of the production process from tooling, through education to budgeting need to include it.

Severity of the constraints varies – from ones easy to satisfy, easy to workaround or hack, through more challenging ones (sometimes having to heavily compromise and drop them), but it’s worth thinking about the way ahead in the process and proper planning.

As I mentioned multiple times – I don’t want to discourage anyone from breaking or ignoring all those guidelines and “common knowledge”.

The bold and often “irrational” approaches are the ones that sometimes turn revolutionary, or at least are memorable. Similarly, pure researchers shouldn’t let some potential problems hinder their innovation.

But at the same time, if people representing both the “practical” and “theoretical” ends of the research and technology progress understand each other, it can lead to much healthier and easier “technology transfer” and more useful/practical innovation, driven by needs, and not the hype. 🙂

PS. if you’re curious about some almost 10 year old “war stories” about rendering of the foliage in The Witcher 2 – I tried to capture those below:

Supplement – case study – foliage

I started with a joke about how foliage rendering breaks everything else, but let’s have a more serious analysis into how much work and consideration went into rendering of Witcher 2’s foliage.

Around the time I joined CD Projekt Red in The Witcher 2 pre production, there was no “foliage system” or special solution for rendering vegetation. Artists were placing foliage as normal meshes – grass blade one after another on a level. Every mesh was a component and object inside an entity object, and generally was very heavy and using “general” data structures, same no matter if for rocks, vegetation, or the player character. Throughout the production a lot had to change!

Speedtree importer and wind

One of key artists working on vegetation was using Speedtree middleware a lot – but just as an editor, we were not using their renderer.

Speedtree supports amazingly looking wind, and artists wanted similar behavior in game. One of my first graphics related tasks was adding importing of some wind data from Speedtree into the game, and translating some of the behavior into our vertex shaders.

This was mostly tooling work, and took a few weeks – I needed to modify the tools code, add storage for this wind data, different vertex shader type, material baking code, support wind on non-speedtree assets. All for just supporting something that existed and wasn’t any kind of innovative/research thing.

But it looked great and I confirmed to myself that I was right dreaming of becoming a graphics programmer – it was so rewarding!

Translucency and normals

The next category of tasks with foliage that I remember working on was not implementing anything new, but debugging. We had a general “fake translucency” solution (emulating scattering through inverse boosted Lambertian term). But it relies on the normals pointing away, while our engine supported two sided normals (a mesh face normal flipped depending on which side it is rendered from) and some assets were using it. Similarly, artists were “hacking” normals to point upwards like the terrain to avoid some harsh lighting and smooth everything out, and such meshes didn’t show translucency, or it looked like some wrong, weird specular.

Fixing importers, mesh editor, and generally debugging problems with artists, and educating them (and myself!) took quite a bit of time.

To alpha test, or to draw small triangles?

As we were entering more performance critical milestones, foliage was more and more of a performance problem. Alpha testing is costly and causes overdraw, while drawing small triangles causes so-called quad overdraw. Optimizing the overdraw of foliage was a never ending story – I remember when we started profiling some levels, there was a single asset that took 8ms to render on powerful PCs. We did many iterations and experiments to make it run fast. In the end, the solution was hybrid – many single grass blades, but leaves were organized as alpha tested “leaf cards”.

Level layer meshes

One of the first important optimizations – just for the part of the CPU cost, and reducing loading times – was to “rip” all meshes that were simple enough (no animations other than wind, no control by gameplay) into a separate, much smaller data representation.

This task was something I worked on with lots of mentorship from a senior programmer, and was going back to my engine programming roots (data management, memory, pipeline). 

While it didn’t address all the foliage needs, it was a first step that allowed to process tens of thousands of such objects (instead of hundreds), and artists realized they needed better tooling to populate a large world.

Automatic impostor system

A feature that my much more senior colleague worked on – adding the option of automatic impostor generation. After an user-defined distance, mesh would turn into an auto-rotating billboard – with different baked views from different sides.

While I didn’t implement it, later was improving and debugging it quite a lot (the programmer who implemented it was the most senior person on the team overwhelmed with tasks and generally often worked by doing initial implementation and then handling it off to more junior people – and it was a great learning and mentoring opportunity) on things like inpainting alpha tested regions for correct mip-mapping, or how to optimize backing w.r.t. the GBuffer structure (to avoid extra GBuffer packing/unpacking).

Foliage editor

The next foliage specific work was again around tooling!

Our senior programmer spent a few heavy weeks on making a dedicated foliage editor, with brushes, layers, editors. Lots of mundane work, figuring out UX issues, finding best data representations… Things that need to happen, but almost no paper will describe.

The result was quite amazing though, and during the demo, my colleague showed the artists how he was able to create a forest in 5minutes and right away, our game got populated by them with great looking foliage and vegetation (in my biased opinion, the best looking in games around that time).

Foliage culling and instancing

With lots of manual optimization work, we were nearing the shipping date of Witcher 2 on PC and starting to work on the X360 version. A newly hired programmer, now my good friend, was looking into performance, especially on the 360 of the occlusion culling (an old school “hipster” algorithm that you probably never heard of, a beam tree). Turns out that building a recursive BSP from hundreds of occluders every frame and processing thousands of meshes against it is not a great idea, especially on an in-order PowerPC CPU of X360.

My colleague spent quite a lot of time on making the foliage system a real “system” (and not just a container for meshes) with hierarchical representation, better culling, and hardware instancing (which he had experience with on X360, unlike anyone else in the team). It was many weeks of work as well, and absolutely critical work for shipping on the Xbox.

Beyond Witcher 2 – terrain, procedural vegetation

This doesn’t end the story of vegetation in Witcher games.

For Witcher 3 (note: I barely worked on it; I was moved to the Cyberpunk 2077 team) my colleagues went into integration of the terrain and foliage systems with streaming (necessary for an open world game!), while our awesome technical artist coded procedural vegetation generation inspired by physical processes (effects of altitude, the amount of sun, water proximity), and “game of life” that were used for “seeding” the world with plausibly looking plants and forests.

I am sure there were many other things that happened. But now imagine having to redo all this work because of some new technique – again, doable, but is it worth it? 🙂 

Posted in Code / Graphics | Tagged , , , , , , , , | 4 Comments

Compressing PBR material texture sets with sparsity and k-SVD dictionary learning

Introduction

In this blog post, I am going to continue exploration of compressing whole PBR texture sets together (as opposed to compressing each texture from the set separately) and using the fact that those textures are strongly correlated.

In my previous post, I have described how we can “decorrelate” some of the channels using PCA/SVD and how many of the dimensions in such a decorrelated space contain small amount of information, and therefore can be either dropped completely, stored at a lower resolution, or quantized aggressively. This idea is a backbone of almost all successful image compression algorithms – from JPEG to BC/DXT.

This time we will look at exploring some redundancy across both the stored dimensions, as well as the image space using sparse methods and kSVD algorithm (original paper link).

From my personal motivation, I wanted to learn more about those sparse methods, play with implementing them and build some intuition – one of my coworkers and collaborators on my previous team at Google, Michael Elad is one of the key researchers in the field and his presentations inspired and interested me a lot – and as I believe that until you can implement and explain something yourself, you don’t really understand it – I wanted to explore this fascinating field (in a basic, limited form) myself.

First I am going to do present a simplified and intuitive explanation of what “sparsity” and “overcompletness” mean and how their redundancy can be an advantage. Then I will describe how we can use sparse approximation methods for tasks like denoising, super-resolution, and texture compression. Then I will provide a quick explanation of k-SVD algorithm steps and how it works, and finally show some results / evaluation for the task of compressing PBR material texture sets.

This post will come with a colab notebook (soon – I need to clean it up slightly) so you can reproduce all the code and explore those methods yourself.

Sparse and dictionary based methods in real-time rendering?

I tried to look for some other references to dictionary based methods in real time rendering and was able to find only of three resources: a presentation by Manny Ko and Robin Green about kSVD and OMP, a presentation by Manny Ko about Dictionary Learning in Games, and a mention of using Vector Quantization for shadow mask compression by Bo Li. VQ is a method that is more similar to palletizing than full dictionary learning, but still a part of the family of sparse methods. I recommend those presentations, (a lot of what I describe about the theoretical part – and how I describe – overlaps with them) and if you are aware of any more mentions of those IMO underutilized methods in such a context, please leave a comment and I’ll be happy to link.

Vector storage – complete and orthogonal basis, vs overcomplete and sparsely used frames / dictionaries

To explain what are sparse methods, let’s start with a simple, 2 dimensional vector. How can we represent such a vector? “This is obvious, we can just use 2 numbers, x and y!”. Sure, but what do those numbers mean? They correspond to an orthonormal vector basis. x and y represent actually a proportion of position along the x axis, and the y axis. If you have done computer graphics, you know that you can easily change your basis. For example, a vector can be decribed in so called “world space” (with arbitrary, fixed point), “camera / view space”, or for some simple video game AI purpose in space of a computer-controlled character.

Same vector, two different storage basis.

Vector basis might have different properties. Two that are very desired are orthogonality and normality, often combined together as orthonormal basis. Orthonormality gives us trivial way of changing information between two basis back and forth (no matrix inversion required, no scaling/non-unity Jacobians introduced) and some other properties related to closed form products and integrals – it’s beyond scope of this post, but I recommend “Stupid Spherical Harmonics Tricks” from Peter-Pike Sloan for some examples how SH being orthonormal makes it a desired basis for storing spherical signals.

Choice of “the best” basis matters a lot when doing compression – I have described it in my previous blog, but as a quick refresher, “decorrelating” dimensions can give us basis in which information is maximized in each sequential subset of dimensions – meaning that the first dimension has the most information and variability present, the next one the same or smaller amount etc. This is limited to linear relationships and covariance, but lots of real data shows linear redundancy.

Demonstration from my previous post of finding a storage basis in which we maximize information stored in single dimension.

One of the requirements of a space to be a basis is completeness and uniqueness of representation. Uniqueness of the representation means that given a fixed basis, there is only a single way to describe a vector in this space. Completeness means that every vector can be represented without loss of information. In my last post, we looked at the truncated SVD – which is not complete, because we lose some original information and there is no way how we can represent it back.

What is quite interesting is that we can go further than vector basis and go towards overcomplete frames (let’s simplify – a lot – for the purpose of this post and say that frame is just “some generalization” of a basis). This means that we have more vectors in space than the “rank” of the information we want to store. Overcomplete frames can’t be orthogonal or unique (I encourage readers for whom this is a new concept to try to prove it – even if just intuitively). Sounds pretty useless for information compression? Not really, let me give here a visual example:

Example of a set of points/vectors that cannot be represented well by any two orthogonal vectors, but on the other hand three frame vectors describe this set pretty well.

Note that while in the case of overcomplete frame every point can be described in infinitely many ways (just like x + y = 5 has infinitely many solutions of x and y values that satisfy it), we usually are going to look for a sparse representation. We want as many of the components to be zero (or close to zero enough so that we can discard them, accepting some errors), and the input vectors described by a linear combination of less frame vectors than the size of the original basis.

If this description seems vague, looking at the above example – with complete basis, we generally need two numbers to describe the input point set. However, with an overcomplete frame, we can use just a single 2-bit basis index and a float value describing magnitude along it:

Sparse representation in a “proper” overcomplete frame can approximate some input data pretty well.

Yes, we lose some information, and there is an approximation error – but there is always an error in the case of lossy compression – and the error in such a basis is way smaller than from any truncated SVD (keeping a single dimension of orthogonal basis).

Terminology wise – we commonly refer to a frame vector as atom, and collection of atoms as dictionary – and I will use atoms and dictionary mostly throughout this post.

Dense vs sparse representation – a dense representation has a minimal number of dimensions (rank) of the set, but every item represented in it is going to be represented by multiple coefficients. With sparse representation, we might have many more dimensions, but every item is going to be represented by a small subset of those – most coefficients are zero.

Also note – we will come back to this later in the next section – I have used here a 2D example and a 1-sparse representation (single atom to represent input) – but with higher dimensions, we can (and in practice do) use multiple atoms per vector – so if you have 16 atoms, any representation that uses 1-15 of them can be considered sparse. Obviously if the input space is 3D, and your dictionary has 20 atoms, while a 10 atom representation of a vector “could” be seen as sparse, it probably won’t make sense for most practical scenarios.

Sparsity and images – palettes and tiles

Now I admit, this 2D case is not super compelling. However, things start to look much more interesting in some much higher dimensions. Imagine that instead of storing 20 numbers for a 20 dimensional vector, you could store just a few indices and a few numbers – in very high dimensions, sparse representations get more and more interesting. Before going towards high dimensional case, let me just show some interesting connection – what happens if for a single pixel of e.g. RGB image we store only index (not even a weight) from a dictionary? Then we end up with an image palette.

Example images compressed with palettization and their palettes – source: wikipedia, author: Ricardo Cancho Niemietz

Image palettization is an example of a sparse representation method! One of many algorithms to compute image palette it is k-means, which is a more basic version of k-SVD that I’m going to describe later.

Palettization generalizes to many dimensions – if you have a PBR texture set with 20 channels, you can still palettize it – just your entries will be 20 dimensional vectors.

However, we can go much further and increase the dimensionality (hoping for more redundancy in more dimensions) much further. Let’s have a look at this Kodak dataset image and at image tiles:

Notice in colored rectangles how much self similarity there is inside this image!

There is lots of self-similarity in this image – not just within pixel values (those would palettize very well), but among whole image patches! In such a scenario, we can take a whole patch to become our input vector. I have mentioned in the past that thinking beyond “image is 2D matrix / 3D tensor” is one of the most important concepts in numerical methods for image processing, but as a recap, we can take an RGB image and instead of a tensor, represent it by a matrix:

Representing and image or image patch as a matrix.

Or we can go even further, and represent it as a single vector:

Representing and image or image patch as a vector.

So you can turn a 16×16 RGB tile into a single 768 dimensional vector – and with those 16×16 tiles, a 512 x 512 x3 (RGB) image into a 32×32 image of 768 vectors. This takes time to get used to, but if you internalize this concept (and experiment with it yourself), I promise it will be very useful!

What are sparse methods useful for?

Sparse methods were (and to a lesser extent still are) a “hot” research topic because of a simple “philosophical” observation – the world we live in, and the signals we deal with (audio, images, 3D representation), are in many ways sparse and self-similar.

A 100% random image would not be sparse, but images that we capture with cameras or render are in some domains.

A random image is not sparse – there is no structure or self-similarity to it. On the other hand, natural images are – notice lots of self similarities, flat regions, simple geometric constructs like edges, corners, textures.

I will not present any proofs of it (also personally I haven’t seen one that would be both rigorous and not handwavy, while staying intuitive), however let me describe some implications and use-cases of it.

Compression

Very straightforward and the use-case I am going to focus on – if instead of storing full “dense” information of a large, multidimensional vector (like image tile), we can instead store a dictionary, and indices of atoms (plus their weights). If the compressed signal was truly sparse in some domain, this can result in massive compression ratios.

Denoising

What happens when we add white noise to a sparse signal? It’s going to become non-sparse! Therefore by finding best possible sparse approximation of it, we are going to effectively decompose it into sparse and non-sparse component, which is going to be discarded and hopefully, corresponds to noise.

Crude, quickly coded k-SVD based denoising of a very noisy signal. While the output image doesn’t look good (things like blended, overlapping tiles or varying atom count per tile are a must), it is a demonstration that it’s possible to remove most of the noise!

Super-resolution

Principle of self-similarity and sparsity can be used for super-resolution of low-resolution, aliased images. Different “phases” of the sampled aliased signal produce different samples, but if we assume there is an underlying common shared representation, we could find it and reconstruct them.

Aliased triangle and different tiles corresponding to the same original shape.

In this example if we correctly located all the aliased tiles and be certain that single, same high resolution atom corresponds to all of them, then we’d have enough information to remove most of the aliasing and reconstruct the high resolution edge.

Computer vision

Computer vision tasks like object recognition or object classification don’t work very well with hugely dimensional data like images. Almost all computer vision algorithms start by constructing some representation of the image tile – corner, line detection. While in the “old times” those features were hard-coded, nowadays state of the art relies on artificial neural networks with convolutional layers – and the convolutional layers with different filter banks jointly learn the dictionary (edges, corners, higher level patterns), as well as use of this dictionary data (I will mention this connection around the end of this post).

Compressive sensing / compressed sensing

This is a really wild, profound idea that I have learned about just this year and definitely would like to explore more. If we assume that the world and signals we sample are sparse in “some” domain and we know it, then we can capture and transmit much less information (orders of magnitude less!), like random subsamples of the signal. Later we can recompute the missing pieces of the information by forcing sparsity of the reconstruction in this other domain (in which we we know the signal is sparse). This is really interesting and wild area – think about all the potential applications in signal/image processing, or in computer graphics (strongly subsampling expensive raytracing or GI and easily reconstructing the full resolution signal).

Note: Sparsity of simple, orthogonal basis

It is important to emphasize that you don’t need any super fancy overcomplete dictionary to observe signal sparsity. Even simple, orthogonal basis can be sparse – for example Fourier transform, DCT, image gradient domain, decimating wavelets – and many of them are actually very good / close to optimal in “general” (average of different possible images) scenarios. This is a base for many other compression algorithms.

Natural images are usually sparse in Fourier domain.
Left: original image, Right: Only 10% of the largest coefficients in Fourier domain = immediate 10x compression ratio. The output doesn’t look great perceptually (ringing, separate treatment of color channels = color shifts; both can be avoided easily), but this was close to zero consideration or effort!

kSVD for texture and material compression

Later in this post, I am going to describe the k-SVD algorithm – but first, let’s look at use of sparse, dictionary representations and learning for PBR material compression.

I have stated the problem and proposed a simple solution in my previous blog post. As a recap, PBR materials consist of multiple textures representing different physical properties of a surface of a material. Those properties represent surface interaction with light through BRDF models. Different parts of that texture might correspond to different surface types, but usually large parts of the material correspond to the same surface with the same physical properties. I have demonstrated how those properties are globally correlated. Let’s have a look at it again:

An example PBR material, credit cc0textures, Lennart Demes. There are 5 different textures, but each texture corresponds to the same physical location.

After my longish introduction, it might be already clear that we can see lots of self-similarity in this image and very similar local patches. Maybe you can even imagine how those atoms will look like – “here is an edge”, “here is a stone top”, “here is a surface between the stones” that repeat themselves across the image.

It seems that it should be possible to reuse this “global” information and compress it a lot. And indeed, we can use one of sparse representation learning algorithms, kSVD to both find the dictionary, as well as fit the image tiles to it. I will describe the algorithm itself below, but it was used in following way:

  • Split the NxN image into n x n sized tiles after subtracting image mean,
  • Set the dictionary size to contain m atom tiles,
  • Run kSVD to learn the dictionary and allowing for up to k atoms per tile,
  • Recombine the image.

Just for the initial demo (not optimal parameters, I selected the dictionary to contain 8×8 tiles, dictionary sized like 1/32th of the original image, 8 atoms per tile). Afterwards, the texture set will look like this:

Top: reconstruction, middle: reference, bottom: abs error.

I’m presenting it in a low resolution as later I will look more into proper comparisons and the effect of some parameters. From such a first look, I’d say it doesn’t look too bad at all given that per every tile instead of storing 8x8x9 numbers, we store only 9 indices and 9 weights, plus the dictionary is only 1/32th of the orignal image!

But more importantly, how do the atoms / dictionary look like? Note: atoms are normalized across all channels and we have subtracted the image mean; I am showing absolute value of each pixel of each atom:

Dimension-wise slices (visualization) of unsorted atoms from fitting a dictionary to a PBR texture set.

Ok, this looks a bit too messy for my taste. Let’s sort them based on similarity (greedy, naive algorithm: start with the first tile, find most similar to it, then proceed to the next tile until you run out of tiles):

Dimension-wise slices (visualization) of sorted atoms from fitting a dictionary to a PBR texture set.

Important note: from the above 5 differently colored subsets, first 8×8 tile of each of them corresponds to the same atom. They are sliced across different PBR properties for visualization purpose only (I don’t know about you, but I don’t like looking at 9 dimensional images😅).

This is hopefully clearer and also shows that there are some clear “categories” of patches and words in our dictionary: some “flat” patches with barely any texture or normal contribution at all, some strong edges, and some high frequency textured regions. An edge patch will show an edge both in normal-map as well as the AO, while a “textured” patch will show texture and variation in all of the channels. Dictionary learning extended correlation between global color channels to correlation of full local structures. I think this is pretty cool and the first time I saw it running I was pretty excited – we can automatically extract quite “meaningful” (even semantically!) information and how the algorithm can find some self-similarities, repeating patterns and use it to compress the image a lot.

I will come back to analyzing the compression and quality (as it depends on the parameters), but let’s now look at how the k-SVD works.

k-SVD algorithm

I have some bad news and some good news. The bad news are that finding optimal dictionary and atom assignments is NP-Hard problem. ☹ We cannot solve it for any real problems in reasonable time. The good news is that there are some approximate/heuristic algorithms that perform well enough for practical applications!

k-SVD is such an algorithm. It is a greedy, iterative, alternating optimization algorithm that is made of the following steps:

  1. Initialize some dictionary,
  2. Fit atoms to each vector,
  3. Update the dictionary given current assignment
  4. If satisfied with the results, finish. Otherwise, go back to step 2.

This going between steps 2 and 3 is the “alternating” part, and repeating it is “iterative”.

Note that there are many variations of the algorithm and improvements in many follow-up papers, so I will look at some flavor of it. 🙂 But let’s look at each step.

Dictionary initialization

For the dictionary initialization, you can apply many heuristics. You can start with completely random atoms in your dictionary! Or you can take some random subset of your initial vectors / image tiles. I have not investigated this properly (I am sure there are many great papers on initialization that maximizes the convergence speed), but generally taking some random subset seems to work reasonably. Important note is that you’d want your atoms to be “normalized” across all channels – you’ll see why in the next step.

Atom assignment

For the atom assignment, we’re going to use Orthogonal Matching Pursuit (OMP). This part is “greedy” and not necessarily optimal. We start with the original image tile and call it a “residual”. Then we repeat the following:

  1. Find the best matching atom compared to residual. To find this best matching atom according to squared L2 norm, we can simply use dot product. (Think and try to prove why the dot product – cosine similarity of normalized vectors – ordering is equivalent to the squared L2 norm, subject to offset and scale).
  2. Looking at all atoms found so far, do a least squares fit against the original vector.
  3. Update the residual.

We repeat it as many times as the desired atom count, or alternatively until error / residual is under some threshold.

As I mentioned, this is a greedy and not necessarily optimal part. Why? Look at the following scenario where we have a single 4 dimensional vector, 3 atoms, and allow max 2 atoms per fit:

Given our yellow vector to fit, at the first step if we find the most similar atom, it is going to be the orange one – the error is the smallest. Then we would find the green one, and produce some combination that would produce 2 last dimensions too high/too low – a non-negligible error. By comparison, optimal, non-greedy assignment of 2 atoms “red” and “green” would produce zero error!

Greedy algorithms are often reasonable approximations that run fast enough (the decision made is only locally optimal, so we can consider very small subset of solutions), but in many cases are not optimal.

Dictionary update

In this pass, we assume that our atom assignment is “fixed” and we cannot do anything about it – for now. We go even further, as we look at each atom one by one, and assume that all the other atoms are also fixed! Given almost everything fixed, we optimize: given this assignment and all the other atoms being fixed, what is the optimal value for this atom?

So in this example we will try to optimize the “red box” atom, given two “purple box” items using it.

The way we approach it is solving the following: if there was no this atom contribution, how can we minimize the error of all the atoms using it? We look at all errors of every single item using this atom and try to find some single vector that multiplied by different weights will maximize the removal of average of those errors (so matching it).

In a visual form, it will look something like this:

Updating a single atom to approximate all the residuals of multiple items is simply a SVD rank1 decomposition! One vector will be the locally optimal atom, while the orthogonal vector will be a weight vector.

If you are reader of this blog, I hope you recognized that we are looking for a separable, rank 1 approximation of the error matrix! This is where the SVD of k-SVD comes from – we use truncated SVD and compute a rank 1 approximation of the residual. One of separable vectors will be our atom, the other one – updated weights for this atom in each approximated vector.

Note that unlike the previous step, which could lead to unoptimal results, here we have at least a guarantee that each step will improve the average error. It can increase the error of some items, but on average / in total it will always be decreased (or it would have been the same if SVD would found that the initial atom and the weight assignment were already minimizing the residual).

Also note that updating the atoms one by one and updating the weights / residuals at the same time makes it similar to Gauss-Seidel method – resulting in a faster convergence, but at a cost of sacrificed parallelism.

And this is roughly it! We alternate the steps of assigning atoms and updating them until we are happy with the results. There are many nitty-gritty details and follow-up papers, but described like this, kSVD is pretty easy to implement and works pretty well!

How many iterations do you need? This is a good question and probably depends on the problem and those implementation details. For my experiments I was using generally 8-16 iterations.

Different parameters, trade-offs, and results

Note for this section – I am clearly not a compression expert. I am sure there are some compression specific interesting tricks and someone could do a much better analysis – I will not pretend to be competent in those. So please treat it as mostly theoretical analysis from a compression layman.

I will analyze all three main parameters one by one, but first, let’s write down a formula for general compression/storage cost:

Image tiles themselves – k indices, k weights for each of the tiles. Indices need as many bits as log2 of the number of atoms m in the dictionary, and the weights need to be quantized to b bits (which will affect the final precision/error). Then there is the dictionary, which is going to be simply m of n*n sized tiles, each element containing original number of dimensions. So the total storage is going to be:

It’s worth thinking about comparing it to a “break even” ratio, so simply N*N*d. Ignoring the image, dictionary break even is m <= (N/n)^2. Ignoring the dictionary, image break even is:

Seeing those formulas, we can analyze the components one by one.

First of all, we need to make sure that m is small enough so that we actually get any compression (and our dictionary is not simply the input image set 🙂 ). In my experiments I was setting it to 1/32-1/16 of N^2 total tiles of the input for larger tiles, or simply fixed small value for very small tiles.

Then assuming that m * n^2 is small enough, we focus on the image itself. For better compression ratios we would like to have tile size n as large as possible without blowing this earlier constraint – this is a quadratic term. As usually, think about the extreme: a tile 1×1 is the same as no tiling/spatial information at all (still might offer a sparse frame alternative to SVD discussed in my previous post), and large tiles means that our “palette” elements are huge and there are few of them.

Generally, for all the following experiments, I looked only at very significant compression ratios (mentioned dictionary itself being 3-6% size of the input) and visible / relatively large quality loss to be able to compare the error visually.

People more competent in compression will notice lots of opportunities for improvements – like splitting indices from weights, deinterleaving the dictionary atoms, exploiting the uniqueness of atom indices etc. – but again, this is outside of my expertise.

Because of that, I haven’t done any more thorough analysis of the “real” compression ratios, but let’s compare a few results in terms of image quality – both subjective, and objective numbers.

Tile size 1, 3 atoms per pixel, dictionary 32 atoms -> PSNR 34.9

Top: reconstruction, middle: reference, bottom: abs error.
Single pixel element dictionary is not the most fascinating or insightful one. 🙂

This use-case is the simplest, as it is a sparse equivalent of the SVD we have analyzed in my previous post. Instead of dense 6 weights per pixel (to represent 9 features), we can get away with just 3 and just 32 elements in the dictionary, we can a pretty good SNR! Note however the problem (will be visible on those small crops in all of our examples) – tiny green grass islands disappear, as they occupy tiny portion of the data and don’t contribute too much to the error. This is also visible in the dictionary – none of the elements is green. Apart from it, the compressed image looks almost flawless.

Tile size 2, 4 atoms per tile, dictionary 256 atoms -> PSNR 32.16

Top: reconstruction, middle: reference, bottom: abs error.

In this case, we switch to a tiled approach – but with tiny, 2×2 tiles. We have as many weights per tile, as there are pixels per tile – but pixels were originally 8 channel and now we have per atom only a single weight. The PSNR / error are still ok, however the heightmap starts to show some “chunkiness”. In the dictionary, we can slowly start to recognize some patterns – and how some tiles represent only low frequency, while other higher frequency patterns! The exception is the heightmap – it wasn’t correlated too much with the other channels details, which explains errors and lack of high frequency information. A larger dictionary or more atoms per tile would help to decorrelate those.

Tile size 4, 8 atoms per tile, 6% of the image in dictionary -> PSNR 32.76

Top: reconstruction, middle: reference, bottom: abs error.
Note: this is only a subset of the dictionary.

4×4 tiles is an interesting switching point as we can compress dictionary with the hardware BC compression. We’d definitely want that, as the dictionary gets large! Note that we get less atoms per tile than there were pixels in the first place! So it’s not only compressing the channel count, but also less than actual pixel count. The dictionary looks much more interesting – all those tiny lines, edges, textured areas.

Tile size 8, 12 atoms per tile, 6% of the image in dictionary -> PSNR 28.11

Top: reconstruction, middle: reference, bottom: abs error.
Note: This is only a subset of the dictionary.

At this point, the compression got very aggressive (per each 8×8 pixels of the image, we store only 12 weights and indices, that both can be quantized!) – and the tiles are the same size as JPG ones. Still, visually looks pretty good, and the oversmoothing of normals shouldn’t be an issue in practice – but if you look very closely at the height map, you will notice some blocky artifacts. The dictionary itself shows a lot of variety of patches and content type.

Performance – compression and decompression

Why are dictionary based techniques not used more widely? I think the main problem is that the compression step is generally pretty slow – especially the atom fitting step. Per each element and per each atom, we are exhaustively searching for the best match, performing a least squares fit, recomputing the residual, and then searching again. Every operation is relatively fast, but there are many of them! How slow it is? To generate the examples in my post, colab was running in a few minutes per result – ugh. On the other hand, this is a Python implementation, with brute-force linear search. Every single operation in the atom fitting part is trivially parallelizable, and can be accelerated using spatial acceleration structures. I think that it could be made to run a few seconds / up to a minute per a texture set – which is not great, but also not terrible if not part of an interactive workflow, but an asset bake.

How about the decompression performance? Here things are much better. We simply fetch atoms one by one and add their weighted contributions. So similar to SVD based compression, this becomes just a series of multiply-adds and the cost is proportional to the atom count per tile or pixel. We incur a cost of indirection and semi-random access per tile, which can be slow (extra latency), however when decompressing larger tiles, this is very coherent across many neighboring pixels. Also, generally dictionaries are small so should fit in cache. I think it’s not unreasonable to also try to compress multiple similar texture sets using a single dictionary! In such a case when drawing sets of assets, we could be saving a lot on the texture bandwidth and cache efficiency. Specific performance trade-offs depend on the actual tile sizes, and for example whether block compression is used on the dictionary tiles. Generally, my intuition says that it can be made practical (and I’m usually pessimistic) – and Bo Li’s mention of using VQ for shadowmask (de)compression confirms it.

Relationship to neural networks

When you look at the dictionary atoms above, and at visualizations of some common neural networks, there is some similarity.

Neural network filters learned for image classification from a classic paper by Krizhevsky et al.

This resemblance might seem weak, but the network was trained on all kinds of natural images (and for a different task), while our dictionary was fit to a specific texture set. Still, you can observe some similar patterns – appearance of edges, fixed color patches, more highly textured regions. It also resembles some other overcomplete frames like Gabor wavelets. I think of it intuitively that as the network also learns some representation of the image (for the given task), those neural network features are also a overcomplete representation, and however they are not sparse, some regularization techniques can partially sparsify them.

Neural networks have mostly replaced sparse methods as they decoupled the slow training from (relatively) fast interference, however I think there is something beautiful and very “interpretable” about the sparse approaches, and when you have already learned a dictionary and don’t need to iterate, for many tasks they can be attractive and faster.

Conclusions

My post aimed to provide some gentle (not comprehensive, but rather a more intuitive and hopefully inspiring) introduction to sparse and dictionary learning methods and why they can be useful and attractive for some image compression and processing tasks. The only novelty is a proposal of using those for compressing PBR textures (again I want to strongly emphasize the idea of exploiting relationships between the channels – if you compress your PBR textures separately, you are throwing out many bits! 🙂 ), but I hope that you might find some other interesting uses. Compressing animations? Compressing volumetric textures and GI? Filtering of partial GI bakes (where one part of level is baked and we use its atoms to clean up the other one)? Compressing meshes? Let me know what you think!

Posted in Code / Graphics | Tagged , , , , , , , , , , | 6 Comments

Dimensionality reduction for image and texture set compression

Teaser image: Example PBR Texture set compressed with presented technique.Textures credit cc0textures, Lennart Demes.

In this blog post I am going to describe some of my past investigations on reducing the number of channels in textures / texture sets automatically and generally – without assuming anything about texture contents other than correspondence to some physical properties of the same object. I am going to describe the use of Singular Value Decomposition (a topic that I have blogged about before) / Principal Component Analysis for this purpose.

The main use-case I had in mind is for Physically Based Rendering materials with large sets of multiple textures representing different physical material properties (often different sets per different material/object!), but can be extended to more general images and textures – from simple RGB textures, to storing 3D HDR data like irradiance fields.

As usually, I am also going to describe some of the theory behind it, a gentle introduction on a “toy” problems and two examples on simple RGB textures from Kodak dataset, before proceeding to the large material texture set use-case.

Note: There is one practical, significant caveat of this approach (related to GPU block texture compression) and a reason why I haven’t written this post earlier, but let’s not get ahead of ourselves. 🙂

This post comes with three colabs: Part 1 – toy problem, Part 2 – Kodak images, Part 3 – PBR materials.

Intro – problem with storage of physically based rendering materials

I’ll start with a confession – this is a blog post that I was aiming to write… almost 3 years ago. Recently a friend of mine asked me “hey, I cannot find your blog post on materials dimensionality reduction, did you end up writing it?” and the answer is “no”, so now I am going to make up for it. 🙂

The idea for it came when I was still working on God of War and one of the production challenges was disk and memory storage cost for textures and materials. Without going into details on what were the challenges of our specific material system and art workflows, I believe this is a challenge for all contemporary AAA video games – simply physically based rendering type of workflows requires a lot of textures – stressing not only the BluRay disk capacities, but also the gamers are not happy about downloading tens of gigabytes of just patches…

To list a few example different textures used per a single object/material (note: probably no single material used all of them, some of them are rare, but I have seen all in production):

  • Albedo textures,
  • Specularity or specular color textures,
  • Gloss maps,
  • Normal maps,
  • Heightmaps for blending or parallax mapping,
  • Alpha/opacity maps for blending,
  • Subsurface or scattering color,
  • Ambient occlusion maps,
  • Specular cavity maps,
  • Overlay type detail maps.

Uff, that’s a lot! And some of those use more than one texture channel! (like albedo maps that use RGB colors, or normal maps that are usually stored in two or three channels). Furthermore, with increasing target screen resolutions, textures also need to be generally larger. While I am a fan of techniques like “detail mapping” and using masked tilers to avoid the problem of unique large textures everywhere, those cannot cover every case – and the method I am going to describe applies to them as well.

All video games use hardware accelerated texture block compression, and artists manually decide on the target resolution of different maps. It’s time consuming, not very creative, and fighting those disk usage and memory budgets is a constant production struggle – not just for the texture streaming or fitting on a BluRay disk, but also the end users don’t want to download 100s of GBs…

Could we improve that? I think so, but before we go to the material texture set use-case, let’s revisit some ideas that apply to “images” in more general sense.

Texture channels are correlated

Let’s start with a simple observation – all natural images have correlated color channels. “Physical” explanation for it is simple – no matter what is the color of the surface, when it is lit, its color is modulated by the light (in fact by the light spectrum and after interacting with its surface through BRDF, but let’s keep our mental model simple!). This means that all of those color channels will be at least partially correlated, and light intensity / luminance modulates whatever was the underlying physical surface (and vice versa).

What do you mean by “correlated” color channels and how to decorrelate them?

Let’s look at a contrived example – list of points representing pixels in a two color (red and green) channel texture to keep visualization in 2D. In this toy scenario we can observe a high correlation between red and green in the original red/green color space. What does this mean? If the value of green channel brightness is high, it is also very likely that the red channel will is also high. On the following plot, every dot represents a pixel of an image positioned in the red/green space on the left:

Left: scatter plot of points in our example in original, RG space. Right: scatter plot of the same points when “decorrelated” by projecting them onto luminance axis.

On the right plot, I have projected those on the blue line, representing something like luminance (overall brightness).

We can see that from the original signal which was changing a lot on both axis, after projecting onto luminance axis, we end up with significantly less variation and no correlation on the perpendicular “chrominance” (here defined as simple difference of red and green channelvalues) axis. Decorrelated color space means that values of a single channel are not dependent or linearly related to the values of another color channel. (There could be a non-linear relationship, but I am not going to cover it in this post and it becomes significantly more difficult to analyze, requiring iterative or non-linear methods.)

This property is useful for compression, as while the original image needed for example 8 bits for both red and green color channels, in the decorrelated space we can store the other axis with a significantly smaller precision – there is much less variation, so possibly 1-2 bits would suffice, or we could store it in smaller resolution.

This is one of the few reasons why we use formats like YUV for encoding, compressing and sending images and videos. (As a complete offtopic, three other reasons that come to my mind right now are also interesting – human visual perception system being less perceptive to color changes and allowing for more compression of color; chroma varying more smoothly (less high frequencies) in natural images; and finally for now obsolete now historical reasons, where old TVs needed “backwards compatibility” requiring transmitted signal to be possible to view on black&white old TVs ).

Let’s have a look at two images from Kodak dataset in RGB and YUV color spaces. I have deliberately picked one image with many different colors (greenish water, yellow boat, red vests, blue shorts, green and white tshirts), and the other one relatively uniform (warm-green white balance of an image with foliage):

RGB image and Y, U, and V color channels.
RGB image and Y, U, and V color channels.

It’s striking how much of the image content is encoded just in the Y, luminance color channel! You can fully recognize the image from the luminance, while chrominance is much more “ambiguous”, especially on the second picture.

But let’s look at it more methodically and verify – let’s plot the absolute value of the correlation matrices of color channels of those images.

Correlation matrices of two Kodak images / image crops in RGB space. The one with less colors (right) has almost perfectly correlated color channels! The other one (left) still has very strong correlation, but only between some channel pairs.

This plot might deserve some explanation: Value of 1 means fully correlated channels (as expected on the diagonal – variable is fully correlated with itself).

Value of zero means that two color channels don’t have any linear correlation (note – there could be some other correlation, e.g. quadratic!) and the matrices are symmetric, as the red color channel has same correlation with the green as the green with the red one.

First picture, with more colors has significantly less correlation between color channels, while the more uniformly colored one shows very strong correlation.

I claimed that YUV generally decorrelates color channels, but let’s put this to a test:

Absolute value of correlation matrices of two Kodak images / image crops in YUV space. Value of 1 means fully correlated or negatively correlated values (as expected on diagonal – variable is fully correlated with itself). The one with less colors (right) after conversion to YUV has almost no cross channel correlation The other one (left) has some decorrelation, but it is not as strong as the first example.

The YUV color space decorrelated more colorful image to some extent, while the less colorful one got decorrelated almost perfectly.

But let’s also see how much information is in every “new” channel with displaying absolute covariances (one can think of covariance matrix as correlation matrix multiplied by variances):

Covariance matrices of the two YUV converted Kodak images. Covariance is a pretty good measurement of “information” as also shows us how much variation is there in each color channel, and in relationship between them.

In those covariance plots we can verify our earlier intuition/observation that the luminance represents the image content/information pretty well and most of it is concentrated there.

Now, the Y luminance channel is created to roughly correspond to perception of the brightness of the human vision. There are some other color spaces like YCoCg that are designed for more decorrelation (as well as better computational efficiency).

But what if we have some correlation, but not along this single axis? Here’s an example:

Despite strong linear correlation, projecting onto fixed axis only partially decorrelates color channels.

Such a fixed axis doesn’t work very well here – we have somewhat reduced the ranges of variation, but there is still some “wasteful” strong linear correlation…

Is there some transformation that would decorrelate them perfectly while also providing us with the “best” representation for a given subset of components? I am going to describe a few different “angles” that we can approach the problem from.

Line fitting / least squares vs dimensionality reduction?

The problem as pictured above – finding best line fits – is simple. In theory, we could do some least squares linear fit there (similarly to what I described in my post on guided image filters). However, I wanted to strongly emphasize that line fitting is not what we are looking for here. Why so? Let’s have a look at the following plot:

Difference between least squares line fitting and PCA/SVD. The first one seeks smallest sum of squared errors in the original y space, while the latter obtains a line with the smallest sum of squared errors of the projection.

Least squares line fitting will minimize the error and fit for the original y axis (as it is written to minimize the average “residual” error a*x+b – y), while what we are looking for is a projection axis with smallest error when the point is projected on that line! In many cases those lines might be similar, but in some they might be substantially different.

Here’s an example on a single plot (hopefully still “readable”):

Least squares line fitting – blue line – might obtain substantially different results from optimal projection axis – orange line.

For the case of projection, we want to minimize the component perpendicular to the projection axis, while linear line fitting / linear least squares minimize the error measured in the y dimension. Those are very different theoretical (and also practical when the number of dimensions grows) concepts!

In the earlier plot I mentioned name PCA, which stands for Principal Component Analysis. It is a technique for dimensionality reduction – something that we are looking for.

Instead of writing how to implement PCA with covariance matrices etc., (it is quite simple and fun to derive it using Lagrange multipliers, so might write a separate post on it in future) I wanted to look at it from a different perspective – of Singular Value Decomposition, which I blogged about in the past in the context of filter separability

I recommend that post of mine if you have not seen it before – not because of its relevance (or my arrogance), but simply as it describes a very different use of SVD. I think that seeing some very different use-cases of mathematical tools, concepts, and their connections is the best way to understand them, build intuition and potentially lead to novel ideas. As a side note – in linear algebra packages, PCA is usually implemented using SVD solvers.

Representing images as matrices – image doesn’t have to be a width x height matrix!

Before describing how we are going to use SVD here, I wanted to explain how we want to represent N-channel images by matrices.

This is something that was not obvious to me, and even after seeing it, defaulting to thinking about it in different ways took me years. The “problem” is that usually in graphics we immediately think of an image as a 2D matrix – width and height of the image get represented with matrix width and height; simple adding color channels immediately poses a problem, as it requires a 3D tensor. This thinking is very limited, only one of few possible representations and arguably a not very useful one. It took me quite a while to stop thinking about it this way and it is important even for “simple” image filtering (bilateral filter, non-local means etc.).

In linear algebra, matrices represent “any” data. How we use a matrix to represent image data is up to us – but most importantly, up to the task and goal we want to achieve. Representation that we are going to use doesn’t care for spatial positions of pixels – they don’t even have to be placed on an uniform grid! Think of them just as “some samples somewhere”. For such use-cases, we can represent the image as all different pixel positions (or sample indices) on one axis, and color channels on the other axis.

Two equivalent image representations – as a w x h x c 3D tensor, or as a w*h x c 2D matrix.

This type of thinking becomes very important when talking about any kind of processing of image patches, collaborative filtering etc. – we usually “flatten” patch (or whole image) and pack all pixels on one axis. If there is one thing that I think is particularly important and worth internalizing from my post – it is this notion of not thinking of images as 2D matrices / 3D tensors of width x height pixels.

Important note – as long as we remember the original dimensions and sample locations, those two representations are equivalent and we can (and are going to!) transform back and forth between them.

This approach extends to 3D (volumetric textures), 4D (volumetric videos) etc. – we can always pack width * height * depth pixels along one axis.

SVD of image matrices

We can perform the Singular Value Decomposition of any matrix. What do we end up with for the case of our image represented as a w*h x c matrix?

(Note: I tried to be consistent with ordering of matrices and columns / rows, but it’s usually something that I trip over – but in practice it doesn’t matter so much as you can transpose the input matrix and swap the U and V matrices.)

The SVD is going to look like:

Singular value decomposition of our image data. We end up with three separate matrices, U, S, V where U matrix is the same size as the input image, singular value matrix S can be seen as a c x c diagonal matrix (depending on used interpretation there might be more rows, but the rest entries are zero), and a c x c V matrix.

Oh no, instead of a single matrix, we ended up with three different matrices, one of which is as big as the input! How does this help us?

Let’s analyze what those matrices represent and their properties. First of all, matrix U represents all of our pixels in our new color space. We still have C channels, but they have different “meaning”. I used color coding of “some” colors to emphasize that those are not our original colors anymore.

Those new C channels contain different amount of energy / information and have different “importance”. Amount of it is represented in the diagonal S matrix, with decreasing singular values. The first new color channel will represent the largest amount of information about our original image, the next channel less etc. (Note the same used colors as in matrix U).

Finally, there is a matrix V that transforms between the new color channels space, and the old one. Now the most important part – this transformation matrix is orthogonal (and even further, orthonormal). It means that we can think of it as a “rotation” from the old color space, to the new one, and every next dimension provides different information without any “overlap” or correlation. I used here “random” colors, as those represent weights and remappings from the original, RGBA+other channels, to the new vector space.

Now, together matrices T and S inform us about transforming image data into some color space, where channels are orthogonal, and sorted by their “importance”, and where first k components are “optimal” in terms of contained information (see Eckart–Young theorem).

Edit / bonus section: A few readers have commented in a thread on twitter (and on Facebook) that while Eckart-Young is from perspective of linear algebra, usually image compression and stochastic processes analysis use the name Karhunen–Loève theorem/transform. It is a concept that is new to me (as I have not really worked with neither of those fields) and I have to and will be happy to read more about it. Thanks to everyone for pointing out this connection, always excited to learn some new concepts!

Think of it as rotating input data to a perfectly decorrelated linear space, with dimensions sorted based on their “importance” or contribution to the final matrix / image, and fully invertible – such representation is exactly what we are looking for.

How does it relate to my past post on SVD?

A question that you might wonder is – in my previous post, I described using SVD for analyzing separability and finding approximations of 2D image filters – how does this use of SVD relate to the other?

We have looked in the past at singular components, sorted by their importance, combined with single rows + columns reconstructing more and more of the original filter matrix.

Here we have a similar situation, but our U “rows” are all pixels of the image, every one with single channel, and V “columns” are combinations of the original input channels that transform to this new channel space. Just like before, we can add singular values multiplied by the corresponding vectors one by one, to get more and more accurate representation of the input (no matter whether it’s a filter matrix or a flattened representation of an image). The first singular value and color combination will tell us the most about the image, the next tells a bit less less information etc. Some images (with lots of correlation) have huge disproportion of those singular values, while in some other ones (with almost no correlation) they are going to be decaying very slowly.

This might seem “dry” and is definitely not exhaustive. But let’s try to build some intuition for it by looking at the results of SVD on our “toy” problem. If we run it on toy problem matrix image data as it is, we will immediately encounter a limitation that is easy to address:

Left: original data with a plotted yellow line corresponding to the “luminance” and a green line representing new first “dimension” of the SVD. Center: Same data converted to luminance/chrominance. Right: Same data when transformed by SVD matrix V and S.

The line corresponding to the first singular value passes nicely through the cluster or our data points (unlike simple “luminance”), however clearly doesn’t decorrelate points – we still see some simple linear relationship in the transformed data.

This is a bit disappointing, but there is a simple reason for it – we ran SVD on original data, fitting a line/projection, but also forcing it to go through original zero!

Usually when performing PCA or SVD in the context of dimensionality reduction, we want to remove the mean of all input dimensions, effectively “centering” the data. This way we can “release” the line intercept. Let’s have a look:

Left: original data with a plotted yellow line corresponding to the SVD without mean subtraction, and a green line representing new first “dimension” of the SVD with “centering” of the data before fitting. Center: Uncentered SVD shows some remaining linear correlation and value range skew away from zero. Right: Same data with “centered” SVD shows perfect decorrelation and a significant range reduction of the second dimension, while more information is represented in the first dimension.

As presented, shifting the fit center to the mean of the data, shifts the line fit and achieves perfect “decorrelating” properties. PCA that runs on covariance matrices does this centering implicitly, while in the case of SVD we need to do it manually.

In general, we would probably want to always center our data. For compression-like scenarios it has an additional benefit that we need to compress only the “residual”, so difference between local color, and the average for the whole image, making the compression and decompression easier at virtually no additional cost.

In general, many ML applications go even step further, and “standardize” the data, bringing it to the same range of values by dividing them by their standard deviations. For many ML-specific tasks this makes sense, as for tasks like recognition we don’t really know the importance of ranges of different dimensions / features. For example if someone was writing a model to recognize a basketball player based on their weight, height, and maybe average shot distance, we don’t want to get different answers depending on whether we enter the player’s weight in kilograms vs pounds!

However for tasks like we are aiming for – dimensionality reduction for “compression” we want to preserve the original ranges and meanings, and can assume that they are already scaled based on their importance. We can take it even further and use the “importance” to guide the compression, and I will describe it briefly around the end of my post.

SVD on Kodak images

Equipped with knowledge on how SVD applies to decorrelating image data, let’s try it on the two Kodak pictures that we analyzed using simple YUV transformations.

Let’s start with just visualizing both images in the target SVD color space:

Results of projecting images into a color space resulting from Singular Value Decomposition. Note that it seems to be very similar to the YUV color space!

Results of projection seem to be generally very similar to our original YUV decomposition, with one main difference is that our projected color channels are “sorted”. But we shouldn’t just “eyeball” those, instead we can analyze this decomposition numerically.

Covariance matrices and singular values

Comparing covariance matrices (and square roots of those matrices – just for visualization / “flattening”) shows the real difference between such a pretty good, but “fixed” basis function like YUV, and a decorrelated basis found by SVD:

Absolute value of covariance matrices that I normalized to 1 in the first channel (position 0,0 of each different basis projection) – in addition to regular covariance matrices on the left, I also added a version displaying a square root for dynamic range compression / easier visualization on the right.

As expected, SVD fully decorrelates the color channels. We can also look at how the singular values for those images decay:

Decay of the energy in different channels of YUV conversion and in SVD – in linear and in log scales.

So again, hopefully it’s clear that with SVD we get faster decay of the “energy” contained in different image channels – the first N channels will reconstruct the image better. Let’s look at the visual example.

Visual difference when keeping only two channels

To fully appreciate the difference, let’s look at how the images would look like if we used only two channels – YUV vs SVD and corresponding PSNRs:

Using only two “decorrelated” color channels – YUV vs SVD – visual difference as well as PSNR.

The difference is striking visually, as well as very significant numerically.

This shouldn’t come as a surprise, as SVD finds a reparametrization / channel space that is the “best” projection for thr L2 metric (which is also used for PSNR computations), decorrelates, and sorts this space according to importance.

I wonder why JPEG wouldn’t use such decomposition – I would be very curious to read in the comments! :)I have two guesses of mine: JPEG is an old color format with lots of “legacy” and support in hardware – both encoders, as well as decoders. A single point-wise floating point matrix multiply (where matrix is just 4×3 – 3×3 multiplication and adding back means) seems super cheap to any graphics programmers, but would take a lot of silicon and power to do efficiently in hardware. A second guess is related to the “perceptual” properties of chrominance (and worse human vision sensitivity to it) – SVD doesn’t guarantee us that decomposition will happen in the rough direction of luminance, could be “any” direction.

Edit / addendum: While my remark re JPEG was about the YUV luminance/chrominance decomposition as compared to doing PCA on the color channels, the use of DCT basis for the spatial compression component (instead of mentioned before KLT and eigendecomposition of data directly) is considered “good enough”, and I learned from a reply twitter thread with Per Vognsen and Fabian Giesen about some justifications for its use and another set of connections to stochastic processes.

Now, this use case – decorrelating RGB channels – can be very useful not for standard planar image textures, but also for 3D volumetric colored textures. I know that many folks store and quantize/compress irradiance fields in YUV-like color spaces, and using there SVD will most likely have better quality/compression ratios. This would become even more beneficial for storing sky / ambient occlusion in addition to the irradiance, and one of the further section includes an example AO texture (albeit for the 2D case).

Relationship to block texture compression

If you have worked with GPU “block texture compression” (like DXT / BC color formats) you can probably recognize some of the described concepts – in fact, the simplest BC1 format does a PCA/SVD and finds a single axis, on which colors are projected! I recommend Nathan Reed’s blog post on this topic and a cool visualization.

This is done per a small, 4×4 block and because of that, we can expect much more energy be stored in the single component. This is a visualization of how such BC compression would +/- look like with our toy problem:

The simplest variants of BC compression project colors onto a single line that is found using PCA/SVD approaches and then quantized heavily.

Any blocks that don’t have almost all of information in the first singular value (so are not constant or have more than a single gradient) are going to end up with discoloration and block artifacts – but since we talk about just 4×4 blocks, GPU block compression usually works pretty well.

Block compression will be also pose a problem for the initial use-case and research direction I tried to follow – but more in that later!

What if we had many more channels?! PBR texture sets.

After a lengthy introduction, time to get back to the original purpose of my investigations – PBR texture sets for materials. How would SVD work there? My initial thinking was that if there are many more channels, some of them have to be correlated (as local areas will share physical properties – metal parts might have high specularity and gloss, while diffuse might be rough and more scratched). While such correlations might be only weak, there are many channels, and a lot of potential for also different, non-obvious correlations. Or correlations that are “accidental” and not related to the physical properties. 🙂

Since I don’t have access to any actual game assets anymore, and am not sure about copyright / licensing of many of the scenes included with the engines, I downloaded some texture set from cc0textures by Lennart Demes (such an awesome initiative!). Texture sets for materials there are a bit different than what I was used to in the games I worked on, but are more than enough for demonstration purpose.

I picked one set with some rocks/ground (on GoW, the programmers were joking that half of GoW disk was taken by rock textures, and at some point in the production it wasn’t that far off from true 🙂 ).

The texture set looks like this:

Example PBR texture set created by Lennart Demes that we are going to analyze and “compress” with SVD.

For the grayscale channels I took only single channels, and normals are represented using 3 channels, giving total of 9 channels.

The cool thing is – we don’t really have to “care” or understand what those color channels represent! We could even use 3 color channels to represent the “gray” textures, and the SVD would recognize it automatically (with some singular values ending up exactly zero) but I don’t want to present misleading/too optimistic data.

For visualization, I represented this texture set as horizontally stacked textures, but the first thing we are going to do is to first represent it as 2048 x 1365 x 9 array, and then for purpose of SVD as (2048 * 1365) x 9 one.

If we run the SVD on it (and then reshape it to w x h x 9), we get the following 9 decorrelated new “color”/image channels:

New color channels resulting from the SVD decomposition stacked together. U used data in U matrix multiplied by the S matrix to demonstrate the relative importance of it – feature magnitudes.

Alternatively, if we display it as 3 RGB images:

Same data as above (U times S from the SVD decomposition), but with 9 channels displayed as three RGB textures.

Immediately we can see that information contained in those decays very quickly, last four containing barely any information at all!

We can verify this looking at the plot of singular values:

As seen from the textures earlier, after 5 color channels, we get significant drop off of the information and start observing diminishing returns (“knee” of the cumulative curve).

Let’s have a look at how did SVD remap the input to the output channels and what kind of correlations it found.

In the following plot, I put abs(S.T) (using absolute value in most of my plots, as it’s easier to reason about existence of correlations, not their “direction”):

In this matrix, columns represent input features, and rows represent the output channels/features. Color/magnitude signifies abs() of the contribution. Input channel 0 is AO and contributes to four output features; channels 1,2,3 are albedo and map mostly to first two channels, and a bit to channels 5/6/7 etc.

I marked two input textures – albedo and normals, and as they are a 3-channel representation of the some physical property and we can see how because of this correlation they contribute to very similar output color channels, almost like same “blocks”. We can also observe that for example the last two output features contribute to a little bit of AO, little bit of albedo, and a bit to one channel of normals – comprising some leftover residual error correction in all of the channels.

Now, if we were to preserve only some of the output channels (for data compression), what would be the reconstruction quality? Let’s start with the most “standard” metric, PSNR:

PSNR of the reconsctruction if we take only N channels from the SVD decomposition. Note that even with zero channels, we get non-zero PSNR due to mean-subtraction. Generally PSNRs above ~36dB are considered “ok”. I “clip” above 50dB as such high values don’t make any difference for any perceptual purpose.

I skipped keeping of 9 channels in the plot, as for those we would expect “infinite” PSNR (minus floating point imprecisions / roundoff error).

So generally, for most of the channels, we can get above ~30dB with keeping only 5 data channels – which is what we intuitively expected from looking at the singular values. This is how reconstruction from keeping only those 5 channels looks like:

Top: Original PBR texture set. Bottom: Reconstructed from the first 5 SVD channels. Difference is most visuble on albedo and the roughness.

Main problems I can see is slight albedo discoloration (some rocks lose the desaturated rock scratches) and different frequency content on the last, roughness texture.

Going back to the PSNRs plots, we can see that they are different per channels and e.g. displacement becomes very high quickly, while the color and roughness stay pretty low… Can we “fix” those and increase the importance of e.g. color?

Input channel weighting / importance

I mentioned before that many ML applications would want to “standardize” the data before computing PCA / SVD and make sure range (as expressed by min/max or standard deviations) is the same to make sure we don’t bake in assumption of importance of any features as compared to the others in the analysis.

But what if we wanted to bake our desired importance / weighting into the decomposition? 🙂

Artists do this all the time manually by adjusting the resolutions of different sets of PBR textures to different resolutions; most 3D engine have some reasonable “defaults” there. We can use the same principles to guide our SVD and improve the reconstruction ratio. To do it, we would simply rescale the input channels (after the mean subtraction) by some weights that control the feature importance. For the demonstration, I will rescale the color map to be 200% as important, while reducing the importance of displacement to 75%.

This is how channel contributions and PSNR curves look like before/after:

Channel contributions without and with channel weighting increasing the contribution of albedo (channels 1/2/3) and decreasing the weight of the fourth channel. Notice how with the weighting most of albedo contributions happen in the first singular value, and the 4th channel of displacement gets distributed among few multiple values.
PSNR curves before and after the channel weighting increasing the contribution of albedo (channels 1/2/3) and decreasing the fourth channel. Notice how increase of the reconstruction accuracy of some channels had to lower it for some other ones, similarly global PSNR.

Such simple weighting directly improved some of the metrics we might care about. We had to “sacrifice” PSNR and reconstruction quality in different channels, but this can be a tuneable, conscious decision.

And finally, the visual results for materials reconstructed from just 5 components:

Top: original textures. Middle: SVD decomposition and reconstruction from 5 components without weighting. Bottom: Reconstruction from 5 components that includes increasing the weight of the albedo, while decreasing height map (third displayed texture). We observe the visual error decreasing significantly on the albedo – no discoloration, while increasing significantly for the height map.

Other than some “perceptual” or artist-defined importance, I think that it could be also adjusted automatically – decreasing weight of channels that are reconstructed beyond diminishing returns, and increasing of the ones that get too high error. It could be some semi-interactive process, dynamically adjusting number of channels preserved and the weighting – an area for some potential future research / future work. 🙂

Dropping channels vs smaller channels

So far I have described “dropping channels” – this is really easy to get significant space savings (e.g. 5/9 == almost 50% savings with PSNR in the mid 30s), but isn’t it too aggressive? Maybe we can do the same thing as JPG and similar encoders do – just store the further components at smaller resolutions.

For the purpose of this post and just demonstration, I will do something really “crude” and naive – box filter and subsample (similar to mip-map generation), with nearest neighbor upsampling – we know we can do better, but it was simpler to implement and still demonstrates the point. 🙂

For demonstration I kept the 4 components at the original resolution, then 3 components at half resolution, and the final 2 components at quarter resolution. In total this gives us slightly smaller storage than before (7/16th), but the PSNRs are as follow: Average PSNR 36.96, AO 43.69, Color 41.08, Displacement 29.98. Normal 41.25, Roughness 35.13.

Not too bad for such naive downsampling and interpolation!

When downsampled, images looked +/- the same, so instead here is a zoom-in to just crops (as I expected to see some blocky artifacts):

Top: original textures crops, bottom: crops of textures reconstructed from aggressive decimation of further SVD channels with super-naive bilinear/box filter and NN upsampling.

With the exception of the height map which we have purposefully “downweighted” a lot, the reconstruction looks perceptually very good! Note that this ugly “ringing” and blocky look of it would be significantly better if even simple bilinear interpolation was used.

I’d add that we typically have also a third option – adjusting the quantization and the number of bits we encode each of our channels. This actually makes the most sense (because our channel value ranges get smaller and smaller and we know them directly from the proportions singular values), and is something I’d recommend playing with.

Runtime cost – conversion back to the original channel space

Before I conclude with the biggest limitation and potentially a “deal breaker” of the technique, a quick note on the runtime cost of such decomposition technique.

The decomposed/decorrelated color channels are computed with zero mean and unity standard deviation, so we would generally want to shift them by 0.5 and rescale so that they cover range [0, 1], instead of [-min, max]. Let’s call this a (de)quantization matrix M. So far we have matrices U (pixels x N), S (N x N diagonal), T (N x N, orthogonal), M (we could do N+1 x N to use “homogenous coordinates” and get translations for mean centering for free), input feature weight matrix / vector W, and vector with means A.

We can multiply all of those matrices (other than U which becomes our new image) offline and as a part of precomputation and using homogenous coordinates (N input features + ‘1’ constant in additional channel), we end up with a (N+1 x N) conversion matrix if keeping all channels, or (N x M+1) where M < N if dropping some channels. The full runtime application cost becomes a matrix multiply per pixel – or alternatively, a series of multiply-adds – each reconstructed feature becomes a weighted linear combination of the “compressed” / decorrelated features, so total cost per a single texture channel is M+1 madds.

This is very cheap to do on a GPU without any optimizations (maybe one could use “tensor cores” which are simply glorified mixed-precision matrix-multipliers?), but has a disadvantage that even if we are going to use a single channel (for example heightmap for raymarching or in a depth prepass), we still need to read all the input channels. If you have many passes or use-cases like this, this could elevate the required texture bandwidth significantly, but otherwise could be almost “free” (ALU is generally cheap, especially if doesn’t have any branching that would elevate the used register count).

Note however that it’s not all-or-nothing – if you want you could store some textures as before, and compress and reduce dimensionality of the other ones.

Caveats / limitations – BC texture compression

Time to reveal the reason why I haven’t followed up on this research direction earlier – I imagine that to many of you, it might have been obvious throughout the whole post. This approach is going to generally lose a lot of quality when being compressed with GPU block compression formats.

You remember how I described that the simplest BC formats do some local PCA and then just keep only single principal component? Now, if we purposefully decorrelated our color channels and there is no correlation anymore, such a fit and projection is going to be pretty bad! 😦

Now, it’s not that horrible – after all, we have found global decorrelation for the whole texture. Luckily, this doesn’t mean that there won’t be local (per block) one.

BC formats do this analysis per block (otherwise every image would end up as some form of “colored grayscale”) and there still might be potential for local correlations and not losing all the information with block compression. This is how our color channels to compress look like:

Crops of the colors of analyzed texture transformed to the decorrelated SVD space. Despite global decorrelation, we still observe blocks and areas of constant color or just two colors.

So generally, locally we see mostly flat regions, or combinations of two colors.

This is not any exhaustive test, proof, or analysis, just a single observation, but maybe it won’t be as bad as I thought when I abandoned this idea for a while – but it’s a future research direction! If you have any thoughts, experiments, or would like to brainstorm, let me know in the comments or contact me. 🙂

Summary

In this post we went through quite a few different concepts:

  • Idea of “decorrelating” color channels to maximize information contained in them and remove the redundancy,
  • One alternative possible way how to think about images or image patches for linear algebra applications – as 2D matrices, not 3D tensors,
  • Very brief introduction to use of SVD for computing “optimal” decorrelated channel decompositions (SVD for PCA-like dimensionality reduction),
  • Relationship between the least-squares line fitting and SVD/PCA (they are not the same!),
  • Demonstration of SVD decomposition on a 2-channel, “toy” problem,
  • Relationship of such decomposition to Block Compression formats used on GPUs,
  • Reducing the resolution/quantization/fully dropping “less significant” decorrelated color channels,
  • Comparison of efficacy of SVD as compared to YUV color space conversion that is commonly used for image transmission and compression,
  • Potential application of dimensionality reduction for compressing the full PBR texture sets,
  • And finally, caveats / limitations that come with it.

I hope you found this post at least somewhat interesting / inspiring and encouraging some further experiments with using the multi-channel texture / material cross-channel correlation.

While the scheme I explored here has some limitations for practical GPU applications (compatibility with BC formats), I hope I was able to convince you that we might be leaving some compression/storage benefits unexplored, and it’s worth investigating image channel relationships for dimensionality reduction and potentially some smarter compression schemes.

As a final remark I’ll add that applicability of this technique depends highly on the texture set itself, and how quickly the singular values decay. I have definitely seen more correlated texture sets than the example, but it’s also possible that on some inputs the correlation and technique efficacy might be worse – all depends on your data. But you don’t have to make a decision whether to use it or how many channels to keep manually – by analyzing the singular values as well as the PSNR itself it can be easily deducted from data and the decision automated.

The post comes with three colabs: Part 1 – toy problem, Part 2 – Kodak images, Part 3 – PBR materials which hopefully will encourage verification / reproduction / exploration.

Thanks for reading!

Posted in Code / Graphics | Tagged , , , , , , , , | 11 Comments

“Optimizing” blue noise dithering – backpropagation through Fourier transform and sorting

Introduction

This will be a blog post that is second in an (unanticipated) series on interesting uses of the JAX numpy autodifferentiation library, as well as an extra post in my very old post series on dithering in games and blue noise specifically.

Teaser: we will “optimize” a dithering matrix directly in Fourier domain and backpropagate for gradient descent together with backpropagation through sorting, to obtain specific perceptual, blue-noise dithering properties. The image shows comparison between white, and blue noise dithering.

I am going to describe how we can use auto-differentiation and loss function optimization to perform a precomputation of a dithering pattern directly with a desired Fourier/frequency spectrum and the desired range of covered values by backpropagating through those two (differentiable!) transforms.

While I won’t do any detailed analysis of the pattern properties and don’t claim any “superiority” or comparisons to any of the state of the art (hey, this is a blog, not a paper! 🙂 ), it seems pretty good, highly tweakable, works pretty fast for tileable sizes, and to my knowledge, this idea of generating such patterns directly from spectrum with other (spatial, histogram) loss components is quite fresh.

If you are new to “optimization”, gradient descent, and back-propagation, I recommend at least my previous post just as a starting point, and then if you have some time, reading proper literature on the subject.

This blog post comes with a “scratchpad quality” colab notebook that you can find and play with here. It is quite ugly and unoptimal code, but being as simple as possible and thus a starting point for some other explorations.

Motivation

Motivation for this idea came from one of the main themes that I believe in and pursue in my personal spare time explorations / research:

Offline optimization is heavily underutilized by the computer graphics community and incorporating optimization of textures, patterns, and filters can advance the field much more than just “end-to-end” deep learning alone.

This was my motivation for my past blog post on optimizing separable filters, and this is what guides my interests towards ideas like accelerating content creation or compression using learning based methods.

Given a problem, we might want to simplify it and extract some information from data that is not known in a simple, closed form – this is what machine learning is about. However, if we run the “expensive” learning, optimization and precomputation offline, the runtime can be as fast – or even faster if we simplify the problem – as before and we can still use our domain knowledge to get the best out of it. Some people call it differentiable programming, but I prefer “shallow learning”.

Playing with different ideas for differentiating and optimizing Fourier spectrum for some different problem, I realized – hey, this looks like directly applicable to the good, old blue-noise!

Blue noise for dithering – recap

Let’s get back on-topic! It’s been a while since I looked at blue noise dithering and when writing it, I felt like my blog post mini-series has exhausted this topic to the point of diminishing returns (FWIW I don’t think that blue noise is a silver bullet for many problems; it has some cool properties, but its strength lies in dithering and only extreme low sample counts for sampling).

Before I describe benefits of the blue noise for dithering, have a look at an example dithered gradient from my past blog post:

A linear gradient dithered with some white vs blue noise. Can you guess which one is which? 🙂

Generally – informal definition that is most common in graphics – blue noise is type of noise that in a general sense has no low frequency content. This spectral property is desirable for dithering for the following main reasons:

  • Low frequencies present in the noise cause low frequencies of the error, which are very easily “caught” by the human visual system,
  • With blue noise dithering pattern, we are less likely to “miss” high frequency details of a given range of values. Imagine if low frequency caused “rise” of the pattern in a local spot, while it had high frequencies or information in different range – they would all quantize to a similar value.
  • Finally, with further downstream processing and high resolution displays, high frequency error is much easier to filter out by either image processing, or simply our imperfect eyes, with eye sight or processing reducing the variance of the apparent error.
  • Unlike some predefined patterns like Bayer matrix, blue noise has no visible “artificial” structure, making it both more naturally looking, as well as less prone to amplify some of the artificial frequencies.

All those properties make it a very desireable type of noise / pattern for simple look-up based dithering. (Note for completeness: if you can afford to do error diffusion and optimize the whole image based on the error distribution, it will yield way better results).

“Optimizing” Fourier spectrum

One of the blue noise generation methods that I initially tried for 1D data was “highpass filter and renormalize” by Timothy Lottes. The method is based on a simple, smart idea – highpass filtering followed by renormalizing values for proper range/histogram – and is super efficient to the point of applicability in real-time, but has limited quality potential, as renormalization “destroys” the desired spectral properties. Iterating this method doesn’t give us much – every highpass operation done using finite size convolution has only local support, while the renormalization is global. This imbalance was in my explorations leading to geting stuck in some low quality equilibrium, and after just a few iterations a ping-pong of two exactly same patterns.

If only we knew some global support transform that is able to look directly at spectral content… If only such transform was “tileable”, so the spectra were “circular” and periodic… Hey, this is just plain old Fourier transform!

While in many cases, Fourier being valid on periodic / circular patterns is a nuisance that needs to be addressed; similarly its global support (single coefficient affects the whole image), but here both of those properties work to our advantage! We want our spectral properties to work on toroidal – tiled – domains, as well as we want to optimize such pattern globally.

Furthermore, the Fourier transform is trivially differentiable (every frequency bucket is a sum of exponentials on the input pixels) and we can even directly write a loss function that mixes operations directly on the frequency spectrum (but even a single “pixel” of the spectrum will impact every single pixel of the input image. With such gradient based optimization at every step we can just slightly “nudge” our solution towards desired outcome and maybe won’t get stuck immediately in a low quality equilibrium.

Fourier transform is differentiable and we can backpropagate any loss function defined on the frequency spectrum back to the original spatial domain. Interesting side-effect of the global spatial support is that modifying even single frequency bucket has effect on all of the input samples. Similarly, every input sample contributes to all frequency buckets.

Similarly, if we sort pixels by value (do get ordering that would generate a histogram), we can just push them “higher” and lower and we can back-propagate through sorting.

Similarly, backpropagating loss function defined on sorted values (~a CDF of a histogram) is trivial as it requires only “reindexing” values (remembering which input value index corresponded to given sorted bucket). In this case loss and gradient are more “local” – single value maps back to a single value, but if we optimize for the whole histogram, all values will have some non-zero gradient.

Writing gradient of both of those manually wouldn’t be very hard… But in JAX it is already done for us, we already got both operations implemented as part of the library, and we can JIT the resulting gradient evaluation, so nothing left for us but to design a loss and optimize it!

Fourier domain optimization – whole science fields and prior work

I have to mention here that optimizing frequency spectra, or backpropagating through them is actually very common and it is used all the time in medical imaging, computational imaging, radio / astronomical imaging, and there are types of microscopy and optics based on it. Those are super fascinating fields and it probably takes lifetime to study those properly, and I am complete layman when it comes to those – but before I conclude the section, I’ll link a talk by Katie Bouman on the recent event horizon telescope which also does frequency domain processing (and in a way uses the whole Earth as a such Fourier telescope – no big deal! 🙂 ).

Setting up loss function and optimization tricks

As I mentioned in my previous post, designing a good loss function is the main part of the problem when dealing with optimization-based techniques… (along things like convergence, proofs of optimality, conditioning, and better optimization shemes… but even many “serious” papers don’t really care about those!).

So what would be the loss component of our blue noise function? I came up with the following:

Components of our dithering blue noise loss function:
1. Histogram uniformity. We want some desired histogram shape. For simplicity I used uniform one, but a triangular or Gaussian one might be a better choice for smooth gradients!
2. Frequency content focused on high frequencies, with absence of the low ones.
3. General frequency spectrum uniformity – we don’t want “spikes” similar to e.g. a Bayer pattern.

This is just an example – again, I wanted to focus on the technique, not on the specific results or choices! If you some alternative ideas, please let me know in the comments below 👇.

Again, code for this blog post is available and immediately reproducible here.

There are also many “meta” tuning parameters – what would be our low frequency cutoff point? How do we balance each loss component? What histogram shape we want? The best part – you can try and adjust any of those to taste and to your liking and specific use-case. 🙂

When running optimization on such a compound loss function, I almost immediately encountered getting stuck in some “dreaded” local minima. On the other hand, when optimizing only for the loss sub-components, the optimization was converging nicely. Inspired by the stochastic gradient descent that is commonly known to generalize and converge to better global minima than any batch based / whole dataset optimization, I used this observation to “stochasticize” / randomize the loss function slightly (to help “nudge” the solutions out of the local minima or saddle points).

The idea is simple – scale the loss components by a random factor, different per each gradient update / iteration. Given my “complete” loss function comprised of three sub-components, I scale each one by a random uniform in the range 0-1. This seemed to work very well and help optimization to achieve plausibly looking results without local minima.

As a final step, I run fine-tuning on the histogram only part (could as well just force-normalize, but reused existing code), as without it histograms were “almost” perfect. I think in practice it wouldn’t matter, but why not have a “perfect” unbiased dithering pattern if we can?

Left: Pretty good resulting optimized histogram – result of optimizing for the whole loss – individual component is very close to perfect dithering histogram. Right: A “perfect” histogram after fine tuning (or just renormalization).

Results

Without further ado, after playing for a few tries with different loss components and learning rates, I got a following 64×64 dithering matrix:

And the following spectrum:

One of the components of the loss that is hardest to optimize for is spectrum uniformity. It makes sense to me intuitively – smoothness of the spectrum means that we can propagate values only between ajdacent frequencies, which then affect all pixels and most likely cancel each other out (especially together with other loss components). Most likely a smarter metric (maybe just optimize directly for desired value in buckets? 🙂 ) would yield better results – or better optimizer, or simply running the optimization for longer than a few minutes, but they are not too bad at all! And it’s not 100% clear to me if it is necessary to have perfectly uniform spectrum.

Since those blue noise dithering matrices would be used after tiling for dithering, I tested how they look on 2bit dithering of an RGB image from Kodak dataset:

The difference is striking – which is not surprising at all. It’s even more visible on a 1-but quantized image (though here we can argue that both look unacceptably low quality / “retro” heh):

Still – in the blue noise dithered version you can actually read the text and I think it looks much better! I haven’t compared with any other techniques and not sure if it really matters in practice, especially for something so simple as dithering that has also more expensive, better alternatives (like error diffusion).

Conclusions

I have presented an alternative approach to designing dither matrices – by directly optimizing the energy function on the frequency spectrum, and the value range, and running a gradient descent to optimize for those objectives.

I haven’t really exhausted the topic and there are many interesting follow-up questions:

  • how a triangular or Gaussian noise value PDF affects the optimization?
  • What is the desired frequency cutoff point for dithering?
  • How to balance the loss terms and how to express smoothness more elegantly? Does the smoothness matter?
  • Would better optimization schemes like e.g. momentum based yield even faster convergence?
  • What about extending it to more dimensions? Projection-slice property of Fourier transform acts against us, but maybe we can optimize frequency in 2D, and some other metric in other dimensions?

…and probably many more, that I don’t plan to work more on for now, but I hope my post content and immediately-reproducible Python JAX code that I included will inspire you to some further (and smarter than my brute-force!) experiments.

Post scriptum / digression – optimization vs deep learning – which one is “expensive”?

While writing this post, I contemplated on something quite surprising – a digression on optimization and its relationship with deep learning – something I found counter-intuitive and paradoxically converse views between practitioners and academia: When switching from working in “the industry” on mostly production (and applied research), to working in research groups and interacting with academics, I immediately encountered some opposite views on deep neural networks.

For me, a graphics practitioner, they were cool, higher quality than some previous solutions, but impractical and too slow for most real product applications. At the same time, I have heard from at least some academics that neural networks were super efficient, albeit somewhat sloppy approximations.

What the hell, where does this disconnect come from? Then, reading more about state of the art methods prior to DL revolution, I realized that most of them were using optimization of some energy function on a whole problem and solving a complex, non-linear and huge problem with gradient descent (for example on a non-linear system where every unknown is a pixel of an image!) per every single instance of the problem, and from scratch.

By comparison, artificial neural networks might seem brute-force and wasteful to a practitioner, (per every pixel, hundreds if not thousands/millions of huge matrix multiplies of mostly meaningless weights and “features”, and gigantic waste of memory bandwidth) but the key difference is that the expensive “loss” optimization happens offline. In the runtime/evaluation it is “just” passing data through all the layers and linear algebra 101 arithmetic.

So yeah, our perspectives were very different. 🙂 But I am also sure that if we care about e.g. power consumption, energy efficiency and “operations per pixel”, we can take the idea of optimizing offline much further.

Posted in Code / Graphics | Tagged , , , , , , , , | 4 Comments

Bilinear texture filtering – artifacts, alternatives, and frequency domain analysis

In this post we will look at one of the staples of real-time computer graphics – bilinear texture filtering. To catch your interest, I will start with focusing on something that is often referred to as “bilinear artifacts”, trapezoid/star-shaped artifact of bilinear interpolation – what causes them? I will discuss briefly some common bilinear filtering alternatives and how they fix those, link a few of my favorite papers on (fast) image interpolation, and analyze the frequency response of common cheap filters.

Screenshot from my favorite video game of all time – Deus Ex (2000). See this diamond shape of the pupil? This is what I will refer to as “bilinear artifacts”.

As usual, I will work on the topic in “layers”, starting with basics and “introduction to graphics” level and perspective – analyzing the “spatial” side of the artifact. Then I will (non-exhaustively) the alternatives – bicubic and biquadratic filter, and finally analyze those from a signal processing / EE / Fourier spectrum perspective. I will also comment on relationship between filtering as used in the context of upsampling / interpolation, and shifting images. (Note: I am going to completely ignore here downsampling, decimation, and generally trilinear filtering / mip-mapping.)

Update: Based on a request, I added a link to a colab with plots of the frequency responses here.

What is bilinear filtering and why graphics use it so much?

I assume that most of the audience of my blog post are graphics practitioners or enthusiasts and most of us use bilinear filtering every day, but a quick recap never hurts. Bilinear filtering is a texture (or more generally, signal) interpolation filter that is separable – it is a linear filter applied on the x axis of the image (along the width), and then a second filter applied along the y axis (along the height).

Note on notation: Throughout this post I will use linear/bilinear almost interchangeably due to this property of bilinear filtering being a linear filter applied in the two directions sequentially.

Starting with a single dimension, linear filter is often called a “tent” or “triangle” filter. In the context of input texture interpolation, we can look at it as either “gather” (“when computing an output pixel, which samples contribute to it?”), or “scatter” operation (“for each input sample, what are its contributions to the final result?”). Most of GPU programmers naturally think of “gather” operations, but here this “triangle” part is important, so let’s start with the scatter.

For every input pixel, its contributions to the output signal form a “tent” – a “triangle” of pixel contributions is centered at sample N, with height equal to the sample/pixel N value, and its base spans from the sample (N-1) to (N+1). To get the filtering result we sum all contributing “tents” (there are either 1 or 2 covering each output pixel). All of triangles have the same base length, and all of them overlap with a single other triangle on either side of the top vertex. Here is a diagram that hopefully will help:

(Bi)linear filtering is called triangle or tent filtering as contributions of the input samples to the filtered results are “triangles” or “tents”.

Switching from “scatter” to a more natural (at least to people who write shaders) gather approach – between samples (N+1) and (N) we sum up contributions of all “tents” covering this area, so (N+1) and N, and at exact position of the sample N we get only its contribution.

Formula for this line is simply xn * (t – 1) + xn-1 * t. This means that in 1D each output sample has two contributing input pixels (though sometimes one contribution is zero!), and we linearly interpolate between them. If we repeat this twice per x and y, we end up with four input input samples and four weights (t_x – 1) * (t_y – 1), (t_x ) * (t_y – 1), (t_x – 1) * (t_y ), (t_x) * (t_y). Since the weights are just multiplication of independent x and y weights, we can verify that this filter is separable (if you are still unsure what it means in practice, check out my previous blog post on separable filters).

Bilinear filtering is ubiquitous in graphics and there are a few reasons for it – but the main one is that is super cheap, and furthermore, hardware accelerated; and that it provides significant quality improvement over nearest-neighbor filtering.

All modern GPUs can do bilinear filtering of an input texture in a single hardware instruction. This involves both memory access (together with potentially uncompressing block compressed textures into local cache), as well as the interpolation arithmetic. Sometimes (depending on the architecture and input texture bit depth) the instruction might have slightly higher latency, but unless someone is micro-optimizing or has a very specific workload, the cost of the bilinear filtering can be considered “free” on contemporary hardware.

At the same time, it provides significant quality improvement over “nearest neighbor” filtering (just replicating each sample N times) for low resolution textures.

Nearest neighbor filtering (red) produces discontinuities, while bilinear sampling (blue) produces continuous values between samples.
Bilinear vs nearest neighbor filtering.

Being supported natively in hardware is a huge deal and the (bilinear) texture filtering was one of the main initial features of graphics accelerators (when they were still mostly separate cards, in addition to the actual GPUs).

Personal part of the story – I still remember the day when I saw a few “demo” games on 3dfx Voodoo accelerator on my colleague’s machine (it was a Diamond Monster3D) back in primary school. Those buttery smooth textures and no pixels jumping around made me ask my father over and over again to get us one (at that time we had a single computer at home, and it was my father’s primary work tool) – a dream that came true a bit later, with Voodoo 2.

Side remark – pixels vs filtering, interpolation, reconstruction

One thing that is worth mentioning is that I deliberately don’t want to draw input – or even the output pixels as “little squares” – pixels are not little squares!

Classic tech memo from Alvy Ray Smith elaborates on it, but if you want to to deeper into understanding filtering, I find it extremely important to not make such a “mistake” (I blame “nearest neighbor” interpolation for it).

So what are those values stored in your texture maps and framebuffers? They are infinitely small samples taken at a given location. Between them, there is no signal at all – you can think of your image as “impulses”. Such signal is not very useful on its own, it needs a reconstruction filter that reconstructs a continuous, non discrete representation. Such a reconstruction filter can be your monitor and in this case, pixels could be actually squares (though different LCD monitors have different pixel patterns), or a blurred dot of a diffracted ray in CRT monitor. (This is something that contemporary pixel art gets wrong and Timothy Lottes used to have interesting blog posts about – I think they might have gotten deleted, but his shadertoy showing more correct emulation of a CRT monitor is still there).

For me it is also useful to think of texture filtering / interpolation in a similar way – we are reconstructing a continuous (as in – not discrete/sampled) signal and then resample it, taking measurements / values at new pixel positions from this continuous representation. This idea was one of key components in our handheld mobile super-resolution work at Google, allowing to resample reconstructed signals at any desired magnification level.

Example from published past work that I worked on on how given a continuous reconstruction of a sparse signal (from many locally randomly shifted input frames) can be sampled at different destination grid densities, achieving higher zoom factors with less aliasing and more details.

Bilinear artifacts – spatial perspective – pyramid, and mach bands

Ok, bilinear filtering seems to be “pretty good” and for sure is cheap! So what are those bilinear artifacts?

Again a zoom-in into a screenshot from Deus Ex intro – notice the star-like pupil.

In the post intro I have shown example artifact from a 20 year old video game back when the asset textures were very low resolution, but those happen anytime we interpolate from low resolution buffers. In my Siggraph 2014 talk about volumetric atmospheric effects rendering, I have shown an example of problematic artifacts when upsampling low resolution 3D fog texture:

Volumetric fog bilinear artifacts – notice stair/saw-like look.

In my old talk proposed to solve this problem with additional temporal low-pass effect (temporal jitter + temporal supersampling), because the alternative (bicubic / B-spline sampling – more about it later) was too expensive to apply in real time. Nota bene such a solution is actually pretty good at approximating blurring / filtering for free if you already have a TAA-like framework.

If we get back to our “tent” filter scatter interpretation, it should become easier to see what causes this effect. Separable filtering applies first a 1D triangle filter in one direction, then in another direction, multiplying the weights. Those two 1D ramps multiplied together result in a pyramid – this means that every input pixel will “splat” a small pyramid, and if the ratio of the output to the input resolutions is large and textures have high contrast, then those will become very apparent. Similarly those pyramids on any edge that is not aligned with perfect 45 degrees, will create jaggy, aliased appearance.

Horizontal tent filter followed by a vertical tent filter results in a “pyramid”.

Ok, it’s supposed to be a pyramid, but what’s the deal with this bright star-like lines?

The second “spatial” effect of why bilinear filtering is not great are so-called Mach bands.

This phenomenon is one of the fascinating “artifacts” (or conversely – desired features / capabilities) of human visual system. Human vision cannot be thought of as a “sensor” like in a digital camera or “film” in analog one – it is more like a complicated video processing system, will all sorts of edge detectors, contrast boosting, motion detection, embedded localized tonemapping etc. Mach bands are caused by tendency of HVS to “sharpen” image and emphasize any discontinuity.

Perfectly linear gradient (for example generated by linear interpolation). Do you also see this apparent bright white line on the right side of the gradient?

(Bi)linear interpolation is continuous, however its derivative is not (derivative of a piecewise linear function is a piecewise constant function). In mathematical definition of smoothness and C-continuity we say that it is C0 continuous, but not C1 continuous. I find it fascinating that HVS can detect such discontinuity – detecting features like lines and corners is like differentiation and indeed, most common and basic feature detector in computer vision, Harris Corner Detector analyzes local gradient fields.

To get rid of this effect, we need some filtering and interpolation that has a higher order of continuity. Before I move to some other filters, I have to mention here also a very smart “hack” from Inigo Quilez – by hacking the interpolation coordinates, you can get interpolation weights that are C1 continuous.

Screenshot from: https://www.shadertoy.com/view/XsfGDn, author Inigo Quilez. Using hacked UVs together with hardware bilinear interpolation to get smoother filtering. Be sure to check Inigo’s excellent article.

It’s a really cool hack and works very well with procedural art (where manually computed gradients are used to produce for example normal maps), but produces “rounded rectangles” type of look and aliases in motion, so let’s investigate some other fixes.

Bilinear alternatives – bicubic / biquadratic

The most common alternative considered in graphics is bicubic interpolation. The name itself is in my opinion confusing, as cubic filtering can mean any filtering where we use 4 samples, and filter weights result from evaluating 3rd order polynomials (similarly to linear using 2 samples and a 1st order polynomial). In 2D, we have 4×4 = 16 samples, but we also interpolate separably. However there are many alternative 3rd order polynomials weights that could be used for filtering… This is why I avoid using just the term “bicubic”. Just look at Photoshop image resizing options, does this make any sense!?

Bicubic, or bicubic sharper?  ¯\_(ツ)_/¯

The most common bicubic filter used in graphics is one that is also called B-Spline bicubic (after reading this post, see this classic paper from Don Mitchell and Arun Netravali that analyzes some other ones – I revisit this paper probably once every two months!). It looks like this:

As you can see, it is way smoother and seems to reconstruct shapes much more faithfully than bilinear! Smoothness and lack of artifacts comes from its higher order continuity – it is designed to have matching and continuous derivatives at each original sample point.

It has a few other cool and useful properties, but my personal favorite one is that because all weights are positive, it can be optimized to use just 4 bilinear samples! This old (but not dated!) article from GPU Gems 2 (and a later blog post from Phill Djonov elaborating on the derivation) describe how to do it with just some extra ALU to compute the look-up UVs. (Similarly in 1D this can be done in just 2 samples – which can be quite useful for low-dimensionality LUTs).

Its biggest disadvantage – it is visibly “soft”/blurry, looks like the whole image got slightly blurred – which is what is happening… Some other cubic interpolators are sharper, but you have to pay a price for that – negative filter weights that can cause halos / ringing, as well as make it impossible to optimize so nicely.

Second alternative that caught my attention and was actually a motivation to finally write something about image resampling was a recent blog post from Thomas Deliot (and an efficient shadertoy implementation from Leonard Ritter) on biquadratic interpolation – continuous interpolation that reconstructs a quadratic function and considers 3 input samples (or 9 in 2d). I won’t take the spotlight from them on how it works, I recommend checking the linked materials, but included an animated comparison – it looks very good, sharper than the bspline bicubic, and is a bit cheaper.

Post update: Won Chun pointed to me a super interesting piece of literature on quadratic filters from Neil Dogson. The link is paywalled, but you will easily find references by googling for “Quadratic interpolation for image resampling”. It derives the quadratic filter with an additional parameter that allows to control filter sharpness similarly to bicubic filters parameters B/C. The one that I used for analysis is the blurriest, “approximating” one. Here is a shadertoy from Won Chun and a corresponding Desmos calculator that proposes also a 2 sample approximation of the (bi)quadratic filter, cool stuff!

Interpolation – shifting reconstructed signal and position dependent filtering

A second perspective that I am going to discuss here is how texture filtering creates different local filters and filter weights depending on the fractional output/input pixel positions relationship.

Let’s consider for now only 1D separable interpolation and the range of 0.0-0.5 (-0.5 to 0.0 or 0.5 to 1.0 can be considered as switching to different sample pair, or symmetric).

With a subpixel offset of 0.0, linear weights will be [1.0, 0.0], and with a subpixel offset of 0.5, they will be [0.5, 0.5]. Those two extremes correspond to a very different image filters! Imagine input filtered with a convolution filter of 1 (this returns just the original one) vs a filter of [0.5, 0.5] – this corresponds to low quality, box blur of the whole image.

To demonstrate this effect, I made a shadertoy that compares bilinear, bicubic bspline, biquadratic, and a sinc filter (more about it later) with dynamically changing offsets.

Same texture bilinearly interpolated. Upper left half of the image includes shift by 0.5 pixels, the other one no (and thus no filtering). I upsampled the resulting image with NN interpolation for magnification.

So with an offset of U or V of 0.5, we are introducing very strong blurring (even more with both shifted – see above), and no blurring at all with no fractional offsets.

Knowing that our effective image filters change based on the pixel “phase” we can have a look at spectral properties and frequency responses of those filters and see what we can learn about those resampling filters.

Bilinear artifacts – variance loss, and frequency perspective

So let’s have a look at filter weights of our 3 different types of filters as they change with pixel phase / fractional offset:

As expected, they are symmetric, and we can verify that they sum to 1. Weights are quite interesting themselves – for example quadratic interpolation with fractional offset of 0.0 has the same filter weights as the linear interpolation with offset of 0.5!

Looking at fractional offset dependent, we can analyze how those weights affect the signal variance, as well as the frequency response (and in effect, frequency content of the filtered images). Variance change resulting from a linear filter is equal to the sum of the squares of all its weights. So let’s plot the effect on input signal variance from those three filters.

Position dependent variance reduction of three different analyzed filters.

With the bilinear filter, the total variance changes (a lot!) with the subpixel position – which will cause apparent contrast loss. Variance and contrast loss due to blending/filtering is an interesting statistical effect, very visible perceptually, and one of my favorite papers on procedural texture synthesis from Eric Heitz and Fabrice Neyret achieves very good results from blending random blended tilings of an input texture mostly “just” by designing a variance preserving blending operator!

Now, moving to the full frequency response of a those filters – for this we can use for example Z-transform commonly used in signal processing (or go directly with complex phasors). This is goes beyond scope of this blog post, (if you are interested in details, let me know in the comments!) but I will present the resulting frequency responses of those 3 different filters:

Frequency responses of linear, quadratic, and bspline at fractional offsets of 0.0, 1/6, 1/3, 1/2. Each plot corresponds to a different fractional coordinate offset.

Looking at those plots we can observe a very different behavior. As we realized above, linear at no fractional offset produces no filtering effect at all (identity response), and both other filters are lowpass filters. Interestingly, no offset is where quadratic interpolation performs the most lowpass filtering, the opposite of both other filters! To be able to reason about the filters in more of isolation, I also plotted each filters frequency response at different phases on a single plot for each:

Frequency responses of linear, quadratic, and bspline at fractional offsets of 0.0, 1/6, 1/3, 1/2. Each plot corresponds to a different filter.

Looking at these plots, we can see that bspline cubic is the most “consistent” between the different fractional offsets, while quadratic is less consistent, but filters out somewhat less of high frequencies (which is part of its sharper look). Linear is the least consistent, going from between a very sharp image, and a very blurry one.

This inconsistency (both in terms of total variance, as well as frequency response – in fact the first can be derived from the latter from Parseval’s theorem) is another look at what we perceive as bilinear artifacts – “pulsing” and inconsistent filtering of the image. Looking at linear image shifted with phase 0.0 and 0.5 is like looking at two very different images. While bicubic bspline over-blurs the input, it is consistent, doesn’t “pulse” with changing offsets and doesn’t create aliased visible patterns. I see the biquadratic filter as kind of a “middle ground” that looks very reasonable and definitely will use it in practice.

In the original eye screenshot texture we can see filtering / variance reduction in action. Pixels close to the interpolated texel origins are sharp (green), some get blurred horizontally (purple), some vertically (yellow), and some both (red). Such an non-uniform effect of variable blurring and variable contrast is yet another take on the bilinear artifacts.

Given how bicubic and similar filters have quite consistent lowpass filtering, some of the blurring effect can be compensated with an additional, sharpening filter. This is what for example my friend Michal Drobot proposed in his talk on TAA for Far Cry 4 (and others have arrived to and used independently) – after resampling, do a global uniform sharpening pass on the whole image to counter some of the blurriness. Given how blurriness is relatively constant, sharpen filter can be even designed to recover the missing input frequencies accurately!

Bilinear / bicubic alternatives – (windowed) sinc

It’s worth noting that resampling filters don’t have to necessarily so heavily blur out the images and their frequency response can be much more high frequency preserving. This is beyond scope of this blog post, but some bicubic filters are actually designed to be slightly sharpening and boosting some of the high frequencies (Catmull-Rom spline – often used in TAA for continuous resampling).

Such built-in sharpening might not be desired, so many filters are designed to be as frequency preserving as possible (windowed sinc like Lanczos). As I mentioned, it is beyond the scope of this blog post, but just for fun, I include example frequency response of a truncated (non windowed!) sinc, as well as final gif comparison. Such a sinc is also featured in my shadertoy.

Unwidowed sinc frequency response at different subpixel offsets – note how there is no dominant lowpass filtering (with exception of last part, close to Nyquist), but we get lots of frequency response ringing and we will observe some oversharpening. Do not use unwindowed sinc in practice!
All filters compared. Note that truncated sinc might be the sharpest, but it looks unacceptably (ugly!) with over-filtering and significantly stronger image contrast. If you’re looking for sharpness, better choice might be some tweaked bicubic filter, or windowed sinc – Lanczos.

For much more complete and comprehensive treatment of filter trade-offs (sharpness, ringing, oversharpening), I again highly recommend this classic paper from Mitchell and Netravali – be sure to check it out.

Conclusions

This post touched on lots of topics and connection between them (I hope that it could be inspiring for you to go deeper into them) – I started with analyzing the most common, simple, ardware accelerated image interpolation filter – bilinear. I discussed its limitations and why it is often replaced (especially when interpolating from very low resolution textures) by other filters like bicubic – which might seem counter-intuitive given that they are much blurrier. I discussed what causes the “bilinear artifacts”, both from the simple spatial filtering perspective, as well as the effect of variance loss / contrast reduction / lowpass filtering. I barely scratched surface of the topic of interpolation, but I had fun writing it (and creating the visualizations), so expect some follow up posts about it in the future!

Posted in Code / Graphics | Tagged , , , , , | 11 Comments

Using JAX, numpy, and optimization techniques to improve separable image filters

By the end of this blog post, we aim to improve the results of my previous blog post of approximating circular bokeh with a sum of 4 separable filters and reducing its artifacts (left) through an offline optimization process to get the results on the right.

In today’s blog post I will look at two topics: how to use JAX (“hyped” new Python ML / autodifferentiation library), and a basic application that is follow-up to my previous blog post on using SVD for low-rank approximations and separable image filters – we will look at “optimizing” the filters to improve the filtered images.

My original motivation was to play a bit with JAX and see if it will be useful for me, and I immediately had this very simple use-case in mind. The blog post is not intended to be a comprehensive guide / tutorial to JAX, nor a complete optimization primer (one can spend decades studying and researching this topic), but a fun example application – hopefully immediately useful, and inspiring for other graphics or image processing folks.

The post is written as separate chapters/sections (problem statement, some basic theory and challenges, practical application, and the results) – feel free to skip ones that seem obvious to you. 🙂

This post comes with a colab that you can run and modify yourself. The code is not very clean, mostly scratchpad quality, but I decided that it’s still good for the others if I share it, no matter how poorly it reflects on my coding habits.

Recap of the problem – separable filters

In the last blog post, we have looked at analyzing if convolutional image filters can be made separable, and if not, finding separable approximations (as a sum of N separable filters). For this we used Singular Value Decomposition and using a low rank approximation by taking first N singular values.

We have found that to approximate a 50×50 circular filter (can be though of as a circular bokeh), one needs ~13 separable passes over the whole image, and even 8 singular components (that have extremely low numerical error) can produce distracting, unpleasant visual artifacts – negative values cause “ringing”, and some “leftover” corner errors.

Unpleasant, unacceptable visual artifacts in an image filtered with a rank 4 approximation of the circular kernel.

I have suggested that this can be solved with optimization, and in this post I will describe most likely the simplest method to do so!

Side note: After my blog post, I had a chat with a colleague of mine, Soufiane Khiat – we used to work together at Ubisoft Montreal – and given his mathematical background much better than mine, he had some cool insights on this problem. One of suggestions was to use singular value thresholding algorithm and some proximal methods and generally it is probably the way to go for the specific problem – but a bit of against “educational” goal of my blog posts (and staying as general as possible). I still recommend reading about this approach if you want to go much deeper into the topic and again – thanks Soufiane for a great discussion.

What is optimization?

Optimization is a loaded term and quite likely I am going to use it in a way you didn’t expect! Most graphics folks understand it as in “low level optimization” (or algorithmic optimization), aka optimizing the computational cost – memory, CPU/GPU usage, total timing spent on computations – and this is how I used to think of term “optimization” for many years.

But this is not the kind of optimization that a mathematician, or most academics think of and not the topic of my blog post!

Optimization that I will use is the one in which you have some function that depends on a set of parameters, and your goal is to find the set of parameters that achieves a certain goal, usually minimum (sometimes maximum, but they can be equivalent in most cases) over this function. This definition is very general – and in theory it even covers also computational performance optimizations (we are looking for a set of computer program instructions that optimizes performance while not diverging from the desired output).

Optimization of arbitrary functions is generally a NP-hard problem (there are no solutions other than exploring every possible value, which is impossible in the case of continuous functions), but under some constraints like convexity, or looking for local minima, it can be made feasible, and relatively robust and fast – and is basis of modern convolutional neural networks, and algorithms like gradient descent.

This post will skip explaining the gradient descent. If it’s a new or vague concept for you, I don’t recommend just trusting me 🙂 – so if you would like to get a good basic, but intuitive and visual understanding, be sure to check this fantastic video from Grant Sanderson on the topic.

What is JAX?

JAX is a new ML library that supports auto-differentiation and just-in-time compilation targeting CPUs, GPUs, and TPUs.

JAX got very popular recently (might be a sampling bias, but seemed to me like half of ML twitter was talking about it), and I am usually very skeptical about such new hype bandwagons. But I was also not very happy with my previous options, so I tried it, and I think it is popular for a few good reasons:

  • It is extremely lean, in the most basic form it is a “replacement” import for numpy.
  • Auto-differentiation “just works” (without any gradient tapes, graphs etc.) over most of numpy and Python constructs like loops, conditionals etc.
  • There are higher level constructs developed on top of it – like NNs and other ML infrastructure, but you can still use the low-level constructs.
  • There is no more weird array reshaping and expressing everything over batches, constantly thinking about shapes and dimensions. If you want to process a batch, just use functions that map your single-example functions over batches.
  • It is actively developed both by open-source, as well as some Google and DeepMind developers, and available right at your fingers with Colab – zero installation needed.

I have played with it for just a few days, but definitely can recommend it and it will become my go-to auto-diff / optimization library for Python.

Optimization 101 – defining a loss function

This might seem obvious, but before we can start optimizing an objective, we have to define it in some way that is understandable for the computer and optimizeable.

What is non-obvious is that coming up with a decent objective function is the biggest challenge of machine learning, IMO a much bigger problem than any particular choice of a technique. (Note: it also has much wider implications than our toy problems; ill-defined objectives can lead to catastrophes in wider social context, e.g. optimizing for engagement time leads to clickbaits, low quality filler content, and fake news).

For our problem – optimizing filters, we know three things that we want to optimize for:

  • Maintain the low rank of the filter,
  • Keep the separable filter similar to the original one,
  • Avoid the visual artifacts.

The first one is the simple – we don’t have much wiggle room there. We have a hard limit on maximum of N separable passes and fixed number of coefficients. But on the other hand, the two other goals are not well defined.

The “similarity” that we are looking for can be anything – average squared error, average absolute error, maximum error, some perceptual/visual similarity… Anything else, or even any combination of the above. Mathematically, we use term of distance, or metric. Convenient and well researched are metrics that are based on p-norm and mathematicians like to operate on squared Euclidean distance (L2 norm), so average squared errors. Average squared error is so often used as it usually has a simple, closed form solutions like linear least squares, linear regression, PCA), but in many cases it might not be the right loss. Defining perceptual similarity is the most difficult and is an open problem in computer vision, with the recent universal approach of using similarity features extracted by neural networks for the purpose of image recognition.

The third one “avoid the visual artifacts” is even more difficult to define, as we are talking about artifacts that are present in the final image, and not about numerical error in the approximated filter. Deciding on components of the loss function to avoid visual artifacts and tuning them is often the most time consuming part of optimization and machine learning.

Left: original filter, right: rank 4 approximation. Notice the negative values (blue color) and some leftover “garbage” values in the corners. Those will contribute to visual artifacts.

Looking at artifacts in the filtered images I think that the two most objectionable ones are: negative values causing ringing, and some “garbage” pixels in the corners of the filter.

Putting it together – target loss function for optimizing separable image filters

Given all that together, I decided on minimizing a loss function that will sum three terms:

  • Squared mean error term,
  • Heavily penalizing any negative elements in the approximated filter,
  • Additional penalty when a zero element of the original filter becomes non-zero after the approximation.

This is a very common approach – summing together multiple different terms with different meanings and finding the parameters that optimize such a sum. Open challenge is tuning the weights of the different components of the loss function, a further section will show the impact of it.

This loss function might not be the best one, but this is what makes such problems fun – often designing a better loss (closer corresponding to our intentions) can lead to significantly better results without changing the algorithm at all! This was one of my huge surprises – so many great papers just propose simple optimization framework together with an improved loss function to advance state of the art significantly.

Also if you think that it is quite “ad-hoc” – yes, it is! Most academic papers (and productionized code…) have such ad-hoc loss functions where each component might have some good motivation, but they end up with a hodge-podge that doesn’t make too much sense when put together, but empirically works (often verified by ablation studies).

In numpy, an example implementation of the loss function might look like:

def loss(target, evaluated_filter):
  l2_term = L2_WEIGHT * np.mean(np.square(evaluated_filter - target))
  non_negative_term = NON_NEGATIVE_WEIGHT * np.mean(-np.minimum(evaluated_filter, 0.0))
  keep_zeros_term = KEEP_ZEROS_WEIGHT * np.mean((target == 0.0) * np.abs(evaluated_filter))
  return l2_term + non_negative_term + keep_zeros_term 

Now that we have a loss function to optimize, we need to find parameters that minimize it. This is where things can become difficult. If we want to have rank 4 approximation of a 50×50 filter, we end up with 4*(50*2) == 400 parameters. If we wanted to do a brute force search in this 400 dimensional space, this could take a very long time! Let’s say we just wanted to evaluate 10 different potential values – this would take 10^400 loss function evaluations – and this is quite a toy problem!

Luckily, we are in a situation where our initial guess obtained through SVD is already kind of ok, and we want to just improve it. This is a classic assumption of “local optimization” and can be achieved through greedily minimizing the error, for example by coordinate descent, or even better gradient descent – going in the direction where the error is decreasing the most. In a general scenario we don’t get any theoretical guarantees of finding the true, best solution, but we are guaranteed that we will get a solution that will be better or at worst the same when compared to the initial one (according to our loss function).

Gradient descent in JAX

How do we compute the gradient of our loss function? For the function that I wrote it is relatively simple and follows calculus 101, but anytime we would change it, we need to re-derive our gradient… Wouldn’t it be great if we could compute it automatically?

This is where various auto-differentiation libraries can help us. Given some function, we can compute its gradient / derivative with regards to some variables completely automatically! This can be achieved either symbolically, or in some cases even numerically if closed-form gradient would be impossible to compute.

In C++ you can use Ceres (I highly recommend it; its templates can be non-obvious at first, but once you understand the basics, it’s really powerful and fast), in Python one of many frameworks, from smaller ones like Autograd to huge ones like Tensorflow or PyTorch. Compared to tf and pt, I wanted something lower level, simpler, and more convenient to use (setting up a graph or gradient tapes is not great 😦 ) – and JAX fills my requirements perfectly.

JAX can be a drop-in replacement to a combo of pure Python and numpy, keeping most of the functions exactly the same! In colab, you can import it either instead of numpy, or in addition to numpy. Here is code that computes our separable filter from list of separable vector pairs, and the loss function.
(note: I will paste some code here, but I highly encourage you to open the colab that accompanies this blog post and explore it yourself.)

import numpy as np
import jax.numpy as jnp

# We just sum the outer tensor products.
# vs is a list of tuples - pairs of separable horizontal and vertical filters.
def model(vs):
  dst = jnp.zeros((FILTER_SIZE, FILTER_SIZE))
  for separable_pass in vs:
    dst += jnp.outer(separable_pass[0], separable_pass[1])
  return dst

# Our loss function. 
def loss(vs, l2_weight, non_negativity_weight, keep_zeros_weight):
  target = model(vs)
  l2_term = l2_weight * jnp.mean(jnp.square(target- REF_SHAPE))
  non_negative_term = non_negativity_weight * jnp.mean(-jnp.minimum(target, 0.0))
  keep_zeros_term = keep_zeros_weight * np.mean((REF_SHAPE == 0.0) * jnp.abs(target))
  return l2_term + non_negative_term + keep_zeros_term 

Note how we mix some simple Python loops and logic in the model function, and numpy-like code (I could have imported jax.numpy as simply np, and you can do it if you want to, but to me such library shadowing feels a bit confusing).

Ok, but so far there is nothing interesting about it; I just kind of rewrote some numpy code as jax.numpy, what’s the big deal?

Now, the “magic” is that you compute the gradient of this function just by writing jax.grad(loss)! By default the gradient is wrt the first function parameter (you can change it if you want). JAX has some limitations and gotchas on what it can compute the gradient of and can require workarounds, but most of them feel quite “natural” (e.g. that PRNG requires explicit state) and I haven’t hit those in practice with my toy example.

Our whole gradient descent step function would look like:

def update_parameters_step(vs, learning_rate, l2_weight, non_negativity_weight, keep_zeros_weight):
  grad_loss = jax.grad(loss)
  grads = grad_loss(vs, l2_weight, non_negativity_weight, keep_zeros_weight)
  return [(param[0] - learning_rate * grad[0], param[1] - learning_rate * grad[1]) for param, grad in zip(vs, grads)]

I was mind-blown how simple it is – no gradient tapes, no graph definitions requires. This is how auto-differentiation should look like from user’s perspective. 🙂 What is even cooler is that the resulting function code can be JIT-compiled to native code for orders of magnitude speed-up! You can place a decorator @jax.jit above your function, or manually create optimized function from jax.jit(function). This makes it really fast and allows you to pick-and-choose what you jit (you can even unroll a few optimization iterations if you want).

Tuning loss function term weights

Time to run an optimization loop on our loss function. I picked a learning_rate of 0.1 and 5000 steps. Those are ad-hoc choices, step count is definitely an overkill, but it just worked and the optimization is pretty fast even on the CPU, so I didn’t bother tweaking them. The whole optimization looks like this:

# Our whole optimization loop.
def optimize_loop(vs, l2_weight, non_negativity_weight, keep_zeros_weight, print_loss):
  NUM_STEPS = 5000
  for n in range(NUM_STEPS):
    vs = update_parameters_step(vs, learning_rate=0.1, l2_weight=l2_weight, non_negativity_weight=non_negativity_weight, keep_zeros_weight=keep_zeros_weight)
    if print_loss and n % 1000 == 0:
      print(loss(vs))
  return vs

Finally we get to the problem of testing different loss function term weights. Luckily, because our problem is small, and jit’d optimization runs very fast, we can actually test it with a few different parameters.

One thing to note is that while we have 3 terms, in fact if we multiply all 3 of them by the same value, we will end up with loss function 3x bigger, but with the same parameters. So we have effectively just 2 degrees of freedom and can keep one of the weights “frozen” – I will set the L2 mean loss as just 1.0 and operate on the weight for the non-negativity and keeping-zeros.

Without further ado, here are our rank 4 separable circular filters after the optimization:

Left to right: keep_zeros_weight of 0.0, 0.3, 0.8.
Top to bottom: non_negativity_weight of 0.0, 0.5, 1.5.

We can notice how those two loss terms affect the results and interact with each other. Non-negativity will very effectively reduce the negative ringing, but won’t address the artifacts in corners of the filter.

The extra penalty of the zero terms becoming non-zero nicely cleans up the corners, but there is still mild ringing around the radius of the filter.

Both of them together reduce all artifacts, but the final shape starts to deviate from our circle, becoming more like a sum of four separate box-filters (which is exactly what is happening)! So it is a trade-off of accuracy, filter shape, and the artifacts. There is no free lunch! 🙂

Results when applied to an real image

Let’s look at the same filters when applied to the same image we have looked at in my previous blog post. Note: I have slightly boosted the gamma function from 7.0 to 8.0 as compared to my previous post to emphasize the visual error.

Left to right: keep_zeros_weight of 0.0, 0.3, 0.8.
Top to bottom: non_negativity_weight of 0.0, 0.5, 1.5.
I highly encourage opening this image in a new tab and zooming in – otherwise the artifacts and differences are hard to notice.

When you zoom in, the difference in artifacts and overall appearance becomes quite obvious. I personally like the most the center picture, and the one below it. Column on the right minimized artifacts the most, especially the bottom-right example, but to me looks too “blocky”.

Can you design a better loss function and parameters for this problem? I am sure you can, feel free to play with the colab! 🙂

Some random ideas that can be fun to evaluate and get better understanding of how loss functions affect the results: penalizing zeros becoming non-zeros more as they get further away from the center, using L1 loss instead of L2 loss, playing with maximum allowed error, optimizing for computational performance by trying to create filter with explicitly zero weights that could get skipped in a shader.

When such simple optimization through gradient descent is going to work?

While I mentioned that optimization is a whole field of mathematics and computer science and beyond scope of any simple blog, in my posts I usually try to give the readers some “intuition” about the topics I write about, and practical tips and tricks, so will mention a few gotchas regarding the optimization.

Gradient descent is so successful and ubiquitous technique (have you heard about this whole field of machine learning through artificial neural networks?!) because of few interesting properties:

First of all, in case of multivariate functions, we don’t need to be strictly convex! There might exist some path in the parameter space that gradient descent can use to “walk in the valleys and around the hills”.

If we are “lucky”, descent can optimize even a non-convex function!

Second related observation is that the more dimensions we have, the easier it is to find some “good” solution (that might not be the total global minimum, but some almost-as good local minimum). Over-completeness and over-parametrization is one of the things that makes very deep networks train well. We are not looking for a single unique perfect solution, but one of many good ones, and the more parameters and dimensions there are, the more (combinatoricly explosive) combinations of parameters can be ok. Think about the example I have visualized above – in 1D we could never “walk around” a locally bad solution, the more dimensions we add, the more paths are there that can lead to improving the results.

It’s one of the areas that are researched in machine learning and called “lottery ticket hypothesis”. I personally found this to initially to be super non-intuitive and was totally mind blown to see how adding fully linear layers to a network can make a huge difference on it converging to good results.

What helps gradient descent a lot is some good initial initialization. This not only increases the chances of convergence, but also the convergence speed. For our filters initializing them with SVD-based low-rank approximation was very helpful – feel free to try if you can get the same results with fully randomized initialization. The worst possible initialization would be something totally-uniform, and with its gradient being same for all variables (or even worse, zero gradient). An example: if optimizing a function that is product of a and b, don’t initialize them both to zero – think how the gradients would look like and what would happen in such case. 🙂

A second practical advice is to keep the learning rate as low as your patience allows for. You avoid the risks of “jumping over” your optimal solution.

Visualization on 1D toy example how wrong learning / optimization rate can cause lack of correct convergence.
Red: too large learning / optimization rate can jump over the global maximum!
Green: Lower learning rate is “safer” for local optimization.

Too high of a learning rate can make your gradients “explode” and jump completely out of reasonable parameter space. Probably everyone who ever played with gradient descent saw loss or parameters becomeing NaNs or infinity at some point and then debugged why. There can be many causes, from a wrong loss function, through bad behaving gradients (derivatives of certain function can get huge; e.g. sqrt around zero), but very often it is too large of learning rate. 🙂 If your optimization is too slow, instead of increasing the optimization rate, try momentum-based or higher order optimization – both are actually trivial to code with JAX, especially for such small use-cases!

Conclusions

In this blog post, we have looked at incorporating some simple offline optimization using numpy and JAX to our programming toolset. We discussed difficulties of designing a “good” loss function and then tuning it, and applied it to our problem of producing good separable image filters. If you do any kind of high performance work, don’t be deterred from machine learning – you don’t have to do it in real time and on the device, but can use it offline. For example of optimizing/computing offline approximations that can make something feasible in real time, check out one of papers that finally got out (I contributed to a longer while ago) – real time stylization through optimizing discrete filters.

My personal final conclusion is that I want more of such simple, yet expressive and powerful libraries like JAX. The less boilerplate the better! 🙂

Posted in Code / Graphics | Tagged , , , , , , , , , , , | 9 Comments

Separate your filters! Separability, SVD and low-rank approximation of 2D image processing filters

In this blog post, I explore concepts around separable convolutional image filters: how can we check if a 2D filter (like convolution, blur, sharpening, feature detector) is separable, and how to compute separable approximations to any arbitrary 2D filter represented in a numerical / matrix form. I’m not covering any genuinely new research, but think it’s a really cool, fun, visual, interesting, and very practical topic, while being mostly unknown in the computer graphics community.

Throughout this post, will explore some basic use of Singular Value Decomposition, one of the most common and powerful linear algebra concepts (used in all sub-fields of computer science, from said image processing, through recommendation systems, ML, to ranking web pages in search engines). We will see how it can be used to analyze 2D image processing filters – check if they are separable, approximate the non-separable ones, and will demonstrate how to use those for separable, juicy bokeh.

Note: this post turned out to be a part one of a mini-series, and has a follow-up post here – be sure to check it out!

Post update: Using separable filters for bokeh approxmation is not a new idea – Olli Niemitalo pointed out this paper “Fast Bokeh effects using low-rank linear filters” to me, which doesn’t necessarily feature any more details on the technique, but has some valuable timings/performance/quality comparisons to the stochastic sampling, if you are interested in using it in practice, check it out – along my follow-up post.

Intro / motivation

Separable filters are one of the most useful tools in image processing and they can turn algorithms from “theoretical and too expensive” to practical under the same computational constraints. In some other cases, ability to use a separable filter can be the tipping point that makes some “interactive” (or offline) technique real-time instead.

Let’s say we want to filter an image – sharpen it, blur, maybe detect the edges or other features. Given a 2D image filter of size MxN, computing the filter would require MxN independent, sequential memory accesses (often called them “taps”), accompanied by MxN multiply-add operations. For large filters, this can get easily prohibitively expensive, and we get quadratic scaling with the filter spatial extent… This is where separable filters can come to the rescue.

If a filter is separable, we can decompose such filter into a sequence of two 1D filters in different directions (usually horizontal, and then vertical). Each pass filters with a 1D filter, first with M, and then the second pass with N taps, in total M+N operations. This requires storing the intermediate results – either in memory, or locally (line buffering, tiled local memory optimizations). While you pay the cost of storing the intermediate results and synchronizing the passes, you get linear and not quadratic scaling. Therefore, typically for any filter sizes larger than ~4×4 (depends on the hardware, implementation etc) using separable filters is going to be significantly faster than the naive, non-separable approach.

But how do we find separable filters?

Simplest analytical filters

Two simplest filters that are bread and butter of any image processing are the box filter, and the Gaussian filter. Both of them are separable, but why?

The box filter seems relatively straightforward – there are two ways of thinking about it: a) if you intuitively think about what happens if you just “smear” a value across horizontal, and then vertical axis, you will get a box; b) mathematically, there are two conditions on the value of the filter, there are two conditions on both dimensions that are independent:

f(x, y) = select(abs(x) < extent_x && abs(y) < extent_y), v, 0)
f(x, y) = select(abs(x) < extent_x, v, 0) * select(abs(y) < extent_y, 1, 0) 

This last product is the equation of separability – our 2D function is a product of a horizontal, and a vertical function.

Gaussian is a more interesting one, and it’s not immediately obvious (“what does the normal distribution have to do with separability?”), so let’s write out the maths.

For simplicity, I will use a non-normalized Gaussian function (usually we need to re-normalize after the discretization anyway) and assume standard deviation of 1/sqrt(2) so that it “disappears” from the denominator.

f(x, y) = exp(-sqrt(x^2 + y^2)^2) // sqrt(x^2 + y^2) is distance of filtered pixel
f(x, y) = exp(-(x^2 + y^2))       // from the filter center.
f(x, y) = exp(-x^2 + -y^2)
f(x, y) = exp(-x^2) * exp(-y^2)

The last step follows from the “definition” of exponential (exponential of sum is a product of exponentials), and afterwards we can see that again, our 2D filter is product of two 1D Gaussian filters.

Gaussian filter matrix

“Numerical” / arbitrary filters

We started with two trivial, analytical filters, but what do we do if we have an arbitrary filter given to us in a numerical form – just a matrix full of numbers?

We can plot them, analyze them, but how do we check if given filter is separable? Can we try to separate (maybe at least approximately) any filter?

Can you make this filter separable? Spoiler: yes, it’s just the Gaussian above, but how do we tell?

Linear algebra to the rescue

Let’s rephrase our problem in terms of linear algebra. Given a matrix MxN, we would like to express it as a product of two matrices, Mx1 and 1xN. Let’s re-draw our initial diagram:

There are a few ways of looking at this problem (they are roughly equivalent – but writing them all out should help build some intuition for it):

  • We would like to find a row Mx1 that can be replicated with different multipliers N times giving us original matrix,
  • Another way of saying the above is that we would like our MxN filter matrix rows/columns to be linearly dependent,
  • We would like our source matrix to be a matrix with rank of one,
  • If it’s impossible, then we would like to get a low-rank approximation of the matrix MxN.

To solve this problem, we will use one of the most common and useful linear algebra tools, Singular Value Decomposition.

Proving that it gives “good” results (optimal in least-squarest sense, at least for some norms) is beyond the scope of this post, but if you’re curious, check out the proofs on the wikipedia page.

Singular Value Decomposition

Specifically, the singular value decomposition of an m\times n real or complex matrix \mathbf {M}  is a factorization of the form {\displaystyle \mathbf {U\Sigma V^{*}} }, where \mathbf {U}  is an m\times m real or complex unitary matrix,\mathbf{\Sigma} is an m\times nrectangular diagonal matrix with non-negative real numbers on the diagonal, and \mathbf {V}  is an n\times n real or complex unitary matrix. If \mathbf {M}  is real, \mathbf {U}  and {\displaystyle \mathbf {V} =\mathbf {V^{*}} } are real orthonormal matrices.

Wikipedia

How does SVD help us here? We can look at the final matrix as a sum of many matrix multiplications, each of a single row/column from the U and V matrix, multiplied by the diagonal entry from the matrix E. SVD diagonal matrix E contains entries (called “singular values”) on the diagonal only and they are sorted from the highest value, to the lowest. If we look at the final matrix as a sum of different rank 1 matrices, weighted by their singular values, the singular values correspond to proportion of the “energy”/”information” of the original matrix contents.

SVD decomposes any matrix into three matrices. The middle one is purely diagonal one, and products of the values of the diagonal with columns/rows of the other matrices create multiple rank1 matrices. If we add all of them together, we are going to get our source matrix back.

If our original matrix is separable, then it is rank 1, and we will have only single singular value. Even if they are not zero (but significantly smaller), and we truncate all coefficients except for the first one, we will get a separable approximation of the original filter matrix!

Note: SVD is a concept and type of matrix factorization / decomposition, not an algorithm to compute one. Computing SVD efficiently and numerically stable is generally difficult (requires lots of careful consideration) and I am not going to cover it – but almost every linear algebra package or library has it implemented, and I assume we just use one.

Simple, separable examples

Let’s start with our two simple examples that we know that are separable – box and Gaussian filter. We will be using Python and numpy / matplotlib. This is just a warm-up, so feel free to skip those two trivial cases and jump to the next section.

import numpy as np
import matplotlib.pyplot as plt
import scipy.ndimage
plt.style.use('seaborn-whitegrid')

x, y = np.meshgrid(np.linspace(-1, 1, 50), np.linspace(-1, 1, 50))
box = np.array(np.logical_and(np.abs(x) < 0.7, np.abs(y) < 0.7),dtype='float64') 
gauss = np.exp(-5 * (x * x + y * y))
plt.matshow(np.hstack((gauss, box)), cmap='plasma')
plt.colorbar()

Let’s see what SVD will do with the box filter matrix and print out all the singular values, and and the U and E row / column associated with the first singular value:

U, E, V = np.linalg.svd(box)
print(E)
print(U[:,0])
print(V[0,:])
[34.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00]
[0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00]
[0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17
 -0.17 -0.17 -0.17 -0.17 -0.17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00]

Great, we get only one singular value – which means that it is a rank 1 matrix and fully separable.

On the other hand, the fact that our columns / vectors are negative is a bit peculiar (it is implementation dependent), but the signs negate each other; so we can simply multiply both of them by -1. We can also get rid of the singular value and embed it into our filters, for example bymultiplying both by sqrt of it to make them comparable in value:

[-0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 1.00 1.00 1.00 1.00 1.00
 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 1.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00]
[-0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 1.00 1.00 1.00 1.00 1.00
 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 1.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00]

Apart from cute -0.00s, we get exactly the separable filter we would expect. We can do the same with the Gaussian filter matrix and arrive at the following:

[0.01 0.01 0.01 0.02 0.03 0.04 0.06 0.08 0.10 0.14 0.17 0.22 0.27 0.33
 0.40 0.47 0.55 0.63 0.70 0.78 0.84 0.90 0.95 0.98 1.00 1.00 0.98 0.95
 0.90 0.84 0.78 0.70 0.63 0.55 0.47 0.40 0.33 0.27 0.22 0.17 0.14 0.10
 0.08 0.06 0.04 0.03 0.02 0.01 0.01 0.01]
[0.01 0.01 0.01 0.02 0.03 0.04 0.06 0.08 0.10 0.14 0.17 0.22 0.27 0.33
 0.40 0.47 0.55 0.63 0.70 0.78 0.84 0.90 0.95 0.98 1.00 1.00 0.98 0.95
 0.90 0.84 0.78 0.70 0.63 0.55 0.47 0.40 0.33 0.27 0.22 0.17 0.14 0.10
 0.08 0.06 0.04 0.03 0.02 0.01 0.01 0.01]

Now if we take an outer product (using function np.outer()) of those two vectors, we arrive at filters almost exactly same as the original (+/- the numerical imprecision differences).

Gaussian and its rank1 approximation – unsurprisingly they look exactly the same.

Non-separable filters

“D’oh, you took filters that you know are separable, fed through some complicated linear algebra machinery and verified that yes, they are separable – what a waste of time!” you might think – and you are right, it was a demonstration of the concept, but has so far not too much practical value. So let’s look at a few more interesting filters – ones that we know are not separable, but maybe are almost separable?

I picked four examples: circular, hexagon, exponential/Laplace distribution, and a difference of Gaussians. Circular and hexagon are interesting as they are very useful for bokeh; exponential/Laplace is somewhat similar to long-tailed BRDF like GGX, and can be used for some “glare”/veil type of effects, and the difference of Gaussians (laplacian of Gaussians) is both a commonly used band-pass filter, as well as can be used for a busy, “donut”-type bokeh.

circle = np.array(x*x+y*y < 0.8, dtype='float64')
exponential = np.exp(-3 * np.sqrt(x*x+y*y))

hexagon = np.array(np.logical_and(np.logical_and(np.logical_and(np.logical_and(np.logical_and(np.logical_and(np.logical_and(x+y < 0.9*np.sqrt(2), x+y > -0.9*np.sqrt(2)),x-y < 0.9*np.sqrt(2)),-x+y < 0.9*np.sqrt(2)),x<0.9),x>-0.9),y<0.9),y>-0.9), dtype='float64')

difference_of_gaussians = np.exp(-5 * (x*x+y*y))-np.exp(-6 * (x*x+y*y))
difference_of_gaussians /= np.max(difference_of_gaussians)

plt.matshow(np.vstack((np.hstack((circle, hexagon)), np.hstack((exponential, difference_of_gaussians)))), cmap='plasma')
plt.colorbar()

Now, if we try to do a rank 1 approximation by taking just the first singular value, we end up with results looking like this:

They look terrible, nothing like the original ones! Ok, so now we know they are not separable, which concludes my blog post…

I’m obviously kidding, let’s explore how “good” are those approximations and how to make them better. 🙂

Low~ish rank approximations

I mentioned that singular values represent percentage of “energy” or “information” in the original filter. If the filter is separable, all energy is in the first singular value. If it’s not, then it will be distributed among multiple singular values.

We can plot the magnitude of singular values (I normalized all the plots so they integrate to 1 like a density function and don’t depend on how bright a filter is) to see what is going on. Here are two plots: of the singular values, and “cumulative” (reaching 1 means 100% of the original filter preserved). There are 50 singular values in a 50×50 matrix, but after some point they reach zero, so I truncated the plots.

The second plot is very informative and tells us what we can expect from those low rank approximations.

To reconstruct fully the original filter, for difference of Gaussians we need just two rank 1 matrices (this is not surprising at all! DoG is difference of two Gaussians, separable / rank 1 filters), for Laplacian we get very close with just 2-3 rank 1 matrices. On the other hand, for circular and hexagonal filter we need many more, to reach just 75% accuracy we need 3-4 filters, and 14 to get 100% accurate representation…

How do such higher rank approximations look like? You can think of them as sum of N independent separable filters (independent vertical + horizontal passes). In practice, you would probably implement them as a single horizontal pass that produces N outputs, followed by a vertical one that consumes those, multiplies them with appropriate vertical filters, and produces a single output image.

Note 1: In general and in most cases, the “sharper” (like the immediate 0-1 transition of hex or a circle) or less symmetric a filter is, the more difficult it will be to approximate it. In a way, it is similar to other common decomposition, like the Fourier transform.

Note 2: You might wonder – if a DoG can be expressed as a sum of two trivially separable Gaussians, why SVD rank 1 approximation doesn’t look like one of the Gaussians? The answer is that with just a single Gaussian, the resulting filter will be more different from the source than our separable weird 4-dot pattern. SVD decomposes so that each progressive low rank decomposition is optimal (in L2 sense), in this case it means adding two non-intuitive, “weird” separable patterns together, where the second one contributes 4x less than the first one.

Low-rank approximations in practice

Rank N approximation of an MxM filter would have performance cost of O(2M*N), but additional memory cost of N * original image storage (if not doing any common optimizations like line buffering or local/tiled storage). If we need to approximate 50×50 filter, then even taking 10 components could be “theoretically” worth it; and for much larger filters it would be also worth in practice (and still faster and less problematic than FFT based approaches). One thing worth noting is that in the case of presented filters, we reach the 100% exact accuracy with just 14 components (as opposed to full 50 singular values!).

But let’s look at more practical, 1, 2, 3, 4 component approximations.

def low_rank_approx(m, rank = 1):
  U,E,V = np.linalg.svd(m)
  mn = np.zeros_like(m)
  score = 0.0
  for i in range(rank):
    mn += E[i] * np.outer(U[:,i], V[i,:])
    score += E[i]
  print('Approximation percentage, ', score / np.sum(E))
  return mn

Even with those maximum four components, we see a clear “diminishing returns” effect, and that the differences get smaller between 3 and 4 components as compared to 2 and 3. This is not surprising, since our singular values are sorted descending, so every next approximation will add less of information. Looking at the results, they might be actually not too bad. We see some “blockiness”, and even worse negative values (“ringing”), but how would that perform on real images?

Visual effect of low-rank approximation – boookeh!

To demonstrate how good/bad this works in practice, I will use filters to blur an image (from the Kodak dataset). On LDR images the effects are not very visible, so I will convert it to fake HDR by applying a strong gamma function of 7, doing the filtering, and inverse gamma 1/7 to emulate tonemapping. It’s not the same as HDR, but does the job. 🙂 This will also greatly amplify any of the filter problems/artifacts, which is what we want for the analysis – with approximations we often want to understand not just the average, but also the worst case.

Left: source image, middle: image blurred with a circular kernel and no gamma, right: the same image blurred with the same kernel, but in fake-HDR space to emphasize the filter shape.
def fake_tonemap(x):
  return np.power(np.clip(x, 0, 1), 1.0/7.0)
def fake_inv_tonemap(x):
  return np.power(np.clip(x, 0, 1), 7.0)
def filter(img, x):
  res = np.zeros_like(img)
  for c in range(3):
    # Note that we normalize the filter to preserve the image brightness.
    res[:,:,c] = fake_tonemap(scipy.ndimage.convolve(fake_inv_tonemap(img[:,:,c]), x/np.sum(x)))
  return res
def low_rank_filter(img, x, rank = 1):
  final_res = np.zeros_like(img)
  U, E, V = np.linalg.svd(x / np.sum(x))
  for c in range(3):
    tmd = fake_inv_tonemap(img[:,:,c])
    for i in range(rank):
      final_res[:,:,c] += E[i] * scipy.ndimage.convolve(scipy.ndimage.convolve(tmd, U[:, i, np.newaxis]), np.transpose(V[i,:, np.newaxis]))
    final_res[:,:,c] = fake_tonemap(final_res[:,:,c])
  return final_res

Without further ado, here are the results.

Unsurprisingly, for the difference of Gaussians and exponential distribution, two components are enough to get identical filtered result, or very close to.

For the hexagon and circle, even 4 components show some of the negative “ringing” artifacts (bottom of the image), and one needs 8 to get artifact-free approximation. At the same time, we are using extreme gamma to show those shapes and emphasize the artifacts. Would this matter in practice and would such approximations be useful? As usually, it depends on the context of the filter, image content, further processing etc. I will put together some thoughts / recommendations in the Conclusions section.

A careful reader might ask – why not clamp filters to not be negative? The reason is that low rank approximations will often locally “overshoot” and add more energy in certain regions, so the next rank approximation component needs to remove it. And since we do it separably and accumulate the filtered result (the whole point of low rank approximation!) we don’t know if the sum will be locally negative or not. We could try to find some different, guaranteed non-negative decomposition, but it’s a NP-hard problem and only some approximations / heuristics / optimization-based solutions exist. Note: I actually wrote a blog post on using optimization to solve (or at least minimize) those issues as a follow-up, be sure to check it out! 🙂

Limitation – only horizontal/vertical separability

Before I conclude my post, I wanted to touch two related topics – one being a disadvantage of the SVD-based decomposition, the other one being a beneficial side-effect. Starting with the disadvantage – low-rank approximation in this context is limited to filters represented as matrices and strictly horizontal-vertical separability. This is very limited, as some real, practical, and important applications are not covered by it.

Let’s analyze two use-cases – oriented, orthogonal separability, and non-orthogonal separability. Anisotropic Gaussians are trivially separable with two orthogonal filter – along the principal direction, and perpendicular to it. Similarly, skewed box blur filters are separable, but in this case we need to blur in two non-orthogonal directions. (Note: this separability is the basis of classic technique of fast hexagonal blur developed by John White and Colin Barre-Brisebois).

Such filtering can be done super-fast on the GPU, often almost as fast as the horizontal-vertical separation (depending if using texture sampler or compute shaders / local memory), yet the SVD factorization will not be able to find it… So let’s look at how the low rank horizontal-vertical approximations and accumulation of singular values look like.

From left to right: original, rank1, rank2, rank4, rank8.

Overall those approximations are pretty bad if you consider that they are trivially separable in those different direction. You could significantly improve them by finding the filter covariance principal components and rotate it to be axis aligned, but this is an approximation, requires filter resampling and complicates the whole process… Something to always keep in mind – as automatic linear algebra machinery (or any ML) is not a replacement for careful engineering and using the domain knowledge. 🙂

Bonus – filter denoising

My post is already a bit too long, so I will only hint at an interesting effect / advantage of SFV – how SVD is mostly robust to noise, and can be used for the filter denoising. If we take a pretty noisy Gaussian (Gaussian maximum of 1.0, noise standard deviation of 0.1), and do a simple rank1 decomposition, we get a significantly cleaner representation of it (albeit with horizontal and vertical “streaks”).

This approximation denoising efficacy will depend on the filter type; for example if we take our original circular filter (which is as we saw pretty difficult to approximate), the results are not going to be so great, as more components that would normally add more details, will add more noise back as well.

Approximations of noisy circular filter – original, rank1, rank2, rank4, rank8. The more components we add, the more noise comes back.

Therefore, there exists a point at which higher rank approximations will only add noise – as compared to adding back information/details. This is closely related to the bias-variance trade-off concept in machine learning, and we can see this behavior on the following plot showing normalized sum of squared error of the noisy circular filter next rank approximations (error is computed against ground truth, of non-noisy filter).

When we add next ranks of approximation to the noisy filter, we initially observe the error is decreasing, but then it starts to increase, as we don’t add more information and original structure, but only noise.

It is interesting to see on this plot that the cutoff point in this case is around rank 12-13 – while 14 components were enough to perfectly reconstruct the non-noisy version of the filter.

Left: clear circular filter, middle: circular filter with noise added, right: rank 12 approximation of the noisy circular filter – notice perfect circle shape while significantly less noise than the middle one!

Overall this property of SVD is used in some denoising algorithms (however decomposed matrices are not 2D patches of an image; instead they usually are MxN matrices where M is a flattened image patch, representing all pixels in it as a vector, and N corresponds to N different, often neighboring or similar image patches), which is a super interesting technique, but way outside of the topic of this post.

Conclusions

To conclude, in this blog post we have looked at image filter/kernel separability – its computational benefits, simple and analytical examples, and shown a way of analyzing if a filter can be converted into a separable (rank 1) representation through Singular Value Decomposition.

As we encountered some examples of non-separable filters, we analyzed what are the higher rank approximations, how we can compute them through sum of many separable filters, how do they look like, and if we can apply them for image filtering using an example of bokeh.

I love the scientific and analytical approach, but I’m also an engineer at heart, so I cannot leave such blog post without answering – is it practical? When would someone want to use it?

My take on that is – this is just another tool in your toolbox, but has some immediate, practical uses. Those include analyzing and understanding your data, checking if some non-trivial filters can be separable, if they can be approximated despite the noise / numerical imprecision, and finally – evaluate how good the approximation is.

For non-separable filters, low rank approximations where the rank is higher than one can often be pretty good with just a few components and I think they are practical and usable. For example, our rank 4 approximation of the bokeh filters was not perfect (artifacts), but computationally way cheaper than non-separable filter, it parallelizes perfectly/trivially, and is very simple implementation-wise. In the past I referenced a very smart solution from Olli Niemitalo to approximate circular bokeh using complex phasors – and low-rank approximation in the real domain covered in this blog post is some simpler interesting alternative. Similarly, it might not produce perfect hexagonal blur like the one constructed using multiple skewed box filters, but the cost would be similar, and the implementation is so much simpler that it’s worth giving it a try.

I hope that this post was inspiring for some future explorations, understanding of the problem of separability, and you will be able to find some interesting, practical use-cases!

Posted in Code / Graphics | Tagged , , , , , , , , , , , | 9 Comments

Analyze your own activity data using Google Takeout – music listening stats example

The goal of this post is to show how to download our own data stored and used by internet services to generate personalized stats / charts like below and will show step-by-step how to do it using colab, Python, pandas, and matplotlib.

Disclaimer: I wrote this post as a private person, not representing my employer in any way – and from a 100% outside perspective.

Intro / motivation

When we decide to use one particular internet service over the other, it is based on many factors and compromises – we might love our choices, but still find ourselves missing certain features.

Seeing “year in review” from Spotify shared by friends made me envious. Google Play Music that I personally use doesn’t have such a cool feature and didn’t give me an opportunity to ponder and reflect on the music I listened to in the past year (music is a very important part of my life).

Luckily, the internet is changing and users, governments, as well as internet giants are becoming more and more aware of what it means for a user to “own their data”, request it, and there is a need for increased transparency (and control/agency over data we generate while using those servies). Thanks to those changes (materializing themselves in improved company terms of service, and regulatory acts like GDPR or California Consumer Privacy Act), on most websites and services we can request or simply download the data that those collect to operate.

Google offers such a possibility in a very convenient form (just a few clicks) and I used my own music playing activity data together with Python, pyplot, and pandas to create a personalized “year in review” chart(s) and given how cool and powerful, yet easy it is, decided to share it in a short blog post.

This blog post comes with code in form of a colab and you should be able to directly analyze your own data there, but more importantly I hope it will inspire you to check out what kind of other user data your might have a creative use for!

Downloading the data

The first step is trivial – just visit takeout.google.com.

Gotcha no 1: Amount of different Google services is impressive (if not a bit overwhelming) and it took me a while to realize that the data I am looking for is not under “Google Play Music“, but under “My Activity”. Therefore I suggest to “Deselect all”, and then select “My Activity”. Under it, click on “All activity data included” and filter to just Google Play Music “My Activity” content:

The nextstep here would be to change the data format from HTML to just JSON – which is super easy to load and parse with any data analysis frameworks or environments.

Finally, proceed to download your data as “one-time archive” and a format that works well for you (I picked a zip archive).

I got a “scary” message that it can take up to a few days to get a download link to this archive, but then got it in less than a minute. YMMV, I suspect that it depends on how much data you are requesting at given time.

Downloaded data

Inside the downloaded archive, my data was in the following file:

Takeout\My Activity\Google Play Music\MyActivity.json

Having a quick look at this JSON file, looks like it contains all I need! For every activity/interaction with GPM, there is an entry with title of the activity (e.g. listened to, searched for), description (which is in fact artist name), time, as well as some metadata:

One thing that is interesting is that “subtitles” contain information about “activity” that GPM associated with given time, temperature, and weather outside. This kind of data was introduced between 2019-03-21 (is not included my entries on that date or before) and 2019-03-23 (the first entry with such information). The activity isn’t super accurate, most of them are “Leisure” for me, often even when I was at work or commuting, but some are quite interesting, e.g. it often got “Shopping” right. Anyway, it’s more of a curiosity and I haven’t found great use for it myself, but maybe you will. 🙂

Gotcha no 2: all times described there seem to be in GMT – at least for me. Therefore if you listen to music in any different timezone, you will have to convert it. I simplified my analysis to use pacific time. This results in some incorrect data – e.g. I spent a month this year in Europe and music that I listened to during that time will be marked incorrectly as during some “weird hours”.

Analyzing the data

Ok, finally it’s time for something more interesting – doing data analysis on the downloaded data. I am going to use Google Colaboratory, but feel free to use local Jupyter notebooks or even commandline Python.

Code I used is available here.

Note: this is the first time I used pandas, so probably some pandas experts and data scientists will cringe at my code, but hey, it gets the job done. 🙂

The first step to analyze a local file with a colab is to upload it to the runtime using the tab on the left of the colab window – see the screenshot below. I have uploaded just the MyActivity.json without any of the folder structure. If you use local scripts, this step is unnecessary. as you can specify full local path

Gotcha no 3: Note that uploading files doesn’t associate them with your notebook, account used to open colab or anything like that – just with current session in the current runtime, so anytime you restart your session, you will have to reupload the file(s).

Time to code! We will start with loading some Python libraries and set the style of plots to a bit prettier:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

Then we can read the JSON file as Pandas dataframe – I think of it as of table in SQL, which is probably oversimplification, but worked for my simple use-case. This step worked for me right away and required no data preprocessing:

df = pd.read_json('MyActivity.json')
print(df.columns)
Index(['header', 'title', 'subtitles', 'description', 'time', 'products',
       'titleUrl'],
      dtype='object')
                         title description                      time
0  Listened to Jynweythek Ylow  Aphex Twin  2019-12-31T23:33:55.865Z
1       Listened to Untitled 3  Aphex Twin  2019-12-31T23:26:23.465Z
2                Listened to 4  Aphex Twin  2019-12-31T23:22:46.829Z
3               Listened to #3  Aphex Twin  2019-12-31T23:15:02.427Z
4           Listened to Nanou2  Aphex Twin  2019-12-31T23:11:36.944Z
5  Listened to Alberto Balsalm  Aphex Twin  2019-12-31T23:06:25.999Z
6             Listened to Xtal  Aphex Twin  2019-12-31T23:01:32.538Z
7       Listened to Avril 14th  Aphex Twin  2019-12-31T22:59:27.321Z
8        Listened to Fingerbib  Aphex Twin  2019-12-31T22:55:37.856Z
9              Listened to #17  Aphex Twin  2019-12-31T22:53:32.866Z

This is great! Now a bit of clean up. We want to apply the following:

  • For easier further coding, rename the unhelpful “description” to “artist”,
  • Convert the string containing date and time to pandas datetime for easier operations and filtering,
  • Convert the timezone to the one you want to use as a reference – I used single one (US Pacific Time), but nothing prevents you from using different ones per different entries, e.g. based on your locations if you happen to have them e.g. extracted from other internet services,
  • Filter only “Listened to x/y/z” activities, as I interested in analyzing what I was listening to, and not e.g. searching for.
df.rename(columns = {'description':'artist'}, inplace=True)
df['time'] = df['time'].apply(pd.to_datetime)
df['time'] = df['time'].dt.tz_convert('America/Los_Angeles')
listened_activities = df.loc[df['title'].str.startswith('Listened to')]

Now we have a dataframe containing information about all of our listening history. We can do quite a lot with it, but I didn’t care for specific “tracks” (analyzing this could be useful to someone), so instead I went to a) filter only activities that happened in 2019, and b) group all events of listening to a single artist.

For the step b), let’s start with just annual totals. We group all activities and use size of each group as a new column (total play count).

listened_activities_2019 = listened_activities.loc[(listened_activities['time'] >= '2019-01-01') & (listened_activities['time'] < '2020-01-01')]
artist_totals = listened_activities_2019.groupby('Artist').size().reset_index(name='Play counts')
top = artist_totals.sort_values('Play counts', ascending=False).set_index('Artist')[0:18]
print(top)
                       Play counts
Artist                            
The Chemical Brothers          151
Rancid                         144
AFI                             98
blink-182                       89
Show Me The Body                69
The Libertines                  68
Ladytron                        64
Flying Lotus                    64
Boards Of Canada                62
Operation Ivy                   62
808 State                       62
Punish Yourself                 61
Lone                            60
Mest                            58
Refused                         54
Aphex Twin                      54
Dorian Electra                  53
Martyn                          52

This is exactly what I needed. It might not be surprising (maybe a little bit, I expected to see some other artists there – but they were soon after – I am a “long tail” type of listener), but cool and useful nevertheless.

Now let’s plot it with matplotlib. I am not pasting all the code here as there is a bit of boilerplate to make plots pretties, so feel free to check the colab directly. Here is the resulting plot:

From this starting point, let your own creativity and imagination be your guide. Some ideas that I tried and had fun with:

  • hourly or weekly listening activity histograms,
  • monthly listening totals / histograms,
  • checking whether I tend to listen to different music in the evening versus the morning (I do! Very different genres to wake up vs unwind 🙂 ),
  • weekly time distributions.

All of those operations are very simple and mostly intuitive in Python + pandas – you can extract things like hour, day of the week, month etc. directly from the datetime column and then use simple combinations of conditional statements and/or groupings.

Some of those queries/ideas that I ran for my data are aggregated in a single plot below:

Summary

Downloading and analyzing my own data for Google Play Music was really easy and super fun. 🙂

I have to praise here all of the following:

  • User friendly and simple Google data / activity downloading process, and its data produced in a very readable/parseable format,
  • Google Colab being an immediate, interactive environment that makes it very easy to run code on any platform and share it easily,
  • Pandas as data analysis framework – I was skeptical and initially would have preferred something SQL-like, but it was really easy to learn for such a basic use case (with help of the documentation and Stackoverflow 🙂 ),
  • As always, how powerful numpy and matplotlib (despite its quite ugly looking defaults and tricks required to do simple things like centering entries on label “ticks”) are – but those are already kind of industry standards, so won’t add here more compliments.

I hope this blog post inspired you to download and tinker with your own activity data. Hopefully this will be not just fun, but also informative to understand what is stored – and you can even judge what do you think is really needed for those services to run well. The activity data I downloaded from Google Play Music looks like a reasonable, bare minimum that is useful to drive the recommendations.

To conclude, I think that anyone with even the most basic coding skills can have tons of fun analyzing their own activity data – combining information to create personally meaningful stats, exploring own trends and habits, extending functionalities, or adding (though only for yourself and offline) completely new features for commercial services written by large teams.

Posted in Code / Graphics | Tagged , , , , , , | Leave a comment