In this shorter post, I will describe a 2X downsampling filter that I propose as a “safe default” for GPU image processing. It’s been an omission on my side that I have not proposed any specific filter despite writing so much about the topic and desired properties. 🙂
I originally didn’t plan to write this post, but I was surprised by the warm response and popularity of me posting a shadertoy with this filter.
I also got asked some questions about derivation, so this post will focus on it.
TLDR: I propose a 8 bilinear tap downsampling filter that is optimized for:
- Good anti-aliasing properties,
- Sharper preservation of mid-frequencies than bilinear,
- Computational cost and performance – only 8 taps on the GPU.
I believe that given its properties, you should use this filter (or something similar) anytime you can afford 8 samples when downsampling images by 2x on a GPU.
This filter behaves way better than a bilinear one under motion, aliases less, is sharper, and is not that computationally heavy.
You can find the filter shadertoy here.
Here is a simple synthetic data gif comparison:
If for some reason you cannot open the shadertoy, here is a video recording:
In this post, I will assume you are familiar with the goals of a downsampling filter.
I highly recommend at least skimming two posts of mine on this topic if you haven’t, or as a refresher:
I will assume we are using “GPU coordinates”, so a “half texel offset” and pixel grids aligned with pixel corners (and not centers) – which means that the described filter is going to be even-sized.
Perfect 1D 8 tap filter
A “perfect” downsampling filter would preserve all frequencies below Nyquist with a perfectly flat, unity response, and remove all frequencies above Nyquist. Note that a perfect filter from the signal processing POV is both unrealistic, and potentially undesirable (ringing, infinite spatial support). There are different schools of designing lowpass filters (depending on the desired stop band or pass band, transition zone, ripple etc), but let’s start with a simple 1D filter that is optimal in the simple least squares sense as compared to a perfect downsampler.
How do we do that? We can either solve a linear least squares system of equations on the desired frequency response (from a sampled Discrete-Time Frequency Response of a filter), or do a gradient descent using a cost function. I did the latter due to its simplicity in Jax (see a post of mine for example optimizing separable filters or about blue noise dithering through optimization).
If we do this for 8 samples, we end up with a frequency response like this:
We can see that it is both almost always sharper than bilinear under Nyquist, as well as removes much less of the high frequencies. There is a “ripple” and some frequencies will be boosted / sharpened, but in general it’s ok for downsampling. Such ripples are problematic for continuous resampling filters – can cause some image frequencies to “explode”, but even creating 10 mip maps, we shouldn’t get any problems – as each levels “boost” will translate to different original frequencies, so they won’t accumulate.
Moving to 2D
Let’s take just an outer product of 1D downsampler and produce an initial 2D downsampler:
It is pretty good (other than slight sharpening), but we wouldn’t want to use a 8×8, so 64 tap filter (it might be ok for “separable” downsampling, which is often a good idea on CPU or DSP, but not necessarily on the GPU).
So let’s have a look at the filter weights:
We can see that many corner weights have magnitudes close to 0.0. Here are all the weights with magnitude below 0.02:
Why 0.02? This is just an ad-hoc number, magnitude ~10x lower than the highest weights. But when setting such a threshold, we can remove half of the filter taps! There is another reason I have selected those samples – bilinear sampling optimization. But we will come back to it in a second.
So we can just get rid of those weights? If we zero them out and normalize the filter, we get the following response:
Some of the overshoots have disappeared! This is nice. What is not great though, is that we started to let through a bit more frequencies on diagonals – which kind of makes sense given the removal of diagonal weights. We could address that a bit by including more samples, but this would increase the total filter cost to 12 bilinear taps instead of 8 – for a minor quality difference.
But can slightly improve / optimize this filter further. We can either directly do a least-squares solve to minimize the squared error compared to a perfect downsampler, or just do a gradient descent towards it (while keeping the zeroed coefficients zero). The difference is not huge, only a few percent of difference on least squares error, but there is no reason not to do it:
Now time for the ultimate performance optimization trick (as 32 tap filter would be prohibitively expensive for the majority of applications).
Bilinear tap optimization
Let’s have a look at our filter:
What is very convenient about this filter is having groups of 2×2 pixels that have the same sign:
When we have a group of 2×2 pixels with the same sign, we can try to approximate them with a single bilinear tap! This is a super common “trick” in GPU image processing. In a way, anytime you take a single tap 2×2 box filter, you are using this “trick” – but it’s common to explicitly approximate e.g. Gaussians this way.
If we take a sample that is offset by dx, dy and has a weight w, we end up with effective weights:
(1-dx)*(1-dy)*w, dx*(1-dy)*w, (1-dx)*dy*w, dx*dy*w.
Given initial weights a, b, c, d, we can optimize our coefficients dx, dy, w to minimize the error. Why will there be an error? Because we have one less degree of freedom in the (dx, dy, w) as compared to the (a, b, c, d).
This is either a simple convex non-linear optimization, or… a low rank approximation! Yes, this is a cool and surprising connection – we are basically trying to make a separable, rank 1 matrix. 2×2 matrix is at most rank 2, so rank 1 approximation in many cases will be good. This is worth thinking about – which 2×2 matrices will be well approximated? Something like [[0.25, 0.25], [0.25, 0.25]] is expressed exactly with just a single bilinear tap, while [[1, 0], [0, 1]] will yield a pretty bad error.
Let’s just use a simple numpy non-linear optimize:
def maximize_4tap(f): def effective_c(x): dx, dy, w = np.clip(x, 0, 1), np.clip(x, 0, 1), x return w * np.array([[(1-dx)*(1-dy), (1-dx)*dy], [dx*(1-dy), dx*dy]]) def loss(x): return np.sum(np.square(effective_c(x) - f)) res = scipy.optimize.minimize(loss, [0.5, 0.5, np.sum(f)]) print(f, effective_c(res['x'])) return res['x']
Note that this method can work also when some of the weights are of differing sign – but the approximation will not be very good. Anyway, in the case of the described filter, the approximation is close to perfect. 🙂
We arrive at the final result:
vec3 col = vec3(0.0); col += 0.37487566 * texture(iChannel0, uv + vec2(-0.75777,-0.75777)*invPixelSize).xyz; col += 0.37487566 * texture(iChannel0, uv + vec2(0.75777,-0.75777)*invPixelSize).xyz; col += 0.37487566 * texture(iChannel0, uv + vec2(0.75777,0.75777)*invPixelSize).xyz; col += 0.37487566 * texture(iChannel0, uv + vec2(-0.75777,0.75777)*invPixelSize).xyz; col += -0.12487566 * texture(iChannel0, uv + vec2(-2.907,0.0)*invPixelSize).xyz; col += -0.12487566 * texture(iChannel0, uv + vec2(2.907,0.0)*invPixelSize).xyz; col += -0.12487566 * texture(iChannel0, uv + vec2(0.0,-2.907)*invPixelSize).xyz; col += -0.12487566 * texture(iChannel0, uv + vec2(0.0,2.907)*invPixelSize).xyz;
Only 8 bilinear samples! Important note: I assume here that UV is centered between 4 high resolution pixels – this is the UV coordinate you would use for bilinear / box interpolation.
This was a short post – mostly describing some derivation steps. But from a practical point of view, I highly recommend the proposed filter as a drop-in replacement to things like creating image pyramids for image processing or post-processing effects.
This matters for low resolution processing (for saving performance), or when we want to create a Gaussian-like pyramid (though with Gaussian pyramids we very often use more heavily lowpass filters and don’t care as much about preserving exactly below Nyquist).
If you look at the stability of the proposed filter in motion, you can see it is purely superior to box downsample – and not sacrificing any sharpness.
Bonus – use on mip maps?
I got asked an interesting question on twitter – would I recommend this filter for mip-map creation?
The answer is – it depends. For sure it is better than a bilinear/box filter. If you create mip-maps on the fly / in runtime, like from procedural textures, then the answer is – probably yes. But counter-intuitively, a filter that does some more sharpening or even lets through some aliasing might be actually pretty good and desireable for mipmaps! I wrote how when we downsample images, we should be aware of how we’re going to use them – including upsample or bilinear sampling, and might want to change the frequency response to compensate for it. Mip-mapping is one of those cases where designing a slightly sharpening filter might be good – as we are going to bilinearly upsample the image for trilinear interpolation or display later.