Tech and scientific writing – graphics, diagram and graph creation tools

Few days ago, I asked a question on twitter: https://twitter.com/BartWronsk/status/919618905319997440
“What is industry/academic standard for diagrams in tech writing? PowerPoint/GoogleDocs diagrams? Dedicated tool like Visio/Dia? Procedural?
I usually use first, but far from satisfied…For proc tried SVG,JS,Graphviz,even Mathematica, but results are either ugly or *lots* of work…”

I got 26 replies, thanks everyone who contributed!

To not have it lost and hoping that someone else will find it useful, writing down answers in a few categories with some comments. Please comment on this post if I got anything wrong and/or I omitted something.

Note that there are no real “conclusions”, almost everyone used something else! Probably almost anything on the list will work. But as a rule of thumb, probably it’s useful to have each of:

  • Dedicated vector graphics program (I’ll give a try to Inkscape and Photoshop vector graphics),
  • Favorite scientific environment for data exploration and plotting (Python+Matplotlib, Mathematica, Matlab),
  • Some favorite environment for authoring documents (whether LaTeX, mathcha or just Powerpoint or Google Docs),
  • maybe something interactive for web/blog (shadertoy, D3, Mathbox).

Vector graphics software

Adobe Illustrator, Subscription $19.99/month, Mac/Windows

For many years standard / go-to for vector graphics. At the same time quite pricey subscription model and could be an overkill for simple diagrams?

Affinity Designer, $49.99, Mac/Windows

No subscription alternative to Illustrator. Lots of people recommended it, worth having a look!

Inkscape, Opensource/Free, Linux/Mac/Windows

Totally free and open source. Subjectively UI looks much cleaner than other open source graphics software.

Omnigraffle, $49/$99 iOS, $99/$199 Mac

Recommended by many people, UI looks pretty clean and productive, sadly Mac/iOS only.

Xfig, Opensource/Free, Linux/Mac/Windows

Dormant development, quite ugly interface (IMO). But open source and free.

Dedicated diagram / flow chart software

Microsoft Visio, $299 or $18 subscription, Mac/Windows

Professional industry grade diagram software. Quite expensive…

SmartDraw, Subscription $9.95/month, Mac/Windows/Cloud

Flowchart marker; don’t know if it’s useful for other, more complicated graphs and diagrams.

yEd, Free, Linux/Mac/Windows

Graph and flowchart only (?), free and multiplatform.

Dia, Free, Linux/Mac/Windows

Dormant open-source project. I used it few times and think there are better, more modern tools (?).

Lucidchart, both free and paid options

Don’t know much about it, looks like it’s more about flowcharts

draw.io, Free, cloud/web based

UI looks very similar to Google Docs, very low entry barrier. Again more about just simple, graph-like diagrams / flow charts.

Visual Paradigm, $349, Linux/Mac/Windows

Another dedicated UML-like, visual diagram tool.

Other software

Microsoft Powerpoint $109.99, Mac/Windows

My go-to drawing solutions for any presentations so far. Works great, supports animations, but is painfully limited.

Photoshop and vector graphics there, Subscription $9.99 a month, Mac/Windows

I’ve never used PS vector graphics, but it’s software I already pay for, so could be a good idea to try it.

Mathcha, free, cloud

This one is very interesting – equation and mathematical graphics (simple plots, diagrams) online editor. Looks like way easier entry / access WYSIWYG LaTeX editor.

Google Docs and family, Free, cloud based

Using this quite a lot at work (what a surprise!) for drawing simple diagrams and it works pretty well. Not leaving document is a big plus, just like with Powerpoint.

Grapher, part of MacOS

Dedicated tool embedded in MacOS; surprisingly powerful for something included in every install!

Interactive web based

Shadertoy, free

This could be a surprising choice, but check out e.g. this algorithm visualization from Leonard Ritter. Code + visualization in one place with animations and optional interactivity!

Mathbox, free, Javascript library

Quite amazing math visualization library for JS and web. Amazing results, but it’s quite a heavy tool that could be overkill for small diagrams.

Flot, free, Javascript library

Interactive plotting for jQuery. Don’t know much about it (not using JS as much as I’d like for creating web pages).

D3, free, Javascript library

Beautiful data visualizations – but not really for generic graphs.

Desmos calculator, free, online

Nice semi-interactive and powerful online mathematical function plotting tool. Good to not just visualize, but also explain reasoning behind some curve look.

Procedural languages and libraries

Asymptote, open source

Dedicated vector graphics language.

Markdeep, open source

“Markdeep is a technology for writing plain text documents that will look good in any web browser (…). It supports diagrams, calendars, equations, and other features as extensions of Markdown syntax.”

Limited diagramming capabilities, but all-in-one document creation solution.

Matplotlib, Python, open source

Go-to Python solution for visualizing data. By default not prettiest, but very powerful and “just works”.

LaTeX with PGF/TikZ, Tex, open source

LaTeX extension/language to define graphs and generally, graphics – programmaticly. Would love to learn it some day and become proficient, but even people who recommended it, warned of steep learning curve.

UMLet Free, has also UI

Free tool to create UML diagrams.

Matlab, $2150, Linux/Mac/Windows

The most expensive option of the list – numerical professional mathematical/engineering package with some graphics options among numerous others.

GNU Octave, open source, all platforms

Free and open source alternative to Matlab! Probably not as powerful for most uses, but for graphs and vector graphics should be comparable. Used it at college for computer vision classes and well, it definitely worked fine for most algebraic problems.

Gnuplot, open source, all platforms

Simple command line tool for plots that I used few times at college. For quick plots very convenient.

Mathematica, $320, Linux/Mac/Windows

Affordable mathematical symbolic and numerical analysis and data exploration software. I wrote about it in the past, used it numerous times for plots in many of my blog posts and always recommend. I also played with graph plotting, though found it a little bit frustrating and not customizable/pretty enough by default.

Graphviz, open source, all platforms

One of those “useful but ugly” pieces of software. Good for quickly creating graphs from code, for anything else no great (?).

Svgwrite for Python, open source, all platforms

Python library for writing directly into SVG. Quite low level, but powerful for specific very “procedural” use cases.

GGplot2 for R, open source, all platforms

I don’t know much about R, but my understanding is that is its matplotlib equivalent. If you use R for data exploration, probably a goto solution.

Editing directly PostScript, open source, all platforms

Very low level and hardcore option; maybe easier with outputting ps directly/proceduraly from code?

 

Posted in Code / Graphics | Tagged , , , , | 3 Comments

Separable disk-like depth of field

This is a short note accompanying shadertoy: https://www.shadertoy.com/view/lsBBWy .

It is direct implementation of “Circularly symmetric convolution and lens blur” by Olli Niemitalo (no innovation on my side, just a toy implementation) and got inspired by Kleber Garcia’s Siggraph 2017 presentation “Circular Separable Convolution Depth of Field”.

Why?

For depth of field effect, Gaussian blurs are seen artistically as “boring”, while hexagonal DoF (popular few years ago) can be subjectively not attractive (artificial, cheap camera qualities). I wrote a bit about bokeh, DoF and some crazy implementation in Witcher 2 in the past. Working on Far Cry 4 and other titles, I used a different approach – scatter as gather “stretched” Poisson bokeh to compute DoF together with motion blur and at the same time.

Circular kernels can be very expensive; prone to noise, ringing etc. One can try to approximate them, like Guerilla Games in Michal Valient’s Killzone presentation, undersample with blur afterwards like Crytek in Tiago Sousa’s presentation, or try to use some separable approaches like Colin Barré-Brisebois.

Unfortunately, there are no circularly symmetric separable filters other than Gaussian filter in real domain. However, in complex domain, one can find whole family of functions (complex phasors multiplied by Gaussian “bell”) that their magnitude is! This is what Olli Niemitalo’s post describes and introduces some “fitted” functions to approximate disk DoF.

Results – quality

As a proof-of-concept, I implemented it here: https://www.shadertoy.com/view/lsBBWy . It has a version with a single component and a version with two components (“harmonics”).

Version with a single component produces quite strong ringing, but much more “pronounced” bokeh than Gaussian one:

circular_vs_gaussian

Circular bokeh single component approximation vs Gaussian.

Version with two components is going to be twice more expensive memory and ALU-wise, but doesn’t have those artifacts so strong:

circular_two_vs_one

Two component vs single component approximation. Notice less ringing and smaller passband – almost perfect, circular shape.

I personally don’t think this ringing or “donut hole” is unpleasant at all; IMO it resembles some older lenses with “busy bokeh” and is something we simulated in Witcher on purpose:

Witcher 2 artistic “busy bokeh”.

ND7_1568

Real world lens “busy bokeh” with some ringing visible.

Results – implementation / performance

Shader toy implementation I provided is obviously not optimal. Recomputing weights and their sum is very wasteful; most of this can be precomputed and stored in an uniform / constant buffer. Instead I stored most of them in first texture / buffer (“Buffer A”).

If we exclude weights, per every component first (“horizontal”) pass requires a “simplified” real times complex multiply and accumulate (2 madds, one per real and one per imaginary component).

Second (“vertical”) pass requites full complex multiplies and accumulates – however after applying optimization from Olli we can keep real component only – (2 madds).

So assuming that there are N taps in both horizontal and vertical directions, every component needs roughly 4N “full rate”/simple instructions – not bad at all. For two components it’s 8N instructions . Both variation just read normal amounts of memory in the first pass (one fetch per tap), but on 2nd pass have either 2x or 4x more memory bandwidth required.

Similarly larger are the storage costs – every RGB channel is multiplied by 2 and by number of components. So using a single component we’d need 6 floating point channels (that can get negative – complex phasor!), for two components 12, so probably 4 RGBA16F textures.

In general, like with many post effects memory bandwidth is definitely a concern here; this technique requires even more of it and can get pretty expensive, but most blurring can be done in local / groupshared memory in compute shaders; and one can do some optimizations like blurring in YCoCg space and storing chroma components in lower res etc. So in general I definitely think it’s practical and cool approach.

In general I really like this technique, I don’t know if I’d use it in practice (still see lots of benefit in combining DoF and MB), but find it really beautiful and elegant, especially maths behind it. 🙂

References

Circularly symmetric convolution and lens blur“, Olli Niemitalo http://yehar.com/blog/?p=1495

“Circular separable convolution depth of field”, Kleber Garcia http://dl.acm.org/citation.cfm?id=3085022

Killzone: Shadow Fall Demo Postmortem“, Michal Valient https://www.slideshare.net/guerrillagames/killzone-shadow-fall-demo-postmortem

Graphic gems of CryEngine 3“, Tiago Sousa http://www.crytek.com/download/Sousa_Graphics_Gems_CryENGINE3.pdf

Hexagonal Bokeh Blur Revisited“, Colin Barré-Brisebois https://colinbarrebrisebois.com/2017/04/18/hexagonal-bokeh-blur-revisited/

Posted in Code / Graphics | Tagged , , , , , , , , , , | 1 Comment

Cull that cone! Improved cone/spotlight visibility tests for tiled and clustered lighting

In this blog post, I will present some common spotlight culling techniques and propose one small improvement that is able to correct results of cheap, tiled / clustered culling on the CPU/GPU with almost negligible ALU cost. If you know a lot about tiled light culling, feel free to skip till the last section.

The idea for this post came from the fact that many people talking about light culling talk mostly about spherical point lights. However in many practical applications point lights are actually rare! It is not uncommon to implement even point light rendering and their shadow casting not through cubemaps for shadowmaps, but through 1-6 clipped spot lights (getting the advantage of being able to vary their resolution, cull intersecting geometry separately and compute optimal near/far planes).

Main problem with culling spotlights is that unlike spherical / point lights, many widely used tests cause some overdraw / false positives. With physically-based correct light falloffs and BRDFs, light can get really huge (specular component can be seen very far away) and most common tests are not suited for culling of large primitives. I will have a look at few possible techniques in following sections.

Note: in sample code you will see instructions like sin or cos – obviously you don’t want to call them, but precompute them ahead – this is only for demonstration purposes.

You can find Mathematica notebook that comes with this post here and the pdf version of it here.

Introduction / assumptions

Talking about generic collision tests of spot lights, I am going to approximate them as cones (extra cap is easy to add; in most of those tests it can be skipped though). I will talk about culling them against mini-frustums with focus on tiled lighting – tiled deferred and clustered forward lighting – but they apply in theory to any kind of culling technique. In theory it could also be applied to tiled forward+ for transparent object lighting, however it won’t benefit that much from some of mentioned optimization techniques due to very long frusta. In those cases it is better to use clustered shading anyway! In general, clustered shading is preferable and can be made extremely efficient with some low level hardware tricks, for example as presented by Tiago Sousa and Jean Geffroy.

In most of this post, I will assume that tile frustum is not extremely long, so has relatively uniform dimensions. While it could be considered limiting, in practice it is possible to achieve that assumption even withing just classic tiled deferred by splitting tiles depth bounds based on the depth distribution.

I will also demonstrate ideas here in 2D for simplicity, but they extend one to one to 3 dimensions (unless noted otherwise).

Simple approach – cone as a sphere

First approach is quite trivial. Instead of dealing with a spotlight / cone, we can have a look at its bounding sphere. Obviously taking equivalent pointlight bounding sphere would be extremely wasteful, so instead we want to find an exact bounding sphere of a cone. To find it, we can have a look at just 2 cases: either triangular cone slice is obtuse or not.

coneboundingsphere.gif

Cone slice bounding sphere

In case of obtuse triangle, bounding sphere’s center will be on the cone cap with radius equal to the radius of the cap. In the acute case, we want to find a circumscribed triangle. There is a very easy formula for finding one based on one of triangle angles and length of opposite side, it is A / sin(a). After taking into account half angle and half-diameter and doing some trigonometric reductions, it becomes trivial 1/(2cos(halfAngle)).

float4 boundingSphere(in float3 origin, in float3 forward, in float size, in float angle)
{
	float4 boundingSphere;
	if(angle > PI/4.0f)
	{
		boundingSphere.xyz = origin + cos(angle) * size * forward;
		boundingSphere.w   = sin(angle) * size;
	}
	else
	{
		boundingSphere.xyz = origin + size / (2.0f * cos(angle)) * forward;
		boundingSphere.w   = size / (2.0f * cos(angle));
	}

	return boundingSphere;
}

This can fail in many situations even with a simplest plane vs sphere tests:

cone_boundingsphere_thin_fail.png

cone_boundingsphere_large_fail.png

However in some it will work reasonably well against a single plane. Having precomputed the spheres on the CPU or just ahead of time, it is extremely cheap test to perform.

coneboundingspherevsplane.gif

Unfortunately, this test it suffers from a “corner” problem typical to all primitive vs planes partial swept line tests:

coneboundingspherevsfrustum.gif

Here is a stop frame from this animation, you can see how every plane test succeeds and cone / sphere are not culled by any of them, so the whole bounding sphere cannot be culled:

cone_collision_sphere_frustum_fail.png

Thomas Gareth proposed a simple trick / workaround: precise, Minkowski sum culling of a sphere against bounding box of the frustum.

Here is how it performs in the same case, simple and cheap box test against the sphere gives expected result without false negatives:

coneboundingspherevsbox.gif

Different approach – plane cone tests

While not forgetting about frustum planes test failures, we can try addressing those very thin or very wide cones with better plane vs primitive collision test and see how it’d work in practice.

The main idea is simple – find a closest point on the primitive to the plane and test it against that plane. In 2D this is trivial – the point will be the closest one of 3 triangle corners:

conevsplane.gif

3D case

In 3D it gets a bit more tricky, but same assumptions remain. But this time we either test the cone apex or closest point on the cap rim. Idea one might to do it efficiently comes from Christer Ericson’s book “Real-Time Collision Detection”.

Let me visualize first how the algorithm works:

3d_cone_vs_plane.png

We are trying to find vector V2 that is closest to the plane surface. To do it, we first compute a vector V1 that is perpendicular to both plane normal and the cone “forward” vector. This can be done by performing a simple cross product. We also get a guarantee that this vector will be on the cone cap (since cone cap plane has all vectors perpendicular to cone axis). Then we do yet another cross product with the cone axis – to get a vector perpendicular to both the cone axis (so it will still be on the cone cap) as well as this vector perpendicular to the plane normal. It means it will be a vector that points in direction most similar to the plane normal or in the opposite direction, depending on the order of cross product operations.

It is very important here to not forget a “safe” normalize, since depending on the sine of the initial angle between plane normal and the cone orientation, result can be a very short vector. If it is zero, we can totally ignore it as it means that cone cap is parallel to the plane and we need to only check the apex.

Here we can see how this works in practice for a test that detects an intersection:

3d_cone_vs_plane_collide.png

Finally, we need to also check the cone origin point – since cone cap can be quite far away from the plane, but the cone intersect it anyway:

3d_cone_vs_plane_corner.png

We can code it easily in hlsl for example like this:

bool TestConeVsPlane(in float3 origin, in float3 forward, in float size, in float angle, in float4 testPlane)
{
	const float3 V1 = cross(testPlane.xyz, forward);
	const float3 V2 = cross(V1, forward);

	const float3 capRimPoint = 	origin +
								size * cos(angle) * forward +
								size * sin(angle) * V2;

	return dot(float4(capRimPoint, 1.0f), testPlane) >= 0.0f || dot(float4(origin, 1.0f), testPlane) >= 0.0f;
}

Back to a 2D example

How does it perform? In theory it could be better (we can get much closer than the bounding sphere and still get some culling), but it is still suffering from the same problems like all frustum plane tests:

conevsfrustums.gif

In practice, this is a huge problem. Just have a look in a a little bit contrived scenario of a regular grid:

conevsplanegrid.gif

This looks almost totally useless!

There are few solutions to it. I will mention first a heavy / complicated, but best solution, suggested by Matt Pettineo – rasterizing the cone in tested space and then either simply checking against this rasterized mask, or doing proper, rasterized computation of tile / cluster bounds in 2D. It is not so easy to implement, requires smart tricks to work around lack of conservative rasterization on most available consumer hardware and adds extra passes to the rendering pipeline. I would suggest doing it this way if you have time, but I will propose a bit coarser, but very cheap and simple solution if you don’t.

Flipping the problem

In the idea I propose (I am sure it is not new/revolutionary; however haven’t seen it in any published material related to tiled light culling) I got inspired by both Thomas Gareth and Inigo Quilez. Instead of doing partial separating axis test by testing frustum planes, in case of small frusta against large primitives, we should do the opposite!

Just test the cone against some other primitive.

What other primitive? The easiest one; a sphere – their symmetry makes Minkowski-sum tests quite trivial. Just find a furthest point on the tested primitive towards the sphere center and perform a distance check.

Here is how it would work geometrically:

conevssphere.gif

We can see a vector V that goes from cone origin towards the sphere. Then this vector V is decomposed into perpendicular vectors V1 and V2. V1 is projection of V onto the cone axis that is clipped against cone axis length, while V2 is V minus the projection – clipped against tan(coneAngle) times V1 length.

This way we get a point that is closest to the cone and we can simply check the distance of this point from the sphere center against the radius of the sphere. Note that this is the correct test, giving accurate results in every case – but it is quite expensive (many reciprocals) and has relatively large register pressure (operations on whole 3D vectors).

We can slightly simplify it by not actually decomposing the vector and just comparing few distances – for example as explained by Charles Bloom. Please note that his original test is very simplified and actually doesn’t test the cone range / behind the cone situation, something like:

conevssphere_simplified.gif

To fix it without adding too much extra cost, I’d add some simple distance checks for the distance of the V1 on the cone axis from cone start/end, so the final simplified test would have only one small case for false positive, when both axis and angle/distance check fail (very small area for small sphere radii):

conevssphere_simplified_corrected.gif

Final, corrected version in hlsl might look like that (after multiplying by sin(a)):

bool TestConeVsSphere(in float3 origin, in float3 forward, in float size, in float angle, in float4 testSphere)
{
    const float3 V = testSphere.xyz - origin;
    const float  VlenSq = dot(V, V);
    const float  V1len  = dot(V, forward);
    const float  distanceClosestPoint = cos(angle) * sqrt(VlenSq - V1len*V1len) - V1len * sin(angle);

    const bool angleCull = distanceClosestPoint > testSphere.w;
    const bool frontCull = V1len >  testSphere.w + size;
    const bool backCull  = V1len < -testSphere.w;
return !(angleCull || frontCull || backCull);
}

Is it worth? Let’s have a look at previous grid test (overlapping spheres from tiny boxes):

conevssphere_simplified_grid.gif

I’d say – absolutely yes. 🙂 With frusta that tend to be thinner/longer, the gain won’t be that great (as bounding spheres volume starts to get much larger than frustum volume), but it is very cheap test (cheaper than testing 6 planes!) and still works really well.

conevssphere_simplified_grid_long.gif

Summary

I highly recommend trying sphere vs cone tests, even if the primary primitive for which you are culling is not a sphere at all and it might seem that a sphere will be too conservative. It is also not all or nothing – you can add this relatively cheap test on top of your other, different and existing tests and you are not sacrificing anything (other than some CPU / GPU cycles).

In general, I would also discourage frustum plane tests in probably any case when the tested primitives might be large compared to frusta – you almost always can do better, even with tests that might have seemed more “conservative”. If you don’t have that many lights, you can keep plane tests for some marginal culling extra efficiency in some cases (e.g. cone parallel and close to a very thin frustum). When in doubt – visualize!

And for a general advice, it is often worth looking at the problem from the opposite side / flipping it around. Obviously if you spend too much time on the problem on your own, it often takes talking to your colleagues just to realize that there might be a different solution. Changing domain is extremely common idea in mathematics (integration domain change, variable change, relative coordinate system change etc.), but I tend to often forget about it while solving “easier” CS problems.

References

https://takahiroharada.files.wordpress.com/2015/04/forward_plus.pdf “Forward+: Bringing Deferred Lighting to the Next Level”, Takahiro Harada, Jay McKee, and Jason C.Yang

https://software.intel.com/en-us/articles/deferred-rendering-for-current-and-future-rendering-pipelines “Deferred Rendering for Current and Future Rendering Pipelines”, Andrew Lauritzen

http://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf “Clustered Deferred and Forward Shading”, Ola Olsson, Markus Billeter, and Ulf Assarsson

http://advances.realtimerendering.com/s2016/ “The devil is in the details: idTech 666”, Tiago Sousa, Jean Geffroy

http://twvideo01.ubm-us.net/o1/vault/gdc2015/presentations/Thomas_Gareth_Advancements_in_Tile-Based.pdf “Advancements in Tiled-Based Compute Rendering”, Gareth Thomas

http://www.iquilezles.org/www/articles/frustumcorrect/frustumcorrect.htm Inigo Quilez on “correct” frustum culling.

http://realtimecollisiondetection.net/books/rtcd/buy/ Christer Ericson, “Real-time collision detection”

https://mynameismjp.wordpress.com/2016/03/25/bindless-texturing-for-deferred-rendering-and-decals/ “Bindless Texturing for Deferred Rendering and Decals” – Matt Pettineo

http://www.cbloom.com/3d/techdocs/culling.txt “View Culling”, Charles Bloom

Posted in Code / Graphics | Tagged , , , , , , , , , , , , | 4 Comments

Small float formats – R11G11B10F precision

While this post is not yet dithering related, it is in a way a part of my series about dithering. You can check index of all parts here or check the previous part.

I will talk here about use of not very popular / well knows format R11 G11 B10 Float (R11G11B10F) format – its precision, caveats and how to improve it.

I want to note here that this post will not touch on many float subtleties, namely NaNs, denorms and infinities. Lots of GPU shader compilers use fast math anyway (unless asked to do strict IEEE compliance) and ignore them – and programmers have to be double careful when their used values.

You can find Mathematica notebook for this post here and corresponding pdf here.

Update: I updated section about losing dynamic range in denorm range after correction from Tom Forsyth and Peter Pike-Sloan that GPUs are standardized to support denorm on write to small floats.

Problem description

Most common representation of colors in rendering is not integer / natural / rational number representation, but floating point representation. Floating point numbers and their large range are useful for few different reasons, but the most important are:

  • Encoding HDR values and lighting,
  • Need for fractional values when operating on multiple colors, mixing them, filtering with filters with fractional or negative weights,
  • Need for larger precision in darker areas without any gamma encoding,
  • Need for bound relative quantization error (constant upper bound relative to signal magnitude),
  • Fact that floating point numbers are “natural” representation for GPUs (for a long time GPUs didn’t have any integer number support or it was “emulated” using float operations… And still some integer operations are slower than floating point operations).

That said, rendering techniques very rarely store 32bit floating point values even for HDR color – because of both memory storage cost as well as performance. Memory bandwidth and caches are usually most sacred resource and simplistic rule of thumb is “ALU is cheap, memory access is expensive”. Even simplest memory access operations have latencies of hundreds of cycles (at least on AMD GCN). Furthermore, cost increases when texturing unit is used – as filtering operations get more and more expensive and operate with slower rates.

Therefore, rendering programmers usually use smaller float formats as intermediate in-memory storage – 2 most common being RGBA16F (4 16bit half float channels) and R11G11B10F (channels R and G having 11 bit small float and channel B using 10 bit small floats).

Let’s have a look at the difference between those formats and full 32bit IEEE float. If you feel comfortable with float representation, feel free to skip the next section.

Floats – recap

I am assuming here that reader knows how floating values are represented, but as for a reminder – typical floating point value is represented by some bits for:

  • sign – just sign of the number, max single bit value and optional (more later),
  • exponent – some bits that are represented in biased, integer format and describe biased exponent of number of 2 before multiplying with rest of the number,
  • mantissa – some bits representing the fractional part of the number before multiplying by exponent. It is assumed that there is a leading 1, decimal point, so for example mantissa of 01011000 corresponds to number 1.01011000 represented binary (in base of 2).

Therefore final typical number is sign(+/- 1) * 2decoded exponent * 1.mantissa.
There are lots of “special” cases of floats that use special smallest and largest values of exponent (denorms, infinity, NaN, zero), but for the purpose of this post, we will have a look later at only one special case – encoding of zero – it is achieved by putting all exponent and mantissa bits to zero. (note: because sign can be still set, there are two zeros, +0 and -0).

Floating points are a very clever representation with lots of nice properties (for example positive floats interpreted as integers can be sorted or atomically min/maxed! Or that integer zero corresponds to just positive float zero), however come with many problems with precision that are not always the most intuitive. I will be mentioning here only some of them – the ones that are relevant to discussed problem.

Regular and small floats

So far I was trying to stay very generic and not specify any bit numbers, but to use floats in hardware (or software emulation), we need to define them.

Here is a table showing various bit depths of regular 32 bit floats as well as half floats and 11 and 10 bit floats as used by graphics hardware / standards:

Bit depth Sign bit present? Exponent bits Mantissa bits
32 Yes 8 23
16 Yes 5 10
11 No 5 6
10 No 5 5

We can immediately see few interesting observations:

  • 11 and 10 floats do not have sign bit! This decision was probably driven by the fact that they have already quite poor precision for most of uses, so they were designed in graphic APIs only to store color; using a sign bit here would be an extra waste.
  • 16 bit “half” floats and 11 and 10bit floats all have same exponent! This is pretty interesting choice, but it guarantees that they can represent +/- similar range of values. Exponent of 5 guarantees that values can go to 65500 and 65000 (depending on their mantissas), which is pretty large even for HDR lighting (unless using non-biased, absolute exposure values or doing increasing precision trick I will cover later). Exponent can be negative, so we can go to similarly (“one over”) low values.
  • Mantissa suffers the most. The difference is quite crazy – 23 vs. 5 bits in the worst case! We are dropping 18 bits of precision. This is very unfortunate information, as it means that relatively, between numbers that are in similar range (similar exponent), we are losing lots of precision.

Also, because of different bit depths of 11 11 10 float format, problem arises from different mantissa bit depths of blue channel and other channels – it will produce various discolorations and hue shifts – similar to ones that appear often in BC1 block compression (with 565 endpoint bit depths), but not being green/purple, but yellow/blue instead. I will show an example of it later in the post. Obviously, this decision makes sense – 11 11 10 format fits nicely in a single dword and perceptually, human vision is least sensitive to blue channel.

So as we see, we are dropping lots of information by converting 32 bit floats to 16 or 11/10 bits. Furthermore, information loss is not proportional between exponent and mantissa – in every small float case, we lose much more information in the mantissa. This can lead to some quantization and banding errors.

Before analyzing quantization, one thing is worth mentioning – IEEE standard defines few different rounding modes (e.g. to nearest, to zero, to +inf and to -inf). I don’t think they are in any way configurable on GPUs (at least in standard, cross vendor APIs) and I will write rest of the post ignoring this complexity and assuming that simplest rounding is used.

Small float mantissa precision – concrete example

I hope that previous section and looking at some numbers for bit depths shows clearly problem of losing lots of numerical precision of smaller format floating point numbers because of very small mantissa.

First, some numerical example. Let’s take 3 simple, 8 bit integer values and represent them as a float in range 0-1 – common operation for colors.

N[252/255, 8]
0.98823529

N[253/255, 8]
0.99215686

N[254/255, 8]
0.99607843

Let’s try to represent them as floats. Using knowledge about float values and knowing that mantissa always starts with one, we need to multiply them by 2 and exponent will be 2-1.

After multiplication we get:

BaseForm[N[2*252/255, 8], 2]
1.1111100111111001111110100

BaseForm[N[2*253/255, 8], 2]
1.1111101111111011111111000

BaseForm[N[2*254/255, 8], 2]
1.1111110111111101111111100

I highlighted the first 5 bits, why? Recall that 10-bit half float has only 5 bits of mantissa! Therefore 10bit half floats (blue channel of R11 G11 B10F) cannot represent accurately even 3 almost-last 8 bit color values! At the same time, you can see that the next bit actually differs – therefore those 3 numbers will produce 2 different values in 11F and produce wrong coloration of white values.

Small float mantissa precision – visualized

Ok, so we know that small floats cannot represent accurately even simple 8bit luminance! But how bad they really are? I created some Mathematica visualizations (see top of the page for link) – first for the worst case, B10F, so dropping 18 bits of mantissa.

10bit_floatvs8bit_linear.png

Things look ok (or even much better – not surprising given how floats are encoded!) close to zero, but error starts increasing and is almost 4x larger close to one compared to linear 8 bit values quantization error!

This comparison however is quite unfair – we don’t use 8bit linear color because of perceptual sensitivities to darks vs brights (“gamma”) and use sRGB instead, so don’t care as much about those bright areas and decide to encode more information into darker parts. This is how comparison of those 3 methods of encoding look like:

10bit_floatvs8bit_srgb.png

Ok, things are a bit more even. Looks like 10bit float precision is a bit better for values up to linear 0.125, but later get worse. Maximum error is almost 2x larger around 1 for 10 bit floats, not great… This will create visible bands on smooth gradients.

Just for fun, extra visualization, relative error (divided by original value):

8bitsrgb_vs_10float_relative.png

As expected, float value quantization relative error is bounded and has a maximum in ranges corresponding to next exponents (if we don’t count here going lower than minimum normalized float representation), while 8 bit linear or sRGB relative errors increase as we approach zero. Floating point relative error is also represented in “bands” corresponding to next exponents and getting 2x larger between 2 adjacent bands.

We will have a look at how to improve things a bit, but first – a bit more about a second problem.

Small float uneven mantissa length problem

Because R11G11B10 floats have uneven mantissa bit length distribution, they will quantize differently. How bad is it? As always with floats, absolute error depends on the range:

11bit_float_vs_10bit_float.png

The larger the number – the higher the error. In last part of the plot it looks pretty bad:

11bit_float_vs_10bit_float_smallerrange.png

What this different quantization mean in practice? It means that there will be discoloration / wrong saturation of the signal. Let’s have a look at a simple gradient from 0.5 to 0.6.

gradient_quantized.png

This is very bad (if you have a good monitor / viewing conditions). And now imagine that art director that you work with likes contrasty, filmic look with saturation boosted:

gradient_quantized_contrast.png

This looks quite unusable… We will have a look at improving it. In this post by changing the signal dynamic range, in the next post by dithering.

Rescaling will not work

Quite common misconception is that it is enough to multiply a float by large number, encode it and divide after decode. It is not going to work, for example, let’s see quantization error when premultiplying by 16:8bitsrgb_vs_10float_relative_vs_premultiplied.png

Zero difference at all! Why? Let’s think what it means to divide by 16 in float representation. Well, mantissa is not going to change! Only thing is that we will subtract 4 from the exponent. So relative error due to mantissa quantization will be exactly the same. One can try to multiply by a number between 1/2 and 2 and we will see a difference in ranges shifting, but it is going to only shift error to either more white or more dark parts:

8bitsrgb_vs_10float_relative_vs_premultiplied_smaller.png

Error bands only slide left or right.

Improving error by applying some gamma

Let’s have a look here at a different method – that will take advantage of the fact that probably (if image is pre-exposed!) we don’t care about extremely small values, where most precision is positioned (to achieve bound relative precision).

I mentioned in my previous post about dynamic range commonly used workaround for shifting precision precision problems – stretching the dynamic range by taking some power of the signal (smaller or larger). For storing higher precision dark areas of images in integers, we wanted to take lower power for encoding – for example famous gamma 1/2.2. However, in this case we would like to do… the opposite! So taking larger power – to understand why, just look at the original comparison where we introduced sRGB variant:10bit_floatvs8bit_srgb.png

We rescaled blue plot from constantly oscillating in fixed bounds to one that grows. Here with 10bit floats the problem is opposite – we have a function that asymptotically grows too quickly – we want to undo it.

Think a bit about it, it’s quite interesting problem. It has a lot to do with the way floats precision is distributed – it is non-linear, logarithmic distribution that handles large dynamic ranges very well; furthermore, exponential-like signal curve will be represented almost linearly! Therefore to take the most from our floating point representation with low bit depths, we would like to increase dynamic range as much as we can prior to encoding. We can do it by for example squaring the signal or taking larger powers. For the initial 3 floats that I used this requires actually quite large exponent, 3 for given values:

BaseForm[N[2*(252/255)*(252/255)*(252/255), 8], 2]
1.1110111000100100001001000

BaseForm[N[2*(253/255)*(253/255)*(253/255), 8], 2]
1.1111010000001100000101000

BaseForm[N[2*(254/255)*(254/255)*(254/255), 8], 2]
1.1111101000000000000001000

Note how they are different (though first two will round the same way).

Let’s have a look at absolute error with applying gamma 3 (note: this graph assumes correct denorm handling, more below):

8bitsrgb_vs_10float_gamma.png

Our error looks asymptotically smaller than 8bit sRGB error – this could be already quite useful storage base. Our previously banded gradient also looks better, as well as its higher contrast version (though not perfect – recall that contrast kind of redoes the gamma):

Before:

gradient_quantized.png

After:

gradient_quantized_gamma.png

Before with contrast:

gradient_quantized_contrast.png

After with contrast:

gradient_quantized_gamma_contrast.png

There is no free lunch though!

First of all, there is ALU cost. As we do this operation per 3 channels, it can get quite significant! Taking x*x*x is 2 full rate operations, but for example pow(x,1/3) is log2 + exp2 + multiply, so 2 quarter rate + 1 full rate = 9 FR instructions per color channel! Cheapest variant is just squaring and sqrt(x) is a single quarter rate instruction = equivalent of 4 FR instructions.

Secondly, this data is now obviously not filterable / blendable… Blending in this space wold ceate over-brightening. This can be an issue (if you need hw blending or to resample it with bilinear taps) or not (if you can do it all manually / in a compute shader).

Thirdly, this extra precision is achieved by sacrificing the dynamic range. It is +/- equivalent to dividing abs value of exponent by the used gamma. So for example, with gamma 3 our maximum representable value will be around pow(65000,1/3) ~= only 40! Is it HDR enough for your use? If pre-exposing the scene probably yes, but hottest points will be clipped… The squared variant looks much better, as around 250+.

Potential problem with small numbers

Note: this section got slightly rewritten after correction from Tom Forsyth and Peter Pike-Sloan.  My assumptions were pessimistic (denorm flush to zero), but apparently, GPUs in for example DirectX are obliged to handle them correctly. Thanks for noticing that!

Another problem could be in a different part – smallest representable numbers. The same abs of exponent division is applied to smallest representable numbers! Therefore smallest normalized representable number after applying gamma 3 will be 0.03125, which is around 8/255 and if we don’t have denorms or denorms are flushed to zero, this would result in a clip! Without handling denorms, the zoomed-in actual graph of error would look:

10bitgamma_zoom_error.png

As the graph would look:

10bitgamma_clip.png

You could try to fix it by preexposing for example by 4:

10bitgamma_clip_preexposed.png

But not only it’s not perfect, but also you’d start losing it again from the top range. (hottest representable values) Instead of already limiting 40, you’d get only 10! This is probably not enough even for displaying the signal on a HDR TV…

Therefore, if denorms were not handled correctly, I’d rather recommend to stick to gamma 2 with preexposure of 4 and accept the slightly higher quantization errors:

10bitgamma2_clip.png

10bitgamma2_error.png

Fortunately, as I got corrected – this is not the case and we can assume that denorms will be handled – so can use those higher exponents if needed – only thinking about how much dynamic range we are sacrificing in the upper/higher parts.

Before finishing this section, interesting side note: have you ever considered how low is normalized float precision when operating on 16 bit floats? Half floats have same exponent bit depth, so if you apply contrast operations to them, you might be entering denorm range very quickly! Which theoretically could result in clipping.

Untested idea – using YCoCg color space?

Some interesting (?) idea could be trying to use some different color space like YCoCg or similar instead of RGB. In (signed) YCoCg smaller chroma = smaller magnitudes of Co Cg components = more precision. This would help decorrelate color channels and avoid ugly chroma shifts when the color is less saturated (and when those shifts are more visible).

Unfortunately, R11G11B10 has no sign bit available – we would need to store 2 extra sign bits “somewhere” (different surface? lowest bit of mantissa / highest bit of exponent?).

Summary – to use R11G11B10F or not to use?

R11G11B10 and small 11 and 10 bit floats have many limitations, but are also extremely compelling storage format. They halve memory storage and bandwidth requirements compared to RGBA16F, are capable of storing high dynamic range signal and after some numerical tricks also provide precision acceptable in most color encoding scenarios. I use them a lot to non critical signals (ambient buffer, many post effects buffers), but I think that they are practical also for regular color buffers if you don’t need alpha blending or filtering and can tinker with the input data a bit.

Update: I got information from Volga Aksoy and Tom Forsyth that Oculus SDK now supports and recommends outputting into this format, so it is definitely practical. Because of darker / perfect viewing conditions with a HMD, human perception is much more sensitive in darks and R11G11B10F performs better than 8bit sRGB in this lower range.

In the next post I will show how to dither floats and get even better results with almost no perceived banding (trading it for noise).

Bonus – comparison with 10bit sRGB

As a small bonus, simple comparison with 10bit sRGB encoding (no hardware support, but some video out libraries support it to allow for more precise color profile / curves conversions). Two plots show error in full 0-1 range and in 0-0.1 darks range.

final_comparison.png

final_comparison_small.png

We can see that 10bit sRGB is much more superior throughout most of the range, but in very low/dark values 10bit floats are either equivalent or even a bit more superior.

References

https://www.opengl.org/wiki/Small_Float_Formats

http://steve.hollasch.net/cgindex/coding/ieeefloat.html Steve Hollasch, “IEEE Standard 754 Floating Point Numbers”

http://community.wolfram.com/groups/-/m/t/274061 Mathematica help – Convert floating point representation to any scientific notation & back

https://msdn.microsoft.com/en-us/library/windows/desktop/cc308050(v=vs.85).aspx#alpha_11_bit Direct3D 10 Floating point rules

Posted in Code / Graphics | Tagged , , , , , , , , , | 4 Comments

Dithering part three – real world 2D quantization dithering

In previous two parts of this blog post mini-series I described basic uses mentioned blue noise definition, referenced/presented 2 techniques of generating blue noise and one of many general purpose high-frequency low-discrepancy sampling sequences.

In this post, we will look at some more practical example – use of (blue) noise in 2D image dithering for quantization!

You can find Mathematica notebook for this post here and its pdf version here.

Bit quantization in 2D

Finally, we are getting to some practical use case. When you encode your images in 8 bits (typical framebuffer) – you are quantizing. When you encode your GBuffers, render targets and others – you are quantizing. Even typical 8 bit is enough to cause some banding on very smooth, slow gradients – like skies or lens flares.

We will cover here a more extreme example though – extreme 3 bit quantization of a linear gradient.

2dditheringorigsignal.png

2dditheringquantizenodither.png

We call those quantization artifacts – 8 visible bands the “banding” effect.

As we learned in previous parts, we can try to fix it by applying some noise. At first, let’s try regular, white noise:

2dditheringquantizewhitenoise.png

Doesn’t look too good. There are 2 problems:

  • “Clumping” of areas, identical to one we have learned before and we will address it in this post.
  • Still visible “bands” of unchanged values – around center of bands (where banding effect was not contributing to too much error.
2dditheringwhitebands.gif

Error visualized.

Those bands are quite distracting. We could try ti fix them by dithering even more (beyond the error):

2dditheringquantizewhitenoisex2.png

This solves this one problem! However image is too noisy now.

There is a much better solution that was described very well by Mikkel Gjoel.

Using triangular noise distribution fixes those bands without over-noising the image:

2dditheringquantizewhitenoisetriangle.png

Use of triangular noise distribution.

Since this is a well covered topic and it complicates analysis a bit (different distributions), I will not be using this fix for most of this post. So those bands will stay there, but we will still compare some distributions.

Fixed dithering patterns

In previous part we looked at golden ratio sequence. It is well defined and simple for 1D, however doesn’t work / isn’t defined in 2D (if we want it to be uniquely defined).

One of oldest, well known and used 2D dithering patterns is so called Bayer matrix or ordered Bayer. It is defined as a recursive matrix of a simple pattern for level zero:

1 2

3 0

With next levels defined as:

4*I(n-1) + 1  —– 4*I(n-1) + 2

4*I(n-1) + 3  —– 4*I(n-1) + 0

It can be replicated with a simple Mathematica snippet:

Bayer[x_, y_, level_] :=
Mod[Mod[BitShiftRight[x, level], 2] + 1 + 2*Mod[BitShiftRight[y, level], 2],
   4] + If[level == 0, 0, 4*Bayer[x, y, level – 1]]

2dditheringbayer.png

8×8 ordered Bayer matrix

What is interesting (and quite limiting) about Bayer is that due to its recursive nature, signal difference is maximized only in this small 2×2 neighborhood, so larger Bayer matrices add more intermediate steps / values, but don’t contribute much to any visible pattern difference. Therefore most game engines that I have seen used up to 4×4 Bayer pattern with 16 distinctive values.

If you plot a periodogram (frequency spectrum) of it, you will clearly see only 2 single, very high frequency dots!

2dditheringbayerperiodogram.png

2D periodogram – low frequencies are in the middle and high frequencies to the sides.

Obviously signal has some other frequencies, but much lower intensities… Plotting it in log scale fixes it:

2dditheringbayerperiodogramlogscale.png

So on one hand, Bayer matrix has lots of high frequency – would seem perfect for dithering. However presence of strong single frequency bands tends to alias it heavily and produce ugly pattern look.

This is our quantized function:

2dditheringquantizebayer.png

If you have been long enough playing with computers to remember 8 bit or 16 bit color modes and palletized images, this will look very familiar – as lots of algorithms used this matrix. It is very cheap to apply (a single look up from an array or even bit-magic ops few ALU instructions) and has optimum high frequency content. At the same time, it produces this very visible unpleasant patterns. They are much worse for sampling and in temporal domain (next 2 parts of this series), but for now let’s have a look at some better sequence.

Interleaved gradient noise

The next sequence that I this is working extremely well for many dithering-like tasks, is “interleaved gradient noise” by Jorge Jimenez.

Formula is extremely simple!

InterleavedGradientNoise[x_, y_] :=
FractionalPart[52.9829189*FractionalPart[0.06711056*x + 0.00583715*y]]

But the results look great, contain lots of high frequency and produce pleasant, interleaved smooth gradients (be sure to check Jorge’s original presentation and his decomposition of “gradients”):

interleavedgradientnoise.png

What is even more impressive is that such pleasant visual pattern was invented by him as a result of optimization of some “common web knowledge” hacky noise hashing functions!

Unsurprisingly, this pattern has periodogram containing frequencies that correspond to those interleaved gradients + their frequency aliasing (result of frac – similar to frequency aliasing of a saw wave):

2dditheringinterleavedgradientnoiseperiodogram.png

And the 3D plot (spikes corresponding to those frequency):

2ddithering3dplotinterleaved.png

Just like with Bayer, those frequencies will be prone to aliasing and “resonating” with frequencies in source image, but almost zero low frequencies given nice, clean smooth look:

2dditheringquantizeinterleavedgradient.png

Some “hatching” patterns are visible, but they are much more gradient-like (like the name of the function) and therefore less distracting.

Blue noise

Finally, we get again to using a blue noise pre-generated pattern. To recall from previous part, blue noise is loosely defined as a noise function with small low frequency component and uniform coverage of different frequencies. I will use here a pattern that again I generated using my naive implementation of “Blue-noise Dithered Sampling” by Iliyan Georgiev and Marcos Fajardo.

So I generated a simple 64×64 wrapping blue noise-like sequence (a couple hours on an old MacBook):

2dbluenoise.png

It has following periodogram / frequency content:

2dditheringblueperiodogram.png

And in 3D (who doesn’t love visualizations?! 😀 ):

2ddithering3dplot.png

Compared to white noise, it has a big “hole” in the middle, corresponding to low frequencies.

2dwhitenoisevsbluenoise.gif

White noise vs blue noise in 2D

At the same time, it doesn’t have linear frequency increase for higher frequencies, like audio / academic definition of blue noise. I am not sure if it’s because my implementation optimization (only 7×7 neighborhood is analyzed + not enough iterations) or the original paper, but doesn’t seem to impact the results for our use case in a negative way.

Without further ado, results of dithering using 2D blue noise:
2dditheringquantizeblue.png

It is 64×64 pattern, but it is optimized for wrapping around – so border pixels error metric is computed with taking into account pixels on the other side of the pattern. In this gradient, it is repeated 2×2.

And this is how it compared to regular white noise:

2dwhitenoisevsbluenoise.gif

White noise vs blue noise

Because of high frequency contents only, you can’t see this problematic “clumping” effect.

It also means that if we oversample (like with all those new fancy 1080p -> 1440p -> 2160p displays), blur it or apply temporal (one of next parts), it will be more similar to original pattern! So when we filter them with 2-wide Gaussian:

2dditheringwhitevsbluefiltered.png

Left: Gaussian-filtered white noise dithering. Right: Gaussian-filtered blue noise dithering.

And while I said we will not be looking at triangle noise distribution in this post for simplicity, I couldn’t resist the temptation of comparing them:

2dwhitenoisevsbluenoisetriangle.gif

White noise vs blue noise with triangular distribution remapping applied.

I hope that this at least hints at an observation that with a good, well distributed large enough blue noise fixed pattern you might get results maybe not the same quality level of error diffusion dithering, but in that direction and orders of magnitude better than standard white noise.

All four compared

Just some visual comparison of all four techniques:

2dditheringallfourcompared.png

White noise, blue noise, Bayer, interleaved gradient noise

2dditheringalltechniquescompared.gif

And with the triangular remapping:

2dditheringallfourcomparedtriangular.png

2dditheringallfourcomparedtriangular.gif

 

My personal final recommendations and conclusions and here would be:

  • Whenever possible, avoid ordered Bayer! Many game engines and codebases still use it, but it produces very visible unpleasant patterns. I still see it in some currently shipped games!
  • If you cannot spare any memory look-ups but have some ALU, use excellent interleaved gradient noise by Jorge Jimenez. It produces much more pleasant patterns and is extremely cheap with GPU instruction set! However patterns are still noticeable and it can alias.
  • Blue noise is really great noise distribution for many dithering tasks and if you have time to generate it and memory to store it + bandwidth to fetch, it is the way to go.
  • White noise is useful for comparison / ground truth. With pixel index hashing it’s easy to generate, so it’s useful to keep it around.

Summary

In this part of the series, I looked at the topic of quantization of 2D images for the purpose of storing them at limited bit depth. I analyzed looks and effects of white noise, ordered Bayer pattern matrices, interleaved gradient noise and blue noise.

In next part of the series (coming soon), we will have a look at the topic of dithering in more complicated (but also very common) scenario – uniform sampling. It is slightly different, because often requirements are different. For example if you consider rotations, values of 0 and 2pi will “wrap” and be identical – therefore we should adjust our noise distribution generation error metric for this purpose. Also, for most sampling topics we will need to consider more that 1 value of noise.

Blog post mini-series index.

Edit 10/31/2016: Fixed triangular noise remapping to work in -1 – 1 range instead of -0.5-1.5. Special thanks to Romain Guy and Mikkel Gjoel for pointing out the error.

References

http://loopit.dk/banding_in_games.pdf “Banding in games”, Mikkel Gjoel

https://engineering.purdue.edu/~bouman/ece637/notes/pdf/Halftoning.pdf C. A. Bouman: Digital Image Processing – January 12, 2015

B. E. Bayer, “An optimum method for two-level rendition of continuous-tone pictures”, IEEE Int. Conf. Commun., Vol. 1, pp. 26-1 1-26-15 (1973).

http://www.iryoku.com/next-generation-post-processing-in-call-of-duty-advanced-warfare “Next generation post-processing in Call of Duty: Advanced Warfare”, Jorge Jimenez, Siggraph 2014 “Advances in Real Time Rendering in games”

“Blue-noise Dithered Sampling”, Iliyan Georgiev and Marcos Fajardo

https://github.com/bartwronski/BlueNoiseGenerator 

http://www.reedbeta.com/blog/quick-and-easy-gpu-random-numbers-in-d3d11/ “Quick And Easy GPU Random Numbers In D3D11”, Nathan Reed

Posted in Code / Graphics | Tagged , , , , , , , , , , , , | 4 Comments

Dithering part two – golden ratio sequence, blue noise and highpass-and-remap

In previous part of the mini-series I covered dithering definition and how dithering changes error characteristics of simple 1D quantization and functions.

In this part I will try to look at what blue noise is, but first wanted to have a look at a number sequence that I used in the previous post and I find very useful.

You can find a Mathematica notebook for golden sequence here and its pdf version here.

For the second part of the post you can find the notebook here and its pdf version here.

Golden ratio sequence

In previous post I used “some” quasi-random function / sequence and mentioned that it’s not perfect, but very useful. The sequence is a sequence made of fractional part of next multiplications of a golden number.

FractionalPart[N*GoldenRatio]

I found idea for using it in a paper “Golden Ratio Sequences For Low-Discrepancy Sampling” by Colas Schretter and Leif Kobbelt.

This is incredibly fascinating sequence, as it seems to distribute next values very well and quite far apart:

golden_sequence_lines.gif

The differences between next elements are in modulus 1:

{0.381966, 0.618034, 0.381966, 0.381966, 0.618034, 0.381966, 0.618034, 0.381966}

So oscillating golden number modulo 1 itself and 2 minus golden number modulo 1. Both numbers are distant enough from zero and one to produce well-distributed sequence where next samples add lots of information.

Edit: Mikkel Gjøl observed that modulo/toroidal minimum distance of 0.618034 is 0.381966, so the sequence becomes:

{0.381966, 0.381966, 0.381966, 0.381966, 0.381966, 0.381966, 0.381966, 0.381966}

I find it a remarkably beautiful property!

Even for small number of “samples” in the sequence, they cover whole 0-1 range very well:

golden_sequence_comparison.gif

Numbers plotted as colors also look “pleasant”:

golden1d.png

If we look at its periodogram:

GoldenNumberSequencePeriodogram.png

We also find some fascinating properties. First of all, energy seems to increase with frequencies. There are visible “spikes” in some frequencies and what is even more interesting is that every next spike happens at a frequency that is golden ratio times higher and it has golden ratio times more energy! I don’t have any explanation for it… so if you are better than me at maths, please contribute with a comment!

This frequency characteristic is extremely useful, however doesn’t satisfy all of our dithering needs. Why? Imagine that our source signal that we are dithering contains same frequencies. Then we would see extra aliasing in those frequencies. Any structure in noise used for dithering can become visible, and can produce undesired aliasing.

I still find this sequence extremely valuable and use it heavily in e.g. temporal techniques (as well as hemispherical sampling). There are other very useful low-discrepancy sequences that are heavily used in rendering – I will not cover those here, instead reference the physically based rendering “bible” – “Physically Based Rendering, Third Edition: From Theory to Implementation” by Matt Pharr, Wenzel Jakob and Greg Humphreys and chapter 7 that authors were kind enough to provide for free!

For now let’s look at blue noise and theoretically “perfect” dithering sequences.

Blue noise

What is blue noise? Wikipedia defines it as:

Blue noise is also called azure noise. Blue noise’s power density increases 3 dB per octave with increasing frequency (density proportional to f ) over a finite frequency range. In computer graphics, the term “blue noise” is sometimes used more loosely as any noise with minimal low frequency components and no concentrated spikes in energy.

And we will use here this more liberal definition (with no strict definition of frequency distribution density increase).

We immediately see that previous golden ratio sequence is not blue noise, as it has lots of visible spikes in spectrum. Perfect blue noise has no spikes and therefore is not prone to aliasing / amplifying those frequencies.

There are many algorithms for generating blue noise, unfortunately many of them heavily patented. We will have a look at 2 relatively simple techniques that can be used to approximate blue noise.

Generating blue noise – highpass and remap

The first technique we will have a look at comes from Timothy Lottes and his AMD GPUOpen blog post “Fine art of film grain”.

The technique is simple, but brilliant – in step one let’s take a noise with undesired frequency spectrum and just reshape it by applying high pass filter.

Unfortunately, arbitrary high pass filter will produce a signal with very uneven histogram and completely different value range than original noise distribution:

afterHighpass.png

After arbitrary highpass operation of random noise originally in 0-1 range.

This is where part 2 of the algorithm comes in. Remapping histogram to force it to be in 0-1 range! Algorithm is simple – sort all elements by value and then remap the value to position in the list.

Effect is much better:

afterHighpassAndRemap.png

Unfortunately, histogram remapping operation also changes the frequency spectrum. This is inevitable, as histogram remapping changes relative value of elements not linearly. Values in middle of the histogram (corresponding to areas that originally had lost of low frequency component) will be changed much more than values in areas with high frequency content. This way part of high-pass filtering effect is lost:

highpass_periodogram_before_after_remap.png

Comparison of histogram before (red) and after (black) the remap, renormalized manually. Note how some low frequency component reappeared.

Still, effect looks pretty good compared to no high pass filtering:

before_after_lowpass.png

Top – regular random noise. Bottom – with high pass and histogram remapping.

Its frequency spectrum also looks promising:

afterHighpassAndRemapPeriodogram.png

However, there is still this trailing low pass component. It doesn’t contain lots of energy, but still can introduce some visible low pass error in dithered image…

What we can do is to re-apply the technique again!

This is what we get:

highPassRemapTwice.gif

Frequency spectrum definitely looks better and whole algorithm is very cheap so we can apply it as many times as we need.

Unfortunately, no matter how many times we will reapply it, it’s impossible to “fix” all possible problematic spots.

I think about it this way – if some area of the image contains only very low frequency, after applying highpass filter, it will get few adjacent values that are very close to zero. After histogram remapping, they will get remapped to again similar, adjacent values.

HighpassAndRemapUnfixable.png

Small part of a sequence with a local minimum that algorithm repeated even 10 times cannot get out of. Notice few areas of almost uniform gray.

It’s possible that using a different high pass filter or adding some noise between iterations or detecting those problematic areas and “fixing” them would help – but it’s beyond scope of this post and the original technique.

What is worth noting is that original algorithm gives sequence that is not perfect, but often “good enough” – it leaves quite bad local spots, but optimizes frequency spectrum globally .Let’s check it in action.

Results

Let’s have a look at our initial, simple 1D dithering for binary quantization:

HighpassAndRemapDitherComparison.png

Rows 1, 3, 5 – original sine function. Row 2 – dithering with regular noise. Row 4 – dithering with golden ratio sequence. Row 6 – dithering with “highpass and remap” blue-noise-like sequence.

We can see that both golden ratio sequence and our highpass and remap are better than regular noise. However it seems like golden ratio sequence performs better here due to less “clumping”. You can see though some frequency “beating” corresponding to peak frequencies there:

HighpassRemapDitherPeriodograms.gif

Black – white noise. Red – golden ratio sequence. Green – highpass and remap noise sequence.

So this is not a perfect technique, but a) very fast b) tweakable and c) way better than any kind of white noise.

Better? Slower blue noise

Ok, what could we do if we wanted some solution that doesn’t contain those local “clumps”? We can have a look at Siggraph 2016 paper “Blue-noise Dithered Sampling” by Iliyan Georgiev and Marcos Fajardo from Solid Angle.

The algorithm is built around the idea of using probabilistic technique of simulated annealing to globally minimize desired error metric (in this case distance between adjacent elements).

I implemented a simple (not exactly simulated annealing; more like a random walk) and pretty slow version supporting 1, 2 and 3D arrays with wrapping: https://github.com/bartwronski/BlueNoiseGenerator/

As usually with probabilistic global optimization techniques, it can take pretty damn long! I was playing a bit with my naive implementation for 3D arrays and on 3yo MacBook after a night running it converge to at best average quality sequence. However, this post is not about the algorithm itself (which is great and quite simple to implement), but about the dithering and noise.

For the purpose of this post, I generated a 2000 elements, 1D sequence using my implementation.

This is a plot of first 64 elements:

generatedbluepiece.png

Looks pretty good! No clumping, pretty good distribution.

Frequency spectrum also looks very good and like desired blue noise (almost linear energy increase with frequency)!

generatedbluefrequencies.png

If we compare it with frequency spectrum of “highpass and remap”, they are not that different; slightly less very low frequencies and much more of desired very high frequencies:

highpassremapvsgenerated.gif

Highpass and remap (black) vs Solid Angle technique (red).

We can see it compared with all other techniques when applied to 1D signal dithering:

all1dtechniquescomparison.png

Every odd row is “ground truth”. Even rows: white noise, golden ratio sequence, highpass and remap and finally generated sequence blue noise.

It seems to me to be perceptually best and most uniform (on par with golden ratio sequence).

We can have a look at frequency spectrum of error of those:

all1dtechniquescomparedfrequencies.gif

Black – white noise. Red – golden ratio sequence. Green – highpass and remap. Yellow – generated sequence.

If we blur resulting image, it starts to look quite close to original simple sine signal:

comparisonall1dblurred.png

If I was to rate them under this constraints / scenario, I would probably use order from best to worst:

  • Golden ratio sequence,
  • Blue noise generated by Solid Angle technique,
  • Blue noise generated by highpass and remap,
  • White noise.

But while it may seem that golden ratio sequence is “best”, we also got lucky here, as our error didn’t alias/”resonate” with frequencies present in this sequence, so it wouldn’t be necessarily best case for any scenario.

Summary

In this part of blog post mini-series I mentioned blue noise definition, referenced/presented 2 techniques of generating blue noise and one of many general purpose high-frequency low-discrepancy sampling sequences. This was all still in 1D domain, so in the next post we will have a look at how those principles can be applied to dithering of a quantization of 2D signal – like an image.

Blog post mini-series index.

References

https://www.graphics.rwth-aachen.de/media/papers/jgt.pdf “Golden Ratio Sequences For Low-Discrepancy Sampling”, Colas Schretter and Leif Kobbelt

“Physically Based Rendering, Third Edition: From Theory to Implementation”, Matt Pharr, Wenzel Jakob and Greg Humphreys.

https://en.wikipedia.org/wiki/Colors_of_noise#Blue_noise 

http://gpuopen.com/vdr-follow-up-fine-art-of-film-grain/ “Fine art of film grain”, Timothy Lottes

“Blue-noise Dithered Sampling”, Iliyan Georgiev and Marcos Fajardo

Posted in Code / Graphics | Tagged , , , , , , , | 6 Comments

Dithering part one – simple quantization

Introduction

First part of this mini-series will focus on more theoretical side of dithering -some history and applying it for 1D signals and to quantization. I will try to do some frequency analysis of errors of quantization and how dithering helps them. It is mostly theoretical, so if you are interested in more practical applications, be sure to check the index and other parts.

You can find Mathematica notebook to reproduce results here and the pdf version here.

What is dithering?

Dithering can be defined as intentional / deliberate adding of some noise to signal to prevent large-scale / low resolution errors that come from quantization or undersampling.

If you have ever worked with either:

  • Audio signals,
  • 90s palletized image file formats.

You must have for sure encountered dithering options that by adding some noise and small-resolution artifacts “magically” improved quality of audio files or saved images.

However, I found on Wikipedia quite an amazing fact about when dithering was first defined and used:

…[O]ne of the earliest [applications] of dither came in World War II. Airplane bombers used mechanical computers to perform navigation and bomb trajectory calculations. Curiously, these computers (boxes filled with hundreds of gears and cogs) performed more accurately when flying on board the aircraft, and less well on ground. Engineers realized that the vibration from the aircraft reduced the error from sticky moving parts. Instead of moving in short jerks, they moved more continuously. Small vibrating motors were built into the computers, and their vibration was called dither from the Middle English verb “didderen,” meaning “to tremble.” Today, when you tap a mechanical meter to increase its accuracy, you are applying dither, and modern dictionaries define dither as a highly nervous, confused, or agitated state. In minute quantities, dither successfully makes a digitization system a little more analog in the good sense of the word.

— Ken Pohlmann, Principles of Digital Audio
This is inspiring and interesting historical fact and as I understand it that it works by avoiding bias in computations and resonances by randomly breaking up some mechanical vibration feedback loops.
But history aside, let’s look at the dithering process for 1D signals first, like audio.

Dithering quantization of a constant signal

We will start first with analyzing the most boring possible signal – a constant signal. If you know a bit about audio and audio-related DSP, you might ask – but you promised looking at audio and and audio by definition cannot have a constant term! (furthermore, both audio software and hardware deliberately remove so called DC offset)
That’s true and we will have a look at more complicated functions in a second, but first things first.
Imagine that we are doing a 1 bit quantization of a normalized floating point signal. This means we will be dealing with final binary values, 0 or 1.
If our signal is 0.3, simple rounding without any dithering will be the most boring function ever – just zero!
Error is also constant, 0.3 and therefore average is also 0.3. This means that we introduced quite big bias to our signal and completely lost original signal information.
We can try to dither this signal and have a look at results.
Dithering in this case (used rounding function) is applying just plain, random white noise (random value per every element, producing uniform noise spectrum) and adding random value from range (-0.5, 0.5) to the signal prior to quantization.

quantizedDitheredSignal =
  Round[constantSignalValue + RandomReal[] – 0.5] & /@ Range[sampleCount];

Constant_dither_noise.png
Constant_dither_noise_img.png
It’s difficult to see anything here, just that now result of quantization is some random ones and zeros… With (as hopefully expected) more zeros. It’s not terribly interesting signal on its own, but what is quite interesting is the plot of the error and average error.
Constant_dither_error.png
Ok, we can see that as expected, error is also alternating… but what is quite scary is that error got sometimes bigger (0.7 absolute value)! So our maximum error is worse, pretty unfortunate… however, average noise is:

Mean[ditheredSignalError]
0.013

Much much smaller than original error of 0.3. With sufficiently large amount of samples this error would go towards zero (limit). So error for constant term got much smaller, but let’s have a look at frequency plot of all errors.
spectrum_quantization_noise_comparison.gif

Red plot/spike = frequency spectrum of error when not using dithering (constant, no frequencies). Black – with white noise dithering.

Things are getting more interesting! This shows first major takeaway of this post – dithering distributes quantization error / bias among many frequencies.
We will have a look in the next section how it helps us.

Frequency sensitivity and low-pass filtering

So far we observed that dithering a quantized constant signal:
  • Increased maximal error.
  • Almost zeroed average, mean error.
  • Added constant white noise (full spectral coverage) to the error frequency spectrum, reducing the low-frequency error.
By itself it doesn’t help us too much… However, we are not looking at quantization of any arbitrary mathematical function / signal. We are looking here at signals that will be perceived by humans. Human perception is obviously limited, some examples:
  • Our vision has a limit of acuity. Lots of people are short-sighted and see blurred image of faraway objects without correction glasses.
  • We perceive medium scale of detail much better than very high or very low frequencies (small details of very smooth gradients may be not noticed).
  • Our hearing works in specific frequency range (20Hz -20kHz, but it gets worse with age) and we are most sensitive to middle ranges – 2kHz-5kHz.

Therefore, any error in frequencies closer to upper range of perceived frequency will be much less visible.

Furthermore, our media devices are getting better and better and provide lots of oversampling. In TVs and monitors we have “retina”-style and 4K displays (where it’s impossible to see single pixels), in audio we use at least 44kHz sampling file formats even for cheap speakers that often can’t reproduce more than 5-10kHz.

This means, that we can approximate perceptual look of a signal by low-pass filtering it. Here I did a low pass filtering (padding with zeros on the left -> “ramp up”):
Constant_dither_noise_lowpass.png

Red – desired non-quantized signal. Green – quantized and dithered signal. Blue – low pass filter of that signal.

Signal starts to look much closer to original, unquantized function!
Unfortunately we started to see some low frequencies that are very visible and were not present in the original signal. We will look at fixing it by using blue noise in part 3 of the series, but for now this is how it could look like with some quasi-noise function that has much less lower frequency content:
Constant_dither_noise_golden_lowpass.png
This is possible because our quasi-random sequence has following frequency spectrum:
Constant_dither_noise_golden_spectrum.png
But for now enough looking at simplistic, constant function. Let’s have a look at a sine wave (and if you know Fourier theorem – a building block of any periodic signal!).

Quantizing a sine wave

If we quantize a sine wave with 1 bit quantization, we get a simple… square wave.
sine_quantize.png
Square wave is quite interesting, as it comprises base frequency as well as odd harmonics.
It is interesting property that is used heavily in analog subtractive synthesizers to get hollow/brassy sounding instruments. Subtractive synthesis starts with a complex, harmonically rich sound and filters it by removing some frequencies (with filter parameters varying over time) to shape sounds in desired way.
Square wave frequency spectrum:
sine_quantize_spectrum.png
But in this post we are more interested in quantization errors! Let’s plot the error as well as frequency spectrum of the error:
sine_quantize_error.png
sine_quantize_error_spectrum.png
In this case we are in much better situation – average error is close to zero! Unfortunately, we still have lots of undesired low frequencies, very close to our base frequency (odd multiplies with decreasing magnitudes). This is known as aliasing or dithering noise – frequencies that were not present in the original signal appear and they have pretty large magnitudes.
Even low-pass filtering cannot help this signal much… As error has so many low frequencies:
sine_quantize_lowpass.png

Low-pass filtered quantized sine

sine_quantize_error_lowpass.png

Low-pass filtered quantized sine error

Let’s have a look at how this changes with dithering. At first sight things don’t improve a lot:
sine_quantize_dither.png
However if we display it as an image, it starts to look better:
sine_quantize_dither_img.png
And notice how again, quantization error gets distributed among different frequencies:
spectrum_quantization_noise_comparison_sine.gif
This looks very promising! Especially considering that we can try to filter it now:
sine_quantize_dither_lowpass.png
That’s slightly crooked sine, but looks much closer to original one than the non-dithered one with the exception of a phase shift introduced by asymmetrical filter (I am not going to describe or cover it here; it is fixable simply by applying symmetrical filters):
sine_quantize_dither_lowpass_comparison.png

Red – original sine. Green – low pass filtered undithered signal. Blue – low pass filtered dithered signal.

Plotting both error functions confirms numerically that error is much smaller:

sine_quantize_dither_lowpass_error.png

Red – error of low-pass filtered non-dithered signal. Blue – error of low-pass filterer dithered signal.

Finally, let’s just quickly look at a signal with better dithering function containing primarily high frequencies:

sine_quantize_dither_vs_golden_img.png

Upper image – white noise function. Lower image – a function containing more higher frequencies.

sine_quantize_dither_golden_lowpass_comparison.png

Low-pass filtered version dithered with a better function – almost perfect results if we don’t count filter phase shift!

And finally – all 3 comparisons of error spectra:

spectrum_quantization_noise_golden__comparison_sine.gif

Red – undithered quantized error spectrum. Black – white noise dithered quantized error spectrum. Blue – noise with higher frequencies ditherer error spectrum.

Summary

This is end of part one. Main takeaways are:
  • Dithering distributes quantization error / bias among many different frequencies that depend on the dithering function instead of having them focused in lower frequency area.
  • Human perception of any signal (sound, vision) works best in very specific frequency ranges. Signals are often over-sampled for end of perception spectrum in which perception is almost marginal. For example common audio sampling rates allow reproducing signals that most adults will not be able to hear at all. This makes use of dithering and trying to shift error into this frequency range so attractive because of previous point.
  • Different noise functions produce different spectra of error that can be used knowing which error spectrum is more desired.

In the next part we will have a look at various dithering functions – the one I used here (golden ratio sequence) and blue noise.

Blog post mini-series index.

Posted in Code / Graphics | Tagged , , , , , | 2 Comments