C#/.NET graphics framework

In my previous post about bokeh I promised that I will write a bit more about my simple C# graphics framework I use at home for prototyping various DX11 graphics effects.

You can download its early version with demonstration of bokeh effect here.

So, the first question I should probably answer is…

Why yet another framework?

Well, there are really not many. :) In the old days of DirectX 9, lots of coders seemed to be using ATI (now AMD) RenderMonkey . It is no longer supported, doesn’t have modern DirectX APIs support. I really doubt that with advanced DX10+ style API it would be possible to create something similar with full featureset – UAVs in all shader stages, tesselation, geometry and compute shaders.

Also today most of newly developed algorithms got much more complex.

Lots of coders seem to be using Shadertoy to showcase some effects or quite similar, quite an awesome example would be implementation of Brian Karis area lights by ben. Unfortunately such frameworks work well for fully procedural, usually raymarched rendering with a single pass – while you can demonstrate amazing visual effects (demoscene style), this is totally unlike regular rendering pipelines and is often useless for prototyping shippable rendering techniques. Also because of basing everything on raymarching, code becomes hard to follow and understand, with tons of magic numbers, hacks and functions to achieve even simple functionalities…

There are two frameworks I would consider using myself and that caught my attention:

  • “Sample Framework” by Matt Pettineo. It seems it wraps very well lots of common steps needed to set up simple DirectX 11 app and Matt adds new features from time to time. In the samples I tried it works pretty well and the code and structure are quite easy to follow. If you like coding in C++ this would be something I would look into first, however I wanted to have something done more in “scripting” style and that would be faster to use. (more about it later).
  • bgfx by Branimir Karadžić. I didn’t use it myself, cannot really tell more about it, but it has benefit of being multiplatform and multi API, so it should make it easy to abstract lots of stuff – this way algorithms should be easier to present in a platform agnostic way. But it is more of an API abstraction library, not a prototyping playground / framework.

A year or two ago I started to write my own simple tool, so I didn’t look very carefully into them, but I really recommend you to do so, both of them are for sure more mature and written better way than my simple tech.

Let’s get to my list of requirements and must-have when developing and prototyping stuff:

  • Possibility of doing multi pass rendering.
  • Mesh and texture loading.
  • Support for real GPU profiling – FPS counter or single timing counter are not enough! (btw. paper authors, please stop using FPS as a performance metric…)
  • DX11 features, but wrapped – DX11 is not very clean API, you need to write tens of lines of code to create a simple render target and all of “interesting” views like RTV, UAV and SRV.
  • Data drivenness and “scripting-like” style of creating new algorithms.
  • Shader and possibly code reloading and hot swapping (zero iteration times).
  • Simple to create UI and data driven UI creation.

Why C# / . NET

I’m not a very big fan of C++ and its object-oriented style of coding. I believe that for some tasks (not performance critical) scripting or data driven languages are much better, while other things are expressed much better in functional or data oriented style. C++ can be a “dirty” language, doesn’t have a very good standard library and templated extensions like boost (that you need for as simple tasks as regular expressions) are a nightmare to read. To make your program usable, you need to add tons of external library requirements. It gets quite hard to have them compile properly between multiple machines, configurations or library versions.

Obviosuly, C++ is here to stay, especially in games, I work with it every day and can enjoy it as well. But on the other hand I believe that it is very beneficial if a programmer works in different languages with different working philosophies – this way he can learn “thinking” about problems and algorithms, not the language specific solutions. So I love also Mathematica, multi-paradigm Python, but also C#/.NET.

As I said, I wanted to be able to code new algorithms in a “scripting” style, not really thinking about objects, but more about algorithms themselves – so I decided to use .NET and C#.

It has many benefits:

  • .NET has lots of ways of expressing solutions to a problem. You even can write in more dynamic/scripting style, Emit or dynamic objects are extremely powerful tools.
  • It has amazingly fast compilation times and quite decent edit&continue support.
  • Its performance is not that bad if you don’t write with it code that is executed thousands of times per frame.
  • .NET on windows is an excellent environment / library and has everything I need.
  • It should run on almost every developers Windows, with Visual Studio Express (free!) and if you limit used libraries (I use SlimDX) compilation / dependency resolving shouldn’t be a problem.
  • It is very easy to write complex functional-style solutions to problems with LINQ (yes, probably all game developers would look disgusted at me right now :) ).
  • It is trivial to code UI, windows etc.

So, here I present my C# / .NET framework!

csharprenderer

 

Simplicity of adding new passes

As I mentioned, my main reason to create this framework was making sure that it is trivial to add new passes, especially with various render targets, textures and potentially compute. Here is an example of adding simple pass together with binding some resources, render target and later rendering a typical post-process fullscreen pass:

 

using (new GpuProfilePoint(context, "Downsample"))
{
    context.PixelShader.SetShaderResource(m_MainRenderTarget.m_RenderTargets[0].m_ShaderResourceView, 0);
    context.PixelShader.SetShaderResource(m_MainRenderTarget.m_DepthStencil.m_ShaderResourceView, 1);
    m_DownscaledColorCoC.Bind(context);
    PostEffectHelper.RenderFullscreenTriangle(context, "DownsampleColorCoC");
}

We also get a wrapped GPU profiler for given section. :)

To create interesting resources (render target texture with all potentially interesting resource views) one would type once simply just:

m_DownscaledColorCoC = RenderTargetSet.CreateRenderTargetSet(device, m_ResolutionX / 2, m_ResolutionY / 2, Format.R16G16B16A16_Float, 1, false);

Ok, but how do we handle the shaders?

Data driven shaders

I wanted to avoid tedious manual compilation of shaders, creation of shader objects and determining their type. Adding a new shader should be done in just one place, shader file – so I went with data driven approach.

Part of the code called ShaderManager parses all the fx files in the executable directory with multiple regular expressions and looks for shader definitions, sizes of compute shader dispatch groups etc. and stores all the data.

So all shaders are defined in hlsl with some annotations in comments, they are automatically found and compiled. It supports also shader reloading and on shader compilation error presents a message box with error message and you can close it after fixing all of the shader compilation errors. (multiple retries possible)

This way shaders are automatically found, referenced in code by name.

// PixelShader: DownsampleColorCoC, entry: DownsampleColorCoC
// VertexShader: VertexFullScreenDofGrid, entry: VShader
// PixelShader: BokehSprite, entry: BokehSprite
// PixelShader: ResolveBokeh, entry: ResolveBokeh
// PixelShader: ResolveBokehDebug, entry: ResolveBokeh, defines: DEBUG_BOKEH

Data driven constant buffers

I also support data driven constant buffers and manual reflection system – I never really trusted DirectX effects framework / OpenGL reflection.

I use dynamic objects from .NET to access all constant buffer member variables just like regular C# member variables – both for read and write. It is definitely not the most efficient way to do it, forget about even hundreds of drawcalls with different constant buffers – but  on the other hand, it was never main goal of my simple framework – but real speed of prototyping.

Example of (messy) mixed read and write constant buffer code – none of “member” variables are defined anywhere in code:

mcb.zNear = m_ViewportCamera.m_NearZ;
mcb.zFar = m_ViewportCamera.m_FarZ;
mcb.screenSize = new Vector4((float)m_ResolutionX, (float)m_ResolutionY, 1.0f / (float)m_ResolutionX, 1.0f / (float)m_ResolutionY);
mcb.screenSizeHalfRes = new Vector4((float)m_ResolutionX / 2.0f, (float)m_ResolutionY / 2.0f, 2.0f / (float)m_ResolutionX, 2.0f / (float)m_ResolutionY);
m_DebugBokeh = mcb.debugBokeh > 0.5f;

Nice and useful part of parsing constant buffers with regular expressions is that I can directly specify which variables are supposed to be user driven. This way my UI is also created procedurally.

procedural_ui

float ambientBrightness; // Param, Default: 1.0, Range:0.0-2.0, Gamma
float lightBrightness;   // Param, Default: 4.0, Range:0.0-4.0, Gamma
float focusPlane;        // Param, Default: 2.0, Range:0.0-10.0, Linear
float dofCoCScale;       // Param, Default: 6.0, Range:0.0-32.0, Linear
float debugBokeh;        // Param, Default: 0.0, Range:0.0-1.0, Linear

As you see it supports different curve responses of sliders. Currently is not very nice looking due to my low UI skills and laziness (“it kind of works, so why bother”) – but I promise to improve it a lot in the near future, both on the code side and usability.

Profilers

Final feature I wanted to talk about and something  that was very important for me when developing my framework was possibility to use extensively multiple GPU profilers.

You can place lots of them with hierarchy and profiling system will resolve them (DX11 disjoint queries are not obvious to implement), I also created very crude UI that presents it in a separate window.

profilers

Future and licence

Finally, some words about the future of this framework and licence to use it.

This is 100% open source without any real licence name or restrictions, so use it however you want on your own responsibility. If you use it and publish something based on it and respect the graphics programming community and development, please share your sources as well and mention where and who you got original code from – but you don’t have to.

I know that it is in very rough form, lots of unfinished code, but every week it gets better (every time I use it and find something annoying or not easy enough, I fix it :) ) and I can promise to release updates from time to time.

Lots of stuff is not very efficient – but it doesn’t really matter, I will improve it only if I need to. On the other hand, I aim to improve code quality and readability constantly.

My nearest plans are to fix obj loader, add mesh and shader binary caching, better structure buffer object handling (like append/consume buffers), provide more supported types in constant buffers and fix the UI. Further future is adding more reflection for texture and UAV resources, font drawing and GPU buffer-based on-screen debugging.

 

Bokeh depth of field – going insane! part 1

Recently I was working on console version depth of field suitable for gameplay – so simple, high quality effect, running with a decent performance on all target platforms and not eating big percent of budget.

There are tons of publications about depth of field and bokeh rendering, personally I like photographic, circular bokeh and it was also request from the art director, so my approach is doing simple poisson-like filtering – not separable, but achieves nice circular bokeh. Nothing fancy to write about.

If you wanted to do it with other shapes, I have two recommendations:

1. For hexagon shape a presentation how to approximate it by couple passes of separable skewed box blurs from John White, Colin Barré-Brisebois from Siggraph 2011. [1] 

2. Probably best for “any” shape of bokeh – smart modern DirectX 11 / OpenGL idea of extracting “significant” bokeh sprites by Matt Pettineo. [2]

But… I looked at some old screenshots of the game I spent significant part of my life on – The Witcher 2 and missed its bokeh craziness – just look at this bokeh beauty! :)

witcher_bokeh2

witcher_bokeh3

I will write a bit about technique we used and aim to start small series about getting “insane” high quality bokeh effect aimed only for cutscenes and how to optimize it (I already have some prototypes of tile based and software rasterizer based approaches).

Bokeh quality

I am a big fan of analog and digital photography, I love medium format analog photography (nothing teaches you expose and compose your shots better than 12 photos per quite expensive film roll plus time spent in the darkroom developing it :) ) and based on my photography experience sometimes I really hate bokeh used in games.

First of all – having “hexagon” bokeh in games other than aiming to simulate lo-fi cameras is very big mistake of art direction for me. Why?

Almost all photographers just hate hexagonal bokeh that comes from aperture blades shape. Most of “good quality” and modern lenses use either higher number or rounded aperture blades to help fight this artificial effect as this is something that photographers really want to fight.

So while I understand need for it in racing games or Kayne & Lynch gonzo style lo-fi art direction – it’s cool to simulate TV or cheap cameras with terrible lenses, but having it in either fantasy, historical or sci-fi games just makes no sense…

Furthermore, there are two quite contradictory descriptions of high quality bokeh that depend on the photo and photographer itself:

  • “Creamy bokeh”. For many the gold standard for bokeh, especially for portraits – it completely melts the background down and allows you to focus your attention on the main photo subject, a person being photographed. Irony here is that such “perfect” bokeh can be achieved by simple and cheap gaussian blur! :)

ND7_1514

  • “Busy bokeh” or “bokeh with personality” (the second one is literal translation from Polish). Preference of others (including myself), circular or ring-like bokeh that creates really interesting results, especially with foliage. It gives quite “painterly” and 3D effect showing depth complexity of photographed scene. It was characteristic to many older lenses, Leica or Zeiss that we still love and associate with golden age of photography. :)

ND7_1568

Both example photos are taken by me on Iceland. Even first one (my brother) taken with portrait 85mm lens doesn’t melt the background completely – a “perfect” portrait lens (135mm + ) would.

So while the first kind of bokeh is quite cheap and easy to achieve (but it doesn’t eat couple millis, so nobody considers it “truly next gen omg so many bokeh sprites wow” effect ;) ), the second one is definitely more difficult and requires having arbitrary, complex shapes of your bokeh sprites.

The Witcher 2 insane bokeh

So… How did I achieve bokeh effect in The Witcher 2? Answer is simple – full brute-force with point sprites! :) While other developers proposed it as well at similar time [3], [4], I believe we were the first ones to actually ship the game with such kind of bokeh and we didn’t have DX10/11 support in our engine, so I wrote everything using vertex and pixel shaders.

Edit: Thanks to Stephen Hill for pointing out that actually Lost Planet was first… and much earlier, in 2007! [8]

The algorithm itself looked like:

  1. Downsample the scene color and circle of confusion calculated from depth into half-res.
  2. Render grid of quads – every quad corresponding to one pixel of half-res buffer. In vertex shader fetch depth and color, calculate circle of confusion and scale the sprite accordingly. Do it only for the far CoC - kill triangles corresponding to in-focus and near out-of-focus areas by moving them outside the viewport. In pixel shader fetch the bokeh texture, multiply by it (and by inverse sprite size squared) and output RGBA for premultiplied-alpha-like result. Alpha-blend them additively and hope for enough memory bandwidth.
  3. Do the same second time, for in-focus depth of field.
  4. Combine in one fullscreen pass with in-focus areas.

Seems insane? Yes it is! :) Especially for larger bokeh sprites the overdraw and performance costs were just insane… I think that some scenes could take up to 10ms on just bokeh on some latest GPUs at that time…

However, it worked due to couple of facts:

  • It was special effect for “Ultra” configuration and best PCs. We turned it off even in “High” configuration and had nice and optimal gaussian blur based depth of field for them.
  • It was used only for cutscenes and dialogues, where we were willing to sacrifice some performance for amazing and eye-candy shots and moments.
  • We had very good cutscene artists setting up values in “rational” way, they were limiting depth of field to avoid such huge timings and to fit everything in the budget. Huge CoC was used in physically based manner (telephoto lens with wide aperture) – for very narrow angle shots where usually there was one character and just part of the background being rendered – so we had some budget to do it.

Obviously, being older and more experienced I see how many things we did wrong. AFAIR the code for calculating CoC and later composition pass were totally hacked, I think I didn’t use indexed draw calls (so potentially no vertices reusing) and multi-pass approach was naive as well – all those vertex texture fetches done twice…

On the other hand, I think that our lack of DX10+ kind of saved us – we couldn’t use expensive geometry shaders, so probably vertex shaders were more optimal. You can check some recent AMD investigations on this topic with nice numbers comparisons – and it is quite similar to my experiences even with the simples geometry shaders. [5]

Crazy scatter bokeh – 2014!

As I mentioned, I have some ideas to optimize this effect using modern GPU capabilities as UAVs, LDS and compute shaders. Probably they are obvious for other developers. :)

But before I do, (as I said, I hope this to be whole post series) I reimplemented this effect at home “for fun” and to have some reference.

Very often at home I work just for myself on something that I wouldn’t use in shipping game, I’m unsure if it will work or will be shippable or simply want to experiment. That’s how I worked on Volumetric Fog for AC4 – I worked on it in my spare time and on weekends at home and realizing that it actually can be shippable, I brought it to work. :)

Ok, so some results for scatter bokeh.

dof1dof2dof3dof4

I think it is quite faithful representation of what we had quality-wise. You see some minor half-res artifacts (won’t be possible to fully get rid of them… unless you do temporal supersampling :> ) and some blending artifacts, but the effect is quite interesting.

What is really nice about this algorithm is possibility of having much better near plane depth of field with better “bleeding” onto background (not perfect though!)- example here.

dofnear_blend

Another nice side-effect is having possibility of doing “physically-based” chromatic aberrations.

If you know about physical reasons for chromatic aberrations, you know that what games usually do (splitting RGB and offsetting it slightly) is completely wrong. But with custom bokeh texture, you can do them accurately and properly! :)

Here is some example of bokeh texture with some aberrations baked in (those are incorrect, I should scale color channels not move, but done like that they are more pronounced and visible on such non-HDR screenshots).

bokeh_shape

And examples how it affects image – on non-HDR it is very subtle effect, but you may have noticed it on other screenshots.

dofnear_aberration dofnear_noaberration

Implementation

Instead of just talking about the implementation, here you have whole source code!

This is my C# graphics framework – some not optimal code written to make it extremely easy to prototype new graphics effects and for me to learn some C# features like dynamic scripting etc.

I will write more about it, its features and reasoning behind some decisions this or next week, meanwhile go download and play for yourself! :)

Licence to use both this framework and bokeh DoF code is 100% open source with no strings attached – but if you publish some modifications to it / use in your game, just please mention me and where it comes from (you don’t have to). I used Frank Meinl Sponza model [5] and SlimDX C# DirectX 11 wrapper [6].

As I said, I promise I will write a bit more about it later.

The effect quality-wise is 100% what was in The Witcher 2, but there are some improvements performance-wise from Witcher 2 effect.

  1. I used indexed draw. Pretty obvious.
  2. I didn’t store vertices positions in array, instead calculate them procedurally from vertex ID. On such bandwidth heavy effect everything that avoids thrashing your GPU caches and allows to use ALU instead will help a bit.
  3. I use single draw call for both near and far layers of DoF. Using MRT would be just insane, geometry shaders use is a performance bottleneck, so instead I just used… atlasing! :) Old-school technique, but it works. Sometimes you can see edge artifacts from it (one plane leaks into atlas space of the other one) – it is possible to remove them in your pixel shader or with some borders, but I didn’t do it (yet).

I think that this atlasing part might require some explanation. For bokeh accumulation I use double-width texture, and spawn “far” bokeh sprites into one half, while the other ones in the second one. This way, I avoid overdraw / drawing them multiple times (MRT), geometry shaders (necessary for texture arrays as render targets) and avoid multiple vertex shader passes. Win-win-win!

I will write more about performance in later – but you can try for yourself and check that it is not great, I have even seen 11ms with extremely blurry close DoF plane filling whole screen on GTX Titan! :)

References

1. “More Performance! Five Rendering Ideas from Battlefield 3 and Need for Speed: The Run”, John White, Colin Barré-Brisebois http://advances.realtimerendering.com/s2011/White,%20BarreBrisebois-%20Rendering%20in%20BF3%20(Siggraph%202011%20Advances%20in%20Real-Time%20Rendering%20Course).pptx

2. “Depth of Field with Bokeh Rendering”, Matt Pettineo and Charles de Rousiers, OpenGL Insights and  http://openglinsights.com/renderingtechniques.html#DepthofFieldwithBokehRendering http://mynameismjp.wordpress.com/2011/02/28/bokeh/

3. The Technology Behind the DirectX 11 Unreal Engine Samaritan Demo (Presented by NVIDIA), GDC 2011, Martin Mittring and Bryan Dudash http://www.gdcvault.com/play/1014666/-SPONSORED-The-Technology-Behind

4. Secrets of CryENGINE 3 Graphics Technology, Siggraph 2011, Tiago Sousa, Nickolay Kasyan, and Nicolas Schulz http://advances.realtimerendering.com/s2011/SousaSchulzKazyan%20-%20CryEngine%203%20Rendering%20Secrets%20((Siggraph%202011%20Advances%20in%20Real-Time%20Rendering%20Course).ppt

5. Vertex Shader Tricks – New Ways to Use the Vertex Shader to Improve Performance, GDC 2014, Bill Bilodeau. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Vertex-Shader-Tricks-Bill-Bilodeau.ppsx

6. Crytek Sponza, Frank Meinl http://www.crytek.com/cryengine/cryengine3/downloads

7. SlimDX

8. Lost Planet bokeh depth of field http://www.beyond3d.com/content/news/499 http://www.4gamer.net/news/image/2007.08/20070809235901_21big.jpg

GCN – two ways of latency hiding and wave occupancy

I wanted to do another follow-up post to my GDC presentation, you can grab its slides here.

I talked for quite long about shader occupancy concept, which is extremely important and allows to do some memory latency hiding.

The question that arises is “when should I care”? 

It is a perfect question, because sometimes high wave occupancy can have no impact on your shader cost, sometimes it can speed up whole pass couple times and sometimes it can be counter-productive!

Unfortunately my presentation showed only very basics about our experiences with GCN architecture, so I wanted to talk about it a bit more.

I’ve had some very good discussions and investigations about it with my friend Michal Drobot (you can recognize his work on area lights in Killzone: Shadowfall [1] and earlier work on Parallax Occlusion Mapping acceleration techniques [2]) about it and we created set of general rules / guidelines.

Before I begin, please download AMD Sea Islands ISA  [3] (modern GCN architecture), AMD GCN presentation[4] and AMD GCN whitepaper [5] and have them ready! :) 

Wait instruction

One of most important instructions I will be referring to is

S_WAITCNT

According to the ISA this is dependency resolve instruction – waiting for completion of loading scalar or vector data. 

Wait for scalar data (for example constants from a constant buffer that are coherent to a whole wavefront) are signalled as:

LGKM_CNT

In general we don’t care as much about them – you will be unlikely bound by them, as latency of constant cache (separate, faster cache unit - page 4 in the GCN whitepaper) is much lower and you should have all such values ready.

On the other hand, there is:

VM_CNT

Which is vector register memory load/write dependency resolve and has much higher potential latency and/or cost – if you have L2 or L1 cache miss for instance…

So if we look at example extremely simple shader disassembly (from my presentation):

s_buffer_load_dwordx4 s[0:3], s[12:15], 0×08

s_waitcnt     lgkmcnt(0)

v_mov_b32     v2, s2

v_mov_b32     v3, s3

s_waitcnt     vmcnt(0) & lgkmcnt(15)

v_mac_f32     v2, s0, v0

v_mac_f32     v3, s1, v1

We see some batched constant loading followed by an immediate wait for it, before it is moved to vector register, while later there is a wait for vector memory load to v0 and v1 (issued by earlier shader code, which I omitted – it was just to load some basic data to operate on it so that compiler doesn’t optimize out everything as scalar ops :) ) before it can be actually used by ALU unit.

If you want to understand the numbers in parenthesis, read the explanation in ISA – counter is arranged in kind of “stack” way, while reads are processed in sequential way.

I will be mostly talking about s_waitcnt on vector data.

Latency hiding

We have two ways of latency hiding:

  • By issuing multiple ALU operations on different registers before waiting for load of specific value into given register. Waiting for results of a texture fetch obviously increases the register count, as increases the lifetime of a register.
  • By issuing multiple wavefronts on a CU – while one wave is stalled on s_waitcnt, other waves can do both vector and scalar ALU. For this one we need multiple waves active on a CU.

The first option should be quite familiar for every shader coder, previous hardware also had similar capabilities – but unfortunately is not always possible. If we have some dependent texture reads, dependent ALU or nested branches based on a result of data fetch, compiler will have to insert s_waitcnt and stall whole wave until the result is available. I will talk later about such situations.

While second option existed before, it was totally hidden from PC shader coders (couldn’t measure its impact in any way… Especially on powerful nVidia cards) and in my experience it wasn’t as important on X360 and its effects as pronounced as on GCN. It allows you to hide lots of latency on dependent reads, branches or shaders with data-dependent flow control. I will also mention later shaders that really need it to perform well.

If we think about it, those two ways are a bit contradictory – one depends on register explosion (present for example when we do loop unrolling that contains some texture reads and some ALU on it), while the other one can be present when we have low shader register count and large wave occupancy.

Practical example – latency hiding by postponing s_waitcnt

Ok, so we know about two ways of hiding latency, how are they applied in practice? By default, compilers do lots of loop unrolling.

So let’s say we have such a simple shader (old-school poisson DOF).

for(int i = 0; i < SAMPLE_COUNT; ++i)
{
float4 uvs;

uvs.xy = uv.xy + cSampleBokehSamplePoints[i].xy * samplingRadiusTextureSpace;
uvs.zw = uv.xy + cSampleBokehSamplePoints[i].zw * samplingRadiusTextureSpace;

float2 weight = 0.0f;
float2 depthAndCocSampleOne = CocTexture.SampleLevel(PointSampler, uvs.xy, 0.0f ).xy;
float2 depthAndCocSampleTwo = CocTexture.SampleLevel(PointSampler, uvs.zw, 0.0f ).xy;

weight.x = depthCocSampleOne.x > centerDepth ? 1.0f : depthAndCocSampleOne.y;
weight.y = depthCocSampleTwo.x > centerDepth ? 1.0f : depthAndCocSampleTwo.y;

colorAccum += ColorTexture.SampleLevel(PointSampler, uvs.xy, 0.0f ).rgb * weight.xxx;
colorAccum += ColorTexture.SampleLevel(PointSampler, uvs.zw, 0.0f ).rgb * weight.yyy;

weightAccum += weight.x + weight.y;
}

Code is extremely simple and pretty self-explanatory, there is no point to write about it – but just to make it clear, I batched two sample reads for the reason of combining 2 poisson xy offsets inside a single float4 for constant loading efficiency reasons (they are read into 4 registers with a single instruction).

Just a part of the generated ISA assembly (simplified a bit) could look something like:

image_sample_lz v[9:10], v[5:8], s[4:11], s[12:15]
image_sample_lz v[17:19], v[5:8], s[32:39], s[12:15]
v_mad_legacy_f32 v7, s26, v4, v39
v_mad_legacy_f32 v8, s27, v1, v40
image_sample_lz v[13:14], v[7:10], s[4:11], s[12:15]
image_sample_lz v[22:24], v[7:10], s[32:39], s[12:15]
s_buffer_load_dwordx4 s[28:31], s[16:19]
s_buffer_load_dwordx4 s[0:3], s[16:19]
s_buffer_load_dwordx4 s[20:23], s[16:19]
s_waitcnt lgkmcnt(0)
v_mad_legacy_f32 v27, s28, v4, v39
v_mad_legacy_f32 v28, s29, v1, v40
v_mad_legacy_f32 v34, s30, v4, v39
v_mad_legacy_f32 v35, s31, v1, v40
image_sample_lz v[11:12], v[27:30], s[4:11], s[12:15]
v_mad_legacy_f32 v5, s0, v4, v39
v_mad_legacy_f32 v6, s1, v1, v40
image_sample_lz v[15:16], v[34:37], s[4:11], s[12:15]
s_buffer_load_dwordx4 s[16:19], s[16:19]
image_sample_lz v[20:21], v[5:8], s[4:11], s[12:15]
v_mad_legacy_f32 v8, s3, v1, v40
v_mad_legacy_f32 v30, s20, v4, v39
v_mad_legacy_f32 v31, s21, v1, v40
v_mad_legacy_f32 v32, s22, v4, v39
v_mad_legacy_f32 v33, s23, v1, v40
s_waitcnt lgkmcnt(0)
v_mad_legacy_f32 v52, s17, v1, v40
v_mad_legacy_f32 v7, s2, v4, v39
v_mad_legacy_f32 v51, s16, v4, v39
v_mad_legacy_f32 v0, s18, v4, v39
v_mad_legacy_f32 v1, s19, v1, v40
image_sample_lz v[39:40], v[30:33], s[4:11], s[12:15]
image_sample_lz v[41:42], v[32:35], s[4:11], s[12:15]
image_sample_lz v[48:50], v[30:33], s[32:39], s[12:15]
image_sample_lz v[37:38], v[51:54], s[4:11], s[12:15]
image_sample_lz v[46:47], v[0:3], s[4:11], s[12:15]
image_sample_lz v[25:26], v[7:10], s[4:11], s[12:15]
image_sample_lz v[43:45], v[7:10], s[32:39], s[12:15]
image_sample_lz v[27:29], v[27:30], s[32:39], s[12:15]
image_sample_lz v[34:36], v[34:37], s[32:39], s[12:15]
image_sample_lz v[4:6], v[5:8], s[32:39], s[12:15]
image_sample_lz v[30:32], v[32:35], s[32:39], s[12:15]
image_sample_lz v[51:53], v[51:54], s[32:39], s[12:15]
image_sample_lz v[0:2], v[0:3], s[32:39], s[12:15]
v_cmp_ngt_f32 vcc, v9, v3
v_cndmask_b32 v7, 1.0, v10, vcc
v_cmp_ngt_f32 vcc, v13, v3
v_cndmask_b32 v8, 1.0, v14, vcc
v_cmp_ngt_f32 vcc, v11, v3
v_cndmask_b32 v11, 1.0, v12, vcc
s_waitcnt vmcnt(14) & lgkmcnt(15)
v_cmp_ngt_f32 vcc, v15, v3
v_mul_legacy_f32 v9, v17, v7
v_mul_legacy_f32 v10, v18, v7
v_mul_legacy_f32 v13, v19, v7
v_cndmask_b32 v12, 1.0, v16, vcc
v_mac_legacy_f32 v9, v22, v8
v_mac_legacy_f32 v10, v23, v8
v_mac_legacy_f32 v13, v24, v8
s_waitcnt vmcnt(13) & lgkmcnt(15) 

I omitted the rest of waits and ALU ops – this is only part of the final assembly – note how much scalar architecture makes your shaders longer and potentially less readable!

So we see that compiler will probably do loop unrolling, decide to pre-fetch all the required data into multiple VGPRs (huge amount of them!). 

Our s_waitcnt on vector data is much later than the first texture read attempt.

But if we count the actual cycles (again – look into ISA / whitepaper / AMD presentations) of all those small ALU operations that happen before it, we can estimate that if data was in the L2 or L1, (probably it was, as CoC of central sample must have been fetched before the actual loop) there probably will be no actual wait.

If you just look at the register count, it is huge (remember that the whole CU has only 256 VGPRs per a SIMD!) and the occupancy will be very low. Does it matter? Not really :)

My experiments with forcing loop there (it is tricky and involves forcing loop counter into a to uniform…) show that even if you get much better occupancy, the performance can be the same or actually lower (thrashing cache, still not hiding all the latency, limited amount of texturing units).

So the compiler will probably guess properly in such case and we got our latency hidden very well even within one wave. It is not always the case – so you should count those cycles manually (it’s not that difficult nor tedious) or rely on special tools to help you track such stalls (I cannot describe them for obvious reasons).

Practical example – s_waitcnt necessary and waits for data

I mentioned that sometimes it is just impossible to do s_waitcnt much later than the actual texture fetch code.

Perfect example of it can be such code (isn’t useful in any way, just an example):

int counter = start;
float result = 0.0f;
while(result == 0.0f)
{
result = dataRBuffer0[counter++];
}

It is quite obvious that every next iteration of loop or an early-out relies on a texture fetch that has just happened. :( 

Shader ISA disassembly will look something like:

label_before_loop:
v_mov_b32 v1, 0
s_waitcnt vmcnt(0) & lgkmcnt(0)
v_cmp_neq_f32 vcc, v0, v1
s_cbranch_vccnz label_after_loop
v_mov_b32 v0, s0
v_mov_b32 v1, s0
v_mov_b32 v2, 0
image_load_mip v0, v[0:3], s[4:11]
s_addk_i32 s0, 0×0001
s_branch label_before_loop
label_after_loop:

So in this case having decent wave occupancy is the only way to hide latency and keep the CU busy – and only if you have somewhere else in your shader or in a different wave on the CU ALU-heavy code.

This was the case in for instance screenspace reflections or parallax occlusion mapping code I implemented for AC4 and that’s why I showed this new concept of “wave occupancy” on my GDC presentation and I find it very important. And in such cases you must keep your vector register count very low.

General guidelines

I think that in general (take it with a grain of salt and always check yourself) low wave occupancy and high unroll rate is good way of hiding latency for all those “simple” cases when you have lots of not-dependent texture reads and relatively moderate to high amount of simple ALU in your shaders.

Examples can be countless, but it definitely applies to various old-school simple post-effects taking numerous samples.

Furthermore, too high occupancy could be counter-productive there, thrashing your caches. (if you are using very bandwidth-heavy resources)

On the other hand, if you have only small amount of samples, require immediate calculations based on them or even worse do some branching relying on it, try to go for bigger wave occupancy.

I think this is the case for lots of modern and “next-gen” GPU algorithms:

  • ray tracing
  • ray marching
  • multiple indirection tables / textures (this can totally kill your performance!)
  • branches on BRDF types in deferred shading
  • branches on light types in forward shading
  • branches inside your code that would use different samples from different resource
  • in general – data dependent flow control

But in the end and as always – you will have to experiment yourself.

I hope that by this post I also have convinced you how important it is to look through the ISA and all documents / presentations on hardware, its architecture and all low-level and final disassembly code – even if you consider yourself a “high level and features graphics / shader coder” (I believe that there is no such thing as “high level programmer” that doesn’t need to know target hardware architecture in real-time programming and especially in high-quality, console or PC games). :)

References:

[1] http://www.guerrilla-games.com/publications.1 

[2] http://drobot.org/ 

[3] http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture1.pdf 

[4] http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf

[5] http://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

GDC follow-up: Screenspace reflections filtering and up-sampling

After GDC I’ve had some great questions and discussions about techniques we’ve used to filter and upsample the screenspace reflections to avoid flickering and edge artifacts. Special thanks here go to Angelo Pesce, who convinced me that our variation of weighting the up-sampling and filtering technique is not obvious and worth describing.

Reasons for filtering

As I mentioned in my presentation, there were four reasons to blur the screenspace reflections:

  • Simulating different BRDF specular lobe for surfaces of different roughness – if they are rougher, reflections should appear very blurry (wide BRDF lobe).
  • Filling holes from missed rays. Screenspace reflections are very approximate technique that relies on screenspace depth and colour information that very rarely represents properly the scene geometric complexity. Therefore some rays will miss objects and you will have some holes in your reflection buffer.
  • Fighting aliasing and flickering. Quite obvious one – and lowpass filter will help a bit.
  • Upsampling half-resolution information. When raytracing in half-resolution, all the previous problem become even more exaggerated, especially on edges of geometry. We had to do something to fight them.
Filtering radius difference on rough and smooth surfaces

Filtering radius difference on rough and smooth surfaces

Up-sampling technique

First I’m going to describe our up-sampling technique, as it is very simple.

For up-sampling we tried first industry standard depth-edge aware bilateral up-sampling. It worked just fine for geometric and normal edges, but we faced different problem. Due to different gloss of various areas of same surface, blur kernel was also different (blur was also in half resolution).

We observed a quite serious problem on important part of our environments – water puddles that stayed after the rain. We have seen typical jaggy edge and low-res artifacts on a border of very glossy and reflective water puddle surface (around it there was quite rough ground / dirt).

As roughness affects also reflection / indirect specular visibility / intensity, effect was even more pronounced. Therefore I have tried adding a second up-sample weight based on comparison of surface reflectivity (combination of gloss based specular response and Fresnel) and it worked just perfectly!

It could be used even on its own in our case – but it may not be true in case of other games – we used it to save some ALU / BW. For us it discriminated very well general geometric edges (characters / buildings had very different gloss values than the ground), but probably not every game or scene could do it.

Filtering technique

We spend really lots of time on getting filtering of the reflections buffer right – probably more than on the actual raytracing code or optimizations.

As kind of pre-pass and help for it, we did cross-style slight blur during downsampling of our color buffer for the screenspace reflections.

Similar technique was suggested by Mittring for bloom [1] and in general is very useful to fight various aliasing problems when using half-res colour buffers and I recommend it to anyone trying to use half-res color buffer for anything. :)

Downsampling filtering / blur pattern

Downsampling filtering / blur pattern

Then later we performed a weighted separable blur for performance / quality reasons – to get properly blurred screenspace reflections for very rough surfaces blurring radius must be huge! Using separable blur with varying radius is in general improper (special thanks to Stephen Hill for reminding it to me) as the second pass could catch some wrong blurred samples (with a different blur radius in orthogonal direction), but worked in our case – as surface glossiness was quite coherent on screen, we didn’t have any mixed patterns that would break it.

Also screen-space blur is in general improper approximation of convolution of multiple rays agains the BRDF kernel, but as also both Crytek and Guerrilla Games mentioned in their GDC presentations [2] [3], it looks quite convincing.

Filtering radius

Filtering radius depended just on two factors. Quite obvious one is surface roughness. We ignored the effect of cone widening with distance – I knew it would be “physically wrong” but from my experiments and comparing against real multiple ray traced reference convolved with BRDF – visual difference was significant only on rough, but flat surfaces (like polished floors) and vert close to the reflected surface – with normal maps and on organic and natural surfaces or bigger distances it wasn’t noticeable as something “wrong”. Therefore for performance / simplicity reasons we have ignored it.

At first I have tried basing the blur radius on some approximation of a fixed-distance cone and surface glossiness (similar to the way of biasing mips of pre filtered cubemaps). However, artists complained about the lack of control and as our rendering was not physically based, I just gave them blur bias and scale control based on the gloss.

There was a second filtering factor – when there was a “hole” in our reflections buffer, we artificially increased the blurring radius, even for shiny surfaces. Therefore we applied form of push-pull filter.

  • Push – we tried to “push” further away proper ray-tracing information by weighting it higher
  • Pull – pixels that lacked proper information looked for it in larger neighbourhood.

It was better to fill the holes and look for proper samples in the neighbourhood than have ugly flickering image.

Filtering weight

Our filtering weight depended on just two factors:

  • Alpha of sample being read – if it was a hole or properly ray traced sample.
  • Gaussian function.

Reason for the first one was again to ignore missing samples and do pulling of proper information from pixel neighbourhood. We didn’t weight hole samples to 0.0f – AFAIR it was 0.3f. The reason here was to still get some proper fadeout of reflections and to have lower screen-space reflections weight in “problematic” areas to blend them out to fall-back cube-map information.

Finally, the Gaussian function isn’t 100% accurate approximation of Blinn-Phong BRDF shape, but smoothed out the result nicely. Furthermore and as I mentioned previously, whole no screen-space blur is a proper approximation of 3D multiple ray convolution with BRDF – but can look properly to human brain.

Thing worth noting here is that our filter didn’t use depth difference in weighting function – but on depth discontinuities there was already no reflection information, so we didn’t see any visible artifacts from reflection leaking. Guerilla Games presentation by Michal Valient [3] also mentioned doing regular full blur – without any depth or edge-aware logic.

References

[1] Mittring, “The Technology behind the Unreal Engine 4 Elemental Demo”

[2] Schulz, “Moving to the Next Generation: The Rendering Technology of Ryse”

[3] Valient, “Taking Killzone Shadow Fall Image Quality into the Next Generation”

Temporal supersampling and antialiasing

Aliasing problem

Before I address temporal supersampling, just a quick reminder on what aliasing is.

Aliasing is a problem that is very well defined in signal theory. According to the general sampling theorem we need to have our signal spectrum containing only frequencies lower than Nyquist frequency. If we don’t (and when rasterizing triangles we always will as triangle edge is infinite frequency spectrum, step-like response) we will have some frequencies appearing in the final signal (reconstructed from samples) that were not in the original signal. Visual aliasing can have different appearance, it can appear as regular patterns (so-called moire), noise or flickering.

Classic supersampling

Classic supersampling is a technique that is extremely widely used by the CGI industry. Per every target image fragment we perform sampling multiple times at much higher frequencies (for example by tracing multiple rays per simply pixel or shading fragments multiple times at various positions that cover the same on-screen pixel) and then performing the signal downsampling/filtering – for example by averaging. There are various approaches to even easiest supersampling (I talked about this in one of my previous blog posts), but the main problem with it is the associated cost – N times supersampling means usually N times the basic shading cost (at least for some pipeline stages) and sometimes additionally N times the basic memory cost. Even simple, hardware-accelerated techniques like MSAA that do estimate only some parts of the pipeline (pixel coverage) in higher frequency and don’t provide as good results, have quite big cost on consoles.

But even if supersampling is often unpractical technique, it’s temporal variation can be applied with almost zero cost.

Temporal supersampling theory

So what is the temporal supersampling? Temporal supersampling techniques base on a simple observation – from frame to frame most of the on-screen screen content do not change. Even with complex animations we see that multiple fragments just change their position, but apart from this they usually correspond to at least some other fragments in previous and future frames.

Based on this observation, if we know the precise texel position in previous frame (and we often do! Using motion vectors that are used for per-object motion blur for instance), we can distribute the multiple fragment evaluation component of supersampling between multiple frames.

What is even more exciting is that this technique can be applied to any pass – to your final image, to AO, screen-space reflections and others – to either filter the signal or increase the number of samples taken. I will first describe how it can be used to supersample final image and achieve much better AA and then example of using it to double or triple number of samples and quality of effects like SSAO.

Temporal antialiasing

I have no idea which game was the first to use the temporal supersampling AA, but Tiago Sousa from Crytek had a great presentation on Siggraph 2011 on that topic and its usage in Crysis 2 [1]. Crytek proposed using a sub pixel jitter to the final MVP transformation matrix that alternates every frame – and combine two frames in post-effect style pass. This way they were able to increase the sampling resolution twice at almost no cost!

Too good to be true?

Yes, the result of such simple implementation looks perfect on still screenshots (and you can implement it in just couple hours!***), but breaks in motion. Previous frame pixels that correspond to current frame were in different positions. This one can be easily fixed by using motion vectors, but sometimes the information you are looking for was occluded or had. To address that, you cannot rely on depth (as the whole point of this technique is having extra coverage and edge information from the samples missing in current frame!), so Crytek proposed relying on comparison of motion vector magnitudes to reject mismatching pixels.

***yeah, I really mean maximum one working day if you have a 3D developer friendly engine. Multiply your MVP matrix with a simple translation matrix that jitters in (-0.5 / w, -0.5 / h) and (0.5 / w, 0.5 / h) every other frame plus write a separate pass that combines frame(n) and frame(n-1) together and outputs the result.

Usage in Assassin’s Creed 4 – motivation

For a long time we relied on FXAA (aided by depth-based edge detection) as a simple AA technique during our game development. This simple technique usually works “ok” with static image and improves its quality, but breaks in motion – as edge estimations and blurring factors change from frame to frame. While our motion blur (simple and efficient implementation that used actual motion vectors for every skinned and moving objects) helped to smooth edge look for objects moving quite fast (small motion vector dilation helped even more), it didn’t do anything with calm animations and subpixel detail. And our game was full of them – just look at all the ropes tied to sails, nicely tessellated wooden planks and dense foliage in jungles! :) Unfortunately motion blur did nothing to help the antialiasing of such slowly moving objects and FXAA added some nasty noise during movement, especially on grass. We didn’t really have time to try so-called “wire AA” and MSAA was out of our budgets so we decided to try using temporal antialiasing techniques.

I would like to thank here especially Benjamin Goldstein, our Technical Lead with whom I had a great pleasure to work on trying and prototyping various temporal AA techniques very late in the production.

Assassin’s Creed 4 XboxOne / Playstation 4 AA

As a first iteration, we started with single-frame variation of morphological SMAA by Jimenez et al. [2] In its even most basic settings it showed definitely better-quality alternative to FXAA (at a bit higher cost, but thanks to much bigger computing power of next-gen consoles it stayed in almost same budget compared to FXAA on current-gen consoles). There was less noise and artifacts and much better morphological edge reconstruction , but obviously it wasn’t able do anything to reconstruct all this subpixel detail.

So the next step was to try to plug in temporal AA component. Couple hours of work and voila – we had much better AA. Just look at the following pictures.

No AA

No AA

FXAA

FXAA – good but blurry AA on characters, terrible noise on sub-pixel detail

SMAA

Single sample SMAA – sharper, but lots of aliasing untouched

Temporal AA

Temporal AA

Pretty amazing, huh? :)

Sure, but this was at first the result only for static image – and this is where your AA problems start (not end!).

Getting motion vectors right

Ok, so we had some subtle and we thought “precise” motion blur, so getting motion vectors to allow proper reprojection for moving objects should be easy?

Well, it wasn’t. We were doing it right for most of the objects and motion blur was ok – you can’t really notice lack of motion blur or slightly wrong motion blur on some specific objects. However for temporal AA you need to have them proper and pixel-perfect for all of your objects!

Other way you will get huge ghosting. If you try to mask out this objects and not apply temporal AA on them at all, you will get visible jittering and shaking from sub-pixel camera position changes.

Let me list all the problems with motion vectors we have faced and some comments of whether we solved them or not:

  • Cloth and soft-body physical objects. From our physics simulation for cloth and soft bodies that was very fast and widely used in the game (characters, sails) we got full vertex information in world space. Object matrices were set to just identity. Therefore, such objects had zero motion vector (and only motion from camera was applied to them). We needed to extract such information from the engine and physics – fortunately it was relatively easy as it was used already for bounding box calculations. We fixed ghosting from moving soft body and cloth objects, but didn’t have motion vectors from the movement itself – we didn’t want to completely change the pipeline to GPU indirections and subtracting positions from two vertex buffers. It was ok-ish as they wouldn’t move very abruptly and we didn’t see artifacts from it.
  • Some “custom” object types that had custom matrices and the fact we interpreted data incorrectly. Same situation as with cloth existed also for other dynamic objects. We got some custom motion vector debugging rendering mode working and fixing all those bugs was just matter of couple days in total.
  • Ocean. It was not writing to the G-buffer. Instead of seeing motion vectors of ocean surface, we had proper information, but for ocean floor or “sky” behind it (when with very deep ocean there was no bottom surface at all). The fix there was to overwrite some G-buffer information like depth and motion-vectors. However, still we didn’t store previous frame simulation results and didn’t try to use them, so in theory you could see some ghosting on big and fast waves during storm. It wasn’t very big problem for us and no testers ever reported it.
  • Procedurally moving vegetation. We had some vertex noise based artist-authored vegetation movement and again, difference between two frame vertex position values wasn’t calculated to produce proper motion vectors. This is single biggest visible artifact in game from temporal AA technique and we simply didn’t have the time to modify our material shader compiler / generator and couldn’t apply any significant data changes in patch (we improved AA in our first patch). Proper solution here would be to automatically replicate all the artist created shader code that calculates output local vertex position if it relies on any input data that changes between frames like “time” or closest character entity position (this one was used to simulate collision with vegetation), pass it through interpolators (perspective correction!), subtract it and have proper motion vectors. Artifacts like over blurred leaves are sometimes visible in the final game and I’m not very proud of it – although maybe it is usual programmer obsession. :)
  • Objects being teleported on skinning. We had some checks for entities and meshes being teleported, but in some single and custom cases objects were teleported using skinning – it would be impractical to analyze whole skeleton looking for temporal discontinuities. We asked gameplay and animations programmers to mark them on such a frame and quickly fixed all the remaining bugs.

Problems with motion vector based rejection algorithm

Ok, we spend 1-2 weeks on fixing our motion vectors (and motion blur also got much better! :) ), but in the meanwhile realized that the approach proposed by Crytek and used in SMAA for motion rejection is definitely far from perfect. I would divide problems into two categories.

Edge cases

It was something we didn’t really expect, but temporal AA can break if menu pops up quickly, you pause the game, you exit to console dashboard (but game remains visible), camera teleports or some post-effect immediately kicks in. You will see some weird transition frame. We had to address each case separately – by disabling the jitter and frame combination on such frame. Add another week or two to your original plan of enabling temporal AA to find, test and fix all such issues…

Wrong rejection technique

This is my actual biggest problem with naive SMAA-like way of rejecting blending by comparing movement of objects.

First of all, we a had very hard time to adjust the “magic value” for the rejection threshold and 8-bit motion vectors didn’t help it. Objects were either ghosting or shaking.

Secondly, there were huge problems on for example ground and shadows – the shadow itself was ghosting – well, there is no motion vector for shadow or any other animated texture, right? :) It was the same with explosions, particles, slowly falling leaves (that we simulated as particle systems).

For both of those issues, we came up with simple workaround – we were not only comparing similarity of motion of objects, but on top of it added a threshold value – if object moved faster than around ~2 pixels per frame in current or previous frame, do not blend them at all! We found such value much easier to tweak and to work with. It solved the issue of shadows and visible ghosting.

We also increased motion blur to reduce any potential visible shaking.

Unfortunately, it didn’t do anything for transparent or animated texture changes over time, they were blended and over-blurred – but as a cool side effect we got free rain drops and rain ripples antialiasing and our art director preferred such soft, “dreamy” result. :)

Recently Tiago Souse in his Siggraph 2013 talk proposed to address this issue by changing metric to color-based and we will investigate it in the near future [3].

Temporal supersampling of different effects – SSAO

I wanted to mention another use of temporal supersampling that got into final game on the next-gen consoles and that I really liked. I got inspired by Matt Swoboda’s presentation [4] and mention of distributing AO calculation sampling patterns between multiple frames. For our SSAO we were having 3 different sampling patterns (spiral-based) that changed (rotated) every frame and we combined them just before blurring the SSAO results. This way we effectively increased number of samples 3 times, needed less blur and got much much better AO quality and performance for cost of storing just two additional history textures. :) Unfortunately I do not have screenshots to prove that and you have to take my word for it, but I will try to update my post later.

For rejection technique I was relying on a simple depth comparison – we do not really care about SSAO on geometric foreground object edges and depth discontinuities as by AO definition, there should be almost none. Only visible problem was when SSAO caster moved very fast along static SSAO receiver – there was visible trail lagging in time – but this situation was more artificial problem I have investigated, not a serious in-game problem/situation. Unlike the temporal antialiasing, putting this in game (after having proper motion vectors) and testing took under a day, there were no real problems, so I really recommend using such techniques – for SSAO, screen-space reflections and many more. :)

Summary

Temporal supersampling is a great technique that will increase final look and feel of your game a lot, but don’t expect that you can do it in just couple days. Don’t wait till the end of the project, “because it is only a post-effect, should be simple to add” – it is not! Take weeks or even months to put it in, have testers report all the problematic cases and then properly and iteratively fix all the issues. Have proper and optimal motion vectors, think how to write them for artist-authored materials, how to batch your objects in passes to avoid using extra MRT if you don’t need to write them (static objects and camera-only motion vector). Look at differences in quality between 16bit and 8bit motion vectors (or maybe R11G11B10 format and some other G-Buffer property in B channel?), test all the cases and simply take your time to do it all properly and early in production, while for example changing a bit skeleton calculation or caching vertex skinning information (having “vertex history”) is still an acceptable option. :)

References

[1] http://iryoku.com/aacourse/ 

[2] http://www.iryoku.com/smaa/

[3] http://advances.realtimerendering.com/s2013/index.html

[4] http://directtovideo.wordpress.com/2012/03/15/get-my-slides-from-gdc2012/

My upcoming GDC 2014 presentation

Ok, so GDC 2014 is coming up next week, are you excited? Because I am. :) 

Thanks to the GDC Committee I will be giving a talk this year http://schedule.gdconf.com/session-id/826051 named “Assassin’s Creed IV – Road to Next-Gen Graphics”. As I’m +/- finished with the contents of my presentations and after many iterations on it, I wanted to give you a small sneak-peek of its contents so you can decide if it’s worth to come and see it.

As the presentation title suggests, it will be mostly about various next-gen techniques we developed to next-genify our game. Don’t expect any “we upped the texture and screen resolution and increased geometric LOD” boring and common stuff, I will talk only about novel and newly developed techniques and next-gen console experience from a developer point of view. :)

Global Illumination

This section will be a bit different from the other ones, as I will briefly describe a partially baked GI solution we used on both next gen as well as current gen.

GI

In over one month a small strike team consisting of Mickael Gilabert, John Huelin, Benjamin Rouveyrol and me created, iterated on and deployed a solution that uses around ~600kB VRAM, almost zero main memory RAM, adds around under 1ms of GPU overhead on PS3 and is compatible with dynamic time of day and various weather presets. I think it was a huge and important addition and improvement over previous AC games rendering.

GI

Light probes

Volumetric Fog

I will do a small introduction to various atmospheric-scattering related effects and how we tried to unify them in a single, coherent system (so no more separate volume shadows, fog, light-shafts, god-rays and post-effect based hacks!). Developed Volumetric Fog algorithm uses small resolution volumetric textures to both estimate (procedural animations) participating media density, estimate in-scattered lighting (from many and any light sources!) and then in a second step create a lookup texture for final in- and out- scattering to be applied during shading. It can be applied it in either deferred or forward manner using one tex3D operation and one MAD instruction as it is totally decoupled from scene geometric information and z-buffer. Final performance is a fixed cost of around 1.1ms on both Sony PlayStation 4 and Microsoft Xbox One including “common” engine operations like shadowmap downsampling / ESM generation from depth-based shadow maps and applying the effect in a separate fullscreen pass.

Volumetric fog - local lights

Volumetric fog – local lights

Please note that on this screenshot showed effect uses custom, art-driven (not physically based!) phase functions and uses exaggerated (but in-engine) settings.

Volumetric fog - light shafts

Volumetric fog – light shafts

Screen-space Reflections

I will talk briefly about the reasons why it is beneficial to sometimes use screen-space reflections (see my previous blog post), describe how the algorithm should work in general and then details of our implementation and performance optimizations for next gen consoles. I will show achieved results and talk how we got performance of this effect down to 1-2ms (depending on a scene).

AC4 - Screenspace Reflections On

AC4 – Screenspace Reflections On

AC4 - Screenspace Reflections Off

AC4 – Screenspace Reflections Off

Next-gen console GPU architecture, its impact on performance and optimizations

Definitely the most technical part of my presentation, but potentially most useful for other graphics programmers who are still to ship a next-gen title. I will describe what we have learned about the GCN architecture of PS4 and X1 GPUs and how we applied this knowledge in practice.

You can expect me to describe GCN compute unit architecture and explain basic terms related to it like:

  • vector/scalar registers and difference between them
  • register pressure
  • wave occupancy
  • SIMD lanes
  • Latency hiding
  • “Superscalar-like” architecture

…and how all of them affect the performance of your shaders and GPU code. It won’t be a section only about theory – I will try to show some code snippets and talk about actual numbers.

Bonus content and summary

I planned lots of bonus content that I probably won’t be able to describe in the talk itself. However, I will post the presentation on my blog after the conference with all the slides – and will be available to answer your questions – during and after the conference. Bonus content includes our efficient Parallax Occlusion Mapping implementation and code, SSAO algorithm we used and reasoning behind it, possible next-gen only extensions to our GI technique, fully GPU simulated procedural rain.

I hope to see you there! :)