Insider guide to tech interviews

I’ve been meaning to write this post for over a year… an unofficial insider guide to tech interviews!

This is the essence of the advice I give my friends (minus personal circumstances and preferences) all the time, and I figured more people could find it useful.

The advice is about both the interview itself – preparing for it and maximizing opportunity to present your skills and strengths adequately, but also the other aspects of the job finding and interviewing process – talking with the recruiter, negotiating salary, making sure you find a great team.

A few disclaimers here:

This post is my opinion only. It’s based on my experiences both as an interviewer (I gave ~110 interviews at Google, and probably over a 100 before that – yep, that’s a lot, had weeks of 2-3 interviews!), someone who always reads other interviewer’s feedback, and an interviewee (had offers from Google, Facebook, Apple – in addition to my gamedev past with many more). But it doesn’t represent any official policies or guidelines of any of my past or current employers. Any take on those practices is my opinion only. And I could be wrong about some things – being experienced doesn’t guarantee being great at interviewing.

All companies have different processes. From “big tech” Google, Facebook, and Amazon do “coding” (problem solving / algorithm) interviews based on standard cross company procedures, Microsoft, Apple, and nVidia have more custom / each team does their own interviewing. A lot of my advice about problem solving interviews applies more to the former. But most of the other general advice is applicable at most tech companies in general.

Things change over time and evolve. Some of this info might be outdated – I will highlight parts that have already changed. Similarly, every interviewer is different, so you might encounter someone giving you exactly the opposite type of interview – and this randomness is always expected.

This is not a guide on how to prepare for the coding interviews. There are some books out there (“Cracking the Coding Interview” is bit outdated and has too many “puzzles”, but it was enough for me to learn) and competitive programming websites (I used Leetcode, but it was 5y ago) that focus on it and I don’t plan to “compete” in a blog post with hundreds of pages and thousands of hours others put into comprehensive guidelines. This is just a set of practical advice and common pitfalls that I see candidates fall into – plus focus on things like chats with the recruiter and negotiating the salary.

The guide is targeted mostly at candidates who already have a few years of industry experience. I am sorry about this – and I know it’s stressful and challenging getting a job when starting out in a field. But this guide is based on my experience; I’ve been interviewing mostly senior candidates (not by choice, but because of shortage of interviewers) and I was starting over a decade ago and in Poland, so my personal experience would not matter either. Still, hopefully even people who look to land their first job or even an internship can learn something.

Most important advice

The most important thing that I’d want to convey and emphasize – if you are not in a financial position that desperately needs money, then expect respect from the recruiter and interviewers and don’t be desperate when looking for a job and interviewing.

Even if you think some position and company is a dream job, be willing and able to walk away at any stage of the process – and similarly, if you don’t succeed, don’t stress over it.

If you are desperate, you are more likely to fall victim to manipulations (through denial and biases) and make bad decisions. You would be more likely to ignore some strong red flags, get underleveled, get a bad offer, or even make some bad life choices.

I don’t mean this advice as treating hiring as some kind of sick poker game, hiding your excitement or whatever – the opposite; if you are excited, then it’s fantastic; genuine positive emotions resonate well with everyone!

But treat yourself with respect and be willing to say no and draw clear boundaries.

Kind of… like in any healthy human relationship?

For example – if you don’t want to relocate, state it very clearly and watch out for things like sunken cost fallacy. If you don’t want any take-home tasks (I personally would not take any that’s longer than 2h of work), then don’t.

At the same time, I hope it goes without saying – expecting respect doesn’t mean entitlement to a specific interview outcome, or even worse being unpleasant or rude even if you think you are not treated well. Remember that for example your interviewers are not responsible for the recruiter’s behavior and vice versa.

Starting a discussion with the recruiter

Unless applying directly to a very small company / a start-up, your interview process is going to be overseen and coordinated by a recruiter. This applies both when you apply (whether to a generic position, or a specific team), as well as when you are approached by someone. Such recruiters are not technical people, and will not conduct your interviews, but clear communication with them is essential.

I generally discourage friends from using external recruiting agencies. I won’t go deeper into it here (some personal very bad experience – a recruiter threatening to call my current employer at that time because I wanted to withdraw from the process with them) – and yes, they can be helpful in negotiating some salaries and knowing the market, and there are for sure some helpful external recruiters and you’ll most likely some positive stories from colleagues. But remember that their incentives are not helping you or even the company, but earning the commission, and to do it, some are dishonest and will try to manipulate the both sides!

Understand recruiter’s role

At large tech companies your “case” (application) will be reviewed by a lot of people who don’t know each other – interviewers, hiring committees, hiring managers, compensation committees, up to VP or CEO approval. Many of them will not understand your past experience, most likely many will not even look at your CV. But they all need access to some information like this – plus feedback from interviews, talks with internal teams, managers, understand your compensation expectations, and what are your interests. There are internal systems to organize this information and make it as automatically available, but the person who collects it, organizes it, and relays is the recruiter. Think of them as a law attorney who builds a case.

…But they are not your attorney, and not your “friend”.

Job of a recruiter is to find a candidate who satisfies internal hiring requirements, walk them through the interviewing process, and make them sign an offer. I have definitely heard of recruiters lying to candidates and manipulating them to make it happen (most often through lying about under-leveling or putting some artificial time pressure on candidates – more on it later). Sometimes they are tasked to fill slots in a very specific team or geographic location and will try to pressure a candidate to pick such a team, despite other, better options being available. Recruiters often have specific hire targets/quotas as part of their performance evaluations and salary/bonus structure – and some might try to convince or even manipulate you to accept a bad deal hit those targets.

But they are not your “enemy” either! After all, they want to see you hired, and you want to see yourself hired as well?

It’s exactly like signing any business deal – you share a common goal (signing this deal and you getting employed, and them finding the right employee), and some other goals are contradictory, but through negotiations you can hopefully find a compromise that both sides will feel happy about.

But what it means in practice – be as clear with communication with the recruiter as possible and actively help them build your case. Express very quickly your expectations and any caveats you might have.

At the same time, watch out for any type of “bullshit”, manipulation, or pressure practices. Watch out for some specific agenda like trying to push you towards a specific team or relocating if you don’t want it. If you have contacts inside of the company, you can try verifying some information that the recruiter gives you and seems questionable.

Do you need a recommendation / referral?

I get often asked by friends or acquaintances to refer them as they believe it will make it easier to get hired. Generally, this is not the case. I still happily do that – to help them; and there is a few thousand dollar referral bonus if a referral is hired.

Worth noting that hiring in tech is extremely costly. Flying candidates, the amount of interviews and committee time, this adds up. There is also a huge cost of having teams that need more employees, and cannot find them. Interviewing obviously costs also candidates (their own time, which is valuable; plus any potential other missed opportunities). But how costly is it in practice in dollars? External hiring agencies typically take 3 monthly candidate salaries as commission – so this goes into multiple tens of thousands of dollars. And this is just for sourcing; in addition to internal hiring costs! So a few thousands for an internal referral – especially that such a referral can be much higher quality than external sourcing – is not a lot.

But in practice referrals of someone I didn’t work with seem to not matter and are immediately rejected by recruiters. In a referral form, there is an explicit question asking to explain if I worked with someone, how long etc. And it seems that referrals of people one didn’t work with are not enough to get past even the recruiter. Which seems expected.

On the other hand, if you can find a hiring manager inside the company to get you referred and want to get you hired on their team, this can increase your chances a lot. This almost guarantees going past the recruiter and directly towards phone screens. Such a hiring manager can help you get targeted for a higher level, and add some weight for the final hiring committee decision – but not a guarantee. Some hiring managers try to “push” for a hand-picked panel of interviewers and convince them to ask easier questions, and it’s obviously questionable ethically and only sometimes works, but sometimes backfires.

Just don’t ask your friends to find such a hiring manager for you, it’s not really possible (especially in a huge company), and puts them in an uncomfortable position…

but it’s totally ok for you to reach out to people managing teams asking them directly to apply to work with them, instead of recruiters or official hiring forms. But it’s also ok for managers to ignore you, don’t be upset if that happens.

Do not get underleveled

This might be the most important advice before starting the interviewing – understand company level structure and the level you are aiming for and what the recruiter envisions for you (typically based on a shallow read of your CV and simple metrics like career tenure and obtained education degree).

All companies have job hierarchy ladders; in small companies it’s just something like “junior – regular – senior”, in large companies it will be a number + title. See something like levels.fyi for levels salaries and equivalences between different companies.

Why do you need to do this before interviews start? Because interviewers are interviewing you for a target level and position!

A very common story is someone starting a hiring process, talking with a recruiter, doing phone screens, on-site interviews, and around the time of getting an offer… “I want to be hired as a senior programmer” and the recruiter says with a sadness in their voice “but senior is L5, and you were interviewed for L4. I am sorry. You would need to go through the whole interviewing process again. And we don’t have time. But don’t worry. If you are really, truly an L5 candidate, you will quickly get promoted to L5!”.

This is misleading at best and a lie at worst and a trap that numerous people fall into. Most people I know fell into it. I was also one that got definitely underleveled (both based on feedback from peers plus it got “fixed” with a promotion a year later). I have seen many brilliant, high performance people with many years of past career experience on ridiculously low levels only because they didn’t negotiate it.

“Quickly get promoted” is such an understatement of how messy promotion can be (this is a topic beyond this post) and will cost you demonstrable “objective” achievements including product launches (it’s not enough that you’re a high performer), getting recommendations and support from numerous high level peers that you need to get to know and demonstrate your abilities to, your case passing through some committees – and will take at least a year since you got hired.

But what’s even worse, after your promotion, it’s almost impossible to get another one for two more years -> so by agreeing to a lower level, you are most likely removing those two years from your career at that company.

Is level important in practice? Yes and no. Your compensation and bonuses depend on it. In the case of equity, it can be significantly better on a higher level – and your initial grant is given you typically for the whole first four (!) years. Some internal position availability might depend on it (internal mobility job postings with strict L5/L6/L7 requirements). On the other hand, even a low level doesn’t prevent you from leading projects, or doing high impact, cool work.

But my general advice would be – rather don’t agree to a lower level; even if you don’t care too much about compensation and stuff like that (for me it’s secondary, we are extremely privileged in tech that even “average” paying salaries are still fantastic), it’s possible you would later be bitter about it and less happy because of something so silly and not related to your actual, daily work.

So explain your expectations right away, before the interviews start – if you want to be a “senior” don’t agree to signing a contract for another position with a “promise of future re-evaluation”; it doesn’t mean anything.

Yes it might mean ending the hiring process earlier, but remember the advice about being non-desperate and respecting yourself? It’s ok to walk away.

Some digression and side note – I think under-leveling is one of the many reasons for gender salary disparity. All big corps claim they pay all genders equally for the same work, and I don’t have a reason to question their claims, but most likely this means per “level” – but it doesn’t take into account that a gender might get under-leveled compared to another one. Whether from recruiter’s bias, or cultural gender norms that make some people less likely to fight for proper, just treatment – but under-leveling is in my opinion one of the main reasons for gender (and probably other categories) pay gap. Obviously not the fault of its victims, but the system that tries to under-level “everyone”, yet some are affected disproportionally. Discrimination is often setting the same rules for everyone, but applying them differently.

How long does the whole process take?

Interviews can drag for months. This sucks a lot – and I experienced it both from some giants, as well as some small companies (at my first job, it took me 1.5 years from interviewing to getting an offer… though you can imagine it was due to some special circumstances). Or it can be super fast – I have also seen some giants moving very fast, or smaller companies giving me the offer the same day I interviewed and while I was still in the building.

But take this into account – especially around any holidays, vacations season etc. it can take multiple months and plan accordingly. Typically after each stage (talking with the recruiter, phone screen, on-site interviews, getting an offer) there are 1-15 days of delay / wait time… At Google it took me roughly ~5 months from my first chat with a recruiter and agreeing for interviews to signing my offer. And I have heard some worse timing stories.

This is especially important when interviewing at multiple places – worth letting recruiters know about it to avoid getting completely out of sync. Similarly important if you are under financial pressure.

Note that if you need to obtain a visa, everything might take even a year or a few longer. I won’t be discussing it in this guide – but immigration complicates a lot and can be super stressful (I know it from first hand experience…).

Why are the whiteboard “coding” interviews the way they are?

Before going toward some interviewing advice, it’s worth thinking about why tech coding (often “whiteboard” pre-pandemic) interviews that get so much hatred online are the way they are. Are companies acting against their own interest for some irrational reason? Do they “torture” candidates for some weird sense of fun?

(Note: I am not defending all of the practices. But I think they are reasonable for the companies once you consider the context and understand they are for them, and not for candidates.)

Historically, companies like Facebook or Google were looking for “brilliant hackers”. People who might lack speciality, but are super fast to adapt to any new problem, quickly write a new algorithm, or learn a new programming language if needed. People who are not assigned to a specific specialty / domain / even team, but are encouraged to move around the company every two years, adding fresh new insights, learning new things, and spreading the knowledge around. So one year a person might be working on a key-value store, another year on its infrastructure, another year on message parsing library, and then another year developing a new frontend framework, or maybe a compiler. And it makes sense to me – one year XML was “hot”, another one JSON, and another one proto buffers – knowing such a technology or not is irrelevant.

A lot has changed since then and now those companies don’t have hundreds of engineers, but hundreds of thousands, but I think the expectation of “smart generalists” mostly stayed until recently and it’s changing only now.

In such a view, you don’t care for coding experience, knowing well any particular domain, programming languages or frameworks; instead you care for the ability to learn those; quickly jump onto new abstract problems that were not solved before, and hack a solution before moving to new ones.

Then testing for “problem solving” (where problems are coding algorithmic problems) makes the most sense. Similarly, asking someone to spend 2 weeks catching up on algorithms (most of us have not touched it since college…) seems like a proxy for quick learning ability.

Those companies also did decades of research into hiring and what correlates with future employee performance and found that yes, such types of interviews they believe represents “general cognitive ability” (I think it’s an unfortunate term) for software engineers correlates well with future performance.

Another finding was that it doesn’t really matter if the process is perfect, but the more interviewers there are and they are in consensus, the better the outcomes. This shouldn’t be surprising from just a mathematical / statistical point of view. So you’d have 4-5 problem solving interviews with algorithms and coding, and even if some are a bit bs, if interviewers agree that you should be hired, you would.

The world has changed and the process is changing as well – luckily. 5-6y ago, you would get barely any domain knowledge interviews – now you get some, especially for niche domains. At Google, there were no behavioral interviews testing emotional intelligence (who cares about hiring an asshole, if they are brilliant, right? This can end badly…) – now luckily they are mandatory.

But those 6y ago I heard stories of someone targeting a graphics related team / position, having purely graphics experience, and failing the interview because of a “design server infrastructure of [large popular service]” systems design question (as I am sure an interviewer believed that every smart person should be able to do it on the spot; somehow I am also sure they worked on it…).

I think this unfortunate reality is mostly gone, but if you are unlucky – it’s not your fault.

Coding interviews are not algorithms knowledge / trivia interviews

One most common misconception I see on the internet are outraged people angry at “I have been coding for 10 years and now they tell me to learn all these useless algorithms! Who cares how to code a B-tree, this is simple to find on the internet, and btw I am a frontend developer!”.

Let me emphasize – generally nobody expects you to memorize how to code a B-tree, or know Kruskal’s algorithm. Nobody will ask you about implementing a specific variant of sorting without explaining it to you first.

Those interviews are not algorithm knowledge interviews.

They are not trivia / puzzles either. If you spend weeks learning some obscure interval trees, you are wasting your time.

You are expected to use some most simple data structures to solve a simple coding problem.

I will write my recommendations on all that I think is needed in a further section, but the crux of the interview is – having some defined input/output and possibly an API, can you figure out how to use simple array loops, maybe some sorting, maybe some graph traversal to solve the problem and analyze its complexity? If you can discuss trade-offs of a given solution, propose a few other ones, discuss their advantages, and code it up concisely – this is exactly what is expected! Not citing some obscure kd-Tree balancing strategies.

And yes, every programmer should care about the big-O complexity of their solutions, otherwise you end up with accidentally quadratic scaling in production code.

So while you don’t need to memorize or practice any exotic algorithms, you should be able to competently operate on very simple ones. 

This might get some hate, but I think every programmer should be able to figure out how to traverse a binary tree in a certain order after a minute or two of thinking (and yes, I know that it’s different in a stressful interview environment; but this applies to any kind of “thinking”! Interviews are stressful, but it doesn’t mean one should not have to “think” during one) – this is not something anyone should memorize.

Similarly, any competent programmer should be able to “find elements of the array B that are also present in an array A”.

And we’re really talking mostly about this level of complexity of problems, just wrapped up in some longer “story” that obscures it.

Randomness of the interview process

Worth mentioning that those large companies have so many candidates that they consider acceptable to have multiple false negatives; or even a false negative ratio order of magnitude higher than false positives. It was especially true in their earlier growth stages when they just became famous, but were not adding tens of thousands people, but just tens a year. In other words – if they had one hiring slot, and 10 truly brilliant candidates in a pool of 1000 applicants, it used to be “ok” to reject 5 of them prematurely – they were still left with 5 people to choose from.

When interviewing for a job, you will get a panel of different interviewers. In small companies, those people know each other. In huge companies, most likely they don’t and could be random engineers from across the company.

Being interviewed by random engineers is supposed to be “unbiased”, but in my opinion it sucks for the candidate – in smaller companies or when being interviewed by the team directly, the interview goes both ways; they are interviewing you and you are interviewing them and deciding if you would like to work with those people, chatting with them, sensing personalities, and what they might work on based on the questions. If the size of a company is over 100k people and you are assigned random interviewers, this is random and useless for judging – as a bad interviewer is someone you are likely to never even meet…

For some of those interviewers, it might be even the first interview they are doing (sadly the interviewer training is very short and inadequate due to general interviewer shortage). Some interviewers might even ask you questions that they are not supposed to ask (like some “math puzzles” or knowledge trivia that you either know or not – those are explicitly banned at most big tech corps, and still some interviewers might ask them by their mistake). So if you get some unreasonable questions – it’s just the “randomness” of the process and I feel sorry for you. Luckily, it’s possible that it will not matter for the outcome, so don’t worry – if you ace the other interviews, a hiring committee that reviews all interviewer feedback will ignore one that is an outlier from a bad interviewer or if they see that a bs question were asked.

But if you don’t get an offer and your application gets rejected – it can also be expected and not your fault; it doesn’t say anything about your skills or value.

The interview process is not designed for judging your “true” value, it’s designed for the company goals – efficiently and heavily filtering out too many candidates for too few open positions. Other world class people were rejected before you – it happens.

So don’t worry about it, don’t take it personally (it’s business!), look at other job opportunities or simply reapply at some later stage.

Shared advice

Regardless of the interview type (problem solving/coding, knowledge, design, behavioral) there is some advice that applies across the board.

Time management

It’s up to both you and your interviewer to do the time management of the interview. Generally, interviewers will try to manage time, but sometimes they can’t do much if you actively (even unknowingly!) obstruct this goal.

What is the goal of an interview? For you it’s to show as many strengths as possible to support a positive hiring decision. For the interviewer is to poke in many directions to find both your strengths and the limits of your skills/knowledge/experience that would determine your fit for the company/team.

Nobody knows everything to the highest level of detail. Everyone’s knowledge has holes – and it’s fine and expected, especially with the more and more complicated and specialized tech industry. 

The interviews are short. In a 45min interview, you have realistically 35mins of interview time – and a bit better 50min in a 1h interview. Any wasted 5mins is time that you are not giving any evidence to support a hire.

Therefore, be concise and to the point!

Don’t go off the tangent. Avoid “fluff” talk, especially if you don’t know something. This is a natural human reaction, trying to say “something” to not show weakness. But this “fluff” will not convince an expert of any strength (might even be considered negative).

Don’t be afraid to ask clarifying questions (to not waste time on going in the wrong direction) or saying you have not seen much of a given concept, but can try figuring out the answer – typically the interviewer will appreciate your try to challenge yourself, but change the topic to something more productive.

Humility and being able to say “I don’t know” also make for great coworkers.

If someone tries to bullshit and pretend they know something they don’t is a huge red flag and a reason for rejection on its own.

Communication

Make your answers interactive and conversational.

This is expected from you both at work as well as during the interview process. It is explicitly evaluated by your interviewers.

Explain your decisions and rationale.

Justify the trade-offs (even if the justification is simple and pragmatic “here I will use a variable name ‘count’ to make the code more concise and fit on the whiteboard, in practice would name it more verbosely.”).

Explain your assumptions – and ask if they are correct.

Ask many clarifying questions for things that are vague or unclear – just like at work you would gather the requirements.

If you didn’t understand the question, express it (helping the interviewer by asking “did you mean X or Y?”).

As mentioned above, don’t add fluff and flowery words and sentences.

Generally, be friendly and treat the interviewer like a coworker – and they might become one.

Being respectful

I mentioned that you should expect to be treated respectfully. Any signs of belittling, proving you are less worthy, ridiculing your experience or takes are huge red flags.

It’s ok to even express that and speak up.

But you also absolutely have to be polite and respectful – even if you don’t like something.

Remember that your interviewers are people; and often operate within company interviewing rules/guidelines.

Don’t like the interview question, think it’s contrived, doesn’t test any relevant skills?

It happened to me many times. But keep this opinion to yourself. It’s possible that you misunderstood the question, its context, or why the interviewer asked it (might be an intro and segway to another one).

Similarly, nobody cares about your “hot takes” “well actually” that disagree with the interviewer on opinions (not facts) – keep to yourself if you think that linked lists are not useful in practice (they are), that C++ is a garbage language (no, it’s not), it’s not necessary to write tests (hmm…), or that complexity analysis is useless (what?).

What purpose would it serve, to show “how smart you are”?

Instead it would show that you are a pain in the ass to work with if even during an interview you cannot resist an urge to argue and bikeshed – then how bad it has to be in your daily work?

Problem solving (“coding”) interviews

Let me focus first on the most dreaded type of the interviews that gets most of the bad reputation. I have already explained that they are not algorithm knowledge interviews, and here is some more practical advice.

Picking interview programming language

Most big companies nowadays let you use any common language during the interviews – and it makes sense, as remember, they don’t look for specialty / expertise and believe that a “smart” candidate can learn a new language quickly; plus at work we use numerous languages (I wrote code in 6 different ones at work in 2021).

So you can pick almost any, but I think some choices are better than the others. I can give you some conflicting advice, use your judgment to balance it out:

  • Something concise like Python gives you a huge advantage. If you can code a solution in 10 lines instead of 70, you are going to be evaluated better on interview efficiency and get a chance to move further ahead in the process. You are likely to make less bugs. And it’s going to be much easier to read it multiple times and make sure it works correctly. If you know Python relatively well, pick it.
  • You do not want to pick a language you don’t feel confident in, or are likely to use incorrectly. So if you write mostly C/C++ code or Java, I would not recommend picking Python only because it’s more concise. It can backfire spectacularly. 

Does coding style matter?

Separate point worth mentioning – generally interviewers should not pay too much attention to things that can be googled easily – like library function names etc. It’s ok to tell the interviewer that you don’t remember and propose to use the name “x” and if you are consistent, it should be fine.

But I have seen some interviewers being picky about coding style, idiomatic constructs etc. and giving lower recommendations. I personally don’t approve of it and I never pay attention to it, but because of that I cannot give advice to ignore the coding style – it might be a good idea to ask your interviewer what their expectations are.

And it’s yet another reason to not pick a language you don’t feel very confident in.

Must-learn data structures / algorithms

While you are not expected to spend too much time on learning algorithms, below are some basics that I’d consider CS101 and something that you could “have to” know to solve some of the interview problems.

Note that by “know” I don’t mean read the wikipedia page, but to know how to use them in practice and have actually done it. It’s like riding a bike – you don’t learn it from reading or watching youtube tutorials. So if you have not used something for a while, I highly recommend practicing using those simple algorithms in your language of choice with something like advent of code or leetcode.

Hash maps / dictionaries / sets

Most problem solving coding interview problems can be solved with loops and just a simple unordered hash map. I am serious.

And rightfully so; this is one of the most common data structures and whole programming languages, databases (no-sql-like) and/or infrastructures are based on those! Know roughly how ordered / unordered one works and their complexity (amortized). Extra points for understanding things like hash collisions and hash functions, but no reasonable interviewer should ask you to write a good one.

Sorting

Don’t learn different sorting algorithms, this is a waste of time.

Know +/- how to implement one or two, their complexity (you want amortized O(N log N)), but also understand trade-offs like in-place sorting vs one that allocates memory.

Only sorting-like algorithm that I think is worth singling out as something different is “counting sort” (though you might not even think of it as a sorting algorithm!) – a simpler variant similar to radix sort that is only O(N) – but works for specialized problems only; and some of problems I have seen can take advantage of counting sort + potentially a hash map.

Having a sorted array, be able to immediately code up binary search, those come up relatively often in phone screens – watch out for off-by-one errors!

Graphs

Do not memorize all kinds of exotic graph algorithms.

But be sure to know how to implement graph construction and traversal in your language of choice without thinking too much about it.

For traversal, know and be able to immediately implement BFS and DFS, and maybe how to expand BFS to a simple Dijkstra algorithm.

If you don’t have too much time, don’t waste it on more advanced algorithms, spanning trees, cycle detection etc. I personally spent weeks on those and it was useless from the perspective of interviews, and at Google I have not seen any interviewer asking those. This wasn’t wasted time for me personally, I learned a lot and enjoyed it, but was not useful preparing for the interviews.

Trees

Know how to write and traverse binary tree; how to do simple operations there like adding/removing. Know that trees are specialized graphs.

The most sophisticated tree structure that will pop up in some interviews (though it was so common that most companies “banned” it as a question) is a trie for spell checks – but I think it’s also possible to derive / figure it out manually, and it might not be an optimal solution to the problem!

Converting recursion to iteration

It’s worth understanding how to convert between recursion based approaches and iteration (loop) based. This is something that rather rarely comes up at work, but might during the interview, so play around with for example DFS implementation alternatives.

Dynamic programming

This one is a bit unfortunate I think – as dynamic programming is extremely rare in practice, I think through 12y of my professional career and solving algorithmic problems almost every day I had only a single use for it during some cost pre computations – but dynamic programming used to be interview favorites…

I was stressed about it, and luckily spent some time learning those – and 6 out of 10 of my interviewers in 2017 asked me problems that could be solved optimally with dynamic programming in an iterative form.

Now I think it has luckily changed, I haven’t seen those in wide use for a while, but some interviewers still ask those questions.

Understanding the problem

I do only on-site interviews (which for the past two years were virtual, blurring the lines… but are still after “phone screens”), so I get to interview good candidates and generally, I like interviewing – it’s super pleasant to see someone blast through a problem you posed and have a productive discussion and a brainstorming session together.

One of the common reasons even such good candidates can get lost in the weeds and fail is if they didn’t understand the problem properly.

Sometimes they understand it at the beginning, but then forget what problem they were solving during the interview.

If the interviewer doesn’t give you an example, ALWAYS write a simple example and input/expected output in the shared document or on the whiteboard.

I am surprised that maybe less than a third of the candidates actually do that.

This will not only help you if you get stuck (might notice some patterns easier), but also an interviewer could help clarify misunderstanding, and it can prevent getting lost and forgetting the original goals later.

It can also serve as a simple test case.

Testing

After you propose and code a solution, very quickly run it through the most simple (but not degenerate) example and test case. In my experience, if a candidate has a bug in their code and they try to test it, 90% of them will find those bugs – which is always a huge plus.

But testing doesn’t stop there.

Code testing is an ubiquitous coding practice now, and big tech companies expect tests written for all code. (And yes, you can test even graphics code)

So be sure to explain briefly how you would test your code – what kind of tests make sense (correctness, edge cases, performance/memory, regression?) and why.

Behavioral interviews

Behavioral interviews are one of the older types of the interviews; they come with different names, but in general it’s the type of the interview where your interviewer looks at your CV, and asks you to walk them through some past projects; often asking questions like “tell me about a situation, when you…” and then questions about collaboration, conflicts, receiving and giving feedback, project management, gathering requirements, in the recent years diversity/inclusion… basically almost anything.

The name “behavioral” comes from the focus on your (past) behaviors, and not your statements.

This is the type of the interviews that are the most difficult to judge – as they attempt to judge emotional intelligence among other things.

On the one hand, I think we absolutely need such interviews – they are necessary to filter out some toxic candidates, people who do not show enough maturity or EQ for a given role, types of personalities that won’t work well with the rest of your team, or maybe even people with different passions/interests who would not thrive there.

On the other hand, they are extremely prone to bias, like interviewers looking for candidates similar to themselves (infamous “culture fit” leading to toxic tech bro cliques).

Also the vast majority of interviewers are totally under qualified and not experienced enough to judge interpersonal skills and traits from such an interview (hunting for subtle patterns and cues) – experienced good managers or senior folks who have collaborated with many people often are, but assuming that a random engineer can do it is just weird and wrong. Sadly, not much you can do about it.

Google didn’t do those interviews until I think 2019 because of those reasons – but it was also bad.

Not trying to filter out assholes leads to assholes being hired.

Or someone who is a brilliant coder or scientist, but a terrible manager getting hired into a managing position.

Note that some companies also mix behavioral interview with knowledge interviews and judge your competence based on them (this was true at every gamedev company I worked for) – while some others treat it only as a behavioral filter.

I suggest for you as an interviewee treating it as both – showing both your competence (whether expertise, engineering, managerial, or technical leadership) and confidence, as well as personality traits.

There are no “perfect” answers

From such a broad description it might seem that such an interview is about figuring out the “right” answers to some questions.

I think this is not true.

There are some clearly bad and wrong answers – examples of repeated, sustained toxic behaviors, bullying, arrogance – but I don’t think there are perfect ones, as this depends a lot on the team.

Example – teamwork. In some teams, you are expected to collaborate a lot with the whole team, having your designs reviewed almost daily, splitting and readjusting tasks constantly. Some personalities thrive in such an environment, some don’t and need more quiet, focused time.

On the other hand, on some teams there is a need for deep experts in some niche domain who can go on their own for a long period of time – driving the whole topic and involving others only when it’s much more mature. Some other personalities might thrive in this environment, while it might burn out the first type.

Another example – management. A relatively junior team needs a manager that is fairly involved and helps not only mentor, but also with some actual work and code reviews, while a more mature team needs a manager who is much more hands-off, less technically involved, and mostly deals with helping with team dynamics and organizational problems.

So “bullshitting” in such an interview (which is a very bad idea, see below), even if it works, can lead to being assigned to some role that won’t make you happy…

Prepare

This might be a vague and “everything goes” type of interview, but it doesn’t mean you cannot prepare for it. You don’t want to be put on the spot being asked about something that you have never thought about and having to come up with an answer that won’t be very deep or representative.

Two ways to prepare – the first one is to simply look online what are the typical questions asked in such an interview, note down ones that surprised you, are tricky, or you don’t know the answer.

The second one, more important, is to go through your CV and all projects – and think how they went, what you did, what did you learn, what went wrong and could have been better. It’s easy to recall our successes, but also think about unpleasant interactions, conflicts, and what has motivated the person you didn’t agree with.

Spend some time thinking about your career, skills, trajectory, motivations – and be honest with yourself. You can do this also while thinking about the typical behavioral interview questions you found online.

Focus on behaviors, not opinions

When asked about how you gather requirements for a new task, don’t give opinions on how it should be done. This is something anyone can have opinions on or read online.

Anchor the answer around your past project, explain how you approached it, what worked, and what didn’t.

Generally opinions are mostly useless for interviewers – and if you simply don’t have experience with something, explain it – remember the time management advice.

Be honest – no bullshit

Finally, absolutely don’t try to bullshit or make up things.

It’s not just about ethics, but also some basic self-preservation.

The industry is much smaller than you might think, and an interviewer could know someone who could verify your story.

Made-up stories also fall apart on details, or could be figured out by things that don’t match up in your CV.

Two situations have stuck in my memory. One was a “panel” interview a long time ago with two of my colleagues and interviewing a programmer who put in their CV something they didn’t really understand, and it caused one of the interviewers to almost snap and grill the interviewee about technical details to the point that they were about to completely break down. They got stopped by another interviewer (senior manager). This was very unprofessional of my colleague and made everyone uncomfortable – I didn’t speak up as I was pretty junior back then and feel bad about it now. But remember that bluffs might be called.

A second one is talking with someone who bragged about their numerous achievements they did personally on a “very small team, maybe 2-3 programmers other than them, mostly juniors”. Not too long time later I met their ex-manager and it turned out that this person did maybe 10% of what they bragged about, and the team was 15 people. This was… awkward.

Obviously, you don’t have to be self incriminating – that one time you snapped and were unprofessional 5y ago? Or that time you were extremely wrong and didn’t change your mind even being faced with the evidence, until it was too late? You don’t have to mention it.

At the same time, if something you did was imperfect or turned out to be a mistake, but you learned your lessons actually makes for a great answer – as it shows introspection, insight, growth, and self-development. Someone who makes mistakes, but recognizes them, shows constant learning and growth is a perfect candidate, as their future potential self is even better!

Be clear about attribution, credit, and contribution

This is expansion of the previous point – when discussing team projects, be very clear what were your personal exact contributions, and what were contributions of the others.

If you vaguely discuss “we did this, we did that”, how is the interviewer supposed to understand if it was your work, your colleagues, who made decisions, and what the dynamics looked like?

Don’t bloat your contributions and achievements, but also don’t hide them.

If you made most of the technical decisions, even if you were not officially a lead – explain it.

At the same time, if someone else made some contributions, don’t put it under vague “we”, but mention it was the work of your collaborator. Give proper credit

Knowledge / domain interviews

Knowledge interviews are one of the oldest and still most common types of interviews. When I was working in video games, every company and team would use knowledge interviews as the main tool. They ranged from legit poking at someone’s expertise – very relevant to their daily work; through some questionable ones (asking a junior programmer candidate fresh from college “what is a vtable”); to some that were just plain bad and not showing almost anything useful (“what does the keyword ‘mutable’ mean”?).

I think this kind of interview can often give super valuable hiring signals (especially when framed around CV/past experience or together with “problem solving” / design).

On the other hand, it can also be not just useless, but also reinforcing some bias and intimidating – when the interviewer grills the interviewee on some irrelevant details that they personally specialize in. Or when they want to hire someone with exactly the same skills like they have – while the team actually needs a different set of skills and experience.

So as always, it depends a lot on the experience and personality of the interviewer.

What makes a good domain knowledge question?

In my opinion it is an open ended nature and potential for both depth and breadth and allowing candidates of all levels to shine according to their abilities.

So for example for a low level programmer a question about “if” statements can lead to fascinating discussions about jumps, branches, branch prediction, speculative execution, out of order processors, vectorization of if statements to conditional selects, and many many more.

By contrast a lazy, useless domain knowledge question has only a single answer and you either know it or not – and it doesn’t lead to any interesting follow up discussions.

Still, it was quite a surprise to learn that Google used to not do any of those a while ago – I explained the reasons in one of the sections above. Now the times have changed, and Google or Facebook will interview you on those as well – not sure about some introductory roles, but that’s definitely the case in specialized organizations like Research. 

Before the interviews, talk with the recruiter what kind of domain interviews you will have – which domains should you prepare to interview for and what the interviewers expect. It’s totally normal, ok and you should ask it! Don’t be shy about it and don’t guess.

Even for a gamedev “engine” role you could be interviewed in a vast range of topics from maths, linear algebra, geometry, through low level, multithreading, how OS works, algorithms and data structures, to even being asked graphics (some places use the term “engine” programmer as “systems” programmer, and in some places it means a “graphics” programmer!).

A fun story from a colleague. They mentioned that straight after college, they applied to a start-up doing quant trading, but there was some miscommunication about the role and the interviews – they thought they applied for an entry level software engineer position, while it was for a junior quant researcher. Whole interview was in statistics and applied math and they mentioned that the interviewer (mathematician) quickly realized the miscommunication and tried to back off from intimidating questions and rescue the situation by finding some kind of simple question “so… tell me what kind of statistical distributions do you know?” “hmm… I don’t know, maybe Gaussian?”. They were totally cool about it, but to avoid such situations, make sure you communicate properly about domain knowledge expectations.

Refresh the basics

I’ll mention some of the knowledge worth refreshing before domain interviews I gave or received (graphics, engine/low-level/optimization, machine learning), but something that is easy to miss in any domain interview – refresh the absolute basics. You know, undergrad level textbook type of definitions, problems, solutions.

Whether chasing the state of the art or simply solving practical problems, it’s easy to forget some complete basics of your domain and “beginner” knowledge – things you have not used or read about since college.

I think it’s always beneficial to refresh it – you won’t lose too much time (as you probably know those topics well, so a quick skim is enough), but won’t be caught off guard by an innocent, simple question that the interviewer intended just as a warm-up.

Getting stuck there for a minute or two is not something that would make you fail the interview, but it can be pretty stressful and sometimes it’s hard to recover from such a moment of anxiety.

Graphics domain interviews

Graphics can mean lots of different things (I think of 3 main categories of skills – 1. high level/rendering/math, 2. CPU rendering programming and APIs, 3. low level GPU / shader code writing and optimizations), but here are some example good questions that you should know answers to (within a specialty; if you are an expert on Vulkan CPU code programming, no reasonable interviewer should be asking about BRDF function reciprocity; and it’s ok for you to answer that you don’t know):

  • What happens on the CPU when you call the Draw() command and how it gets to the GPU?
  • Describe the typical steps of the graphics pipeline.
  • Your rendering code outputs a black screen. How do you debug it?
  • What are some typical ways of providing different, exclusive shader/material features to an artist? Discuss pros and cons.
  • A single mesh rendering takes too much time – 1ms. What are some of the possible performance bottleneck reasons and how to find out?
  • Calculate if two moving balls are going to collide.
  • Derive ray – triangle intersection (can be suboptimal, focus on clear derivation steps),
  • What happens when you sample a texture in a shader? (Follow up: Why do we need mip maps?)
  • How does perspective projection work? (Follow up for more senior roles, poke if they understand and can derive perspective correct interpolation)
  • What is Z-fighting, what causes it, and how can one improve it?
  • What is “culling”, what are some types and why do we need those? 
  • Why do we need mesh LODs?
  • When rasterization can outperform the ray tracing and vice versa? What are the best case scenarios for both?
  • What are pros/cons of deferred vs forward lighting? What are some hybrid approaches?
  • What is the rendering equation? Describe the role of its sub-components.
  • What properties should a BRDF function have?
  • How do games commonly render indirect specular reflections? Discuss a few alternatives and compare them to each other.
  • Tell me about a recent graphics technique/paper you read that impressed/interested you.

Engine programming / optimization / low level domain

It’s been a while since I interviewed for such a position, and similarly long time since I interviewed others, but some ideas of topics to know pretty well:

  • Cache, memory hierarchies, latency numbers.
  • Branches and branch prediction.
  • Virtual memory system and its implications on performance.
  • What happens when you try to allocate memory?
  • Different strategies of object lifetime management (manual allocations, smart pointers, garbage collection, reference counting, pooling, stack allocation).
  • What can cause stalls in in-order and out-of-order CPUs?
  • Multithreading, synchronization primitives, barriers.
  • Some lock-free programming constructs.
  • Basic algorithms and their implications on performance (going beyond big-O analysis, but including constants, like analysis of why linked lists are often slow in practice).
  • Basics of SIMD programming, vectorization, what prevents auto-vectorization.

Best interviews in this category will have you optimize some kind of code (putting your skills and practical knowledge to test). Ones I have been tasked with typically involved:

  • Taking something constant but a function call outside of a loop body.
  • Reordering class fields to fit a cache line and/or get rid of padding, and removing unused ones (split class).
  • Converting AoS to SoA structures.
  • Getting rid of temporary allocations and push_backs -> pre-reserve outside of the loop, possibly with alloca.
  • Removing too many indirections; decomposing some structures.
  • Rewriting math / computations to compute less inside loop body and possibly use less arithmetic.
  • Vectorizing code or rewriting in a form that allows for auto vectorizing (be careful; up to discussion with the interviewer).
  • (Sometimes) adding “__restrict” to function signatures.
  • Prefetching (nowadays might be questionable).

Important note: Generally do not suggest multithreading until all the other problems are solved. Multi-threading bad code is generally a very bad practice.

General machine learning

This is something relatively recent to me and I haven’t done too many interviews about it; but some recurring topics that I have seen from other interviewers and questions to be able to easily answer:

  • Bias/variance trade-off in models.
  • What is under- and over-fitting, how to check which one occurs and how to combat it?
  • Training/test/validation set split – what is its purpose, how does one do it?
  • What is the difference between parameters and hyperparameters?
  • What can cause an exploding gradient during training? What are typical methods of debugging and preventing it?
  • Linear least squares / normal equations and statistical interpretation.
  • Discuss different loss functions – L1, L2, convex / non-convex, domain specific loss functions.
  • General dimensionality reduction – PCA / SVD, autoencoders, latent spaces.
  • How does semi-supervised learning fill in the gap between supervised and unsupervised learning? Pros/cons/applications.
  • You have a very small amount of labeled data. How can you improve your results without being able to acquire any new labels? (Lots of cool answers there!)
  • Explain why convolutional neural networks train much faster than fully connected ones?
  • What are generative models and what are they trying to approximate?
  • How would you optimize runtime of a neural network without sacrificing the loss too much?
  • When would you want to pick a traditional approach over deep learning methods and vice versa? (Open ended)

System design interviews

Systems design interviews are my favorite as an interviewer (and I thoroughly enjoyed them as an interviewee). They are required for more senior levels and comprise a design process for larger scale problems. So instead of simple algorithms, this is the type of problems like designing terrain and streaming system for a game; designing a distributed hash map; explaining how you’d approach finding photos containing document scans; maybe designing an aspect of a service like Apple Photos.

This is obviously not something that can be done in a short interview time frame – but what can be done is gathering some requirements, identifying key difficulties, figuring out trade-offs, and proposing some high level solutions. 

This is the most similar to your daily work as a senior engineer – and this is why I love those.

I don’t have much specific advice about this type of the interviews that would be different from the other ones – but focus on communication, explaining boundaries of your experience, explaining every decision, asking tons of questions, and avoiding filler talk.

An example of one design interview I have partially failed (enough to get a follow-up one but not be rejected). An engineer that spent 10y of his career designing a distributed key-value store asked me to design one. Back then I considered it a fair question as I had focused quite a bit on this type of low-level data structures, though just on a single machine (now I would probably explain that it’s not my speciality so my answers would be based mostly on intuition and not actual knowledge/experience – and the interviewer might change their question). At one point of the interview, I made an assumption about prioritizing on-core performance that was a trade-off of deletion performance as well as ability to parallelize it across multiple machines – and I didn’t mention it. This caused me to make a series of decisions that the interviewer considered as showing lack of depth / insight / seniority. Luckily I got both this feedback from the recruiter, as well as had great feedback from other interviews, and was given a chance to re-do the systems design interview with another engineer that I did well on. In hindsight, if I communicated this assumption and asked the interviewer about it, it could have gone better.

After the interview

After the interview, don’t waste your time and mental stamina on analyzing what could have been wrong and if you answered everything correctly.

Seeking self-improvement is great, but wait for the feedback from the recruiter.

I remember one of the interviews that I thought I did badly on (as I got stuck and stressed on a very simple question and went silent for two minutes; later recovered, but was generally stressed throughout) and was overthinking it for the following week, but later the recruiter (and the interviewer – I worked with them at some later point) told me I did very well.

Sometimes if you totally bomb one interview but do very well on others, you will get a chance to re-do a single interview in the following week. This happened both to me with my design interview I mentioned, and to many of my friends, so it’s an option – don’t stress and overthink.

Also note that if you fail to get an offer, you can try to ask for feedback on how to improve in the future. Sadly, big companies typically won’t give it to you – due to legal reasons and being afraid of lawsuits (if someone disagrees with some judgement and claims discrimination instead).

But small companies often will and I remember how one candidate we interviewed that did well except for one area (normally it would be ok and we’d hire them, but it was also a question of also obtaining a visa) got this as a feedback and in a few months they caught up on everything, emailed us, demonstrated their knowledge, and got the offer.

Interviewing your future team

At smaller companies and some of the giants (Apple and to my knowledge also Microsoft/nVidia), all of your intervierviews will be with your potential team members – an opportunity not only for them to interview you, but as importantly for you to interview them, check the vibe, and decide if you want to work with them.

At Google and some other giants, this happens in a process called “team matching” after your technical interviews are over and successful. At Facebook it was a month after you get hired during a “bootcamp” (totally wrong if you ask me – since so much of your work and your happiness depends on the team, I would not want to sign an offer and work somewhere without knowing that I found a team I really want to work with).

In my opinion, a great team and a great manager is worth much more than a great product. Toxic environments with micromanagers and politics will burn you out quickly.

In any case, at some point you will have an opportunity to chat with potential teammates and potential manager and it’s going to be an interview that goes both ways.

You might be asked both some “soft” and personal motivation/strengths/work preferences questions, as well as more technical questions.

One of my advices is to learn something about your potential future team before chatting with them and be prepared both for questions they might ask, as well as asking them ones that maximize amount of information (spending 10mins on hearing what they are working on is not a productive use of a half an hour you might have – if it’s information that can be obtained easily earlier).

Interview your prospect manager and teammates

Here are some questions that are worth asking (depending if something is important for you or not):

  • How many people are on your team? How many team members (and others outside of your team) do you work closely with?
  • How many hours a week do you spend coding vs meetings vs design vs reading/learning?
  • How much time is spent on maintenance / bug fixing vs developing new features / products?
  • Do you work overtime and if so, how often?
  • Do people send emails outside of the working hours?
  • How much vacation did you take last year?
  • Does the manager need to approve vacation dates, and are there any shipping periods when it’s not possible?
  • Do you have daily or weekly team syncs and stand-ups? 
  • Do you use some specific project management methodology?
  • What does the process of submitting a CL look like – how many hours or days can I expect from finishing writing a CL for it to land to the main repo?
  • What are the next big problems for your team to tackle in the next 6/12/24 months?
  • If I were to start working with you and get onboarded next week, what task/system would I work on?
  • Do you work directly with artists/designers/photographers/x, or are requirements gathered by leads/producers/product managers?
  • What are the 2-3 other teams you collaborate most closely with?
  • Are there other teams at the company with similar competences and goals?
  • How do you decide what gets worked on?
  • How do you decide on priorities of tasks?
  • How do you evaluate proposed technical solutions?
  • When there is a technical disagreement about a solution between two peers, how do you resolve conflicts? Tell me about a particular situation (anonymize if necessary).
  • Does the manager need to review/approve docs or CLs?
  • How often do you have 1-1s with a skip-manager / director?
  • How many people have left / you hired in the last year?
  • How many people are you looking to hire now?
  • How much time per year do you spend on reducing the tech debt?
  • How do you support learning and growth, do you organize paper reading groups, dedicate some time for lectures / presentations, are there internal conferences?
  • Do you send people to conferences? Can I go to some fixed number of conferences per year, or is there some process that decides who gets sent?
  • Does your team present or publish at conferences? If not, why?
  • What is the company’s policy towards personal projects? Personal blog, personal articles?

Watch out for red flags

Just like hiring managers want to watch out for some red flags and avoid hiring toxic, sociopathic employees, you should watch out for some red flags of toxic, dysfunctional teams and manipulating, political or control-freak micromanagers.

Some of the questions I wrote above are targeted at this. But it’s kind of difficult and you as an interviewee are in a disadvantaged, extremely asymmetric situation.

Think about this – companies do background checks, reference checks, interview you for many hours, review your CV looking for any gaps or short tenures – and you get realistically 15, maximum 30 minutes of your questions where you get to figure out the potential red flags from often much more senior and experienced people (especially difficult if they are manipulative and political).

One way to help the inherent asymmetry and time constraint is to find people who used to work at the given company / team, and ask them about your potential manager and the reason they have left. Or find someone who works inside of the company, but not on the team, and ask them anonymously for some feedback. The industry is small and it should be relatively straightforward – even if you don’t know those people personally, you could try reaching out.

A single negative or positive response doesn’t mean much (not everyone needs to get along well and conflicts or disliking something is part of human life!), but if you see some negative pattern – run away!

Similarly, one question that is very difficult to manipulate around is the one about attrition – if a lot of people are leaving the team, and almost nobody seems to stay for more than one project, it’s almost certainly a toxic place.

Another red flag is just the attitude – are they trying to oversell the team? Mention no shortcomings / problems, or hide them under some secrecy? There are legit reasons some places are secretive, but if you don’t hear anything concrete, why would you trust vague promises of it being awesome?

Similarly, vibes of arrogance or lack of respect towards you are clear reasons to not bother with such a team. Some managers use undervaluing of the candidates and playing “tough” or “elite” as a strategy to seem more valuable, so if you sense such a vibe, rethink if you’d want to work there.

Finally, there are some key phrases that are not signs of a healthy functioning place. You know all those ads that mention that they are “like a family”? This typically means unhealthy work-life balance, no clear structures and decision making process, and making people feel guilty if they don’t want to give 100%. Looking for “rockstars”? Lack of respect towards less experienced people, not valuing any non-shiny work, relying on toxic outspoken individuals. When you ask for a reason to work somewhere and hear “we hire and work with only the best!”, it both means that they don’t have anything else good to say, as well as management that in its hubris will decline changing anything (after all, they think they are “the best”…).

Figure out what “needs to be done”

I mentioned this in the proposed questions, but even if a perfect team works on a perfect product, it doesn’t mean you’d be happy with tasks you are assigned.

It’s worth asking about what are the next big things that need to get done – basically why they want to hire you.

Is it for a new large project (lots of room for growth, but also not working on the one you might have picked the team for!), to help someone / as their replacement, or to just clear tech debt and for long term maintenance?

One not-so-secret – whatever system you work on, however temporary, will “stick” to you “forever” – especially if it’s something nobody else wants to do.

I think at every single team my “first” tasks are the things that I had to maintain for as long as they were still in use. And those things pile up and after some time, your full time job will be maintaining your past tasks.

Therefore watch out for those and don’t treat this as just “temporary” – they might define and pigeonhole you in the new role forever.

My favorite funny example is my first job at CD Projekt Red and my second task (the first one was a data “cooker” – though most of it was written by the senior programmer who was mentoring me). It was optimizing memory allocations – as the game and engine were doing 100k memory allocations per frame. 🙂 So after a week of work and cutting it to something around ~600 (was surprisingly easy with mostly just pooling, calling reserve on arrays before push_backs and similar simple optimizations), it was difficult to remove them, and still slow because of some locks and mutexes in the allocator. So I proposed to write a new memory allocator, seems like a perfectly reasonable thing to do for a complete junior straight out of college? 😀 Interestingly it survived Witcher 2 and Witcher 2 X360, so maybe wasn’t actually too bad. Anyway, it was a block list allocator with intrusive pointers – so any memory overwrite on such a pointer would cause an allocator crash at some later, uncorrelated point. For the next 3 years, I was assigned every single memory overwrite bug with comments from testers and programmers “this fucking allocator is crashing again” and them rolling their eyes (you know, positive Polish workplace vibes!). I learned quite a lot about debugging, wrote some debug allocators with guard blocks that were trivial to switch, systems for tracking and debugging allocations etc., but I think there is no need to mention that it was a frustrating experience, especially under 80h work weeks and tons of crunch stress around deadlines.

Negotiating the offer

Finally – you went through the interviews, found a perfect team, so it’s time to negotiate your salary!

I wish every company published expected salary ranges in the offer and you knew it before interviewing – but it’s not the case; at least not everywhere.

This is yet another case of extreme asymmetry – as an applicant, you don’t have all the data that a company uses when negotiating salaries – you don’t know their ranges, bonuses, salary structure, competitor’s salaries. And they know all that, and salaries of all the employees they can compare you with.

Before negotiation, do your homework – figure out how much to expect and how much others within a similar position get paid. Example useful salary aggregator website for the US market is levels.fyi. You can also check salaries (though only base salary, not total compensation) of every H1B visa company application.

If you know someone who works or used to work at some place, you can also ask them about expected salary ranges – sadly this is something that many people treat as a taboo (I understand this, but hope it can change. If you are a friend of mine that I trust, feel free to ask me about my past or current salary. Also note that in Poland or California and many other places it is illegal for employers to ban employees from disclosing their salaries).

Salary negotiation is kind of like a business negotiation – business will (in general; there are obviously exceptions, and some fantastic and fair company owners) want to pay you the lowest salary you agree to / they can offer you. After all, capitalism is all about maximizing profit. Unfortunately, unlike a business deal, I think salary is much more – and not “just” your means of living, but also something that a lot of people associate personal value with and we expect fairness (example: tell someone that their peer at the same company earns 2x more while doing less work and being less skilled and watch their reaction).

I am pretty bad at negotiations – I was raised by my parents in a culture of “work hard, do great stuff, and eventually they will always reward you properly”. They would also be against credits/mortgages, investments, and generally we’d never talk about money. I love my parents and appreciate everything they gave me, and I understand where their advice was coming from – it might have been good advice in the ‘80s Poland – but this is terrible advice in the US competitive system, where you need to “fight” to be treated right. When changing jobs, most of the time I agreed to a lower salary than the one I had – believe it or not, this includes getting a salary cut when moving from Poland to Canada (!).

But here are some strategies on how to get a much better offer even if you are as bad at negotiating money as I am.

First is the one I keep repeating – don’t be desperate, respect yourself, be able to walk away at any point if the negotiations don’t lead nowhere.

Always have competing offers / total compensation structure

The second best and most successful advice is simple – have some competing offers and a few employers fighting for you.

Tell all companies that you are interviewing at a few other places (obviously only if you actually do; don’t bullshit – this is very easy to verify).

Then when you get some offer, present the other offers you get.

And no, it’s not as bold as you might think it is and it’s definitely not outrageous; it’s totally acceptable and normal – when I was interviewing for both Google and Facebook, the Facebook recruiter told me that they will give me an offer only after I present the Google one (!). Google’s offer was pretty bad initially, Facebook offered me almost ~2x of total compensation. After I came back to Google, they have equalized the offers…

Similarly, a few years earlier Sony gave me a fantastic offer (especially for gamedev), but it was only after I told them of the Apple offer I had around that time – and I’m sure it has influenced it.

Worth understanding here is the compensation structure. In big tech corps, the base salary is not very flexible / negotiable. You can negotiate it somewhat within ranges, but not too much (but it depends on the level! See my advice about under-leveling). Similarly target bonus, it’s fixed per-level percentage. But what is negotiable, and you can negotiate a lot is the equity grant. But as far as equity negotiation goes, sky’s the limit. The huge difference between Google and Facebook offers came purely from the equity offer difference.

Note that I don’t condone applying to random places only to get some offers. You’d be wasting some people’s time. Apply to places you genuinely want to work for. This will help you also not be desperate and have some cool options to pick from based not just on salaries, but many more choice axes!

Don’t count on royalties / project shipping bonuses

Game companies sometimes might tell you that the salary might seem underwhelming, but the project bonuses are going to be so fantastic and so much worth it!

Don’t believe in it. I don’t mean they are necessarily lying – everyone wants to be successful and get a gigantic pile of money that they might take a small part of and give to employees.

But unless you have something specific in your contract and guaranteed, this is a subject to change. Additionally, projects get postponed or canceled, and it might be many years until you see any bonus – or might never see it if the studio closes down. Game might underperform. Even if it’s successful, you might get laid off after the game ships (yes, happens all the time) and not see any bonus.

There is also a very nasty practice of tying bonus to a specific metacritic score. Why it’s nasty? Game execs know how much metascore a game will most likely have (they hire journalists and do pre-reviews a few months ahead!), and set this bar a few points above.

FWIW my game shipping bonuses were 1-4 monthly salaries. So quite nice (hey, I got myself a nice used car for one of my bonuses!), but if you work on it for a few years, it’s like less than a monthly salary per year of work – much less than big tech target annual bonuses of 15/20/25% of base salary.

And definitely doesn’t compensate for brutal overtime in that industry.

You have a right to refuse disclosing your current salary

This point depends a lot on the place you live in – check your local laws and regulation, possibly consult a lawyer (some labor lawyers do community consultations even for free) – but in California, Assembly Bill 168 prohibits an employer from asking you about your salary history. You can disclose it on your own if you think it benefits you, but you don’t have to, and an employer asking for it breaks the law.

Sign on bonus

Something that I didn’t know about before moving to the US are “sign on” bonuses – simply a one time pile of cash you get for just signing the contract.

You can always negotiate some, it’s typically discretionary and it makes sense in some cases – if by changing jobs you miss out on a bonus at the previous place, or need to pay back relocation at the previous company. Similarly, it makes sense if you start vesting stock grants after some time like a year – it’s to have a bit more cash flow the first year.

Companies are eager to offer you some sign on bonus as it doesn’t count towards their recurring costs, it’s just part of the hiring cost.

But if a recruiter tries to convince you that a lower salary is ok because you get some sign on bonus, think about it – are you going to remember your sign on bonus in a year or two? I don’t think so, and you might not get enough raises to compensate for it, so treat this argument as a bit of bullshit. But you could use it during negotiation.

Relocation package

If the company requires you to relocate (and you are ok with that), they are most likely to offer some relocation package. Note that those are typically also negotiable!

If you are moving to another country, don’t settle for anything less than 2 months of temporary housing (3 months preferred).

You will need it – obtaining documents, driving licenses etc will take you a lot of time, and you won’t have it to look for a place to rent (or buy), so getting some extra time is very useful.

If you have an option of “cashing out” instead of itemized relocation, it is typically not worth it – especially in most tech locations in the US where monthly apartment / house rental costs are… unreasonable.

You have a right to review contract with a lawyer

If you are about to decide to sign a contract, you can review it with a lawyer (or the lowest effort option – ask your friends to review some questionable clauses – if they have seen anything similar). For example, according to the California Business and Professions Code Section 16600, “every contract by which anyone is restrained from engaging in a lawful profession, trade, or business of any kind is to that extent void.”. So most non-compete clauses in California are simply bullshit and invalid.

On the other hand, I have heard of non-poaching agreements to be not just enforceable in general, but of some cases when they did get enforced.

Never buy into time pressure

Here’s a common trick that recruiters or even hiring managers might pull on you – try to put artificial time pressure and make you commit to a decision almost immediately.

This is a bit disgusting, reminds me of used car sellers or realtors (I remember some rental place in LA that had “this month only” monthly special for 3 years anytime I was driving by; only the deadline date was changing every 2 weeks to something close in time). Your offer will come with some deadline too.

This is human psychology – “losing” something (like a job opportunity) hurts as much even if we don’t have it yet (or might even not have decided for it!).

I know that you might be both excited, as well as afraid of “losing” the offer after so much time and effort put into it – but it works both ways.

Remember that hiring great people is extremely hard right now in tech, and who would pass on a great hire just because you waited for a week or two more? This can (very rarely) happen in a small company (if they have only 1 hiring slot and need to fill it as fast as possible), but generally won’t happen in big tech.

So yes, offers come with a deadline, but they almost always can write you a new one.

Take some time to calm down your emotions, think everything through, review your offer multiple times, consult friends and family. Or sign right away – but only if you want.

Someone close to me was pressured like this by a recruiter – being told the offer will be void if they don’t decide in a day or two – while they needed just a week more. What makes it even sadder, was that it was an external agency recruiter and apparently the company and hiring manager didn’t know about it and were willing to give as much time as necessary – but the person didn’t take the offer because of this recruiter pressure and unpleasant interactions with them.

Or another friend of mine who was told “if you don’t start next month, you won’t get a sign on bonus”.

I personally regret starting a job once a few months sooner than I wanted because I got pressured into it by the hiring manager (“but I have a perfect project for you [can’t tell you what exactly], if you don’t start now, it will be done or irrelevant later” – and later their management style was as manipulative; I felt so silly about not seeing through this).

Fin

This is it, my longest post so far! I’ll keep updating and expanding it (especially with ideas for questions to ask future team / manager, as well as example interview topics) in the future.

I hope you found it useful and got some useful advice – even if you disagree with some of my opinions.

Posted in Code / Graphics | 6 Comments

Procedural Kernel (Neural) Networks

Last year I worked for a bit on a fun research project that ended up published as an arXiv “pre-print” / technical report and here comes a few paragraph “normal language” description of this work.

Neural Networks are taking over image processing. If you only read conference papers and watch marketing materials, it’s easy to think there is no more “traditional” image processing.

This is not true – classical image processing algorithms are still there and running most of your bread and butter image processing tasks. Whether because they get implemented in mobile hardware and each hardware cycle is a few years, or simply due to inherent wastefulness of neural networks and memory or performance constraints (it’s easy for even simple neural networks to take seconds… or minutes to process 12MP images). Even Adobe uses “neural” denoising and super-resolution in their flagship products only in slow, experimental mode.

Misconception often comes from a “telephone game” where engineering teams use neural networks and ML for tasks running at lower resolution like object classification, segmentation, and detection, marketing claims that technology is powered by “AI” (God, I hate this term. Detecting faces or motion is not “artificial intelligence”), and then influencers and media extrapolate it to everything.

Anyway, quality of those traditional algorithms is widely varying and depends a lot on the costly manual tuning process.

So the idea goes – why not use extremely tiny, shallow networks to produce local, per-pixel (though in half or quarter resolution) parameters and see if they improve the results of such a traditional algorithm? And by tiny I mean running at lower resolution, and having less than 20k parameters and so small that it can be trained in just 10 minutes!

The answer is that it definitely works and it significantly improves results of classic bilateral or non-local means denoising filter. 

a). Reference image crop. b). Added noise c). Denoising with single global parameter. d). Denoising with locally varying, network produced parameters.

This has a nice advantage of producing interpretable results, which is a huge deal when it comes to safety and accountability of algorithms – important in scientific imaging, medicine, evidence processing, possibly autonomous vehicles. Here is an example of a denoised image and a produced parameter map – white regions mean more denoising, dark regions less denoising:

This is great! But then we went a bit further and investigated the use of different simple “kernels” and functions to generate parameters to achieve tasks of denoising, deconvolution, and upsampling (often wrongly called single frame super-resolution).

First example of a different kernel is using simple Gaussian blur kernel (isotropic and anisotropic) to denoise an image:

Top left: Noisy image. Top right: Denoised with non-local means filter. Bottom left: Denoised with a network inferred isotropic Gaussian filter. Bottom right: Denoised with a network inferred anisotropic Gaussian filter.

It also works very well and produces softer, but arguably more pleasant results than a bilateral filter. Worth noting that this type of kernel is exactly what we used in our Siggraph 2019 work on multi-frame super-resolution, but I spent weeks tuning it by hand… And here an automatic optimization procedure found some very good, locally adaptive parameters in just 10 minutes of training.

Second application uses polynomial reblurring kernels (work of my colleagues, check it out) to combine deconvolution with denoising (very hard problem). In their work, they used some parameters producing perceptually pleasant and robust results, but again – tuning it by hand. By comparison the approach proposed approach here not only doesn’t require manual tuning, optimizes PSNR, but also denoises the image at the same time!

And it again works very well qualitatively:

Left: Original image. Middle: Blurred and noisy image. Right: Deconvolved and denoised image.

And it produces very reasonable, interpretable results and maps – I would be confident when using such an algorithm in the context of scientific images that it cannot “hallucinate” any non-existent features:

Left: Blurred and noisy image. Middle: Polynomial coefficients map. Right: Deconvolved and denoised image.

Final application is single frame (noisy) image upsampling (wrongly called often super-resolution). Here I used again anisotropic Gaussian blur kernel, predicted at smaller resolution and simply bilinearly resampling. Such a simple non-linear kernel produces way better results than Lanczos!

I simply love the visual look of the results here – and note that the same and simple network produced parameters for both results on the right and there was no retraining for different noise levels!

The tech report has numerical evaluations of the first application; some theoretical work on how to combine those different kernels and applications in a single mathematical framework; and finally some limitations, potential applications, and conclusions – so if you’re curious, be sure to check it out on arXiv!

Posted in Code / Graphics | Tagged , , , , | Leave a comment

Study of smoothing filters – Savitzky-Golay filters

Last week I saw Daniel Holden tweeting about Savitzky-Golay filters and their properties (less smoothing than a Gaussian filter) and I got excited… because I have never heard of them before and it’s an opportunity to learn something. When I checked the wikipedia page and it mentioned it being a least squares polynomial fit, yet yielding closed formula linear coefficients, all my spidey senses and interests were firing!

Convolutional, linear coefficients -> means we can analyze their frequency response and reason about their properties.

I’m writing this post mostly as my personal study of those filters – but I’ll also write up my recommendation / opinion. Important note: I believe my use-case is very different than Daniel’s (who gives examples of smoothed curves), so I will analyze it for a different purpose – working with digital sequences like image pixels or audio samples.

(If you want a spoiler – I personally probably won’t use them for audio or image smoothing or denoising, maybe with windowing, but think they are quite an interesting alternative for derivatives/gradients, a topic I have written before).

Savitzky-Golay filters

The idea of Savitzky-Golay filters is simple – for each sample in the filtered sequence, take its direct neighborhood of N neighbors and fit a polynomial to it. Then just evaluate the polynomial at its center (and the center of the neighborhood), point 0, and continue with the next neighborhood. 

Let’s have a look at it step by step.

Given N points, you can always fit a (N-1) degree polynomial to it.

Let’s start with 7 points and a 6th degree polynomial:

Red dots are the samples, and the blue line is sinc interpolation – or more precisely, Whittaker–Shannon interpolation, standard when analyzing bandlimited sequences.

It might be immediately obvious that the polynomial that goes through those points is… not great. It has very large oscillations. 

This is known as Runge’s phenomenon and equispaced points (our original sampling grid) generally don’t fit polynomials to many functions well, especially with higher polynomial degrees. Fitting high degree polynomials to arbitrary functions on equispaced grids is a bad idea…

Fortunately, Savitzky-Golay filtering doesn’t do this.

Instead, they find a lower order polynomial that fits the original sequence the best in least-squares sense – the sum of squared distances between the polynomials sampled at the original points is minimized.

(If you are a reader of my blog, you might have seen me mention least squares in many different contexts, but probably most importantly when discussing dimensionality reduction and PCA)

Since this is a linear least squares problem, it can be solved directly with some basic linear algebra and produces following fits:

On those plots, I have marked the distance from the original point at zero – after all, this is the neighborhood of this point and the polynomial fit value will be evaluated only at this point.

Looking at these plots there’s an interesting and initially non-obvious observation – degrees in pairs (0,1), (2,3), (3,4) have exactly the same center value (at x position zero). Adding a next odd degree to the polynomial can cause “tilt” and changes the oscillations, but doesn’t impact the symmetric component and the center.

For the Savitzky-Golay filters used for filtering signals (not their derivatives – more on it later), it doesn’t matter if you take the degree 4 or 5 – coefficients are going to be the same.

Let’s have a look at all those polynomials on a single plot:

As well as on an animation that increases the degree and interpolates between consequential ones (who doesn’t love animations?):

The main thing I got from this visualization and playing around is how lower orders not only fit the sequence worse, but also have less high frequency oscillations.

As mentioned, Savitzky-Golay filter repeats this on the sequence of “windows”, moving by single point, and by evaluating their centers – obtains the filtered value(s).

Here is a crude demonstration of doing this with 3 neighborhoods of points in the same sequence – marked (-3, 1), (-2, 2), (-1, 3) and fitting parabolas to each.

Savitzky-Golay filters – convolution coefficients

The cool part is that given the closed formula of the linear least squares fit for the polynomial and its evaluation only at the 0 point, you can obtain a convolutional linear filter / impulse response directly. Most signal processing libraries (I used scipy.signal) have such functionality and the wikipedia page has neat derivation.

Plotting the impulse responses for bunch of 7 and 5 point filters:

The degree of 0 – an average of all points – is obviously just a plain box filter.

But the higher degrees start to be a bit more interesting – looking like a mix of lowpass (expected) and bandpass filters (a bit more surprising to me?).

It’s time to do some spectral analysis on those!

Analyzing frequency response – comparison with common filters

If we plot the frequency response of those filters, it’s visually clear why they have smoothing properties, but not oversmoothing:

As compared to a box filter (Savgol filters of degree 0) and a binomial/Gaussian-like [0.25, 0.5, 0.25] filter (my go to everyday generic smoother – be sure to check out my last post!), they preserve a perfectly flat frequency response for more frequencies.

What I immediately don’t like about it, is that, similarly to a box filter, it is going to produce an effect similar to comb filtering, preserving some frequencies and filtering out some others – oscillating frequency response; they are not lowpass filters. So for example a Savgol filter of 7 points and degree 4 is going to leave zero frequencies at 2.5 radians per cycle, while leaving out almost 40% at Nyquist! Leaving out frequencies around Nyquist often produces “boxy”, grid-like appearance.

When filtering images, it’s definitely an undesirable property and one of the reasons I almost never use a box filter.

We’ll see in a second why such a response can be problematic on a real signal example.

On the other hand, preserving some more of the lower frequencies untouched could be a good property if you have for example a purely high frequency (blue) noise to filter out – I’ll check this theory in the next section (I won’t spoil the answer, but before you proceed I recommend thinking why it might or might not be the case).

Here’s another comparison – with IIR style smoothing, classic simple exponential moving average (probably the most common smoothing used in gamedev and graphics due to very low memory requirements and great performance – basically output = lerp(new_value, previous_output, smoothing) ) and different smoothing coefficients.

What is interesting here is that the Savitzky-Golay filter is a bit “opposite” of its behavior – keeps much more of the lower frequencies, decays to zero at some points, while IIR never really reaches a zero.

Gaussian-like binomial filter on the other hand is somewhere in between (and has a nice perfect 0 at Nyquist).

Example comparison on 1D signals

On relatively smooth and smaller filter spatial support signals, there’s not much difference between higher degree savgol and “Gaussian” (while the box filter is pretty bad!):

While on more noisy signals and higher spatial support, they definitely show less rounding of the corners than a binomial filter:

On any kind of discontinuity it also preserves the sharp edge, but…

Looks like the Savitzky-Golay filter causes some overshoots. Yes, it doesn’t smoothen the edge, but to me overshooting over the value of the noise is a very undesirable behavior; at least for images where we deal with all kinds of discontinuities and edges all the time…

Example comparison on images

…So let’s compare it on some example image itself.

Note that I have used it in separable manner (convolution kernel is an outer product of 1D kernels; or alternatively applying 1D in horizontal, and then vertical direction), not doing the regression on a 2D grid.

…On the first glance, I quite “perceptually” like the results for the sharpest Savitzky-Golay filter…

…on the other hand, it makes the existing oversharpening of the image from Kodak dataset “wider”. It also produces a halo around the jacket on the left of the image for the 7,2 filter. It is not great at removing noise, way worse than even a simple 3 point binomial filter. It produces a “boxy pixels” look (result of both remaining high frequencies, as well as being separable / non-isotropic).

I thought they could be great on filtering out some “blue” noise…

…and it turns out not so much… The main reason is this lack of filtering of the highest frequencies:

…behavior around Nyquist – 40% of the noise and signal is preserved there!

Luckily, there is a way to improve behavior of such filters around Nyquist (by compromising some of the preservations of higher frequencies).

Windowing Savitzky-Golay filters

One of the reasons for not-so-great behavior is use of equal weights for all samples. This is equivalent to applying a rectangular window, and causes “ripples” in frequency domain.

This is similar to applying a “perfect” lowpass filter of sinc and truncating it – which produces undesireable artifacts. Common Lanczos resampling improves it by applying an (also sinc!) window function.

Let’s try the same on the Savitzky-Golay filters. I will use a Hann window (squared cosine), no real reason here, other than “I like it” and I found it useful in some different contexts – overlapped window processing (sums to 1 with fully overlapped windows).

Applying a window here is as easy as simply multiplying the weights by the window function and renormalizing them. Quick visualization of the window function and windowed weights:

Note that window function tends to shrink outer samples strongly towards zero and to get similar behavior, you often need to go with a larger spatial support filter (larger window).

How does it affect the frequency response?

It removes (or at least, reduces) some of the high frequency ripples, but at the same time makes the filter more lowpass (reduces its mid frequency preservation).

Here is how it looks on images:

If you look closely, box-shaped artifacts are gone in the first two of the windowed versions, and are reduced in the 9,6 case.

Savitzky-Golay smoothened gradients

So far I wasn’t super impressed with the Savitzky-Golay filters (for my use-cases) without windowing, but then I started to read more about their typical use – producing smoothened signal gradients / derivatives.

I have written about image (and discrete signal in general) gradients a while ago and if you read my post… it’s not a trivial task at all. Lots of techniques with different trade-offs, lots of maths and physics connections, and somehow any solution I picked it felt like a messy compromise.

The idea of Savitzky-Golay gradients is kind of brilliant – to compute the derivative at a given point, just compute the derivative of the fitted polynomial. This is extremely simple and again has a closed formula solution.

Note however that now we definitely care about differences between odd and even polynomial degrees (look at the slope of lines around 0):

The derivative at 0 can have even an opposite sign between degrees of 2 vs 3 and 4 vs 5.

Since we get the convolution coefficients, we can also plot the frequency responses:

That’s definitely interesting and potentially useful with the higher degrees.

And it’s a neat mental concept (differentiating a continuous reconstruction of the discrete signal) and something I definitely might get back to.

Summary – smoothing filter recommendations

Note: this section is entirely subjective!

After playing a bit with the Savitzky-Golay filters, I personally wouldn’t have a lot of use for them for smoothing digital sample sequences (your mileage may vary).

For preserving edges and not rounding corners, it’s probably best to go with some non-linear filtering – from a simple bilateral, anisotropic diffusion, to Wiener filtering in wavelet or frequency domain.

When edge preservation is less important than just generic smoothing, I personally go with Gaussian-like filters (like the mentioned binomial filter) for their simplicity, trivial implementation, very compact spatial support. Or you could use some proper windowed filter tailored for your desired smoothing properties (you might consider windowed Savitzky-Golay).

I generally prefer to not use exponential moving average filters: a) they are kind of the worst when it comes to preserving mid and lower frequencies b) they are causal and always delay your signal / shift the image, unless you run it forwards and then backwards in time. Use them if you “have to” – like the classic use in temporal anti-aliasing where memory is an issue and storing (and fetching / interpolating) more than one past framebuffer becomes prohibitively expensive. There are obviously some other excellent IIR filters though, even for images – like Domain Transform, just not the exponential moving average.

On the other hand, the application of Savitzky-Golay filters to producing smoothened / “denoised” signal gradients was a bit of an eye opener to me and something I’ll definitely consider the next time I play with gradients of noisy signals. The concept of reconstructing a continuous signal in a simple form, and then doing something with the continuous representation is super powerful and I realized it’s something we explicitly used for our multi-frame super-resolution paper as well.

Posted in Code / Graphics | Tagged , , , , , | Leave a comment

Practical Gaussian filtering: Binomial filter and small sigma Gaussians

Gaussian filters are the bread and butter of signal and image filtering. They are isotropic and radially symmetric, filter out high frequencies extremely well, and just look pleasant and smooth.

In this post I will cover two of my favorite small Gaussian (and Gaussian-like) filtering “tricks” and caveats that are not appreciated by textbooks, but are important in practice.

The first one is binomial filters – a generalization of literally the most useful (in my opinion) filter in computer graphics, image processing, and signal processing.

The second one is discretizing small sigma Gaussians – did you ever try to compute a Gaussian of sigma 0.3 and wonder why you get close to just [0.0, 1.0, 0.0]?

You can treat this post as two micro-posts that I didn’t feel like splitting and cluttering too much.

Binomial filter – fast and great approximation to Gaussians

First up in the bag of simple tricks is “binomial filter”, as called by Aubury and Tuk.

This one is in an interesting gray zone – if you do any signal or image filtering on CPUs or DSPs, it’s almost certain you used them (even if you didn’t know under such a name). On the other hand, through almost ~10 years of GPU programming I have not heard of them.

The reason is their design is perfect for fixed point arithmetic (but I’d argue that is actually also neat and useful with floats).

Simplest example – [1, 2, 1] filter

Before describing what “binomial” filters are, let’s have a look at the simplest example of them – a “1 2 1” filter.

[1, 2, 1] refers to multiplier coefficients that then are normalized by dividing by 4 (a power of two -> can be implemented either by a simple bit shift, or super efficiently in hardware!).

In a shader you can use directly a [0.25, 0.5, 0.25] form instead – and for example on AMD hardware those small inverse power of two multipliers can become just instruction modifiers!

This filter is pretty close to a Gaussian filter with a sigma of ~0.85.

Here is the frequency response of both:

For the Gaussian, I used a 5 point Gaussian to prevent excessive truncation -> effective coefficients of [0.029, 0.235, 0.471, 0.235, 0.029].

So while the binomial filter here deviates a bit from the Gaussian in shape, but unlike this sigma of Gaussian, it has a very nice property of reaching a perfect 0.0 at Nyquist. This makes this filter a perfect one for bilinear upsampling.

In practice, this truncation and difference is not going to be visible for most signals / images, here is an example (applied “separably”):

Last panel is the difference enhanced x10. It is the largest on a diagonal line due to how only a “true” Gaussian is isotropic. (“You say Gaussians have failed us in the past, but they were not the real Gaussians!” 😉 )

For quick visualization, here is the frequency response of both with a smooth isoline around 0.2:

Left: [121] filter applied separably, Right: Gaussian with a comparable sigma.

As you can see, Gaussian is perfectly radial and isotropic, while a binomial filter has more “rectangular” response, filtering less of the diagonals.

Still, I highly recommend this filter anytime you want to do some smoothing – a) extremely fast to implement, simple to memorize, b) just 3 taps in a single direction, so can be easily done also as non-separable 9-tap one c) has a zero at Nyquist, making it great for both downsampling, as well as upsampling (removed grid artifacts!).

Binomial filter coefficients

This was the simplest example of the binomial filter, but what does it have to do with the binomial coefficients or distribution? It is the third row from Pascal’s triangle.

You can construct the next Binomial filter by taking the next row from it.

Since we typically want odd-sized filters, I would compute the following row (ok, I have a few of them memorized) of “n over from 0 to 4”:

[1, 4, 6, 4, 1]

The really cool thing about Pascal’s triangle rows is that they always sum to a power of two – this means that here we can normalize simply by dividing by 16!

…or alternatively you can completely ignore the “binomial” part, and look at it like a repeated convolution. If we keep repeating convolving [1, 1] with itself, we get next binomial coefficients:

a = np.array([1, 1])
b = np.array([1])
for _ in range(7):
  print(b, np.sum(b))
  b = np.convolve(a, b)

[1] 1
[1 1] 2
[1 2 1] 4
[1 3 3 1] 8
[1 4 6 4 1] 16
[ 1  5 10 10  5  1] 32
[ 1  6 15 20 15  6  1] 64

At the same time, looking at coefficients of (a + b) ^ n == (a + b)(a + b)…(a + b) and is exactly the same as repeated convolution.

From this repeated convolution we can also see why the coefficients sum is powers of two – we always get 2x more from convolving with [1, 1].

Larger binomial filters

Here the [1, 4, 6, 4, 1] is approximately equivalent to a Gaussian of sigma ~1.03.

The difference starts to become really tiny. Similarly, the next two binomial filters:

Wider and wider binomial filters

Now something that you might have noticed on the previous plot is that I have marked Gaussian sigma squared as 3 and 4. For the next ones, the pattern will continue, and distributions will be closer and closer to Gaussian. Why is that so? I like how the Pascal triangle relates to a Galton board and binomial probability distribution. Intuitively – the more independent “choices” a falling ball can make, the closer final position distribution will resemble the normal one (Central Limit Theorem). Similarly by repeating convolution with a [1, 1] “box filter”, we also get closer to a normal distribution. If you need some more formal proof, wikipedia has you covered.

Again, all of them will sum up to a power of two, making a fixed point implementation very efficient (just widening multiply adds, followed by optional rounding add and a bit shift).

This leads us to a simple formula. If you want a Gaussian with a certain sigma, square it, and compute a binomial of 4*sigma^2 over x.

For example for sigma 2 and 3, almost perfect match:

Binomial filter use – summary

That’s it for this trick!

Admittedly, I rarely had uses for anything above [1, 4, 6, 4, 1], but I think it’s a very useful tool (especially for image processing on a CPU / DSP) and worth knowing the properties of those [1, 2, 1] filters that appear everywhere!

Discretizing small sigma Gaussians

Gaussian (normal) distribution probability distribution is well defined on an infinite, continuous domain – but it never reaches zero. When we want to discretize it for filtering a signal or an image – things become less well defined.

First of all, the fact that it never reaches zero means that we should use “infinite” filters… Luckily, in practice it decays so quickly that using ceil(2-3 sigma) for radius is enough and we don’t need to worry about this part of the discretization.

Another one is more troublesome and ignored by most textbooks or articles on Gaussian – that we point sample a probability density function, which can work fine for any larger filters (sigma above ~0.7), but definitely cause errors and undersampling for small sigmas!

Let’s have a look at an example and what’s going on there.

Gaussian sigma of 0.4…?

If we have a look at Gauss of sigma 0.4 and try to sample the (unnormalized) Gaussian PDF, we end up with:

The values sampled at [-1, 0, 1] end up being: [0.0438, 0.9974, 0.0438].

This is… extremely close to 0.0 for the non-center taps.

At the same time, if you look at all the values at the pixel extent, it doesn’t seem to correspond well to the sampled value:

If you wonder if we should look at the pixel extent and treat it as a little line/square (oh no! Didn’t I read the memo..?), the answer is (as always…) – it depends. It assumes that nearest-neighbor / box would be used as the reconstruction function of the pixel grid. Which is not a bad assumption with contemporary displays actually displaying pixels, and it simplifies a lot of the following math.

On the other hand, value for the center is clearly overestimated under those assumptions:

I want to emphasize here – this problem is +/- unique to small Gaussians. If we look only at mere sigma of 0.8, middle-side samples start to look much closer to proper representative value:

Yes, edges and the center still don’t match, but with increasing radii, this problem becomes minimal even there.

Literature that observes this issue (most don’t!) suggests to ignore it; and they are right – but only assuming that you don’t want to use a sigma of anything below 1.0.

Which sometimes you actually want to use; especially when modelling some physical effects. Modeling small Gaussians is also extremely useful for creating synthetic training data for neural networks, or tasks like deblurring.

Integrating a Gaussian

So how do we compute an average value of Gaussian in a range?

There is good news and bad news. 🙂 

Good news are – this is very well studied, we just need to integrate a Gaussian and we can just use the error function. The bad news… it doesn’t have any nice and closed formula; even in numpy/scipy environment it is treated as a “special” function. (In practice those are often computed by a set of approximations / tables followed by Newton-Raphson or similar corrections to get to desired precision.)

We just evaluate this integral at pixel extent endpoints, and subtract them and divide it by two.

The formula for a Gaussian integral “corrected” this way becomes:

def gauss_small_sigma(x: np.ndarray, sigma: float):
  p1 = scipy.special.erf((x-0.5)/sigma*np.sqrt(0.5))
  p2 = scipy.special.erf((x+0.5)/sigma*np.sqrt(0.5))
  return (p2-p1)/2.0

Is it a big difference? For small sigmas, definitely! For the sigma of 0.4, it becomes [0.1056, 0.7888, 0.1056] instead of [0.0404, 0.9192, 0.0404]!

We can see how good is this value representing the average on the plots:

When we renormalize the coefficients, we get this huge difference.

The difference is also (hopefully) visible with the naked eye:

Left: Point sampled Gaussian with a sigma of 0.4, Right: Gaussian integral for sigma of 0.4.

Verifying the model and some precomputed values

To verify the model, I also plotted a continuous Gaussian frequency response (which is super usefully also a Gaussian, but with an inverse sigma!) and compared it to two small sigma filters, 0.4 and 0.6:

The difference is dramatic for the sigma of 0.4, and much smaller (but definitely still visible; underestimating “real” sigma) for 0.6.

How does it fare in 2D? Luckily, with error function, we can still use a product of two error functions (separability!). However, I expected that it won’t be radially symmetric and fully isotropic anymore (we are integrating its response with rectangles), but luckily, it’s actually not bad and way better than undersampled Gaussian:

Left: Continuous 2D Gaussian with sigma 0.5 frequency response. Middle: Integrated 2D Gaussian with sigma 0.5 frequency response. Right: undersampled Gaussian with sigma 0.5 frequency response.

Plot shows that the integrated Gaussian has very nicely radially symmetric frequency response until we start getting close to Nyquist and very high frequencies (which is expected) – looks reasonable to me, unlike the undersampled Gaussian which suffers from both this problem, diamond-like response in lower frequencies, and being overall far away in the response shape/magnitude from the continuous one.

Finally, I have tabulated some small sigma Gaussian values (for quickly plugging those in if you don’t have time to compute). Here’s a quick cheat sheet:

SigmaUndersampledIntegralMean relative error
0.2 [0., 1., 0.] [0.0062, 0.9876, 0.0062]67%
0.3[0.0038, 0.9923, 0.0038][0.0478, 0.9044, 0.0478]65%
0.4[0.0404, 0.9192, 0.0404][0.1056, 0.7888, 0.1056]47%
0.5[0.1065, 0.787,  0.1065] [0.1577, 0.6845, 0.1577]27%
0.6[0.1664, 0.6672, 0.1664][0.1986, 0.6028, 0.1986]14%
0.7[0.2095, 0.5811, 0.2095] [0.2288, 0.5424, 0.2288]8%
0.8[0.239, 0.522, 0.239] [0.2508, 0.4983, 0.2508]5%
0.9[0.2595, 0.481,  0.2595] [0.267, 0.466, 0.267]3%
1.0[0.2741, 0.4519, 0.2741] [0.279, 0.442, 0.279]2%

To conclude this part, my recommendation is to always consider the integral if you want to use any sigmas below 0.7.

And for a further exercise for the reader, you can think of how this would change under some different reconstruction function (hint: it becomes a weighted integral, or an integral of convolution; for the box / nearest neighbor, the convolution is with a rectangular function).

Posted in Code / Graphics | Tagged , , , , , , , | 2 Comments

Modding Korg MS-20 Mini – PWM, Sync, Osc2 FM

Almost every graphics programmer I know is playing right now with some electronics (virtual high five!), so I won’t be any different. But I decided to make it “practical” and write something that might be useful to others – how I modded a Korg MS-20 Mini.

The most important ones are a PWM mod input – a quick phone video grab here (poor quality audio, demo only, might upload separately recorded MP3s later):

And oscillator sync together with a oscillator 2 FM input:

I am not taking any credit for the invention of the described modifications, but I am going to cover in some more detail as compared to the previous sources.

Korg MS-20

Korg MS-20 mini is a reissue of the legendary synthesizer from Korg which was made in 1978 (!) as a cheap alternative to super expensive Moogs of that era. It has two extremely “dirty”, screeching filters that was made famous throughout the emerging electronic music scene and where distorted, aggressive sound fit perfectly. It has also a very non-linear behavior and weird interactions between the lowpass and highpass filters when their resonance is cranked up, making it a very hands-on type of gear with many cool sounding sweet spots.

I mean, just look at the wikipedia list of its notable users:

From organic “Da Funk” of Daft Punk, through dirty blog house electro of MSTRKRFT, The Presets, and Mr Oizo, broken beats of Aphex Twin, to pulsating rhythms of the Industrial EBM sounds of Belgian and German artists of the 80s – this is an absolute classic, in many genres exceeding the use of Roland Juno’s or Moogs.

It was both super affordable (well you know… most bands bands on that list are not prog rock millionaires who could have afforded Moog or Roland modular systems, or even Minimoogs 🙂 ), and just sounds great on its own.

I tried some emulations and thought they were pretty cool, but nothing extraordinary, but when I visited my friend Andrzej and we jammed a bit on it + some other synths, I knew I need to get the hardware one!

Semi-modular architecture allows for great flexibility and some “weird” patching and exploration of sound design and synthesis (it’s easy to synthesize whole drumkit!), and the External Signal Processor is something extraordinary and not heard of anymore.

There are a few quirks (the way its envelopes work), and notable limitations though as compared to modern synths – but some of those limitations can be “fixed”! The lack of PWM control of the square/pulse wave, lack of oscillator sync, lack of flexible oscillator wave mixing for the first oscillator, and lack of separate frequency modulation of the second oscillator.

Modding MS-20 – sources

Luckily, the synth has been around for such a long time that many have explored extremely easy mods to allow for that and it’s pretty well documented.

I have followed mainly the Korg MS-20 Retro Expansion by Martin Walker, who in turn expanded on the work of Auxren: MS-20 Mini PCB Photos and OSC Sync Mod, MS-20: PWM CV In and Passive 4X Multiplier.

Please visit those links, and check out their diagrams, which I’m not going to re-post out of respect for the original authors.

I used them as my guidance and definitely without them would not have been able to mod it.

That said, Martin Walker’s description had a tiny bug in the text (diagrams and photos are rock solid and greatly valuable!), and Auxren didn’t cover the most exciting (to me! 🙂 ) Osc2 FM mod essential to make the oscillator sync sound like its most common modulated uses.

One of the mods – PWM input jack – is extremely easy. It requires less disassembly, soldering is trivial. Similarly separate oscillator outputs. Those I can recommend to almost everyone.

The other ones are a bit more tricky and involve soldering to the top of tiny surface components.

Required components, tools, and skills

Here’s a list of minimum what you need to have (apologies if I didn’t get all the tool names right, English is not my native language):

  • A set of screwdrivers. Among those you’d probably want something pretty small and short, some unscrewing is inside of the synth with limited space.
  • A driller with a set of metal drills to drill the holes for pots and jacks. Step drills are an option as well if you know how to use them.
  • A set of wrench keys. Surprisingly not as useful as I thought (more on that later).
  • Soldering kit. Hopefully you have a good one already!
  • A decent multimeter and potentially a component tester.
  • Some optional clamps / holders. This makes a lot of steps easier and safer.
  • Safety goggles and gloves for drilling.
  • Some non-electrostatic (if possible) cloth to put components safely on.
  • Small wire cutters.
  • Some sanding paper.

You will need to buy / have / use as components:

  • 5 mono minijack (3.5mm jack) sockets.
  • Two 100k resistors, one 10k resistor. But it wouldn’t hurt just buying some resistor set.
  • A mini toggle switch (on/off, 2 pins are enough).
  • A single diode like 1N4148 or similar – doesn’t really matter.
  • Three 100k mini potentiometers. Martin suggests log taper (type A), but I’d probably suggest two type A for the volume, and one type B linear for the FM mod intensity.
  • Bunch of colored cables.

Finally, I’d say that if you have never soldered before, I would advise against modding a few hundred dollars (even if bought second hand…) synth that has tons of tiny components, and you’d need to solder to some of the surface-mounted extremely tiny and fragile spots.

If anything, stick to the relatively easy PWM mod.

You can very easily burn off the parts of PCB, make tin drops that will connect things that are not supposed to be connected together…

Despite the very small amount of soldering required, this is rather not a starter DIY project – but I won’t say it’s impossible (I don’t want to gatekeep anyone, and learning by doing with some extra “pressure” is also fun). But I’d still recommend first soldering together a guitar pedal or delay effect and practicing soldering in general, knowing what are the potential issues etc.

Opening it up

Before opening any piece of hardware, the most important advice:

Always take photos of everything you are about to disassemble.

Always place all the parts separately and clearly marked.

This will remove quite a lot of the stress of “figuring out how to put it back together”.

The first step (that I did a bit later and it doesn’t matter, but it’s easiest when done first) – take off all the knobs. Large ones come off easily, smaller ones take some delicate pulling from different sides. Don’t use a screwdriver to lift them, they are made of quite delicate plastic and you could damage it!

Important note: unfortunately, the silver jack “screws” are a decoration. I tried taking them off using a wrench only to realize I’m breaking some plastic. 😦 Leave those in place. I was disappointed at the quality of this and the cost cutting measures… Other than that, MS-20 mini is built well and very smart (in my very limited experience).

Unscrew however with a wrench key the plastic black headphone minijack.

I suggest taking off one of the sides (I took off the one closer to the patchbay), and the back screws.

This is how it looks like inside, with its back taken off:

Now, the cool part is that to do just the PWM input mod, you don’t need to disassemble anything anymore! Just solder the cable to the back of the board in the required point (described later) and you’re done!

If you want to go with the other mods, then you need to take off some more screws. Those include some on the other side (you will see which ones), as well as this annoying metal piece that holds everything more stable:

It’s annoying as I later needed to trim it due to my incorrect pot placement, and it is a bit hard to reach, but you should have no problem.

After it and a few screws on the surface, the main board is disassembled, very carefully (watch out for the connected keyboard cables!) put it down:

Finally, you’ll need a wrench key to take off the nuts holding pots onto metal grounding shield:

And voila, Korg MS-20 mini (almost) disassembled!

Soldering wires

Ok, so now for the tricky part – soldering.

On the back it’s almost trivial:

You solder to one of the ground points (black cable), there are plenty of those, to the pulse width pot (purple cable), and to three middle legs of the oscillator wave selection switch.

On the front, things get trickier:

The osc sync requires soldering two cables at two diodes, marked on my picture with red and yellow – D4 and D7. D4 is an annoying one, you need to solder very precisely between some other components.

Virtual mixing ground for inputs is purple cable at IC14 op-amp.

Finally, for the oscillator 2 FM mod, you need to solder to R183 (green cable). This is where Martin’s diagram is correct, but his text (mentioning soldering to the wiper of the potentiometer) is a bit wrong.

The orange cable was there and was not soldered by me. 🙂 

At this point, I suggest double checking every spot, testing it around with a multimeter (with the synth power off and disconnected, at least as the first step) if you didn’t short anything.

Drilling holes and mounting pots

Unfortunately I didn’t take photos from my drilling process.

It’s pretty standard – mark the spots using a ruler and some sharp piece “puncturing” the paint there, and progressively drill with first small, and then larger and larger drills, until your switch/pots/jacks go through without a problem. You can use step drills, I even bought some, but decided to not use them (as I have never used those before).

After each step, clear everything up from scraps, make sure everything is solid mounted, and at the end optionally sand-paper around it.

Now, other than a few scratches (that I have no idea how happened, but those don’t bother me personally too much…) I made a more serious mistake in my hole placement.

The problem is that this placement: 

Is too far to the left by ~1.5cm (more than half an inch). 😦 This made putting it back problematic with this part:

This is not visible on the picture, but inside, it has metal piece that touches the upper part. And for me it was overlapping with the mini switch and mini jack socket… So I had to cut it slightly. I don’t think it’s a problem – everything worked out well, it holds well, but it was stressful, and be sure to verify the dimensions / sizing.

After your pots and jacks are in place, get back to soldering. I suggest starting with connecting all the grounds together – you are going to need a lot of cables there and be smart about their placement to not turn it into a mess.

It’s much easier to do it when you don’t have everything mounted together and other cables “dangling” around. Then solder the diode and the resistors to the legs of the switch / potentiometers.

At the end finally solder the cables from the mainboard – and be careful at this stage to not “pull” anything.

Putting it back

Before you put everything back together, I suggest very carefully, taking care to not short anything and with proper insulation, turn the MS20 on and start testing. Does it work as expected with the basic functionality? Is the new functionality working?

(Internally, it uses 18V max, so it’s not very unsafe for yourself, but still be careful to not fry the synth!)

If it does (hopefully in the first go!), time to start putting everything together. In the meanwhile, at two more steps (after putting the main board in its place, and after screwing everything except for the back) I still tested if everything’s ok.

Final results

Overall it took me a few whole evenings (a few hours each).

Generally I wouldn’t expect this to take less than ~2 days, unless you know exactly what you’re doing and you’re less scared of screwing something up than I was.

It was stressful at times (soldering), but other than a few scratches that you can clearly see on my photos, everything turned out well.

Ok, so this is again how the synth looks like now (oops, didn’t realize it’s so dusty):

It was a super fun adventure and I love how it sounds! 🙂 

I was also impressed with how one puts together such a large, “mechanical” construction – mostly from plastic and cheap parts, but other than faux jack covers, built really well.

Posted in Audio / Music / DSP | Tagged , , , , , , | Leave a comment

Processing aware image filtering: compensating for the upsampling

This post summarizes some thoughts and experiments on “filtering aware image filtering” I’ve been doing for a while.

The core idea is simple – if you have some “fixed” step at the end of the pipeline that you cannot control (for any reason – from performance to something as simple as someone else “owning” it), you can compensate for its shortcomings with a preprocessing step that you control/own.

I will focus here on the upsampling -> if we know the final display medium, or an upsampling function used for your image or texture, you can optimize your downsampling or pre-filtering step to compensate for some of the shortcomings of the upsampling function.

Namely – if we know that the image will be upsampled 2x with a simple bilinear filter, can we optimize the earlier, downsampling operation to provide a better reconstruction?

Related ideas

What I’m proposing is nothing new – others have touched upon those before.

In the image processing domain, I’d recommend two reads. One is “generalized sampling”, a massive framework by Nehab and Hoppe. It’s very impressive set of ideas, but a difficult read (at least for me) and an idea that hasn’t been popularized more, but probably should.

The second one is a blog post by Christian Schüler, introducing compensation for the screen display (and “reconstruction filter”). Christian mentioned this on twitter and I was surprised I have never heard of it (while it makes lots of sense!).

There were also some other graphics folks exploring related ideas of pre-computing/pre-filtering for approximation (or solution to) a more complicated problem using just bilinear sampler. Mirko Salm has demonstrated in a shadertoy a single sample bicubic interpolation, while Giliam de Carpentier invented/discovered a smart scheme for fast Catmul-Rom interpolation.

As usual with anything that touches upon signal processing, there are many more audio related practices (not surprisingly – telecommunication and sending audio signals were solved by clever engineering and mathematical frameworks for many more decades than image processing and digital display that arguably started to crystalize only in the 1980s!). I’m not an expert on those, but have recently read about Dolby A/Dolby B and thought it was very clever (and surprisingly sophisticated!) technology related to strong pre-processing of signals stored on tape for much better quality. Similarly, there’s a concept of emphasis / de-emphasis EQ, used for example for vinyl records that struggle with encoding high magnitude low frequencies.

Edit: Twitter is the best reviewing venue, as I learned about two related publications. One is a research paper from Josiah Manson and Scott Schaefer looking at the same problem, but specifically for mip-mapping and using (expensive) direct least squares solves, and the other one is a post from Charles Bloom wondering about “inverting box sampling”.

Problem with upsampling and downsampling filters

I have written before about image upsampling and downsampling, as well as bilinear filtering.

I recommend those reads as a refresher or a prerequisite, but I’ll do a blazing fast recap. If something seems unclear or rushed, please check my past post on up/downsampling.

I’m going to assume here that we’re using “even” filters – standard convention for most GPU operations.

Downsampling – recap

A “Perfect” downsampling filter according to signal processing would remove all frequencies above the new (downsampled) Nyquist before decimating the signal, while keeping all the frequencies below it unchanged. This is necessary to avoid aliasing.

Its frequency response would look something like this:

If we fail to anti-alias before decimating, we end up with false frequencies and visual artifacts in the downsampled image. In practice, we cannot obtain a perfect downsampling filter – and arguably would not want to. A “perfect” filter from the signal processing perspective has an infinite spatial support, causes ringing, overshooting, and cannot be practically implemented. Instead of it, we typically weigh some of the trade-offs like aliasing, sharpness, ringing and pick a compromise filter based on those. Good choices are some variants of bicubic (efficient to implement, flexible parameters) or Lanczos (windowed sinc) filters. I cannot praise highly enough seminal paper by Mitchell and Netravali about those trade-offs in image resampling.

Usually when talking about image downsampling in computer graphics, the often selected filter is bilinear / box filter (why are those the same for the 2x downsampling with even filters case? see my previous blog post). It is pretty bad in terms of all the metrics, see the frequency response:

It has both aliasing (area to the right of half Nyquist), as well as significant blurring (area between the orange and blue lines to the left of half Nyquist).

Upsampling – recap

Interestingly, a perfect “linear” upsampling filter has exactly the same properties!

Upsampling can be seen as zero insertion between replicated samples, followed by some filtering. Zero-insertion squeezes the frequency content, and “duplicates” it due to aliasing.

When we filter it, we want to remove all the new, unnecessary frequencies – ones that were not present in the source. Note that the nearest neighbor filter is the same as zero-insertion followed by filtering with a [1, 1] convolution (check for yourself – this is very important!).

Nearest neighbor filter is pretty bad and leaves lots of frequencies unchanged.

A classic GPU bilinear upsampling filter is the same as a filter: [0.125, 0.375, 0.375, 0.125] and it removes some of the aliasing, but also is strong over blurring filter:

We could go into how to design a better upsampler, go back to the classic literature, and even explore non-linear, adaptive upsampling.

But instead we’re going to assume we have to deal and live with this rather poor filter. Can we do something about its properties?

Downsampling followed by upsampling

Before I describe the proposed method, let’s have a look at what happens when we apply a poor downsampling filter and follow it by a poor upsampling filter. We will get even more overblurring, while still getting remaining aliasing:

This is not just a theoretical problem. The blurring and loss of frequencies on the left of the plot is brutal!

You can verify it easily looking at a picture as compared to downsampled, and then upsampled version of it:

Left: original picture. Right: downsampled 2x and then upsampled 2x with a bilinear filter.

In motion it gets even worse (some aliasing and “wobbling”):

Compensating for the upsampling filter – direct optimization

Before investigating some more sophisticated filters, let’s start with a simple experiment.

Let’s say we want to store an image at half resolution, but would like it to be as close as possible to the original after upsampling with a bilinear filter.

We can simply directly solve for the best low resolution image:

Unknown – low resolution pixels. Operation – upsampling. Target – original full resolution pixels.  The goal – to have them be as close as possible to each other.

We can solve this directly, as it’s just a linear least squares. Instead I will run an optimization (mostly because of how easy it is to do in Jax! 🙂 ). I have described how one can optimize a filter – optimizing separable filters for desirable shape / properties – and we’re going to use the same technique.

This is obviously super slow (and optimization is done per image!), but has an advantage of being data dependent. We don’t need to worry about aliasing – depending on presence or lack of some frequencies in an area, we might not need to preserve them at all, or not care for the aliasing – and the solution will take this into account. Conversely, some others might be dominating in some areas and more important for preservation. How does that work visually?

Left: original image. Middle: bilinear downsampling and bilinear upsampling. Right: Low resolution optimized for the upsampling.

This looks significantly better and closer – not surprising, bilinear downsampling is pretty bad. But what’s kind of surprising is that with Lanczos3, it is very similar by comparison:

Left: original image. Middle: Lanczos3 downsampling and bilinear upsampling. Right: Low resolution optimized for the upsampling.

It’s less aliased, but the results are very close to using bilinear downsampling (albeit with less aliased edges), that might be surprising, but consider how poor and soft is the bilinear upsampling filter – it just blurs out most of the frequencies.

Optimizing (or directly solving) for the upsampling filter is definitely much better. But it’s not a very practical option. Can we compensate for the upsampling filter flaws with pre-filtering?

Compensating for the upsampling filter with pre-filtering of the lower resolution image

We can try to find a filter that simply “inverts” the upsampling filter frequency response. We would like a combination of those two to become a perfect lowpass filter. Note that in this approach, in general we don’t know anything about how the image was generated, or in our case – the downsampling function. We are just designing a “generic” prefilter to compensate for the effects of upsampling. I went with Lanczos4 for the used examples to get to a very high quality – but I’m not using this knowledge.

We will look for an odd (not phase shifting) filter that concatenated with its mirror and multiplied by the frequency response of the bilinear upsampling produces a response close to an ideal lowpass filter.

We start with an “identity” upsampler with flat frequency response. It combined with the bilinear upsampler, yields the same frequency response:

However, if we run optimization, we get a filter with a response:

The filter coefficients are…

Yes, that’s a super simple unsharp mask! You can safely ignore the extra samples and go with a [-0.175, 1.35, -0.175] filter.

I was very surprised by this result at first. But then I realized it’s not as surprising – as it compensates for the bilinear tent-like weights.

Something rings a bell… sharpening mip-maps… Does it sound familiar? I will come back to this in one of the later sections!

When we evaluate it on the test image, we get:

Left: original image. Middle: Lanczos4 downsampling and bilinear upsampling. Right: Lanczos4 downsampling, sharpening filter, and bilinear upsampling.

However if we compare against our “optimized” image, we can see a pretty large contrast and sharpness difference:

Left: original image. Middle: Lanczos4 downsampling, sharpening filter, and bilinear upsampling. Right: Low resolution optimized for the upsampling (“oracle”, knowing the ground truth).

The main reason for this (apart from local adaptation and being aliasing aware) is that we don’t know anything about downsampling function and the original frequency content.

But we can design them jointly.

Compensating for the upsampling filter – downsample

We can optimize for the frequency response of the whole system – optimize the downsampling filter for the subsequent bilinear upsampling

This is a bit more involved here, as we have to model steps of:

  1. Compute response of the lowpass filter we want to model <- this is the step where we insert variables to optimize. The variables are filter coefficients.
  2. Aliasing of this frequency response due to decimation.
  3. Replication of the spectrum during the zero-insertion.
  4. Applying a fixed, upsampling filter of [0.125, 0.375, 0.375, 0.125].
  5. Computing loss against a perfect frequency response.

I will not describe all the details of steps 2 and 3 here -> I’ll probably write some more about it in the future. It’s called multirate signal processing.

Step 4. is relatively simple – a multiplication of the frequency response.

For step 5., our “perfect” target frequency response would be similar to a perfect response of an upsampling filter – but note that here we also include the effects of downsampling and its aliasing. We also add a loss term to prevent aliasing (try to zero out frequencies above half Nyquist).

For step 1, I decided on an 8 tap, symmetric filter. This gives us effectively just 3 degrees of freedom – as the filter has to normalize to 1.0. Basically, it becomes a form of [a, b, c, 0.5-(a+b+c), 0.5-(a+b+c), c, b, a]. Computing frequency response of symmetric discrete time filters is pretty easy and also signal processing 101, I’ll probably post some colab later.

As typically with optimization, the choice of initialization is crucial. I picked our “poor” bilinear filter, looking for a way to optimize it.

Without further ado, this is the combined frequency response before:

And this is it after:

The green curve looks pretty good! There is small ripple, and some highest frequency loss, but this is expected. There is also some aliasing left, but it’s actually better than with the bilinear filter. 

Let’s compare the effect on the upsampled image:

Left: original image. Middle: Lanczos3 downsampling and bilinear upsampling. Right: Our new custom downsampling and bilinear upsampling.

I looked so far at the “combined” response, but it’s insightful to look again, but focusing on just the downsampling filter:

Notice relatively strong mid-high frequency boost (this bump above 1.0) – this is to “undo” the subsequent too strong lowpass filtering of the upsampling filter. It’s kind of similar to sharpening before!

At the same time, if we compare it to the sharpening solution:

Left: original image. Middle: Lanczos4 downsampling, sharpen filter, and bilinear upsampling. Right: Our new custom downsampling and bilinear upsampling.

We can observe more sharpness and contrast preserved (also some more “ringing”, but it was also present in the original image, so doesn’t bother me as much).

Different coefficient count

We can optimize for different filter sizes. Does this matter in practice? Yes, up to some extent:

Top-left: original. Next ones are 4 taps, 6 taps, 8 taps, 10 taps, 12 taps.

I can see the difference up to 10 taps (which is relatively a lot). But I think the quality is pretty good with 6 or 8 taps, 10 if I there is some more computational budget (more on this later).

If you’re curious, this is how coefficients look like on plots:

Efficient implementation

You might wonder about the efficiency of the implementation. 1D filter of 10 taps means… 100 taps in 2D! Luckily, proposed filters are completely separable. This means you can downsample in 1D, and then in 2D. But 10 samples is still a lot.

Luckily, we can use an old bilinear sampling trick.

When we have two adjacent samples of the same sign, we can combine them together and in the above plots, turn the first filter into a 3-tap one, the second one into 4 taps, third one into an impressive 4, then 6, and also 6.

I’ll describe quickly how it works.

If we have two samples with offsets -1 and -2 and the same weights of 1.0, we can instead take a single bilinear tap in between those – offset of -1.5, and weight of 2.0. Bilinear weighting distributes this evenly between sample contributions.

When the weights are uneven, it still is w_a + w_b total weight, but the offset between those is w_a / (w_a + w_b).

Here are the optimized filters:

[(-2, -0.044), (-0.5, 1.088), (1, -0.044)]
[(-3, -0.185), (-1.846, 0.685), (0.846, 0.685), (2, -0.185)]
[(-3.824, -0.202), (-1.837, 0.702), (0.837, 0.702), (2.824, -0.202)]
[(-5, 0.099), (-3.663, -0.293), (-1.826, 0.694), (0.826, 0.694), (2.663, -0.293), (4, 0.099)]
[(-5.698, 0.115), (-3.652, -0.304), (-1.831, 0.689), (0.83, 0.689), (2.651, -0.304), (4.698, 0.115)]

I think this makes this technique practical – especially with separable filtering.

Sharpen on mip maps?

One thing that occurred to me during those experiments was the relationship between a downsampling filter that performs some strong sharpening, sharpening of the downsampled images, and some discussion that I’ve had and seen many times.

This is not just close, but identical to… a common practice suggested by video game artists at some studios I worked at – if they had an option of manually editing mip-maps, sometimes artists would manually sharpen mip maps in Photoshop. Similarly, they would request such an option to be applied directly in the engine.

Being a junior and excited programmer, I would eagerly accept any feature request like this. 🙂 Later when I got grumpy, I thought it’s completely wrong – messing with trilinear interpolation, as well as being wrong from signal processing perspective.

Turns out, like with many many practices – artists have a great intuition and this solution kind of makes sense.

I would still discourage it and prefer the downsampling solution I proposed in this post. Why? Commonly used box filter is a poor downsampling filter. Applying sharpening onto it enhances any kind of aliasing (as more frequencies will tend to alias as higher ones), so can have a bad effect on some content. Still, I find it fascinating and will keep repeating that artists’ intuition about something being “wrong” is typically right! (Even if some proposed solutions are “hacks” or not perfect – they are typically in the right direction).

Summary

I described a few potential approaches to address having your content processed by “bad” upsampling filtering, like a common bilinear filter: an (expensive) solution inverting the upsampling operation on the pixels (if you know the “ground truth”), simple sharpening of low resolution images, or a “smarter” downsampling filter that can compensate for the effect that worse upsampling might have on content.

I think it’s suitable for mip-maps, offline content generation, but also… image pyramids. This might be topic for another post though.

You might wonder if it wouldn’t be better to just improve the upsampling filter if you can have control over it? I’d say that changing the downsampler is typically faster / more optimal than changing the upsampler. Less pixels need to be computed, it can be done offline, upsampling can use the hardware bilinear sampler directly, and in the case of complicated, multi-stage pipelines, data dependencies can be reduced.

I have a feeling I’ll come back to the topics of downsampling and upsampling. 🙂 

Posted in Code / Graphics | Tagged , , , , , , , | 5 Comments

Comparing images in frequency domain. “Spectral loss” – does it make sense?

Recently, numerous academic papers in the machine learning / computer vision / image processing domains (re)introduce and discuss a “frequency loss function” or “spectral loss” – and while for many it makes sense and nicely improves achieved results, some of them define or use it wrongly.

The basic idea is – instead of comparing pixels of the image, why not compare the images in the frequency domain for a better high frequency preservation?

While some variants of this loss are useful, this particular take and motivation is often wrong.

Unfortunately, current research contains a lot of “try some new idea, produce a few benchmark numbers that seem good, publish, repeat” papers – without going deeper and understanding what’s going on. Beating a benchmark and an intuitive explanation are enough to publish an idea (but most don’t stick around).

Let me try to dispel some myths and go a bit deeper into the so-called “frequency loss”.

Before I dig into details, here’s a rough summary of my recommendations and conclusion of this post:

  1. Squared difference of image Fourier transforms is pointless and comes from a misunderstanding. It’s the same as a standard L2 loss.
  2. Difference of amplitudes of Fourier spectrum can be useful, as it provides shift invariance and is more sensitive to blur than to mild noise. On the other hand, on its own it’s useless (by discarding phase, it doesn’t correspond at all to comparing actual images).
  3. Adding phase differences to the Frequency loss makes it more meaningful image comparisons, but has a “singularity” that makes it not very useful of a metric. It’s a weird, ad hoc remapping…
  4. A hybrid combination of pixel comparisons (like L2) and frequency amplitude loss combines useful properties of all of the above and can be a desirable tool.

Interested in why? Keep on reading!

Loss functions on images

I’ve touched upon loss functions in my previous machine learning oriented posts (I’ll highlight the separable filter optimization and generating blue noise through optimization, where in both I discuss some properties of a good loss), but for a fast recap – in machine learning, loss function is a “cost” that the optimization process tries to minimize.

Loss functions are designed to capture aspects of the process / function that we want to “improve” or solve.

They can be also used in a non-learning scenario – as simple error metrics.

Here we will have a look at reference error metrics for comparing two images.

Reference means that we have two images – “image a” and “image b” and we’re trying to answer – how similar or different are those two? More importantly, we want to boil down the answer to just a single number. Typically one of the images is our “ground truth” or target.

We just produce a single number describing the total error / difference, so we can’t distinguish if this difference comes from one being blurry, noisy, or a different color.

Most common loss function is a L2 loss – average squared difference:

It has some very nice properties for many math problems (like close form solution, or that it corresponds to statistically meaningful estimates, like with assumption of white Gaussian noise). On the other hand, it’s known that it is also not great for comparing images, as it doesn’t seem to capture human perception – it’s not sensitive to blurriness, but very sensitive to small shifts.

A variant of this loss called PSNR is where we add a logarithm transform:

This makes values more “human readable” as it’s easier to tell that PSNR of 30 is better than 25 and by how much, while 0.0001 vs 0.0007 is not so easy to understand. Generally values of PSNR in the range above 30 are considered to be acceptable, above 36+ very good.

In this post, I will be using simple scipy “ascent” image:

To demonstrate the shortcomings of L2 loss, I have generated 3 variations of the above picture.

One is shifted by 1 pixel to the left/up, one is noisy, and one is blurry:

Top left: reference. Top right: image shifted by (1, 1) pixels. Bottom left: image with added noise. Bottom right: blurred image. All of the three distortion have same L2 error and PSNR!

The first two images look completely identical to us (because they are! Just slightly shifted), yet have the same PSNR and L2 error value like a strongly blurry and just mildly noisy one! All of those have a PSNR of ~22.5.

I think on its own this shows why it’s desired to find an alternative to L2 loss / PSNR.

Also it’s worth noting that sensitivity to small shifts is the exactly same phenomenon that causes this “overblurring”. If a few different shifts generate same large error, you can expect their average (aka… blur!) to also generate a similar error:

To combat those shortcomings, many better alternatives were proposed (a very decent recent one is nVidia’s FLIP).

One of the alternatives proposed by researchers is “frequency loss”, which we’ll focus on in this post.

Frequency spectrum

Let’s start with the basics and a recap. Frequency spectrum of a 2D image is a discrete Fourier transform of the image performed to obtain decomposition of the signal into different frequency bands.

I will use here the upper case notation for a Fourier transform:

A 2D Fourier transform can be expressed as such a double sum, or by applying a one dimensional discrete Fourier transform first in one dimension, then in the other one – as multi-dimensional DFT/FFT is separable. Through a 2D DFT, we can obtain a “complex” (as in complex numbers) image of the same size.

Note above that in DFT, every X value is a linear combination of the input pixels x with constant complex weights that depend only on the position – and as such is a linear transform and can be expressed as a gigantic matrix multiply – more on that later.

Every complex number in the DFT image corresponds to a complex phasor; or in simpler terms, amplitude and phase of a sine wave.

If we take a DFT of the image above and visualize complex numbers as red/green color channels, we get an image like this:

This… is not helpful. Much more intuitive is to look at the amplitude of the complex number :

This is how we typically look at periodogram / spectrum images.

Different regions in this representation correspond to different spatial frequencies:

Center corresponds to low frequencies (slowly changing signal), horizontal/vertical axis to pure horizontal/vertical frequencies of increasing period, while corners represent diagonal frequencies.

The analyzed picture has more strong diagonal structures and it’s visible in the frequency spectrum.

Reference image presented again – notice many diagonal edges.

Most “natural” images (pictures representing the real world – like photographs, artwork, drawings) have a “decaying” frequency – which means much more of the lower frequencies than the higher ones. In the image above, there are very large almost uniform (or smoothly varying) regions, and a DFT captures this information well.

But taking the amplitude of the complex numbers, we have lost information – projected a complex number onto a single real one. Such an operation (taking the amplitude) is irreversible and lossy, but we can look at what we discarded – phase, which looks kind of random (yet structured):

Relationship between magnitude/amplitude, phase, and the original complex number can be expressed as . Phase is as important, and we’ll get back to it later.

Comparing periodograms vs comparing images

After such an intro, we can analyze the motivation behind using such a transformation for comparing images.

One of the main reasons for the “frequency loss” (we’ll try to clarify this term in a second) to compare images is (hand-waving emerges) “if you use the L2 loss, it’s not sensitive to small variations of high frequencies and textures, and it’s not sensitive at all to blurriness. By directly comparing frequency spectrum content, we can reproduce the exact frequency spectrum, and help alleviate this and make the loss more sensitive to high frequencies”.

Is this intuition true? It depends. Different papers define frequency loss differently. As a first go, I’ll start with the one that is wrong and I have seen in at least two papers – use of the squared difference of complex numbers in the power spectrum:

Here is how average L2 (average squared difference) looks like compared to “normalized” power spectrum (depending on the implementation, some FFT implementations divide by N, some divide by N^2):

They are the same. PSNR is the same as log transform of this squared pseudo-frequency-loss.

Average squared complex spectrum difference is the same as squared image difference.

Squared complex spectrum difference is pointless

Those papers that use a standard squared difference on DFTs and claim it gives some better properties are simply wrong. If there are any improvements in experiments, it’s a result of poor methodology or some hyperparameter tuning, not having validation set etc. Well… 🙂 

But why is it the same?

I am not great at proofs and math notation / formalism. So for the ones who like more formal treatment, go see Parseval’s theorem and its proofs. You could also try to derive it yourself from expanding the DFT and Euler’s formula. Parseval’s theorem states:

Just consider x and X as not the image pixels and Fourier coefficients, but pixels and Fourier coefficients of differences – and those two are the same.

(Side note: Parseval’s theorem is super useful when you deal with computing statistics and moments like variance of noise that is non-white / spatially correlated. You can directly compute variance from the power spectrum by simply integrating it.)

But this explanation and staring at lines of maths would not be satisfactory to me, I prefer a more intuitive treatment.

So another take is – DFT transform is a unitary transformation. And L2 norm (squared difference) is unitary transformation invariant. Ok, so far this probably doesn’t sound any more intuitive. Let’s take it apart piece by piece.

First – think of unitary transformation as a complex generalization of a rotation matrix. The DFT matrix (a matrix form of the discrete Fourier transform) can be analyzed using the same linear algebra tools and has many of exactly the same properties like a rotation matrix that you can check – verify that it’s normal, orthogonal etc. 🙂 With just one complication – it does the “rotation” in the complex numbers domain.

When I heard about this perspective of DFT in a matrix form and its properties, it took me a while to conceptualize it (and I still probably don’t fully grasp all of its properties). But it’s very useful to think of DFT as a “weird generalization of the rotation” (I hope more mathematically inclined readers don’t get angry at my very loose use of terminology here 🙂 but that’s my intuitive take!).

Having this perspective, we can look at a squared difference – the L2 norm. This is the same as squared Euclidean distance – which doesn’t depend on the rotations of vectors! You can see it on the following diagram:

On the other hand, some other norms like L1 (“Manhattan distance”) definitely depend on those and change, for example:

By rotating a square, you can get L1 distance between two corners from to 2.

We could look at using loss like this. But instead, we can have a look at a better and luckily more common use of spectral loss – comparing the amplitude (instead of full complex numbers).

Spectrum amplitude difference loss

We can look instead at the difference of squared amplitudes:

This representation is more common and has some interesting properties. Let’s use this loss / error metric to compare the three image distortions above:

Ok, now we can see something more interesting happening!

First of all – no, the shifted value is not missing. Shifting an image has an error of zero, so its negative logarithm is at “infinity”. I considered how to show it on this plot, but decided it’s best left out – yes, it is “infinity”.

This time, we are similarly sensitive to blur, but… not sensitive at all to shifts!

The implication that shifting an image by a pixel (or even 100 or 1000 if your “wrap” the image) gives exactly the same frequency response magnitude and zero difference in amplitudes is very important.

This should make some intuitive sense – the image frequency content is the same; but it’s at a different phase.

Spectrum amplitude is shift invariant for pixel-sized shifts and when not resampling!

This can be a very useful property for some super-resolution like tasks and other inverse problems, where we don’t care about small shifts, or might have some problems in for example image alignment stage and we’d like to be robust to those.

We are also less sensitive to noise. In this context, we can say that we are relatively more sensitive to blurriness. This metric has improved quite a lot the score of the noisy picture as compared to the blurry one – which also seems to be doing +/- what we need.

To get them to comparable score with this new metric, we’d need significantly more noise:

With frequency amplitude loss, you have to add more noise to get similar distortion score as compared to blurring.

But let’s have a look at the direct consequences of this phase invariance that makes this loss useless on its own…

Phase information is essential

Ok, now let’s use our amplitude loss on the original picture, and this one:

Pixels of a different image with the same frequency spectrum amplitudes.

The loss value is… 0. According to the frequency amplitude error metric, they are identical.

Again, we are comparing the picture above with: 

Yep, those two pictures have exactly the same spectral content amplitudes, just a different phase. I actually just reset the phase / angle to be all zeros (but you could set it to any random values).

By comparison, standard L2 metric error would indicate a huge error there.

For natural images, the phase is crucial. You cannot just ignore it.

This is in an interesting contrast with audio, where phases don’t matter as much (until you get into non-linearities, add many signals, or have quickly changing spectral content -> transients)

Comparing phase and amplitude?

I saw some papers “sneakily” (not explaining why and what’s going on, nor analyzing!) try to solve it by comparing both amplitude, and a phase and adding them together:

Comparing phases is a bit tedious (as it needs to be “toroidal” comparison, -pi needs to yield the same results and zero loss as pi; I ignored it above), but does it make sense? Kind of?

I propose to look at it this way – comparing a sum of frequency and phase loss is like instead of applying a shortest path between two numbers, summing a path along the diagonal, and the angle difference, which can be considered a wide arc:

Above, La represents the amplitude difference, while Lp represents phase difference. In this drawing, Lp is supposed to be on the circle itself, but I was not able to do it in Slides.

Obviously, relative length of components can be scaled arbitrarily by the coefficient and the effect of phase reduces.

Here’s a visualization of a single frequency with real part of 1, and imaginary of 0 with real/imaginary numbers on the x/y axis, standard L2 loss:

Standard L2 loss.

The L2 loss is “radial” – the further we go away from our reference point (0, 1), the error increases radially. Make sure you understand the above plot before going further.

If we looked only at the frequency amplitude difference, it would look like this:

Frequency amplitude loss.

Hopefully this is also intuitive – all vectors with the same magnitude also form a circle of 0 error – whether (1, 0), (0,1), or ( , ). So we have infinitely many values (corresponding to different phases) that are considered the same and are centered around (0, 0) this time.

And here’s a sum of squared amplitude and angle/phase loss (with the phase slightly downweighted):

Combination of frequency amplitude and phase loss.

So kind of a mix of the above behaviors; but with a nasty singularity around the center – arbitrary small displacements to a (0, 0) vector cause phase to shift and flip arbitrarily…

Imagine a difference between (0, 0.0000001) and (0, -0.0000001) – they have completely opposite phases!

This is a big problem for optimization. In the regions where you get close to zero signal, and the amplitudes are very small, the gradient from the phase is going to be very strong (while irrelevant to what we want to achieve) and discontinuous. And we also lose our original goal – of relying mostly on frequency amplitude similarity of signal.

So does this combined loss make sense? Because of the above reasons, I would recommend against it. I’d argue that papers that propose the above loss and not discuss this behavior are also wrong.

I’ll write my recommendation in the summary, but first let’s have a look at some other frequency spectrum related alternatives that I won’t discuss, but can be an excellent research direction.

Tiled/windowed DFTs/FFTs and wavelets

Generally looking at the whole image and all of the frequencies with global/infinite support (and wrapping!) is rarely desired. Much more interesting mathematical domain was developed for that purpose – all kinds of “wavelets”, localized frequency decomposition. I believe it’s my third blog post where I mention wavelets, but say I don’t know enough about them and I have to defer the reader to literature, so maybe time for me to catch up on their fundamentals… 🙂 

A simpler alternative (that doesn’t require a PhD in applied maths) is to look at “localized” DFTs. Chop up the image into small pieces (typically overlapping tiles), apply some window function to prevent discontinuities and weight the influence by the distance, and analyze / compare locally.

This is a very reasonable approach, used often in all kinds of image processing – but also one of the main audio processing building blocks: Short-time Fourier Transform. If you ever looked at audio spectrograms and spectral methods – you have seen those for sure. This is also a useful tool for comparing and operating on images in different contexts, but that’s a topic for yet another future post.

My recommendations

Ok, so going back to the original question – does it make sense to use frequency loss?

Yes, I think so, but in a more limited capacity.

As a first recommendation, I’d say – amplitude if not enough, but don’t compare the phases. You can get a similar result (without the singularity!) by mixing a sum of the squared frequency amplitude and regular, pixel space square loss, something looking like this (but further tuneable):

This is how this loss shape looks like:

A combination of L2 and frequency amplitude loss. Has desirable properties of both.

With such a formulation, we can compare all the three original image distortion cases:

This way, we can obtain:

  • Slightly more sensitivity to blur than to noise as compared to standard L2,
  • Some shift invariance for minor shifts,
  • Robustness to major shifts – as L2 will dominate,
  • Similarly being robust to completely scrambled phases,
  • Smooth gradient and no singularities.

Note that I’m not comparing it with SSIM, LPIPS, FLIP, or one of numerous other good “perceptual” losses and will not give any recommendation about those – use of loss and error metric is heavily application specific!

But in a machine learning scenario, in the simplest case where you’d like to improve L2 – add some shift invariance, and care about preserving general spectral content  – for example for preserving textures in a GAN training scenario or for computer vision tasks where registration might be imperfect – it’s a useful simple tool to consider.

I would like to acknowledge my colleague Jon Barron with whom I discussed those topics on a few different occasions and who also noticed some of wrong uses of “frequency loss” (as in squared DFT difference) in some recent computer vision papers.

Posted in Code / Graphics | Tagged , , , , , | 3 Comments

On leaving California and the Silicon Valley

Beginning of the year, my wife and I made a final call – to leave California, “as soon as possible”. To be fair, we talked about this for a long time; a few years, but without any concrete date or commitment. But now it was serious, and even the still ongoing pandemic and all related challenges couldn’t stop us. On the first of April, we were already in a hotel in Midtown of NYC, preparing to look for a new apartment that we found two weeks later.

Not California anymore!

I get asked almost daily “why did you do it? Did your Californian dream not work out?”, and some New Yorkers are in shock and somewhat question how a sane person would move out of “perfect, sunny California”.

The Californian dream did work out – after all, we spent almost 7 years there!

Out of which four were in the San Francisco Bay Area – so it wasn’t “suffering”. I am quick to make impulsive/irrational/radical decisions in my life, so if things were bad, we wouldn’t stay there. There are some absolutely beautiful moments and memories that I’d like to keep for the rest of my life. And I could consider “retiring” in Los Angeles at some older age – it was a super cool place to be and live, I loved many things about it (it was way more diverse – in every possible way – than SFBA); even if it didn’t satisfy all my needs.

But there are also many reasons we “finally” decided for the move, reasons why I would not want to spend any more time in the Silicon Valley, and most are nuanced beyond a short.

Therefore, I wrote a personal post on this – detailing some of my thoughts on the Bay Area, California, Silicon Valley (mostly skipping LA, as we left that area over four years ago) – and why they were not an ultimate match for me. If you expect any “hot takes”, then you’ll be probably disappointed. 🙂

Some disclaimers

It’s going to be a personal, and a very subjective take – so feel free to strongly disagree or simply have opposite preferences. 🙂 Many people love living there, many Americans love suburbs, and as always different people can have completely opposite tastes!

It’s a decision that we made together with my wife (who was even stronger advocating for it, as she didn’t enjoy as much some of the things that I find cool in the South Bay), but I’m going to focus on just my perspective (even if we’d agree on 99% of it).

Finally, I am Polish, a Slav, and an European – so consider that my wording might be on a more negative side that most Americans would use. 🙂

Suburbs

Let me start with the obvious – living in California, especially in the Bay Area is for vast majority of people equivalent to living in the suburbs. We used to live for over a year in San Francisco, but it didn’t live up to our expectations (that’s a longer story on its own).

Even Los Angeles, “big city” is de facto a collection of independent, patched suburbs. Ok, but what’s wrong with that?

Flying back above the Bay Area (here probably Santa Clara) – endless suburbs…

It’s super boring! Extremely suburban lifestyle, with a long drive almost anywhere. To me, suburbs are a complete dystopia in many ways – from car culture, lack of community, lack of inspiring places, zero walkability, segregation based on income and other factors, obsession over safety.

If you are a reader from Europe, I need to write a disclaimer – American suburbs are nothing like European. The scale (and consequences of thereof) is unbelievably different! When I was 14, my family moved from a small apartment in the city center to a house in the suburbs of Warsaw. Back then they were “far suburbs”, my parents build a house in the middle of a huge empty field. But I’d still go to a school in the city center, all my friends lived across the city, and I’d always use great public transport to be within an hour to almost any part of the city. I could even walk on foot to any of those other districts (and I often did!) and at college every afternoon in the summer I’d be cycling across the city. By comparison, in Bay Area walking from one suburb town – Mountain View – to an adjacent one, Palo Alto, would take 1.5 uninspiring hours to walk with many intersections on busy highway streets. Even a drive would be ~half an hour! Going to the city (San Francisco) meant either 1.5h train ride (and then switch to SF public transportation, where some parts could be another 1h away), or 1h car ride with next half an hour looking for a parking spot. Back on topic!

Generally the “vibe” I had was that nothing is happening, so as a typical leisure you have a choice of doing outdoors sports, driving for some hike, watching a movie, playing video games, or going to an expensive, but mediocre restaurant. (Ok, I have to admit – it was mediocre in the South Bay / Silicon Valley. SF restaurants were extremely expensive, but great!)

I grew up in cities and love city life in general.

By comparison this is a collage of photos I took all within 5minutes walk of my current home in NYC – hard to not get immediate visual stimulation and inspiration!

I like density, things going on all the time, busy atmosphere, and urban life. I love walking out, seeing some events happening, streets closed, music playing, concerts and festivals happening, and just getting immersed in it and taking part in some of those events. Or going out to a club with friends, but first checking out some pub or a house party, changing the location a few times, and finally going back home on foot and grabbing a late night bite. It’s all about spontaneity, many (sometime overwhelming!) options to pick from, and not having to plan things ahead – and not feel trapped in a routine. Also, often the best things that you remember are all those that you didn’t plan, just happened and surprised you in a new way. As an emigrant willing to experience some of the new culture directly and get immersed in it- Montreal was great, Los Angeles mediocre, South Bay terrible.

Suburban boredom was extremely uninspiring and demotivating, and especially during the pandemic it felt like Groundhog Day.

With everything closed, one would think you’ll be able to just walk around outdoors?

This is how people imagine that suburbs look like (ok, Halloween was great and just how I’d imagine it from American movies):

But this is how most of it looks like: strip malls, parking lots, cars, streets (rainbows optional):

Public transport is super underfunded and not great (most people use cars anyway), and while some claim that owning and driving a car is “liberating”, for me it is the opposite – liberating is a feeling that I can end up in a random part of a city and easily get back home by a mix of public transport, walking, and city bikes (yeah!), with plenty of choices that I can pick based on my mood or a specific need.

Unfortunately, suburbs typically mean zero walkability. You cannot really just walk outside of your house and enjoy it – pedestrian friendly infrastructure is non-existent. Sidewalks that abruptly end, constant high traffic roads that you need to cross. Even if it wasn’t a problem, there are miles where there is nothing around except for people’s houses, some occasional strip malls – and it doesn’t make for an interesting or pleasant walk. In many areas you kind of have to drive for closest corner store type of small grocery. This meme captures the walkability and non-spontaneous, car heavy and sometimes stressful suburban life very well:

May be an image of text that says 'You need eggs for dinner but have none in your kitchen, what do you do? Rural farmer Urban Suburbanite pedestrian Who Child, go see if Bessie has laid more eggs for us. gets rowded warmu the Child,come walk with me to the comner store to pick up some eggs. oesnt taxes Itrying but'

I don’t want to make this blog post just about suburbs, so I’d recommend the Eco Gecko youtube channel, covering various issues with suburbs – and particularly American suburbs – very well. A quote that I heard a few times that resonated with me is that after WW2, driven by economic boom, Americans decided to throw away thousands of years of know-how on architecture, urban planning and the way we always lived, and instead perform a social experiments named “suburbs” and lead a “motorized” lifestyle. I think it has failed.

As noted, those are my preferences. Suburbs have “objective” problems, strip malls and parkings are ugly, cars are not eco friendly etc – but yes, there are people who love suburbs, relative safety (in US it’s a problem – at least a perceived one), having a large house, clean streets, and say that it’s good for family life. I cannot argue with anyone’s preferences and I respect that – different people have different needs. But I’d like to dispel a myth that suburbs are great for kids. It might be true when they are 1-10 years old, later it gets worse. As I mentioned, when I was ~14 I moved to suburbs – and I hated it… I think that for teenagers, it’s bad for their social life and ability to grow up as independent humans. Now being an adult, I think my parents have a very nice house and a beautiful garden and I can appreciate chilling there with a bbq – but back then I felt alone and isolated, and if it wasn’t for my friends in other parts of the city and great public transport, I’d be very miserable. Which teenager wants to be driven around and “helicoptered” by parents and have whole life revolving around school and “activities”?

Lifestyle

Ok, but people live there and are very happy, so what do they do?

This brings me to a difference in “West Coast lifestyle” (it’s definitely a thing).

Most of the people living there like a specific type of lifestyle – in their spare time they enjoy camping, hiking, rock climbing, mountain biking and similar.

And this is cool, while I’m not into any hardcore sports or camping (being a “mieszczuch”, apparently is translated to “city slicker” – not sure if this is accurate translation), I do love beautiful nature and can appreciate a hike.

Californian hikes and nature can be truly awesome, even outside of the famous spots:

But as much as I love hiking, it gets boring to me very quickly – either you go to see the same stuff and type of landscape over and over, or you need to drive for many hours.

Living in the Silicon Valley, the closest hikes were driving 45min one way, then hunting for half an hour for a parking spot (it doesn’t matter if there is vast open space if some percentage of over 7 million people living around tries to find a parking spot by the road… and forget about getting there by any public transport), doing an hour hike, and the driving back. Boom, half of the weekend gone, most of it spent in a car. And quite a lot of them have same, “harsh” and drought affected climate with dusty roads and dried out, yellow vegetation:

Anyway, the vast majority of friends and colleagues had such “outdoor” and active interests and hobbies (I’m not mentioning other classic techie hobbies that are interesting, but solitary like bread making or brewing and similar).

On the other hand, I am much more interested in other things like music, art, culture, museums, street activities, and clubbing. Note again – there is nothing wrong with loving outdoors vs indoor – just a matter of preference.

“Wait a second!” I might hear someone say. Most of it is available “somewhere” around – some museums, events, concerts, restaurants. Sure, but due to distances and things being sparse and scattered around, everything needs to be planned and done on purpose. If you want to drive to a museum, it’s one hour, but then to go to a particular restaurant, it’s another hour. There is no room for randomness, things just happening and you randomly taking part in them. You need to actively research, plan, book/reserve ahead, drive, do a single activity, and come back home. As I write it, I shiver; this lack of spontaneity kind of still terrifies me. There’s “American prepper” stereotype, people who are super organized ahead of time and often over-prepared – for a reason.

Similarly, I was growing up being a part of alternative cultures communities (punk rock, some goth, new wave, later nu-rave, blog house, rave, electronica, house, disco, underground techno) and was missing it. The kind of thing when I’d go to some club even when nothing special is going on – to get to know some new artist, band, or a dj; to see if any of my friends are around, maybe meet new people, maybe just hang around. This is how I made lifelong and best friends – completely randomly and through shared interests and communities and we are still friends up till today, even if those interests changed later. I missed it greatly and it contributed to a feeling of loneliness.

Photo taken in my hometown, Warsaw. The most missed aspect of culture that was non existent in the Bay Area is kind of “pop-up” culture and community gathering spaces. A place where you just randomly end up with a bunch of your friends, meet some new people, discover some new place or a new music artist, and can expect the unexpected.

So this is matter of preference and just something I’m into vs not into, so I consider it “neutral”. On the other hand, there are other things I definitely didn’t like and consider negative about the vibe I got from some (obviously not all) people I met in Silicon Valley. 

Car culture and driving to all places like restaurants, clubs, and bars has also a “dark side”. Quite a lot of people are drunk driving – way more than in Europe. In Poland blood alcohol limit is 0.02%, in California it is 0.08%. I’d see people visibly drunk with slurry speech that go pick up their car from valet and go home. By comparison in Poland among my friends if someone would have a single weak beer, all of the friends would shout at them “wait a few hours, you cannot drive you idiot”.

Side note: There is also the whole “tech bro” “alternative” culture – people into Burning Man, hardcore libertarianism, radical self expression and radical individualism, polyamory, and microdosing (to increase productivity and compete, obviously…) – but none of it is my thing. If you know me, I’m as far “ideologically” from technocratic libertarianism and rand-esque/randian “freedom for rich individuals” (or rather, attempting to slap a “culture” label on the lifestyle of self-obsessed young bros) as possible. Also, I think I haven’t really met in person even a single person (openly) into this stuff, so maybe it’s an exaggerated myth, or maybe simply much more common in just VC/startup/big money circles?

Competition, money and success

Lots of people in the Bay Area are very competitive and success obsessed.

There are jokes about some douches bragging on their phone while in a supermarket checkout line about how many million they will get from their upcoming IPO – those “jokes” are real situations.

Everyone talks about money, success, position in their company (1 person company = opportunity to be a founder, CEO, CFO and CTO, and a VP of engineering – all in one!), entrepreneurship. At corps / companies people think and talk about promos, bonuses, managers about getting more headcount, higher profile project, and claiming “impact” – everything in some unhealthy competition sauce (sometimes up to a direct conflict between 2 or more teams). You are often judged by your peers by your “success” in those metrics.

Such people see collaboration only as a tool of fulfilling your own goals (once they are fulfilled, at best you are a stranger, at worst they’ll backstab you). Teams are built to fulfil goals of their leads who want to “work through others” and even express it openly and unironically (I still shiver when I hear/read this phrase). Lots of “friendships” are more like just networking or purely transactional. It’s kinda disappointing when a “friend” of yours invites you “for socializing and catching up” and then turns it into a job or startup pitch.

I personally find it toxic.

Ok, at least some of the tech “competition” is not-so-serious. 🙂

It’s also a bad place to have a family (even though I’m not planning one now) – teenage suicide rates are very high around Palo Alto due to this competitive vibe and extremely high expectations and peer pressure… I cannot imagine growing up in an environment where “competition” and you competing starts before even you are born or conceived – with parents securing a house in a place with the best schooling district and fighting and filing applications for spots in a daycare.

The competition area is also very specific – there is a lack of cultural diversity, everyone works in tech, works a lot, and talks mostly about tech.

In the Bay Area, tech is everywhere – here food delivery robots mention “hunger”, while there are actually hungry people on the street (more on that later).

I know, I’m a tech person, so I shouldn’t complain… but I actually thrive in more diverse environments; it’s fun when you have tech people meeting art people meeting humanities people meeting entertainment people; 🙂 extraverts mixing with introverts, programmers and office clerks and musicians and theater people and movie people etc. None of it is there, no real diversity of ideas or cultures.

I don’t think we really met anyone not working in tech or general entrepreneurship (other than spouses of tech colleagues) – which is kind of crazy given being there four years (!).

Tech arrogance

This one is even worse. Silicon Valley is full of extremely smart and talented people, who excel in their domain – technology. People who were always the best students at school, are ridiculously intelligent and capable of extreme levels of abstraction. I often felt very stupid surrounded by peers who grew up there and graduated from those prime colleges and grad schools. Super smart, super educated, super ambitious, and super hard working.

Unfortunately for some people who are smart and successful in one domain, one of the side effects is arrogance outside of their domain – “this is a simple problem, I can solve it”.

Isn’t this what the Silicon Valley entrepreneurship is about? 🙂 A “big vision” that is indifferent to any potential challenges of obstacles (ok, I admit – this is actually a mindset/skill I am jealous of; I wish my insecurities about potentially not delivering didn’t go into pessimism). This comes with a tendency to oversimplify things, ignore nitty gritty details, and jump into it with hyper optimism and technocratic belief that everything can be solved just with technology.

Don’t get me started over the whole idea of “disruption” which is often just ignoring and breaking existing regulations, stripping away worker’s rights, and extracting their money at a huge profit (see Uber, Airbnb, electric scooters, and many more).

“Bro, you have just ‘reinvented’ coworkers/roommates/hotel/bus/taxi/wire transfers”

And you can feel this vibe around all the time – it’s absolutely ridiculous, but being surrounded with such a mindset (see my notes above about lack of cultural diversity…), you can start truly losing perspective.

Economy and prices

Finally, a point that every reader who knows anything about California was expecting – everything is super expensive.

Housing prices and issues around it are very well known, and I’m not an expert on this. It’s enough to say that most apartments and houses around cost $1-4mln, and it’s the buyers who compete (again!) and outbid who gets to buy one, with final sales prices 30% above the asking price. Those “mansions” are typically made from cardboard, ugly, badly built, and with outdated, dangerous electrical installation. Why would sellers bother doing anything better, renovating etc. if people are going to pay fortunes for it anyway?

I didn’t buy a house, my rent was high (but not that bad), but this goes further.

On top of housing, also every single “service” is way more expensive than in other parts of the US (or most of the world). In LA for repairing my car (scratch and dent on a side) in a bodyshop with stellar reviews we paid $800 and were very happy with the service; in Bay Area for a much smaller, but similar type of repair (only scratched side, no dent) of the same car I got a quote for over $4K, of which only $400 was material cost (probably inflated as well) – and we decided not to repair, our car is probably worth less than twice that. ¯\_(ツ)_/¯ 

I am privileged and a high earner, so why would I care?

The worst part to me was the heartbreaking wealth disparity and homeless population.

Campers and cars with people living in them are everywhere, including parked at a side of some tech companies campuses / buildings.

A huge topic on its own, so I won’t add anything original or insightful here, people have written essays and books on it. But still – I cannot get over seeing one block with $4mln mansions, and a few blocks away campers and cars with people living there (who are not unemployed! Some of them work as actually work as so called “vendors” – security, cleaning, and kitchen staff – for the big tech…).

And this is the luckier ones, who still own a camper or a car and have a job – there are many homeless living in tents (or under starry sky) on the streets of San Francisco, often people with disabilities, mental health problems, or addictions.

I won’t exaggerate at all saying that many people in the BA care much more for their dogs, than for fellow humans that get dehumanized.

Climate…? And other misc

This one will be a fun and weird one, as many (even Americans!) assume that California is perfect weather.

It’s definitely not the case in San Francisco and generally closer to the coast – damp, cold, windy most of the year. My mom visiting me in the summer with her friend brought mainly shorts and dresses and shirts. They had to spend the first day shopping for a jacket, hoodie, and long trousers.

Ok, San Francisco is an extreme outlier – going more in-land, it is definitely warm and sunny, which comes at a price.

All the last few years the whole summer you get extreme heat waves, wildfires, bad air quality, and power blackouts. Last year it was particularly bad… Climate change has already come and we get to pay the price – desert areas that we made inhabitable will soon stop being inhabitable anymore. Feels truly depressing (and that humanity cannot get its s..t together).

Apocalypse? No, just Californian summer and wildfires. And some “beautiful” suburbs.

Ok, so you have to spend a few weeks inside with AC on and good air filters, but it’s just a small part of the year?

Here again comes the personal preference – I got tired by almost zero rain, no weather, or seasons. I’m not a big fan of cold or long winters, but I missed autumn, rainy days, excitement of spring, hot summer nights (not many people realize that on the West Coast even in the summer it very rarely gets warmer than 15C at night – desert climate)! You get long summer, and autumn-like “winter” with some nice foliage colors around late November / mid December, and you used to get a few rainy days December – February – but not really anymore.

I have to admit that a few weeks of autumn-like “winter” can be very pretty!

Climate is getting worse, droughts and wildfires are making California worse and worse place to live, but there’s another thing that is on the back of your mind – earthquakes. You feel some minor ones once a month (sometimes once a week), but that’s not a problem at all (I sleep well, so even pretty heavy ones with the furniture shaking wouldn’t wake me up). But once every few decades, there is an earthquake that destroys whole cities and with huge casualties. 😦 I’d hope it never happens (especially having some friends who live there), but it will – and living above faults is like living on a ticking timebomb…

From other misc and random things, California is (too) far away from Europe. This is the most personal of all of the above reasons, buy the family gets older and I don’t get to be part of their lives, and I miss some of my friends. It’s a long and expensive trip, involving a connection (though there might be a direct one soon), so any trip needs to be planned and quite long, using plenty of “precious” vacation days.

Some other gripe that might be surprising for non-Americans – lots of technology and infrastructure is pretty bad quality in the Silicon Valley as well! Unless you are lucky to have fiber available, internet connections are way slower than in Europe (and way more expensive), financial infrastructure old school (though this is a problem of the whole US). When I heard of Apple Pay or Venmo, I was surprised – what’s the big deal? We had paypass contactless payments and instant zero-fee wire transfers (instead of Venmoing a friend, just wire them, they’ll get it right away) in Poland since the early 2000s!

Let me finish this with a more “random”and potentially polarizing (because politics) point.

I am a (democratic) socialist, left wing, atheist etc. and I thought California would at least partially fit my vision of politics and my political preferences. But I quickly learned that Democrats, especially Californian are simply arrogant plutocratic political elites and work towards the interests of local lobbies. (On the major political axis defining the world post 2016 – elitist vs populist – many are as elitist as it gets).

It was very clear during the pandemic with way stricter restrictions than rest of the country, Local and state officials threatening people with cutting off their electricity and water for having a house party – while at the same time, same people who imposed those lockdowns were ignoring them, like California governor having a large party with lobbyists at a 3 Michelin star restaurant, mayor of SF also dining out while telling citizens to stay home, or the speaker Nancy Pelosi sneaking to a hairdresser despite all services closures. Different laws and rules for us, the uneducated masses, different rules for our great leaders.

And there is no room for real democracy or public discourse – as it will be shut down with “trust the science and experts!” (whatever that means; science is a method of investigating the world and building a better understanding of it, and has nothing to do with conducting politics or policies).

Things I will miss

I warned you that I’m a pessimistic Eastern European. But I don’t want to end on such a negative note. Very often, life was also great there, I didn’t “suffer”. 🙂 

Here are some things I will definitely miss about California, and even the Bay Area.

I actually already miss some and am nostalgic about it – after all, it was 4 amazing years of my life.

Inspiring legacy

It was cool to work in THE Silicon Valley. You feel the history of technology all around you, see all the landmarks, and random buildings turn out to be of historical significance.

For example Googleplex is the collection of old Silicon Graphics buildings – how cool and inspiring is that for someone in graphics? Simple bike ride around would be seeing all the old and new offices of companies whose products I used since I was a kid. It feels like being immersed in the technology (and all the great things about it that I dreamt of with the 90s technooptimism).

Yeah, when I was a kid and I was reading about the mythical Silicon Valley (that I imagined as some secret labs and factories in the middle of a desert 😀 funny, but might actually become true in a few decades…) and all of the famous people who created computers and modern electronics, I never imagined I’d be given an opportunity to work there, or sometimes even with those people in person (!).

I still find it dream-like and kind of unbelievable – and I do appreciate this once in a lifetime, humbling experience. Being surrounded by it and all the companies plus their history was inspiring in many ways as well.

I’d often think about the future and in some ways I felt like I’m part of it / being a tiny cogwheel contributing to it in some (negligible) way. This felt – and still feels – surreal and amazing.

Career impact

Let me be clear – by moving out, I have severely limited my career growth opportunities. 

And it’s not just about the tech giants, but also all the smaller opportunities, start ups, or even working with some of the academics.

This is a fact – and the only other almost comparable hub would be the Seattle area.

In NYC (or worse, in Europe if I ever wanted to go back…), tech presence is limited and most likely I will have to work remotely for a foreseeable future. Most people will return to offices, and many of the career growth options (I don’t mean just ladder climbing that I just criticized; but also exposure to projects, opportunities for collaboration, being in person and in the same office with all those amazing people) will be available only at headquarters.

Having an opportunity to present your work to the Alphabet CEO in person is not something that happens often – I’ll most likely never have a chance like that again in my life. But it’s possible if you live and work in “the place where it all happens”.

But life is not only a career – yes, I’ll have it a little bit more difficult, but I’ll manage and be just fine 🙂 – and life is so much more beyond that. You cannot “outwork” or buy happiness.

Life is not about narrowly understood “success” (that can be understood only by niche groups of experts) or competition, but fulfilment and happiness (or pursuit of it…).

Suburban pleasures

While I am not a suburban person, there are things I will miss.

Lots of green and foliage. Nice smells of flowers and freshly mowed grass – and omg those. bushes of jasmine and their captivating smell! Able to ride a bike around smaller neighborhoods without worrying too much about the traffic. But most importantly – outdoor swimming pools in community/apartment complexes, and ubiquitous barbecues!

I love lap swimming and have been doing it almost every single day for the past few years and I will miss it (hey, I already miss it).

Similarly, a regular weekend evening grilling on your large terrace when it just starts to cool down after a hot day with a cold drink in your hand was quite amazing.

Weekend (or week!) night pizza making or a barbecue? No problem, you don’t even need to leave your house / apartment.

I love cycling and some of the casual strolls were nice. I guess I’d miss it already (cycling in NYC is fun, but a different experience – a slalom between badly parked cars and unsafe traffic) if not for a bit of pandemic cycling burnout (every morning I was cycling around, but due to limited choice there were ~3-4 routes that I went through each dozens+ of times).

Finally, our neighborhood was full of cats. I love cats (who doesn’t…), but we cannot have one, and stil we got some pandemic stress relief in petting some of our purring neighbors.

We’d get pretty often some nice unexpected guests entering the apartment.

Californian easy going

Finally – Californian optimistic and positive, easy going spirit is super nice and contagious.

Despite me complaining about having a hard time making friends and criticizing the tech culture, I truly liked most of the people I met there.

If you don’t make any troubles (and have money…), life is very easy, comfortable, everyone is smiling, you exchange pleasant small talks, nothing is bothering you.

Some stranger grinning “hi how are you???” (that seems more extreme than in the other parts of the US) might be superficial, but it can make your day.

There’s some certain Californian optimism that I’ll definitely miss. After all, it’s a land of Californian dream. Yes, this is cheesy (or cheugy?), and the photo is as well. But I like it.

Conclusions

There are no conclusions! This was just my take and I went very personal – the most I ever did on this blog. I know some people share of my opinions; some problems of the Silicon Valley are very real, but don’t read too much into my opinions on it. And it was such an important and great part of my life and my career, and I know I’ll miss many things about it. That said, if you get an offer to relocate to the Bay Area – I’d recommend thinking about all those aspects of living that are outside of work and if they’d fit your preferences or not.

Is New York City going to better serve my needs? I certainly hope so, it seems so far (a completely different world!) and I’m optimistic looking into the future, but only time will tell.

Posted in Travel / Photography | 11 Comments

Neural material (de)compression – data-driven nonlinear dimensionality reduction

Proposed neural material decompression (on the right) is similar to the SVD based one (left), but instead of a matrix multiplication uses a tiny, local support and per-texel neural network that can run with very small amounts of per-pixel computations.

In this post I come back to something I didn’t expect coming back to – dimensionality reduction and compression for whole material texture sets (as opposed to single textures) – a significantly underexplored topic.

In one of my past posts I described how the most simple linear correlation can be used to significantly reduce material texture set dimensionality, and in another one ventured further into a fascinating field of dictionary and sparse compression.

This time, I will describe how we can abandon the land of simple linear correlation (that we explored through the use of SVD) and use data-driven techniques (machine learning!) to discover non-linear functions – using differentiable programming / neural networks.

I’m going to cover:

  • Using non-linear decoding on per-pixel data to get better dimensionality reduction than with linear correlation / PCA,
  • Augmenting non-linear decoders with neighborhood information to improve the reconstruction quality,
  • Using non-linear encoders on the SVD data for target platform/performance scaling,
  • Possibility of using learned encoders/decoders for GBuffer data.

Note: There is nothing magic or “neural” (and for sure not any damn AI!) about it, just non-linear relationships, and I absolutely hate that naming.

On the other hand, it does use neural networks, so investors, take notice!

Edit: I have added some example code for this post on github/colab. It’s very messy and not documented, so use at your own responsibility (preferably with a GPU colab instance)!

Inspiration

There have been a few papers recently that have changed the way I think about dimensionality reduction specifically for representing materials and how the design space might look like.

I won’t list all here, but here (and here or here) are some more recent ones that stood out and I remembered off the top of my head. The thing I thought was interesting, different, and original was the concept of “neural deferred rendering” – having compact latent codes that are used for approximating functions like BRDFs (or even final pixel directional radiance) expanded by a tiny neural network, often in the form of a multi-layer perceptron.

What if we used this, but in a semi-offline setting? 

What if we used tiny neural networks to decode materials?

To explain why I thought it could be interesting to bother to use a “wasteful” NN (NNs are always wasteful due to overcompleteness, even if tiny!), let’s start with function approximation and fitting using non-linear relationships.

Linear vs non-linear relationships

In my post about SVD and PCA, I discussed how linear correlation works and how it can be used. If you haven’t read my post, I recommend to pause reading this one, as I will use the same terminology, problem setting, and will go pretty fast.

It’s an extremely powerful relationship; and it works so well because many problems are linear once you “zoom in enough” – the principle behind truncated Taylor series approximations. Local line equations rule the engineering world! (And can be used as an often more desirable replacement for the joint bilateral filter).

So in the case of PCA material compression, we assume there is a linear relationship between the material properties and color channels and use it to decorrelate the input data. Then. we can ignore the components that are less correlated and contribute less, reducing the representation and storage to just a few channels.

Assuming linear correlation of data and finding line fits allows us to reduce the dimensionality – here from 2D to 1D.

Then the output can be computed by a weighted combination of adding them together.

Unfortunately, once we “zoom out” of local approximations (or basically look at most real world data), relationships in the data stop being so “linear”.

Often PCA is not enough. Like in the above example – we discard lots of useful distinction between some points.

But let’s go back and have a look at a different set of simple observations(represented as noisy data points on axis x and y):

Attempting line fitting on quadratically related data fails to discover the distribution shape.

A linear correlation cannot capture much of the distribution shape and is very lossy. We can guess why – because this looks like a good old parabola, a quadratic function.

Line fit is not enough and errors are large both “inside” the set, as well as approaching its ends.

This is what drove the further developments of the field of machine learning (note: PCA is definitely also a machine learning algorithm!) – to discover, explain, and approximate such growingly more complex relationships, with both parametric (where we assume some model of reality and try to find the best parameters), as well as non-parametric (where we don’t formulate the model of reality or data distribution) models.

The goal of Machine Learning is to give reasonable answers where we simply don’t know the answers ourselves – when we don’t have precise models of reality, there are too many dimensions, relationships are way too complex, or we know they might be changing.

In this case, we can use many algorithms, but the hottest one are “obviously” neural networks. 🙂 The next section will try to look at why, but first let’s approximate our data with a small, few neurons in a hidden layer Mult-Layer Perceptron NN with a ReLU activation function (ReLU is a fancy name for max(x, 0) – probably the simplest non-linear function that is actually useful and differentiable almost everywhere!) with 3 hidden neurons:

Three neuron ReLU neural network approximation of the toy data.

Or for example with 4 neurons as:

Four neuron ReLU neural network approximation of the toy data.

If we’re “lucky” (good initialization), we could get a fit like this:

“Lucky” initialization leads to matching the distribution shape closely with a potential for extrapolation.

It’s not “perfect”, but much better than the linear model. SVD, PCA, or line fits in a vanilla form can discover only linear relationships, while networks can discover some other, interesting nonlinear ones.

And this is why we’re going to use them!

Digression – why MLP, ReLU, and neural networks are so “hot”?

A few more words on the selected “architecture” – we’re going to use a most old-school, classic Neural Network algorithm – Multilayer Perceptron.

A 3 input, 3 neurons in a hidden layer, and 2 output neurons Multilayer Perceptron. Source: wikipedia.

When I first heard about “Neural Networks” at college around the mid 00s, in the introductory courses, “neural networks” were almost always MLPs and considered quite outdated. I also learned about Convolutional Networks and implemented one for my final project, but those were considered slow, impractical, and generally worse performing than SVMs and other techniques – isn’t it funny how things change and history turns around? 🙂

Why are neural networks used today everywhere and for almost everything?

There are numerous explanations, from deeply theoretical ones about “universal function approximation”, through hardware based ones (efficient to execute on GPUs, but also HW accelerators being built for those), but for me the most both pragmatic, as well as perhaps a bit cynical is – a circular explanation – because the field is so “hot” that a) actual research went into building great tools and libraries for those, plus b) there is a lot of know-how.

It’s super trivial to use a neural network to solve a problem where you have “some” data, the libraries and tooling are excellent, and over a decade of recent research (based on half a century of fundamental research before that!) went into making them fast and well understood.

MLP with a ReLU activation provides piecewise linear approximations to the actual solutions. In our example, they will never be able to learn a real parabola, but given enough data it can perform very well if we evaluate on the same data.

While it doesn’t make sense to use NNs for data that we know is drawn from a small parabola (because we can use a parametric model – both more accurate, as well as way more efficient), if we had more than three a few dimensions or a few hundred samples, human reasoning, intuition and “eyeballing” starts to break down.

The volume space of possible solutions increases exponentially (“curse of dimensionality”). For the use-case we’ll look at next (neural decompression of textures), we cannot just quickly look at the data and realize “ok, there is a cubic relationship in some latent space between the albedo and glossiness”- so finding those with small MLPs is an attractive alternative.

But it can very easily overfit (fail to discover real, underlying relationship in the data – variance bias trade-off); for example increasing the hidden layer neuron count to 256, we can get:

With 256 neurons in the hidden layer, network overfits and completely fails to discover a meaningful or correct approximation of the “real” distribution.

Luckily, for our use-case – it doesn’t matter. For compressing fixed data, the more over-fitting, the better! 🙂 

Non-linear relationships in textures

With the SVD compressed example, we were looking at N input channels (stored in textures), with K (9 in the example we analyzed) output channels, with each output channel given by x_0 * w_0 + … + x_n-1 w_n-1, so a simple linear relationship.

Let’s have a look again at N input channels, K output channels, but in between them a single hidden layer of 64 neurons with a ReLU function:

Instead of a SVD matrix multiplication, we are going to use a simple single hidden layer network.

This time, we perform a matrix multiplication (or a series of multiply-adds) for every neuron in the hidden layer and use those intermediate, locally computed values as our “temporary” input channels that get fed to a final matrix multiplication producing the output.

This is not a convolutional network. The data still stays “local”, we read same amount of information per channel. During network training, we are training both the network, as well as the input compressed texture channels.

During training, we optimize the contents of the texture (“latent code“), as well as a tiny decoder network that operates per pixel.

If those terms don’t mean much to you, unfortunately I won’t go much deeper into how to train a network, what optimizers and learning rates to use (I used ADAM with learning rate of 0.01) etc. – there are people who are much more competent and explain it much better. I learned from the Deep Learning book and can recommend it, a clear exposition to both theory and some practice, but it was a while ago, so maybe better resources are available now.

I did however write about optimizing latent codes! About optimizing SVD filters with Jax, optimizing blue noise patterns, and about finding latent codes for dictionary learning. In this case, texture contents are simply a set of variables optimized during network training, also through gradient descent / backpropagation. The whole process is not optimized at all and takes a couple of minutes (but I’m sure could be made much faster).

Without further ado, here’s the numerical quality comparison, number of channels vs PSNR:

Comparison of PSNR of SVD, optimal linear encoding with non-linear encoding optimized jointly with a tiny NN decoder.

When I first saw the numbers and behavior for the very low channel count, I was pretty impressed. The quantitative improvement is significant, over a few db. In the case of 4 channels, it turns it from useless (PSNR below 30dB), to pretty good – over 36dB!

Here’s a visual comparison:

A non-linear decoding of four channel data is way better than linear SVD on four channels not just numerically, but also immediately and perceptually visible even when zoomed out – see the effect on the normal map. Textures / material credit cc0textures, Lennart Demes.

That’s the case where you don’t even need gifs and switching back and forth – the difference is immediately obvious on the normal map as well as the height map (middle). It is comparable (almost exactly the same PSNR as well) to the 5 channels linear encoding.

This is the power of non-linearity – it can express approximations of relationships between the normal map channels like x^2 + y^2 + z^2 = 1, which is impossible to express through a single linear equation.

How do the latent codes look like? Surprisingly similar to PCA data and reasonably uncorrelated, like:

Latent space looks reasonably close to the PCA of the input data.

Efficient implementation

Before I go further, how fast / slow would that be in practice? 

I have not measured the performance, but the most straightforward implementation is to compute the per-pixel 64 values, storing them in either registers or LDS through a series of 64×4 MADDs, and then do another 64×9 MADDs to compute the final channels, totaling 832 multiply add ops.

For intermediate channel (64):
  For input channel (1-5):
    Intermediate += Input * weight
    Intermediate = max(0, Intermediate)
For output channel (9):
  For intermediate channel (64):
    Output += Intermediate * weight

This sounds expensive, however all “tricks” of optimizing NNs apply – I’m pretty sure you can use “tensor cores” on nVidia hardware, and very aggressive quantization – not just half floats, but also e.g. 8 or 4 bit smaller types for faster throughput / vectorization.

There’s also a possibility of decompressing this into some intermediate storage / cache like virtual textures, but I’m not sure if this would be beneficial? There’s an idea floating around for at least half a decade about texel shaders and how they could be the solution to this problem, butI’ve been far away from GPU architecture for the past 4 years to not have an opinion on that.

Would this work under bilinear interpolation?

A question that came to my mind was – does this non-linear decoding of data break under linear interpolation (or any linear combination) of the decoded inputs?

I have performed the most trivial test – upscale encoded data bilinearly by 4x, decode, downsample, and compare to the original decoded. It seems to be very close, but this is an empirical experiment on only one example. I am almost sure it’s going to work well on more materials – the reason is how tiny is the network as compared to the number of texels, with not much room to overfit.

To explain this intuition, here is again example from above – where small parameter count and undefitting (left) behaves well between the input points, while the overfit case (right) would behave “badly” and non-linearly under linear interpolation:

Small networks (left) have no room for overfitting, while larger (right) don’t behave well / smoothly / monotonicly between the data points.

What if we looked at neighbors / larger neighborhoods?

When you look at material textures, one thing that comes to mind is that per channel correlations might be actually less obvious than another type of correlations – spatial ones.

What I mean here is that some features depend not only on the contents of the single pixel, but much more on where it is and what it represents.

For example – a normal map might represent the derivative / gradient of the height map, and cavity or AO map darker spots be surrounded by heightmap slopes. 

Would looking at pixels of target resolution image, as well as the larger context of blurred / lower mip level neighborhood give non-linear decoder better information?

We could use either authored features like gradients, or just general convolution networks and let the network learn those non-linearities, but that could be too slow.

I had another idea that is much simpler/cheaper and seems to work quite ok – why not just look at the blurred version of the texel neighborhood? In the extreme close to zero cost approximation – if we looked at one of the next levels in the mip chain which is already built?

So as an input of the network, I add simulated reading from 2 mip levels down (by downsampling the image 4x, and then upsampling back).

This seems to improve the results quite significantly in this my favorite (strong compression, good PSNR results) 2-5 channel regime:

Adding lower mip levels as the input to the small decoder network allows to improve the PSNR significantly.

Maybe not as spectacular, but now together with those two changes we’re talking about improving PSNR for storing just 3 channels and reconstructing 9 from 25dB to over 35.5dB, this is a massive improvement!

Why not go ahead with full global encoding and coordinate networks?

Success of recent techniques in neural rendering and computer vision where the input are just spatial coordinates (like brilliant NeRF), begs a question – why not use same approach for compressing materials / textures? My understanding is that a fully trained NeRF actually uses more data than in the input (expansion instead of compression!), but even if we went with a hybrid solution (some latent code + a mixed coordinate and regular input network), for the decoding use on GPUs we’d generally don’t want to have a representation with “global support” (where every piece of information describes the behavior / decoded data in all locations).

If you need to access information about the whole texture to encode every single pixel, the data is not going to fit in the cache… And it doesn’t make sense. I think of it this way – if you were rendering an open world, to render a small object in one place would you want be required to always access information about a different object 3mile / 5km away?

Local support is way more efficient (as you need to read less data that fits in the local cache), and even techniques like JPEG that use global support basis like DCT only per a local, small tile.

Neural decompressor of linear SVD data

This is an idea that occurred to me when playing with adding those spatial relationships. Can a “neural” decompressor learn some meaningful non-linear decoding of linearly encoded data? We could use a learned decoder on fast, linearly encoded data.

This would have two significant benefits:

  1. The compression / encoding would be significantly faster.
  2. Depending on the speed / performance (or memory availability if caching) of the platform, one could pick either fast, linear decompression, or a neural network to decompress.
  3. Artists could iterate on linear, SVD data and have a guarantee that the end result will be higher quality on higher end platforms.
Depending on available computational budget, you could pick between either linear, single matrix-multiply decoding, or a non-linear one using NNs!

In this set-up, we just freeze the latent code computed through SVD and optimize only the network. The network has access to the SVD latent code and two of the following mip levels.

Comparison of linear vs non-linear decoding of the linearly compressed, SVD data.

The results are not as impressive be before, but still getting ~4dB extra from exactly same input data is an interesting and most likely applicable result?

GBuffer storage?

As a final remark, in some cases, small NNs could be used for decoding data stored in GBuffer in the context of deferred shading. If many materials could be stored using the same representation, then decoding could happen just before shading.

This is not as bizarre an idea as it might seem – while compression would most likely be more modest, I see a different benefit – many game engines that use deferred shading have hand-written hard-coded spaghetti code logic for packing different material types and variations with many properties, with equally complicated decoding functions.

When I worked on GoW, I spent weeks on “optimal” packing schemes that were compromising the needs of different artists and different types of materials (anisotropic specular, double specular lobes, thin transmission, retroreflective specular, diffusion with different profiles…).

…and there were different types of encoding/decoding functions like perceptually linearizing roughness, and then converting it to a different representation for the actual BRDF evaluation. Lots of work that was interesting (a fun challenge!) at times, but also laborious, and sometimes heartbreaking (when you have to pick between using more memory and the whole game losing 10% performance, and telling an artist that a feature you wrote for them and they loved will not be supported anymore).

It could be an amazing time saver and offloading for the programmer to rely on automatic decoding instead of such imperfect heuristics.

Summary

In this post, I returned to some ideas related to compressing whole material sets together – and relying on their correlation.

This time, we focused on the assumption of the existence of non-linear relationships in data – and used tiny neural networks to learn this relationship for efficient representation learning and dimensionality reduction. 

My post is far from exhaustive or applicable – it’s more of an idea outline / sketch, and there are a lot of details and gaps to fill – but the idea definitely works! – and provides real, measurable benefit – on the limited example I tried it on. Would I try moving in this direction in production? I don’t know, maybe I’d play more with compressing material sets if I was under dire need for reducing material sizes, but otherwise I’d probably spend some more time following the GBuffer compression automation angle (knowing it might be tricky to make practical – how do you obtain all the data? how do you eveluate).

If there’s a single outcome outside of me again toying with material sets, I hope it somewhat demystified using neural networks for simple data driven tasks:

  • It’s all about discovering, non-linear relationships in lots of complicated, multi-dimensional data,
  • Despite limitations (like ReLU can provide only a piecewise linear function approximation) they can indeed discover those in practice,
  • You don’t need to run a gigantic end-to-end network that has millions of parameters to see benefits of them,
  • Small, shallow networks can be useful and fast (832 multiply-adds on limited precision data is most likely practical)!

Edit: I have added some example code for this post on github/colab. It’s very messy and not documented, so use at your own responsibility (preferably with a GPU colab instance)!

Posted in Code / Graphics | Tagged , , , , , , , , , , | 5 Comments

Superfast void-and-cluster Blue Noise in Python (Numpy/Jax)

This is a super short blog post to accompany this Colab notebook.

It’s not an official part of my dithering / Blue Noise post series, but thematically fits it well and be sure to check it out for some motivation why we’re looking at blue noise dither masks!

It is inspired by a tweet exchange with Alan Wolfe (who has a great write-up about the original, more complex version of void and cluster and a lot of blog posts about blue noise and its advantages and cool mathematical properties) and Kleber Garcia who has recently released “Noice”, a fast, general tool for creating various kinds of 2D and 3D noise patterns.

The main idea is – one can write a very simple and very fast version of void-and-cluster algorithm that takes ~50 lines of code including comments!

How did I do it? Again in Jax. 🙂

Algorithm

The original void and cluster algorithm comprises 3 phases – initial random points and their reordering, removing clusters, and filling voids.

I didn’t understand why three phases are necessary, the paper didn’t explain them, so I went to code a much simpler version with just initialization and a single loop:

1. Initialize the pattern to a few random seed pixels. This is necessary as the algorithm is fully deterministic otherwise, so without seeding it with randomness, it would produce a regular grid.

2. Repeat until all pixel set:

  1. Find an empty pixel with the smallest energy.

  2. Set this pixel to the index of the added point.

  3. Add the energy contribution of this pixel to the accumulated LUT.

  4. Repeat until all pixels are set.

Initial random points

Nothing too sophisticated there, so I decided to use a jittered grid – it prevents most clumping and just intuitively seems ok.

Together with those random pixels, I also create a precomputed energy mask.

Centered energy mask for pattern sized (16×16)
Uncentered energy mask for pattern sized (16×16)

It is a toroidally wrapped Gaussian bell with a small twist / hack – in the added point center I place float “infinity”. The white pixel in the above plots is this “infinity”. This way I guarantee that this point will never have smaller energy than any unoccupied one. 🙂  This simplifies the algorithm a lot – no need for any bookkeeping, all the information is stored in the energy map!

This mask will be added as a contribution of every single point, so will accumulate over time, with our tiny infinities 😉 filling in all pixels one by one.

Actual loop

For the main loop of the algorithm, I look at the current energy mask, that looks for example like this: 

And use numpy/jax “argmin” – this function literally returns what we need – index of the pixel with the smallest energy! Beforehand I convert the array to 1D (so that argmin works), get an 1D index, but then easily convert it to 2D coordinates using division and modulo by single dimension size. It’s worth noting that operations like “flatten” are free in numpy – they just readjust the array strides and the shape information.

I add this pixel with an increased counter to the final dither mask, and also take our precomputed energy table and “rotate it” according to pixel coordinates, and add to the current energy map.

After the update it will look like this (notice how nicely we found the void):

After this update step we continue the loop until all pixels are set.

Results

The results look pretty good:

I think this is as good as the original void-and-cluster algorithm.

The performance is 3.27s to generate a 128×128 texture on a free GPU Colab instance (first call might be slower due to jit compilation) – I think also pretty good!

If there is anything I have missed or any bug, please let me know in the comments!

Meanwhile enjoy the colab here, and check out my dithering and blue noise blog posts here.

Posted in Code / Graphics | Tagged , , , , , | 1 Comment

Computing gradients on grids of pixels and voxels – forward, central, and… diagonal differences

In this post, I will focus on gradients of image signals defined on grids in computer graphics and image processing. Specifically, gradients / derivatives of images, height fields, distance fields, when they are represented as discrete, uniform grids of pixels or voxels.

I’ll start with the very basics – what do we typically mean by gradients (as it’s not always “standardized”), what are they used for, what are the ypical methods (forward or central differences), their cons and problems, and then proceed to discuss an interesting alternative with very nice properties – diagonal gradients.

My post will conclude with advice on how to use them in practice in a simple useful scheme, how to extend it with a little bit of computations to a super useful concept of a structure tensor that can characterize dominating direction of any gradient field, and finish with some signal processing fun – frequency domain analysis of forward and central differences.

Image gradients

What are image gradients or derivatives? How do you define gradients on a grid?

It’s not very well defined problem on discrete signals, but the most common and useful way to think about it is inspired by signal processing and the idea of sampling:

Assuming there was some continuous signal that got discretized, what would be the partial derivative with regards to the spatial dimension of this continuous signal at gradient evaluation points?

This interpretation is useful both for computer graphics (where we might have discretized descriptions of continuous surfaces; like voxel fields, heightmaps, or distance fields), as well as in image processing (where we often assume that images are “natural images” that got photographed).

Continuous gradients and derivatives used for things like normals of procedural SDFs are also interesting, but a different story. I will not cover those here, and instead recommend you check out this cool post by Inigo Quilez with lots of practical tricks (re-reading it I learned about the tetrahedron trick) and advice. I am sure that no matter your level of familiarity with the topic, you will learn something new.

In my post, I will focus on computer graphics, image processing, and basic signal processing takes on the problem. There are two much deeper connections that I haven’t personally worked too much with. So I leave it as something that I wish I can expand my knowledge in the future, but also encourage my readers to explore it:

Further reading one: The first connection is with Partial Differential Equations and their discretization. Solving PDEs and solving discretized PDEs is something that many specialized scientific domains deal with, and computing gradients is an inherent part of numerical discretized PDE solutions. I don’t know too much about those, but I’m sure literature covers this in much detail.

Further reading two: The second connection is wavelets, filter banks, and the frequency precision / localization trade-off. This is something used in communication theory, electrical engineering, radar systems, audio systems, and many more. While I read and am familiar with some theory, I haven’t found too many practical uses of wavelets (other than the simplest Gabor ones or in use for image pyramids) in my work, so again I’ll just recommend you some more specialized reading. 

Applications

Ok, why would we want to compute gradients? Two common uses:

Compute surface normals – when we have something like a scalar field – whether distance field describing an underlying implicit surface, or simply a height field, we might want to compute its gradients to compute normals of the surface, or of the heightfield for normal mapping, evaluating BRDFs and lighting etc. Maybe even for physical simulation, collision, or animation on terrain!

Left: an example heightmap image, right: partial derivatives (exaggerated for demonstration) that can be used for generating normal maps for lighting, collision, and many other uses in a 3D environment.

Find edges, discontinuities, corners, and image features – the other application that I will focus much more on is simply finding image features like edges, corners, regions of “detail” or texture; areas where signal changes and becomes more “interesting”. The human visual system is actually built from many edge detectors and local differentiators and can be thought of as a multi-scale spatiotemporal gradient analyzer! This is biologically motivated – when signal changes in space or time, it means it’s a potential point of interest or a threat.

The bonus use-case of just computing the gradients is finding the orientation and description of those features. This is bread and butter of any image processing, but also very common in computer graphics – from morphological anti-aliasing, selectively super-sampling only edges, or special effects. I will go back to it in the description of local features in the Structure Tensor section, but for now let’s use the most common application – just using the gradient vector magnitude, used for example in the Sobel operator: .

Gradient magnitude can be used for simple edge detection in an image, but also many more (described later in the post!).

Test signals

I will be testing gradients on three test images:

  • Siemens star – a test pattern useful to test different “angles” and different frequencies,
  • A simple box – has corners that can immediately show problems,
  • 1 0 1 0 stripe pattern – frequencies exactly at Nyquist.
Test patterns / images

The Siemens star was rendered at 16x resolution and downsampled with a sharp antialiasing filter (windowed sinc). There is some aliasing in the center, but it’s ok for our use-case (we want to see a sharp signal change there and then detect it).

Forward differences

The first, most intuitive approach is computing forward difference.

This is an approximation of by computing f(x+1) – f(x).

What I love about this solution is that it is both intuitive, naive, as well as theoretically motivated! I don’t want to rephrase the wikipedia and most of readers of my blog don’t care so much for formal proofs, so feel free to have a look there, or in specialized courses (side note: 2010s+ are amazing, with so many best professors sharing their course slides openly on the internet…). It’s basically built around the Taylor expansion of the f(x+1) – f(x).

This difference operator is the same as convolving the image with a [-1, 1] filter.

It’s easy, works reasonably well, is useful. But there are two bigger problems.

Problem 1 – even-shift

If you have read my previous blog post – on bilinear down/upsampling – one of challenges might be visible right away. Convolving with an even-sized filter, shifts the whole image by a half a pixel.

It’s easily visible as well:

Images vs image gradient magnitude. Notice the shift to left/top.

Depending on the application, this can be a problem or not. “A solution” is simple – undo it with another even sized filter. Perfect solution would be resampling, but resampling by a half pixel is surprisingly challenging (see my another blog post), so instead we can blur it with a symmetric filter. Blurring the gradient magnitude fixes it (at the cost of blur):

Blurring with an even sized filter fixes the even sized shift. But one other problem might have became much more visible right now.

Note: You might wonder, what would happen if we would blur just the partial differences instead? We will see in a second.

But let’s focus on another, more serious problem.

Problem 2 – different gradient positions for partial differences

Second problem is more severe; what happens if we compute multiple partial derivatives; like df/dx and df/dy this way?

Partial derivatives are approximated at different positions!

Notice how gradient being defined between pixels means that two partial forward derivatives are defined at different points! This is a big problem.

Now let’s have a look at the same plot again, this time focusing on “symmetry”. This shows why it’s a real, not just theoretical issue. If we look at gradients of a Siemens star and the box, we can see some asymmetry:

Notice left-top gradients being stronger for Siemens star, and producing a “hole” on the box corner. Oops.

This is definitely not good and will cause problems in image processing or computing normals, but it’s even worse if we look at the gradient magnitude of a simple “square” in the center of the image – notice what happens to the image corners, one corner is cut off, and another one 2x too intense!

This is a big problem for any real application. We need to look at better approximations.

Central differences

I mentioned that “blurring” partial derivatives can “recenter” them, right? What if I told you that it is also numerically more accurate? This is the so-called central difference.

It is evaluated by . The division by two is important and intuitively can be understood as dividing by the distance between the pixels, or the differential dx, where dx is 2x larger.

When forward difference was an approximation accurate to the O(h), this one is more accurate, to O(h^2). I won’t explain those terms here, but the wikipedia has some basic intro, and in numerical methods literature you can find a more detailed explanation.

It is also equivalent to “blurring” the forward difference with a [0.5, 0.5] filter! [0.5, 0.5] o [-1, 1] leads to [-0.5, 0, 0.5]. I was initially very surprised by this connection – of a more accurate, theoretically motivated Taylor expansion, and just “some random ad-hoc practical blur”. This is even sized blur, so this also fixes the half pixel shift problem and centers both partial derivatives correctly:

Central difference leads to perfectly symmetric (though not isotropic) results!

Problem – missing center…

However, there is another problem. Notice how the gradient estimate on the pixel doesn’t take pixel’s value into account at all! This means that patterns like 0 1 0 1 0 1 will have… zero gradient

This is how both methods compare on all gradients, including the “striped” pattern:

Top: analyzed images, center: forward difference, bottom: central difference. Notice how both stripes show no gradient, as well as the center of the Siemens star seems to be missing!

The central difference predicts zero gradient for any 1 0 1 0-like, highest frequency (Nyquist) sequence! On Siemens star it demonstrates as missing / zero gradients just outside of the center.

We know that we have strong gradients there, and strong patterns/edges, so this is clearly not right.

This can and will lead to some serious problems. If we look at the Siemens star, we can notice how it is nicely symmetric, but misses gradient information in the center

Sobel filter

It’s worth mentioning here what is the commonly used Sobel filter. Sobel filter is nothing more or less than central difference that is blurred (with a binomial filter approximation of Gaussian) in the direction perpendicular to evaluated gradient, so something like:

The practical motivation for it in Sobel filter is a) filtering out some noise and tendency to be sensitive to single pixel gradients that are not very visible, and more importantly b) its blurring introduces more “isotropic” behavior (you will see below how “diagonals” and “corners” are closer magnitude to the horizontal/vertical edges).

Sobel operator would compare to the central difference (comparing those due to the largest similarity):

Does the isotropic and fully rotationally symmetric behavior matter?

Mentioning isotropic or anisotropic response on grids is always weird to me. Why? Because the Nyquist frequencies are not isotropic or radial, they form a box. This means that we can represent higher frequencies on the diagonals than the principal axes. I have covered in an older post of mine how this property can be used in checkerboard rendering by rotating the sample grid (used for example in many Playstation 4 Pro to achieve 4K rendering with spatiotemporal upsampling). But any analysis becomes “weird” and I often see discussions of whether separable sinc is “the best” filter for images, as opposed to rotationally symmetric jinc.

I don’t have an easy answer for how important it is for the image gradients, and the question is “it depends”. Do you care about detecting higher frequency diagonal gradient as stronger gradients or not? What’s the use-case? I haven’t thought about it in the context of normal mapping, as this could lead to “weird” curvatures, but I’m not sure. I might come back to it out of personal curiosity at some point, so please let me know in the comments what are your thoughts and/or experiences.

Image Laplacians

Edit: Jeremy Cowles suggested an interesting topic to add here and an area of common confusion. Sobel operator can be sometimes used interchangeably / confused with Laplace operator. Laplace operator can be discretized like:

Both can be used for detecting the edges, but it’s very important to distinguish the two. Laplace operator is a scalar value corresponding to the sum of second order derivatives. It measures a very different physical property and has a different meaning – it measures smoothness of the signal. This means that for example a linear gradient will have a Laplacian of exactly zero! It can be very useful in image processing and computer graphics, and sometimes even better than the gradient; but it’s important to think what you are measuring and why!

Laplace operator is signed, but its absolute value on our test signals looks like:

Absolute value of the image laplacian – laplacian magnitude.

So despite being centered it doesn’t have problem of vanishing 0 1 0 1 0 sequence – great! On the other hand notice how it’s almost zero outside of the Siemens star center, and stronger in the center and on the corners – very different semantic meaning and practical applications.

Larger stencils?

Notice that as you can use the central difference for increase of the accuracy of the forward difference / higher order approximation, you can go even further and further, adding more taps to your derivative-approximating filter!

They can help with disappearing high frequency gradients, but your gradients become less and less localized in space. This is the connection to wavelets, audio, STFT and other areas that I mentioned.

I also like to think of it intuitively as half pixel resampling of the [-1, 1] filter. 🙂 For example if you were to resample [-1, 1] with a half pixel shifting larger sinc filter, you’d get similar behavior:

Left: [1, -1] filter shifted with a windowed sinc. Right: Least-squares optimal 9 tap derivative.

Diagonal differences

With all this background, it’s finally time for the main suggestion of the post.

This is something I have learned from my coworkers who worked on the BLADE project (approximating image operations with discrete learned filters), and then we used in our work that materialized as the Handheld Multiframe Super-Resolution paper and Super-Res Zoom Pixel 3 project that I had a pleasure of leading.

I get a lot of questions about this part of the paper in some emails as well, so I think it’s not a very commonly known “trick”. If you know where it might have originated and domains that use it more often, please let me know.

Edit: Don Williamson has noticed on twitter that it’s known as Robert’s Cross and was developed in 1963!

The idea here is to compute gradients on the diagonals. For example in 2D (note that it extends to higher dimensions!):

We can see that with the diagonal difference, both partial derivative approximations are centered in the same point. We compute those just like forward differences, but need to compensate for larger distance, sqrt(2), size of the diagonal of the pixels, so and .

If you use gradients like this you can “rotate it” for normals. For just gradient magnitude you don’t need to do anything special, it’s computed the same way (length of the vector)!

Gradient computed like this doesn’t have missing high frequency gradients problems; nor asymmetry problems:

Diagonal gradients offer an easy solution to two problems of both central, and forward differences.

It has the problem of pixel shift, and some anisotropy, but the pixel shift can be easily fixed.

Diagonal differences – fixing shift in a 3×3 pixel window

In a 3×3 window, we can compute our gradients like this:

How to compute symmetric and not phase shifting diagonal gradients in a 3×3, 9 pixel window.

This means sum all the squared gradients, and divide by 4 (number of gradient pairs) – all very efficient operations. Then you just take the square root and get the gradient magnitude.

This fixes the problem of phase shift and IMO is the best of all alternatives discussed so far:

Diagonal differences detect high frequencies properly, don’t have a pixel shift problem, and while they show perfect isotropy, in practice it is often “good enough”.

Some blurring is on the one hand reducing the “localization”, but on the other hand it can be desired and improve robustness to noise and single pixel gradients.

But why stop there?

With just a few lines of code (and some extra computation…) we can extract even more information from it!

Structure Tensor and the Harris corner detector

We have a “bag” or gradient values for a bunch of pixels… What if we tried to find a least squares solution to “what is the dominant gradient direction in the neighborhood”?

I am not great at math derivations, but here’s an elegant explanation of the structure tensor. Structure tensor is a Gram matrix of the gradients (vector of “stacked” gradients multiplied by its transpose).

Lots of difficult words, but in essence it’s a very simple 2×2 matrix, where entries are averages of dx^2, dy^2 on the diagonal, and dx*dy on the off diagonal:

Now if we do eigendecomposition of this matrix (which is always possible for a Gram matrix – symmetric, positive semidefinite), we get the following simple closed forms for the eigenvalues:

You might want to normalize the vectors; in the case of the off-diagonal Gram matrix entry of zero, the eigenvectors are simply coordinate system principal axes (1, 0) and (0, 1) (be sure to “sort” them in that case though , same for underdetermined systems, where the ordering doesn’t matter.

Larger eigenvalue and associated eigenvector describe the direction and magnitude of the gradient in dominant direction; it gives us both its orientation and how strong the gradient is in that specific direction.

The second, smaller eigenvalue and perpendicular, second eigenvector gives us information about the gradient in the perpendicular direction.

Together this is really cool, as it allows us to distinguish edges, corners, regions with texture, as well as orientation of edges. In our Handheld Multiframe Super-Resolution paper we used two “derived” properties – strength (simply the square root of the structure tensor dominant eigenvalue) describing how strong is the gradient locally, and coherence computed as . (Note: the original version of the paper had a mistake in the definition of the coherence, a result of late edits for the camera ready version of the paper that we missed. This value has to be between 0 and 1).

Coherence is a bit more uncommon; it describes how much stronger is the dominant direction over the perpendicular one.

By analyzing the eigenvector (any of them is fine, since they are perpendicular), we can get information about the dominant gradient orientation.

In the case of our test signals that are comprised of simple frequencies, those look as follow (please ignore the behavior on borders/boundaries between images in my quick and dirty implementation):

Row 1: Analyzed signals, Row 2: Gradient strength in the dominant direction, Row 3: Gradient orientation angle remapped to 0-1, Row 4: Gradient coherence – most of gradients are very coherent in all of the presented examples, with an exception of zero signal region (coherence undefined -> can be zero), and some corners.

Here’s some other example on an actual image, with a remapping often used for displaying vectors – hue defined by the angle, and saturation by the gradient strength:

Left: Processed image. Right: Color representation of the gradient vectors, where angle describes hue in HSV color space, and the saturation comes from the gradient strength

Such analysis is important for both edge-aware, non-linear image processing, as well as in computer vision. Structure tensor is a part of Harris corner detector, which used to be probably the most common computer vision question prior to the whole deep learning revolution. 🙂 (Still, if you’re looking for a job in the field even today, I suggest not skipping on such fundamentals as interviewers can still ask for classic approaches to verify how much of the background you understand!)

Fixing rotation

One thing that is worth mentioning is how to “fix” the rotation of structure tensor from diagonal gradients. If you compute the angle through arctan, this is not something you’d care about (just add value to your angle), but in scenarios where you need higher computational efficiency, you don’t want to be doing any inverse trigonometry, and this matters a lot. Luckily, rotating Gram or covariance matrices is super easy! Formula for rotating a covariance matrix is simply . Constructing a rotation matrix by 45 degree from angle formula, we get first:

And then:

(Note: worth double checking my math. But it makes sense that having exactly the same 4 diagonal elements cancels them and suggests a vertical only or diagonal only gradient!)

Frequency analysis

Before I conclude the post, I couldn’t resist looking at comparing the forward differences and central differences from the perspective of signal processing.

It’s probably a bit unorthodox and most people don’t look at it this way, but I couldn’t stop thinking about the disappearing high frequencies; ideal differentiator of continuous signals should have “infinite” frequency response as frequencies go to infinity!

Obviously frequencies can go only up to Nyquist in the discrete domain. But let’s have a look at the frequency response of the forward difference convolution filter and the central difference:

The central difference convolution filter has zero frequency response at Nyquist and seems to match the frequency response worse than a simple forward difference filter with increasing frequencies. This is the “cost” we pay for compensating for the phase shift.

But given this frequency response, it makes sense that a sequence of 1 0 1 0 completely “disappears”, as it is exactly at Nyquist, where the response is zero!

Finally, if we look at higher order / more filter tap gradient approximations, we can see that it definitely improves for most mid-high frequencies, but the behavior when approaching Nyquist is still a problem:

Ok, with 9 taps we get better and better behavior and closer to the differentiator line.

At the same time, the convergence is slow and we won’t be able to recover the exact Nyquist:

This is due to “perfect” resampling having frequency response of 0 at Nyquist. Whether this is a problem or not depends on your use-case. In both photography, as well as computer graphics we get get sequences occurring at perfect Nyquist (from rasterization or sampling photographic signals). Anyway, I highly recommend the diagonal filters which don’t have this issue!

Conclusions

In my post, I only skimmed the surface of the deep topic of grid (image, voxel) gradients, but covered a bunch of related topics and connections:

  1. I discussed differences between the forward and central differences, how to address some of their problems, and some reasons for the classic Sobel operator.
  2. I presented a neat “trick” of using diagonal gradients that fixes some of the shortcomings of the alternatives and I’d suggest it to you as the go-to one. Try it out with other signal sources like voxels, or for computing normal maps!
  3. I mentioned how to robustly compute centered gradients on a 3×3 grid using the diagonal method.
  4. I briefly touched on the concept of structure tensor, a least squares solution for finding dominant gradient direction on a grid, and how its eigenanalysis provides many more extensions / useful descriptors over just gradients.
  5. Finally, I played with analyzing frequency response of discrete difference operators and how they compare to theoretical perfect differentiator in signal processing.

As always, I hope that at least some of those points are going to be practically useful in your work, inspire you in your own research, or to do further reading and maybe correct me or engage in discussion in comments on on twitter. 🙂

Posted in Code / Graphics | Tagged , , , , , | 6 Comments

Bilinear down/upsampling, aligning pixel grids, and that infamous GPU half pixel offset

See this ugly pixel shift when upsampling a downsampled image? My post describes where it can come from and how to avoid those!

It’s been more than two decades of me using bilinear texture filtering, a few months since I’ve written about bilinear resampling, but only two days since I discovered a bug of mine related to it. 😅 Similarly, just last week a colleague asked for a very fast implementation of bilinear on a CPU and it caused a series of questions “which kind of bilinear?”.

So I figured it’s an opportunity for another short blog post – on bilinear filtering, but in context of down/upsampling. We will touch here on GPU half pixel offsets, aligning pixel grids, a bug / confusion in Tensorflow, deeper signal processing analysis of what’s going on during bilinear operations, and analysis of the magic of the famous “magic kernel”.

I highly recommend my previous post as a primer on the topic, as I’ll use some of the tools and terminology from there, but it’s not strictly required. Let’s go!

Edit: I wrote a follow-up post to this one, about designing downsampling filters to compensate for bilinear filtering.

Bilinear confusion

The term bilinear upsampling and downsampling is used a lot, but what does it mean? 

One of the few ideas I’d like to convey in this post is that bilinear upsampling / downsampling doesn’t have a single meaning or a consensus around this term use. Which is kind of surprising for a bread and butter type of image processing operation that is used all the time!

It’s also surprisingly hard to get it right even by image processing professionals, and a source of long standing bugs and confusion in top libraries (and I know of some actual production bugs caused by this Tensorflow inconsistency)!

Edit: there’s a blog post titled “How Tensorflow’s tf.image.resize stole 60 days of my life” and it’s describing same issue. I know of some of my colleagues that spent months on fixing it in Tensorflow 2 – imagine effort of fixing incorrect uses and “fixing” already trained models that were trained around this bug…

Image for post
Image credit/source: Oleksandr Savsunenko

Some parts of it like phase shifting are so tricky that a famous blog post of “magic kernel” comes up every few years and again, experts re(read) it a few times to figure out what’s going on there, while the author simply rediscovered the bilinear! (Important note: I don’t want to pick on the author, far from it, as he is a super smart and knowledgeable person, and willingness to share insights is always respect worthy. “Magic kernel” is just an example of why it’s so hard and confusing to talk about “bilinear”. I also respect how he amended and improved the post multiple times. But there is no “magic kernel”.)

So let’s have a look at what’s the problem. I will focus here exclusively on 2x up/downsampling and hope that some thought framework I propose and use here will be beneficial for you to also look at and analyze different (and non-integer factors).

Because of bilinear separability, I will again abuse the notation and call “bilinear” a filter when applied to 1D signals and generally a lot of my analysis will be in 1D.

Bilinear downsampling and upsampling

What do we mean by bilinear upsampling?

Let’s start with the most simple explanation, without the nitty gritty: it is creating a larger resolution image where every sample is created from bilinear filtering of a smaller resolution image.

For the bilinear downsampling, things get a bit muddy. It is using a bilinear filter to prevent signal aliasing when decimating the input image – ugh, lots of technical terms. I will circle back to it, but first address the first common confusion.

Is this box or bilinear downsampling? Two ways of addressing it

When downsampling images by 2, we every often use terms box filter and bilinear filter interchangeably. And both can be correct. How so?

Let’s have a look at the following diagram: 

(Bi)linear vs box downsampling give us the same effective weights. Black dots represent pixel centers, upper row is the target/low resolution texture, and the bottom row the source, higher resolution one. Blue lines represents discretized weights of the kernel.

We can see that a 2 tap box filter is the same as a 2 tap bilinear filter. The reason for it is that in this case, both filters are centered between the pixels. After discretizing them (evaluating filter weights at sample points), there is no difference, as we no longer know what was the formula to generate them, and how the filter kernel looked outside of the evaluation points.

The most typical way of doing bilinear downsampling is the same as box downsampling. Using those two names for 2x downsampling interchangeably is both correct! (Side note: Things diverge when taking about more than 2x downsampling. This might be a good topic for another blog post.) For 1D signals it means averaging every two elements together, for 2D images averaging 4 elements to produce a single one.

You might have noticed something that I implicitly assumed there – pixel centers there were shifted by half a pixel, and the edges/corners were aligned.

There is “another way” of doing bilinear downsampling, like this:

A second take on bilinear downsampling – this time with pixel centers (black dots) aligned. Again the source image / signal is on the bottom, target signal on the top.

This one definitely and clearly is also a linear tent, and it doesn’t shift pixel centers. The resulting filter weights of [0.25 0.5 0.25] are also called a [1 2 1] filter, or the simplest case of a binomial filter, a very reasonable approximation to a Gaussian filter. (To understand why, see what happens to the binomial distribution as the trial count goes to infinity!). It’s probably the filter I use the most in my work, but I digress. 🙂

Why this second method is not used that much? This is by design and a reason for half texel shifts in GPU coordinates / samplers, and you might have noticed the problem – the last texel of high resolution array gets discarded. But let’s not get ahead of ourselves, first we can have a look at the relationship with upsampling.

Two ways of bilinear upsampling – which one is “proper”?

If you were to design a bilinear upsampling algorithm, there are a few ways to address it.

Let me start with a “naive” one that can have problems. We can take every original pixel, and between them just place averages of the other ones.

Naive bilinear upsampling when pixel centers are aligned. Some pixels receive a copy of the source (green line), the other ones (alternating) a blend between two neighbors.

Is it bilinear / tent? Yes, it’s a tent filter on zero-inserted image (more on it later). It has an unusual property; some pixels get blurred, some pixels stay “sharp” (original copied).

But more importantly, if you do box/bilinear downsampling as described above, and then upsample an image, it will be shifted:

Using box downsampling, and then copy / interpolate upsampling shifts the image by half a pixel. This is a wrong way to do it!

Or rather – it will not correct for the half pixel shift created by downsampling.

It will work however with downsampling using the second method. The second method interpolates every single output pixel; all are interpolated:

When done properly, bilinear down/upsample doesn’t shift the image.

This another way of doing bilinear upsampling that might first feel initially unintuitive: every pixel is 0.75 of one pixel, and 0.25 of another one, alternating “to the left” and “to the right”. This is exactly what a GPU does when you upsample a texture by 2x:

There are two simple explanations for those “alternating” weights. The first, easiest one is just looking at the “tents” in this scheme:

If we draw interpolation “tents”, we can see that the lower resolution image samples are alternating on the either side of the high resolution sample.

I’ll have a look at the second interpretation of this filter – it’s [0.125 0.375 0.375 0.125] in disguise 🕵️‍♀️, but first with this intro, I think it’s time to make the main claim / statement: we need to be careful to use same reference coordinate frames when discussing images of different resolutions.

Be careful about phase shifts

Your upsampling operations should be aware of what downsampling operations are and how they define the pixel grid offset, and the other way around!

Even / odd filters

One important thing to internalize is that signal filters can have odd or even number of samples. If we have an even number of samples, such a filter doesn’t have a “center”, so it has to shift the whole signal by a half pixel in either direction. By comparison, symmetric odd filters can shift specific frequencies, but don’t shift the whole signal:

Odd length filters can stay “centered”, while even length filters shift the signal/image by half a pixel.

If you know signal processing, those are the type I and II linear phase filters.

Why shifts matter

Here’s a visual demonstration of why it matters. A Kodak dataset image processed with different sequences, first starting with box downsampling:

Using box / tent even downsampling followed by either even, or odd upsampling.

And now with [1 2 1] tent odd downsampling:

Using tent odd downsampling followed by either even, or odd upsampling.

If there is a single lesson from my post, I would like it to be this one: Both “takes” on the bilinear up/downsampling above can be the valid and correct ones, you simply need to pick the proper one for your use-case and the convention used throughout your code/frameworks/libraries; always use a consistent coordinate convention for the downsampling and upsampling. When you see term “bilinear”, always double check what it means! Because of it, I actually like to reimplement those and be sure that I’m consistent…

That said, I’d argue that the “box” bilinear downsampling and the “alternating weights” are better for average use-case. The first reason might be somewhat subjective / minor (because bilinear down/upsampling is inherently low quality and I don’t recommend using it when the quality matters more than simplicity / performance). If we visually inspect the upsampling operation, we can see more leftover aliasing (just look at the diagonal edges) in the odd/odd combo:

Two types of upsampling/downsampling can prevent image shifting, but produce differently looking and differently aliased images.

The second reason, IMO a more important one is how easily they align images. And this is why GPU sampling has this “infamous” half a pixel offset.

That half pixel offset!

Ok, so my favorite part starts – half pixel offsets! Source of pain, frustration, misunderstanding, but also a super reasonable and robust way of representing texture and pixel coordinates. If you started graphics programming relatively recently (DX10+ era) or are not a graphics programmer – this might be not a big deal for you. But basically, with older graphics APIs framebuffer coordinates didn’t have a half texel offset, while the texture sampler expected it, so you had to add it manually. Sometimes people added it in the vertex shader, sometimes in the pixel shader, sometimes setting up uniforms on the CPU… a complete mess; it was a source of endless bugs found almost every day, especially on video games shipping on multiple platforms / APIs!

What do we mean by half pixel offset?

If you have a 1D texture of size 4, what are your pixel/texel coordinates?

They can be [0, 1, 2, 3]. But GPUs use a convention of half pixel offsets, so they end up being [0.5, 1.5, 2.5, 3.5]. This translates to UVs, or “normalized” coordinates [0.5/4, 1.5/4, 2.5/4, 3.5/4], which spans a range of [0.5/width, 1 – 0.5/width].

This representation seems counterintuitive at first, but what it provides us is a guarantee and convention that the image corners are placed at [0 and 1] normalized, or [0, width] unnormalized.

This is really good for resampling images and operating on images with different resolutions.

Let’s compare the two on the following diagrams:

Half pixel offset convention aligns pixel grids perfectly, by aligning their corners/edges.
No offset convention aligns the first pixel center perfectly – and in the case of 2x scaling, also every other pixel. But images “overlap” outside of the 0,1 range and are not symmetric!

While the half a pixel align pixel corners, the other way of down/upsampling comes from aligning the first pixel centers in the image.

Now, let’s have a look at how we compute the bilinear upsampling weights in the half a pixel shift convention:

This convention makes it amazingly simple and obvious where the weights come from – and how simple the computation is once we align the grid corners. I personally use it as well even in APIs outside of GPU shader realm – everything is easier. If adding and removing 0.5 adds performance cost, then can be removed at microoptimizations stage, but usually doesn’t matter that much.

Reasonable default?

Half a pixel offset for pixel centers used in GPU convention for both pixels and texels is a reasonable default for any image processing code dealing with images of different resolutions.

This is expecially important when to dealing with textures of different resolutions and for example mip maps of non power of 2 textures. A texture with 9 texels instead of 4? No problem:

A texture with 9 texels aligns easily and perfectly with a one with 4 texels. Very useful for graphics operations, where you want to abstract the texture resolutions away.

It makes sure that grids are aligned, and the up/downsampling operations “just work”. To get box/bilinear downsampling, you can just take a single bilinear tap of the source texture, the same with the upsampling.

So trivial to use it that when you start graphics programming, you rarely think about it. Which is a double edge sword – both great for an easy entry point for beginners, but also a source of confusion once you start getting deeper into it and analyzing what’s going on or do things like fractional or nearest neighbor downsampling (or e.g. create a non-interpolable depth map pyramid…).

Even if there were no other reasons, this is why I’d recommend treating phase shifting box downsample and the [0.25 0.75] / [0.75 0.25] upsamplers as your default when talking about bilinear as well.

Bonus advantage: having texel coordinates shifted by 0.5 means that if you want to get an integer coordinate – for example for texelFetch instruction – you don’t need to round. Floor / truncation (which in some settings can be a cheaper operation) gives you the closest pixel integer coordinate to index!

Note: Tensorflow got it wrong. The “align_corners” parameter aligns… centers of the corner pixels??? This is a really bad and weird naming plus design choice, where upsampling a [0.0 1.0] by factor of 2 produces [0, 1/3, 2/3, 1], which is something completely unexpected and different from either of the conventions I described here.

Signal processing – bilinear upsampling

I love writing about signal processing and analyzing signals also in the frequency domain, so let me explain here how you can model bilinear up/downsampling in the EE / signal processing framework.

Upsampling usually is represented as two operations: 1. Zero insertion and 2. Post filtering.

If you never heard of this way of looking at it (especially the zero insertion), it’s most likely because in practice nobody in practice (at least in graphics or image processing) implements it like this, it would be super wasteful to do it in such a sequence. 🙂

Zero insertion 

Zero insertion is an interesting, counter-intuitive operation. You insert zeros between each element (often multiplying the original ones by 2x to preserve the constant/average energy in the signal; or we can fold this multiplication in our filter later) and get 2x more samples, but they are not very “useful”. You have an image consisting of mostly “holes”…

In 2D zero insertion causes every 2×2 quad contain one pixel and three zeros.

I think that looking at it in 1D might be more insightful:

1D zero insertion – notice high frequency oscillations.

From this plot, we can immediately see that with zero insertion, there are many high frequencies that were not there! All of those zeros create lots of high frequency coming from alternating and “oscillating” between the original signal, and zero. Filters that are “dilated” and have zeros in between coefficients (like a-trous / dilated convolution) are called comb filters – because they resemble a comb teeth!

Let’s look at it from the spectral analysis. Zero insertion duplicates the frequency spectrum:

Upsampling by zero insertion duplicates the frequency spectrum.

Every frequency of the original signal is duplicated, but we know that there were no frequencies like this present in the smaller resolution image; it wasn’t possible to represent anything above its Nyquist! To fix that, we need to filter them out after this operation with a low pass filter:

To get properly looking image, we’d want to remove high frequencies from zero insertion by lowpass filtering.

I have shown some remainder frequency content on purpose, as it’s generally hard to do “perfect” lowpass filtering (and it’s also questionable if we’d want this – ringing problems etc).

Here is how progressively filtered 1D signal looks like, notice high frequencies and “combs” disappearing:

Notice how progressively more blurring causes upsampled signal lose the wrong high frequency comb teeth and it converges to 2x higher resolution original one!

Here’s an animation of blurring/filtering on the 2D image and how there it also causes this zero-inserted image to become more and more like just properly upsampled:

Blurring zero inserted image converges to upsampled one!

Looks like image blending, but it’s just blending filters – imo it’s pretty cool. 😎

Nearest neighbor -> box filter!

Obviously, the choice of the blur (or technically – lowpass) filter matters – a lot. Some interesting connection: what if we convolve this zero-inserted signal with a symmetric [0.5, 0.5] (or 1,1 if we didn’t multiply the signal by 2 when inserting zeros) filter?

Convolving image with a [1, 1] filter is the same as nearest neighbor filter!

The interesting part here is that we kind of “reinvented” the nearest neighbor filter! After a second of though, this should be intuitive; a sample that is zero gets contributions from the single non-zero neighbor, which is like a copy, while the sample that is non-zero is surrounded by two zeros, and they don’t affect it.

We can see on the spectral / Fourier plot where the nearest neighbor hard edges and post-aliasing comes from (red part of the plot):

The nearest neighbor upsampling is also shifting the signal (because it is even number of samples) and will work well to undo the box downsampling filter, which fits the common intuition of replicating samples being the “reverse” of box filtering and causing no shift problem.

Bilinear upsampling take one – direct odd filter

Let’s have a look at how the strategy of “keep one sample, interpolate between” can be represented in this framework.

It’s equivalent to filtering our zero-upsampled image with a [0.25 0.5 0.25] filter.

The problem is that in such setup, if we multiply the weights two (to keep average signal the same) and then by zeros (where the signal is zero), we get alternating [0.0 1.0 0.0] and [0.5 0.0 0.5] filters, with very different frequency response and variance reduction… I’ll reference you here again to my previous blog post on it, but basically you get alternating 1.0 and 0.5 of original signal variance (sum of effective weights squared).

Bilinear upsampling take two – two even filters

The second approach of alternating weights of [0.25 0.75] can be seen as simply: nearest neighbor upsampling – a filter of [0.5 0.5], and then [0.25 0.5 0.25] filtering!

This sequence of two convolutions gives us an effective kernel of [0.125 0.375 0.375 0.125] on the zero inserted image, so if we multiply it by 2 simply alternating [0.25 0.0 0.75 0.0] and [0.0 0.75 0.0 0.25]. Corners aligned bilinear upsampling (standard bilinear upsampling on the GPU) is exactly the same as the “magic kernel”! 🙂 This is also this second, more complicated explanation of bilinear 0.25 0.75 weights I promised.

Advantage of it is that with the effective weight of [0.25 0.75] and [0.75 0.25] (ignoring zeros) on alternating pixels, they have the same amount of filtering and variance reduction of 0.625 – very important!

This is how the combined frequency response compares to the previous one: 

Two ways of bilinear upsampling both leave aliasing (everything after the dashed line half Nyquist), as well as blur the signal (everything before it).

So as expected, more blurring, less aliasing, consistent behavior between pixels.

Neither is perfect, but the even one will generally cause you less “problems”.

Signal processing – bilinear downsampling

By comparison, downsampling process should be a bit more familiar to readers who have done some computer graphics or image processing and know of aliasing in this context.

Downsampling consists of two steps in opposite order: 1. Filtering the signal. 2. Decimating the signal by discarding every other sample.

The ordering and step no 1 is important, as the second step, decimating is equivalent to (re)sampling. If we don’t filter the signal spectrum above frequencies representible in the new resolution, we are going to end up with aliasing, folding back of frequencies above previous half Nyquist:

When decimating, original signal frequencies will alias, appearing as wrong ones after decimation. To prevent aliasing, you generally want to prefilter the image with a strong antialiasing – lowpass – filter.

This is the aliasing the nearest-neighbor (no filtering) image downsampling causes:

Aliasing manifests as wrong frequencies; notice on the bottom plot how end of the spectrum looks like 2x smaller frequency than before decimation.

Bilinear downsampling take one – even bilinear filter

First antialiasing filter we’d want to analyze would be our old friend “linear in box disguise”, [0.5, 0.5] filter. It is definitely imperfect, and we can see both blurring, and some leftover aliasing:

The Graphics community realized this a while ago – when doing a series of downsamples for post-processing, for example bloom / glare; the default box/tent/bilinear filters are pretty bad in such case. Even small aliasing like this can be really bad when it gets “blown” to the whole screen, and especially in motion. It was even a large chunk of Siggraph presentations, like this excellent one from my friend Jorge Jimenez.

I also had a personal stab at addressing it early in my career, and even described the idea – weird cross filter (because it was fast on the GPU) – please don’t do it, it’s a bad idea and very outdated! 🙂 

Bilinear downsampling take two – odd bilinear filter

By comparison the odd bilinear filter (that doesn’t shift the phase) looks like a little different trade-off:


Less aliasing, more blurring. It might be better for many cases, but the trade-offs from breaking the half-pixel / corners aligned convention are IMO unacceptable. And it’s also more costly (not possible to do a single tap 2x downsampling).

To get better results -> you’ll need more samples, some of them with negative lobes. And you can design an even filter with more samples too, for example even Lanczos:

It’s possible to design better downsampling filters. This is just an example, as it’s an art and craft of its own (on top of the hard science). 🙂

Side note – different trade-offs for up/downsampling?

One interesting thing that has occurred to me on a few occasions is that the trade-offs for low pass filtering for upsampling and downsampling are different. If you use a “perfect” upsampling lowpass filter, you will end up with nasty ringing.

This is typically not the case for downsampling. So you can opt for a sharper filter when downsampling, and a less sharp for upsampling, and this is what Photoshop suggests as well:

Photoshop also suggests smoother/more blurry upsampling filter, while a sharper (closer to “perfect”) lowpass filter, because ringing / halos tend to not be as much of a problem there as in the case of upsampling.

Conclusions

I hope that my blog post helped to clarify some common confusions coming from using the same, very broad terms to represent some different operations.

A few of main takeaways that I’d like to emphasize would be:

  1. There are a few ways of doing bilinear upsampling and downsampling. Make sure that whatever you use uses the same convention and doesn’t shift your image after down/upsampling.
  2. Half pixel center offset is a very convenient convention. It ensures that image borders and corners are aligned. It is default on the GPU and happens automatically. When working on the CPU/DSP, it’s worth using the same convention.
  3. Different ways of upsampling/downsampling have different frequency response, and different aliasing, sometimes varying on alternating pixels. If you care about it (and you should!), look more closely into which operation you choose and optimal performance/aliasing/smoothing tradeoffs.

I wish more programmers were aware of those challenges and we’d never again again hit bugs due to inconsistent coordinate and phase shifts between different operations or libraries… I also with we could never see those “triangular” or jagged aliasing artifacts in images, but bilinear upsampling is so cheap and useful, that instead we should be just simply aware of potential problems and proactively address them.

To finish this section, I would again encourage you to read my previous blog post on some alternatives to bilinear sampling.

PS. What was my bug that I mentioned at beginning of the post? Oh, it was simple “off by one” – in numpy when convolving with np.signal.convolve1d and 2d I assumed wrong “direction” of the convolution of even filters. Subtle bug, but it was shifting everything by one pixel after sequence of downsamples and upsamples. Oops. 😅

Posted in Code / Graphics | Tagged , , , , , , , | 9 Comments

Is this a branch?

Let’s try a new format – “shorts”; small blog posts where I elaborate on ideas that I’d discuss at my twitter, but they either come back over and over, or the short form doesn’t convey all the nuances.

I often see advice about avoiding “if” statements in the code (especially GPU shaders) at all costs. The worst of all is when this happens in glsl and programmers replace some simple if statements with (subjectively) horrible step instruction which IMO completely destroys all readability, or even worse – sequences of mixes, maxes, clips and extra ALU.

After all, those ifs are branches, and branches are bad, right?

TLDR: No, definitely not – on all accounts.

Why do programmers want to avoid branches?

Let’s first address why programmers would like to avoid branches. There are some good reasons for that.

Scalar CPU

On the CPU, “branches” in a non-SIMD code might seem “innocent”, it’s just a conditional jump to some other code. Instead of executing an instruction a next address, we jump to instruction b at some other address. Unfortunately, there are many complicated things going on under the hood: All modern CPUs are heavily pipelined, superscalar, and out-of order.

All modern CPUs are superscalar and deeply pipelined. Many instructions are processed and executed at the same time. Source: wikipedia

To extract maximum efficiency, they use a model of speculative execution – a CPU “gambles” and “guesses” results of a branch ahead of time, not knowing if it will be taken or not, and starts executing those instructions. If it’s right – great! However if it’s not, then the CPU needs to stall, cancel all the guesses, go back to the branch, and execute the other instructions. Oops.

Luckily, CPUs are usually very good at this “guessing” (they actually analyze past branch patterns and take them into account!). On the other hand, without going into too much detail – each branch consumes some of the branch predictor unit resources. One extra branch on some short, innocent code can make results of the branch prediction worse, and slow down other code (where branch predictor might be necessary for good performance!).

SIMD and the GPU

When considering SIMD or the GPU, the situation is even worse.

Simplifying – single program and same instruction stream executes on multiple “lanes” of the same data, processing 4/8/16/32/64 elements at the same time.

If you want to take a branch when some of the elements take one path, some another one, you have two main options: either abandon vectorization and scalarize code (very bad option; usually means that it’s not worth vectorizing given piece of code), or execute both code paths – first the first one, store the results somewhere, execute the second one, and “combine” the results.

This is a general (over)simplification, and example GPUs have dedicated hardware to help do it efficiently and without wasting resources (hardware execution masks etc), but generally means extra cost of managing those masks, storing results somewhere, double execution costs and finally – branches can serve as barriers, preventing some latency hiding (see my old blog post on it).

Combined

Given those two, it might seem that avoiding branches seems reasonable, and the common advice of avoiding ifs makes sense. But whether it makes sense to avoid branch or not is more complicated (more on it later), and the second one (whether an if is a branch) is not the case.

Are if statements branches?

Let’s clear the worst misconception.

When you write an “if” in your code, the compiler won’t necessarily generate a branch with a jump.

I will focus now on CPUs, mostly because how easy it is to test it with Matt Godbolt’s amazing compiler explorer. 🙂 (Plus how there are two mainstream architectures, unlike many different shader ISAs)

Let’s have a look at the following simple integer function.

int is_this_a_branch(int v, int w) {
    if (v < 10) {
        return v + w; 
    } 
    else {
        return 2 * v - 2 * w;
    }
}

I tested it with 3 x64 compilers (GCC x64 trunk, Clang x64 trunk, MSVC 19.28) and one ARM (armv8-a clang), and only MSVC generates an actual branch there.

clang              
        mov     eax, edi
        sub     eax, esi
        add     esi, edi
        add     eax, eax
        cmp     edi, 10
        cmovl   eax, esi
        ret
gcc
        mov     eax, edi
        lea     edx, [rdi+rsi]
        sub     eax, esi
        add     eax, eax
        cmp     edi, 9
        cmovle  eax, edx
        ret
msvc        
        cmp     ecx, 10
        jge     SHORT $LN2@is_this_a_ BRANCH!!!
   BRANCH!
        lea     eax, DWORD PTR [rcx+rdx]
        ret     0
$LN2@is_this_a_:
        sub     ecx, edx
        lea     eax, DWORD PTR [rcx+rcx]
        ret     0

arm clang
        sub     w9, w0, w1
        add     w8, w1, w0
        lsl     w9, w9, #1
        cmp     w0, #10                         // =10
        csel    w0, w8, w9, lt

When using floats:

float is_this_a_branch2(float v, float w) {
    if (v < 10.0f) {
        return v + w; 
    } 
    else {
        return 2.0f * v - 2.0f * w;
    }
}

Now 2 out of 4 compilers generate a branch:

clang           
        movaps  xmm2, xmm0
        addss   xmm2, xmm1
        movaps  xmm3, xmm0
        addss   xmm3, xmm0
        addss   xmm1, xmm1
        cmpltss xmm0, dword ptr [rip + .LCPI1_0]
        subss   xmm3, xmm1
        andps   xmm2, xmm0
        andnps  xmm0, xmm3
        orps    xmm0, xmm2
        ret

gcc
        movss   xmm2, DWORD PTR .LC0[rip]
        comiss  xmm2, xmm0
        jbe     .L10  BRANCH!!!

        addss   xmm0, xmm1
        ret
.L10:
        addss   xmm1, xmm1
        addss   xmm0, xmm0
        subss   xmm0, xmm1
        ret

msvc
        movss   xmm2, DWORD PTR __real@41200000
        comiss  xmm2, xmm0
        jbe     SHORT $LN2@is_this_a_  BRANCH!!!
        addss   xmm0, xmm1
        ret     0
$LN2@is_this_a_:
        addss   xmm0, xmm0
        addss   xmm1, xmm1
        subss   xmm0, xmm1
        ret     0

arm clang
        fadd    s2, s0, s1
        fadd    s3, s0, s0
        fadd    s1, s1, s1
        fmov    s4, #10.00000000
        fsub    s1, s3, s1
        fcmp    s0, s4
        fcsel   s0, s2, s1, mi
        ret

What are those conditionals and selects?

This is basically CPUs’ and compilers’ way of “serializing” both code paths. If it considers overhead of a branch larger than the amount of work it can save, it makes sense to compute both code paths, and use a single instruction to select which one is the actual result.

This obviously “depends” on the use case and whether it is even “legal” for a compiler to do such a transformation (like “side effects”; or if a compiler executed some reads of memory, it could be unsafe and lead to segmentation fault). 

Does writing ternary conditional help avoid branches?

No.

This is a common advice, but it’s generally not true. If you look at the above example rewritten slightly:

float is_this_a_branch3(float v, float w) {
    return v < 10.0f ? (v + w) : (2.0f * v - 2.0f * w);
}

The generated code is identical (!).

It might be possible (?) that some compilers do generate branchless code for like this, but giving this as a general advice makes no sense. Whether you write an actual if, or a ternary expression, it’s still a high level if statement, just expressed differently.

What about the GPU?

On the GPU, it’s the same; the compiler will generate conditional selects when appropriate.

I used Tim Jones Shader Playground and AMD ISA (mostly because I know it relatively better than the others), and the unfortunate step function:

#version 460

layout (location = 0) out vec4 fragColor;
layout (location = 0) in float inColor;

void main()
{
#if 1
	fragColor = vec4(step(1.0, inColor));
#else
    float x;
    if (inColor < 1.0)
        x = 1.0;
   	else
        x = 0.0;
   	fragColor = vec4(x);
#endif
}

See for yourself, both versions generate identical assembly:

  s_mov_b32     m0, s3
  s_nop         0x0000
  v_interp_p1_f32  v0, v0, attr0.x
  v_interp_p2_f32  v0, v1, attr0.x
  v_cmp_gt_f32  vcc, 1.0, v0
  v_cndmask_b32  v0, 1.0, 0, vcc
  v_cvt_pkrtz_f16_f32  v0, v0, v0
  exp           mrt0, v0, v0, v0, v0 done compr vm

When using fastmath and not obeying strict IEEE float specification (which is standard for compiling shaders for real time graphics), compilers can even generate other instructions like min or max from your if statements. But YMMV and be sure to check the assembly.

So generally – using “step” just to avoid if statements is pointless.

If you like it for your coding style, then it’s obviously up to you, use as much as you like just for this reason. But please consider programmers who don’t have as much shader or GLSL experience and will be always puzzled by it. 🙂

Are branches always to be avoided?

No.

This is way beyond my aimed “short” format, but branches are not necessarily “bad”.

They can be actually an awesome optimization, even on the GPU!

If you start to use a lot of conditionals

Lots of advice on how branches are bad and have to be avoided comes from a prehistoric era of old CPUs and GPUs or old compilers; an advice that is passed without re(verifying) – which is understandable, I am not re-checking my whole knowledge or intuition every few months either. 🙂

But the CPU/GPU/compiler architects improved designs significantly and adapted them to general, highly branchy code, and whether there is a penalty or not depends on way too many details to list or give advice.

Rule of thumb: if a branch is generally coherent, it makes sense to use it when it allows to save on memory bandwidth or cache utilization. On GPUs it is usually worth adding branches if you can avoid texture fetches this way with a high probability and coherence. And it’s usually better to have some conditional masks and selects around simple/short sequences of arithmetic operations.

Final advice

  1. Don’t assume that an “if” statement will generate a branch. For most of simple “if” statements, the compiler can and will use conditional selects. Compilers can do many powerful transformations, sometimes ones that you wouldn’t even consider.
  2. Unless writing performance critical code, optimize for code readability. Don’t use obscure instructions and weird arithmetic.
  3. If you care for performance of a hot section, always read the resulting assembly and verify what is the compiler output.
  4. Don’t assume that a branch is bad. Profile your code – both in microbenchmarks, as well as in surrounding context (effects on cache, memory bandwidth, and branch predictor).
Posted in Code / Graphics | Tagged , , , , , , | 7 Comments

Converting wavetables to Ableton Operator AMS waves

This blog post comes with Ableton Operator AMS “wavetables” here.

In Ableton’s FM synth you can use different types of oscillator waves as your operators (both carriers as well as modulators), as well as draw custom ones:

What is not very well known is that you can save those as AMS files (and obviously also load them). The idea for this small project and a blog post came from a video about Ableton Live tips and tricks from Cymatics where I learned about this possibility.

If Operator can write/read such files, this this means that it’s also possible to easily create one with code, and it’s possible to use the harmonic information of arbitrary wavetables!

Note: This post is going to be slightly less technical than most of mine – as it is targeted towards electronic musicians using such software and not software engineers.

Here you can download some Operator wavetables from the harmonics of Native Instruments Massive wavetables I found in an online wavetable pack. I’ve tested those AMS files in Operator in Ableton Live 10 Suite.

To use them, just drag a file and drop it onto the Oscillator tab in Ableton Operator.

This way you can use any wavetable harmonic information as your FM synthesis operator oscillators – for both carriers and modulators!

I also include a Python script that I used to convert those (note: it’s quite naive and hacky, so some programming knowledge and debugging might be required; when programming at home I go with least effort), so you use it yourself on some other wavetables and convert them to AMS files.

Disclaimer / legal note: I don’t know the legal status of those Massive wavetables that I downloaded and converted nor can verify how legit they are (sound pretty accurate to me). IANAL, but I don’t think that sound data that can be extracted / generated by playing Massive by legal owners is copyrightable in any way, but anyway, not going to share those.

The files I uploaded are simply text descriptions of harmonics, lack phase information (more about it below) and don’t use any of the wavetables directly. The format looks like below:

AMS 1
Generate MultiCycleFM
BaseFreq 261.625549
RootKey 60
SampleRate 44100
Channels 1
BitsPerSample 32
Volume Auto
Sine 1 0.73
Sine 2 0.38
Sine 3 0.30
Sine 4 0.32
Sine 5 0.51
Sine 6 1.00
Sine 7 0.23
Sine 8 0.09

Why not just use some Wavetable synthesizer (like NI Massive, Ableton Wavetable, Xfer Serum etc.)?

You definitely can and should, they are excellent soft synths! This is just a fun experiment and an alternative.

However using those in Operator gives you:

  • Very low CPU usage (in my experience most optimized of Ableton’s synths),
  • Simplicity and immediacy of Operator,
  • Possibility to tweak or change existing Operator FM patches by just swapping the oscillator waves,
  • Opportunity to use wavetables as modulators or carriers and possibility of doing weird FM modulations of wavetables in different configurations including self-feedback,
  • A quick inspiration point – open Operator and just start loading those user waves for different weird sounds and play with modulating them one by another. A subtle organ chord modulated by a formant growl? Why not!

At the same time you don’t get wavetable interpolation (you can approximate it with fading between different operators, but it’s definitely not the same) and generally don’t get other benefits of those dedicated wavetable synths.

So it’s a new tool (or a toy), but not a replacement.

How additive synthesis works

While the Operator is FM synth, every of the Oscillators uses additive synthesis to generate its waveforms.

Additive synthesis adds together different harmonics (multiplies of the base frequency) to approximate more complicated sounds.

If we look at a simple square wave:

According to (oversimplified) Fourier theorem, any repeating signal can be also represented as a sum of sine and cosine waves, for example for the square wave:

1 sin(w) + 1/3 sin(3w) + 1/5 sin(5w) + …

This is known as harmonic series and amount of harmonics (those 1, 1/3, 1/5 etc) multipliers and the type of harmonics (notice that all of those 1w, 3w, 5w are odd numbers! This is typical of the square waves and their “oboe”-like sound and as opposed to saw waves with both odd and even harmonics) represent the timbre of the sound.

Additive synthesis is simply summing up different harmonics represented as sines of increasing frequency multiplied by a precomputed amplitude.

Square wave might not be the best example because of so-called Gibbs phenomenon (those nasty oscillations at discontinuities / jumps of signals), but the reconstruction still looks (and will sound) similar.

Additive synthesis uses this concept: Instead of storing and representing waves faithfully, you can store the amplitude of harmonics.

This allows for very efficient data compression! Many waveforms can be represented very well with just a few numbers. Limitation to 64 harmonics might seem relatively large, but on the other hand this is the same as 6 octaves – so from a C0 note you can still get a C5 tone, not bad for most practical uses.

As I mentioned, the representation in Ableton’s Operator doesn’t use the wave phase (relative shift of those sine waves as compared one to another). Without using the exact phase (let’s assume just random phase for fun), the square wave reconstruction could look like this:

With random (or simply wrong) phase, our signal doesn’t “look” like a square anymore… But will still sound like one!

Not like a square at all! But… It will sound generally exactly like a square wave. So does it matter? No. Yes. Kind of. More on that later!

How to get harmonics from any wavetable?

Ok, so let’s say we have a wavetable with 2048 samples, how can we get harmonics for it? The answer is – Discrete Fourier Transform, and in general, process known as spectral decomposition.

It takes our 2048 sample signal and produces 2048 complex numbers representing shifted sine/cosine waves. This is beyond scope of this post, but for “real” signals (like audio samples) and ignoring the phase information you can extract out of is simply 1023 amplitudes of different harmonics and an information about “constant term” (average value, known in audio also as DC term).

Let’s take some simple wavetable (recording of a TB-303 wave with it’s famous diode ladder filter applied):

When we run it through FFT and take the amplitude of the complex numbers, we get the information about the frequency and harmonic content:

From this plot, translating it to the format used by Ableton is very straightforward. Literally take the harmonic index (1, 2, 3, ….), and the amplitude value. And this is all, the amplitude is the harmonic content for given harmonic index! Unsurprisingly this can be coded up in a ~dozen lines of code. 🙂

In this case, we are going to discard the phase information, but here’s how the phase looks like for the curious:

Here’s a comparison of the original (left) with the reconstructed:

Left: Original wave. Right: reconstruction from additive harmonics with the phase information missing.

Notice that the waves are definitely different by discarding the phase. Interesting characteristic of this reconstructed wave is that it is (and has to be!) “symmetrical” due to use of sine waves and no phase information.

But if you listen to a wav file, then again, they should sound +/- identical.

Does the phase information matter?

No. Yes. Kind of.

Hmm, the answer is complicated. 🙂 

Generally when talking about the timbre of continuous tones (and not transients or changing signals), the human ear cannot tell the phase difference.

If we did, then if when we move around (which changes the phase), sounds would change their timbre drastically. Similarly, for example every single EQ (including the linear phase EQs) changes the phase of sounds in addition to amplitude.

But the human hearing does use the phase information, just in a different way – the difference between the phase of signals between two ears is super useful for stereo localization of sounds.

Generally it would seem that the phase doesn’t matter so much?

It doesn’t matter if we are talking about a) linear processes and b) single signal sources.

If you have two copies of a signal, even subtle phase shifts can introduce some cancellations and phase shifts (this is +/- how phaser effects work). In this case, the effect is subtle (as it is very signal dependent), but notice how the phase cancellation reduces some of the harmonics:

Additionally, when introducing non-linearities like distortion or saturation, interactions of different harmonics and different “local” amplitude can result in different distortion, let’s have a look again at the waves with added lines corresponding to some clipping points:

When soft clipping against the black lines, saturated signals will produce different harmonics.

For example after feeding through a “saturator” (I used a hyperbolic tangent function) the waves will be very different:

Left: Original wave fed through saturator. Right: additive harmonics reconstruction fed through saturator.

Notice the asymmetric soft clipping (rounding of the peaks) of the original vs symmetric on the reconstruction.

It results in different harmonics introduced through saturation:

It’s not that one is more saturated than the other, just that the harmonics are different. This goes into symmetric / asymmetric clipping, tube vs transistor odd vs even harmonics etc. A very deep topic on which I’m not an expert.

This can and will sound different, hear for yourself.

Note: those phase differences affecting distortion so much is one of the reasons why it’s hard to get properly sounding emulations of hardware, for example on 303 emulations done through simple analog synths with saw/square waves and a generic filter after a distortion will sound not exactly like the squelchy acid we love. 

So generally for raw, unprocessed sounds the phase often can be ignored, but for further processing it can be necessary and YMMV. 

Conclusions

Despite the lack of phase encoding or wavetable interpolation, I still think using additive synthesis like in Operator for emulating more complex wavetable sounds is a cool and inspiring tool and a fun toy to play with. Next time looking for some weird tones that could be a starting point for a track, why not have fun with modulating different weird additive oscillators? 

Bonus usage

Here’s a weird bonus: you can use those AMS files inside Ableton’s Simpler/Sampler as the single cycle waveforms, just drag and drop those. It doesn’t make any sense to me though. 🙂  

Side note: a bug in Ableton?

On another note, I think I found a bug in Ableton (when trying to figure out the format used by AMS files). When I save a slightly modified square wave and then load it, I get completely different results:

Notice the wave shape look on the second screen, it’s completely wrong (the loaded one also sounds horrible…). 

I think I know where the bug comes from – the AMS files contain the absolute value of amplitude of a given harmonic, while the default waves visual representation (and AMS saved values) have 1/harmonic index normalization. Kinda weird – I can imagine why they would include this normalization from the user’s perspective, but not sure why the exporter/importer loop is wrong.

So don’t worry if your square waves look differently than Ableton default ones.

Posted in Audio / Music / DSP | Tagged , , , , , , , | 3 Comments

Why are video games graphics (still) a challenge? Productionizing rendering algorithms

Intro

This post will cover challenges and aspects of production to consider when creating new rendering / graphics techniques and algorithms – especially in the context of applied research for real time rendering. I will base this on my personal experiences, working on Witcher 2, Assassin’s Creed 4: Black Flag, Far Cry 4, and God of War.

Many of those challenges are easily ignored – they are real problems in production, but not necessarily there only if you only read about those techniques, or if you work on pure research, writing papers, or create tech demos.

I have seen statements like “why is this brilliant research technique X not used in production?” both from gamers, but also from my colleagues with academic background. And there are always some good reasons!

The post is also inspired by my joke tweet from a while ago about appearing smart and mature as a graphics programmer by “dismissing” most of the rendering techniques – that they are not going to work on foliage. 🙂 And yes, I will come back to vegetation rendering a few times in this post.

I tend to think of this topic as well when I hear discussions about how “photogrammetry, raytracing, neural rendering, [insert any other new how topic] will be a universal answer to rendering and replace everything!”. Spoiler alert: not gonna happen (soon).

Target audience

Target audience of my post are:

  • Students in computer graphics and applied researchers,
  • Rendering engineers, especially ones earlier in their career – who haven’t built their intuition yet,
  • Tech artists and art leads / producers,
  • Technical directors and decision makers without background in graphics,
  • Hardware engineers and architects working on anything GPU or graphics related (and curious what makes it complicated to use new HW features),
  • People who are excited and care about game graphics (or real time graphics in general) and would like to understand a bit “how sausages are made”. Some concepts might be too technical and too much jargon, but then feel free to skip those.

Note that I didn’t place “pure” academic researchers in the above list – as I don’t think that pure research should be considering too many obstacles. Role of the fundamental research is inspiration and creating theory that can be later productionized by people who are experts in productionization.

But if you are a pure researcher and somehow got here, I’ll be happy if you’re interested in what kinds of problems might be on the long way from idea or paper to a product (and why most new genuinely good research will never find its place in products).

How to interpret the guide

Very important note – none of the “obstacles” I am going to describe are deal breakers.

Far from it – most successful tech that became state of the art violates many of those constraints! It simply means that those are challenges that will need to be overcome in some way – manual workarounds, feature exclusivity, ignoring the problems, or applying them only in specific cases.

I am going to describe first the use-cases – potential uses of the technology and how those impact potential requirements and constraints.

Use case

The categories of “use cases” deserve some explanation and description of “severity” of their constraints.

Tech demo 

Tech demo is the easiest category. If your whole product is a demonstration of a given technique (whether for benchmarking, showing off some new research, artistic demoscene), most of the considerations go away.

You can actually retrofit everything: from the demo concept, art itself, camera trajectory to show off the technology the best and avoid any problems.

The main considerations will be around performance (a choppy tech demo can be seen as a tech failure) and working very closely with artists able to show it off.

The rest? Hack away, write one-off code – just don’t have expectations that turning a tech demo into a production ready feature is simple or easy (it’s more like the 99% of work remaining).

Special level / one-off

The next level “up” in the difficulty is creating some special features that are one-off. It can be some visual special effect happening in a single scene, game intro, or a single level that is different from the other ones. In this case, a feature doesn’t need to be very “robust”, and often replaces many others.

An example could be lighting in the jungle levels in Assassin’s Creed 4: Black Flag that I worked on. 

Source: Assassin’s Creed 4: Black Flag promo art. Notice the caustics-like lightshafts that were key rendering feature in jungle levels – and allowed us to save a lot on the shadows rendering!

Jungles were “cut off” from the rest of the open world by special streaming corridors and we completely replaced the whole lighting in them! Instead of relying on tree shadow maps and global lighting, we created fake “caustics” that looked much softer and played very nicely with our volumetric lighting / atmospherics system. They not only looked better, but also were much faster – obviously worked only because of those special conditions.

Cinematics

A slightly more demanding type of feature is cinematic-only one. Cinematics are a bit different as they can be very strictly controlled by cinematic artists and most of their aspects like lighting, character placement, or animations are faked – just like in regular cinema! Cinematics often feature fast camera cuts (useful to hide any transitions/popping) and have more computational budget due to more predictable nature (and even in 60fps console games rendered in 30fps).

Witcher 2 cinematic featuring higher character LODs, nice realistic large radius bokeh and custom lighting – but notice how few objects to render are there!

Regular rendering feature

“Regular” features – lighting, particles, geometry rendering – are the hardest category. They need to be either very easy to implement (most of the obstacles / problems solved easily), provide huge benefits surpassing state of the art by far to facilitate the adoption pain, or have some very excited team pushing for it (never underestimate the drive of engineers or artists that really want to see something happen!).

Most of my post will focus on those. 

Key / differentiating feature

Paradoxically, if something is a key or differentiating feature, this can alleviate many of the difficulties. Let’s take VR – there stereo, performance (low latency), and perfect AA with no smudging (so rather forget about TAA), are THE features and absolutely core to the experience. This means that you can completely ignore for example rendering foliage or some animations that would look uncannily – as being immersed and the VR experience of being there are much more important!

Feature compatibility

Let’s have a look at compatibility of a newly developed feature with some other common “features” (the distinction between “features”, and the next large section “pipeline” is fuzzy).

Features are not the most important of challenges – arguably the category I’m going to cover at the end (the “process”) is. But those are fun and easy to look at! 

Dense geometry

Dense geometry like foliage – a “feature” that inspired this post – is an enemy of most rendering algorithms.

The main problem is that with very dense geometry (lots of overlapping and small triangles), many “optimizations” and assumptions become impossible.

Early Z and occlusion culling? Very hard. 

Decoupling surfaces from volumes? Very hard.

Storing information per unique surface parametrization? Close to impossible.

Amount of vertices to animate and pixels to shade? Huge, shaders need simplification!

Witcher 2 foliage – that density! Still IMO one of the best forests in any game.

Dense geometry through which you can see (like grass blades) is incompatible with many techniques, for example lightmapping (storing a precomputed lighting per every grass blade texel would be very costly).

If a game has a tree here and there or is placed in a city, this might not be a problem. But for any “natural” environment, a big chunk of the productionization of any feature is going to be combining it to coexist well with foliage.

Alpha testing

Alpha testing is an extension of the above, as it disables even more hardware features / optimizations.

Alpha testing is a technique, when a pixel evaluates “alpha” value from a texture or pixel shader computations, and based on some fixed threshold, doesn’t render/write it.

It is much faster than alpha blending, but for example disables early z writes (early z tests are ok), and requires raytracing hit shaders and reading a texture to decide if a texel was opaque or not.

It also makes antialiasing very challenging (forget about regular MSAA, you have to emulate alpha to coverage…).

For a description and great visual explanation of some problems, see this blog post of Ben Golus.

Animation – skeletal

Most animators work with “skeletal animations”. Creating rigs, skinning meshes, animating skeletons. When you create a new technique for rendering meshes that relies on some heavy precomputations, would animators be able to “deform” it? Would they be able to plug it into a complicated animation blending system? How does it fit in their workflow?

Note that it can also mean rigid deformations, like a rotating wheel – it’s much cheaper to render complicated objects as a skinned whole, than splitting them.

And animation is a must, cannot be an afterthought in any commercial project.

Witcher 2 trebuchets were not people, but also had “skeletons” and “bones” and were using skeletal animations!

Animation – procedural and non-rigid

The next category of animations are “procedural” and non-rigid. Procedural animations are useful for any animation that is “endless”, relatively simple, and shouldn’t loop too visibly. The most common example is leaf shimmer and branch movement.

See this video of middleware Speedtree movement – all movement there is described by some mathematical formulas, not animated “by hand” and looks fantastic and plausible.

A good rendering technique that is applicable on elements like foliage (again!) needs to support the option of displacing it in any arbitrary fashion from simple shimmer to large bends – otherwise the world will look “dead”.

Non-rigid animation, modifying vertex positions, or even streaming whole vertex buffers causes headaches for the raytracing – as it requires readjusting the spatial acceleration structures (essential for RT) every single frame, which is impractical.

Animation – textures

Yet another type of animation is animating textures on surfaces of objects. This is not just for emulating neons or TVs, but also for example raindrops and rain ripples. Technical and FX artists have lots of amazing tricks there, from just sliding and scaling UVs, having flowmaps, to directly animating “flipbook” textures.

Assassin’s Creed 4 rain was based on displacing normals with a procedurally generated rain heightmap and texture. See my Digital Dragons talk on rain rendering there.

Dynamic viewpoint

Many techniques work well assuming a relatively fixed camera viewpoint. From artists tricks and hacks, to techniques like impostor rendering.

Some rendering acceleration techniques optimize for a semi-constrained view (like Project Seurat that my colleagues worked on). Degrees of camera freedom are something to be considered when adapting any technique. A billboard-based tree can look ok from a distance, but if you get closer, or can see the scene from a higher viewpoint, it will break completely.

Also – think of early photogrammetry that was capturing the specular reflections as textures, which look absolutely wrong when you change the viewpoint even slightly.

Dynamic lighting

How dynamic is the lighting? Is there a dynamic time of day? Weather system? Can the user turn on/off lights? Do special effects cast lights? Are there dynamic shadow casting objects?

The answer is going to be “yes” to many of those questions for most of the contemporary real-time rendering products like games; especially with a tendency of creating larger, living worlds.

This doesn’t necessarily preclude precomputed/baked solutions (like our precomputed GI solution for AC4 that supported dynamic time of day!) but needs extra considerations.

There are still many new publications coming out that describe new approaches and solutions to those problems (combining precomputations of the light transport, and the dynamic lighting).

Shadows

Shadows, another never ending topic and an unsolved problem. Most games still use a mixture of precomputed shadows and shadow maps, some start to experiment with raytraced shadows.

Anytime you want to insert a new type of object to be rendered, you need to consider: is it going to be able to receive shadows from other objects? Is it going to cast shadows on other objects?

For particles or volumetrics, the answer might not be easy (as partial transmittance is not supported by shadow maps), but also “simple” techniques like mesh tessellation or parallax occlusion mapping are going to create a mismatch between shadow caster and the receiver, potentially causing shadowing artifacts!

A slide from my GDC 2014 talk showing parallax occlusion mapping in AC4 – notice complete lack of shadows there. 🙂

Dynamic environment

Finally, if the environment can be dynamic (through game story mandated changes, destruction, or in the extreme case through user content creation), any techniques relying on offline precomputation become challenging.

On God of War the Caldera Lake and the bridge, moving levels of water and bridge rotations were one of the biggest concerns throughout the whole production from the perspective of lighting / global illumination systems that rely on precomputations. There was no “general” solution, it all relied on manual work and streaming / managing the loaded data…

God of War Caldera Lake bridge. Based on what happens there throughout the story, it was quite challenging to manage its lighting changes! Source: Sony promo art.

Transparents and particles

Finally, there are transparent objects and particles – a very special and different category than anything else.

They don’t write depth or motion vectors (more on it later), require back-to-front sorting, need many evaluations of costly computations and texture samplers per final output piels, usually are lit in a simpler way, need special handling of shadows… They are also very expensive due to overdraw – a single output pixel can be many evaluations of alpha blended particles’ pixel shaders.

Person-years of work (on most projects I worked on there were N dedicated FX artists and at least one dedicated FX and particle rendering engineer) that cannot be just discarded or ignored by a newly introduced technique.

Pipeline compatibility

The above were some product features and requirements to consider. But what about some deeper, more low-level implications? Rendering of a single frame requires tens of discrete passes that are designed to work with each other, tens of intermediate outputs and buffers. Let’s look at parts of the rendering pipeline to consider.

Depth testing and writing

I mentioned above some challenges with alpha testing, alpha blending, sorting, and particles. But it gets even more challenging when you realize that many features require precise Z buffer values in the depth buffer for every object!

That creamy depth of field effect and many other camera effects? Most require a depth buffer. Old school depth fog? Required depth buffer!

I love bokeh on this screenshot and the work I’ve done on it with artists on Witcher 2, but notice the errors: Flame in front of the dragon is sharp, while the plane in front of the background is blurred. This is a bug as it should have the same DoF, but the depth read for the bokeh effect is the depth of the solid objects and not particles!

Ok, this seems like a deal breaker, as we know that alpha blended objects don’t write those and there cannot be a single depth value corresponding to them… But games are a mastery of smoke and mirrors and faking. 🙂 If you really care you can create depth proxies by hand, sort and reorder some things manually, alpha blend depth (wrong but can look “ok”), tweak depth of field until artifacts are not distracting… Lots of manual work and puzzles “which compromise is less bad”!

Motion vectors

A related pipeline component is writing of the motion vectors. While depth is essential for many mentioned depth based effects, proper occlusions, screen-sapce refractions, fog etc, the motion vectors are used for “only” two things: motion blur and temporal antialiasing (see below).

Motion blur seems like “just an effect”, but having small amounts of it is essential to reduce the “strobing” effect and generally cheap feel (see movie 24fps half shutter motion blur).

(Some bragging / Christmas time nostalgia: thinking about motion blur made me feel nostalgic – motion blur was the first feature I got interviewed about by the press – I still feel the pride the 23 year old me living in Eastern Europe and that didn’t even finish the college felt!)

Producing accurate motion vectors is not trivial – for every pixel, you need to estimate where this part of the object was in the previous frame. This can mean re-computing some animation data again, storing it for the previous frame (extra used memory), or can be too difficult / impossible (dealing with occlusions, shadows, or texture animations).

On AC4 we have implemented it for almost everything – with an exception of the ocean and ignored the TAA and motion blur problems on it…

Temporal antialiasing

Temporal antialiasing… one of the biggest achievements of the rendering engines in the last few years, but also one of the biggest sources of problems and artifacts. Not going to discuss here if there are alternatives to it or if it’s a good idea or not, but it’s here to stay for a while – not just for the antialiasing, but also temporal distribution of samples in Monte Carlo techniques, supersampling, stochastic rendering, and allowing for slow things to become possible or higher quality in real-time. 

It can be a problem for many new rendering techniques – not only because of the need for pretty good motion vectors, but also its nature can lead to some smearing artifacts, or even cancel out visual effects like sparkly snow (glints, sparkles, muzzle flash etc).

Sparkly snow was one of the most requested features by artists on God of War, but I was not able to create a robust one, as temporal AA was removing it all treating it like temporal flicker / aliasing…

Deferred shading / lighting

Majority of the engines use deferred shading (yes, there are many exceptions and they look and perform fantastic). It has many desirable properties and “lifts” / decouples parts of the rendering, simplifies things like decals, can provide a sweet reduction of the overdraw…

An example of a G-Buffer filled with geometrical data of a scene in OpenGL
An example GBuffer representation – textures representing different material and object properties. Lighting is applied later to those. Source: learnopengl

But having a “bottleneck” in form of the GBuffer and lighting not having access to any other data can be very limiting.

New shading model? New look-up table? New data precomputed / prebaked and stored in vertices or textures? Need to squeeze those into GBuffer!

Modern GPUs handle branching and divergence with ease (provided there is “some” coherency), but it can complicate the pipeline and lead to “exclusive” features.

For example on God of War you couldn’t use subsurface scattering at the same time as the cloth/retroreflective specular model due to them (re)using same slots in the GBuffer. The GBuffer itself was very memory heavy anyway and there were many more mutually exclusive features – I had many possible combinations written out in my notebook (IIRC there were 6 bits just for encoding the material type + its features) and “negotiating” those compromises between different artists who all had different uses and needs.

Any new feature meant cutting out something else, or further compromises…

Camera cuts / no camera cuts

In most cinematics, there are heavy camera cuts all the time to show different characters, different perspectives of the scene, or maybe action happening in parallel. This is a tool widely used in all kinds cinema (maybe except for Dogme 95 or Birdman 😉 ), and if a cinematic artist wants to use it, it needs to be supported.

But what happens when the camera cuts with all the per-pixel temporal accumulation history for TAA/temporal supersampling? How about all the texture streaming that suddenly sees a completely new view? View-dependent amortized rendering like shadowmap caching? All solvable, but also a lot of work, or might introduce unacceptable delay / popping / prohibitive cost of the new frame. A colleague of mine also noted that this is a problem for physics or animations – often when the camera cuts and animators adjusted some positions, you see physical objects “settling in”, for example a necklace move on a character. Another immersion breaker that requires another series of custom “hacks”.

Conversely, lack of camera cuts (like in God of War) is also difficult, especially for cinematics lighting, good animations, transitions etc. In any case – both need to be solved/accounted for. Even worth adding a flag “camera was cut” in the renderer.

Frustum culling

Games generally don’t render what you don’t see in the camera frustum – which is obviously desirable, as why would you waste cycles on it?

Horizon: Zero Dawn gif showing frustum culling. Lots of gamedevs were a little amused that the gaming media and gamers “discovered” games do frustum culling (it was always there).

Now, it gets more complicated! Why would you animate objects that are not visible? Not doing so can save a lot of CPU time.

However things being visible in the main view is only part of the story – there are reflections, shadows… I cannot count in how many games I have seen the shadows of the off-screen characters in the “T-pose” (default pose when a character is not animated). Raytracing that can “touch” any objects poses a new challenge here!

Character in a T-Pose. I am sure you have seen it as a glitch way too many times! Worth paying attention to shadows and reflections as well. 🙂

Occlusion culling

Occlusion culling is the next step after frustum culling. You don’t want to render things outside of the camera? Sure. But if in front of the camera there is a huge building, you also don’t want to render the whole city behind it!

Robust occlusion culling (together with the LOD, streaming etc. below) is in many ways an unsolved problem – all existing solutions require compromises, precomputation, or extremely complex pipelines.

In a way an occlusion culling system is a new rendering feature that has to go through all the steps that I list in this post! 🙂 But given its complexity and general fragility of many solutions – yet another aspect to consider, we don’t want to make it even more difficult.

Level of detail

Any technique that requires some computational budget to render and memory usage needs to “scale” properly with distance. When the rendered object is 100m away and occupies a few pixels, you don’t want it to eat 100MB of memory or its rendering/update take even a millisecond.

Enter “level of detail” – simplified representation of objects, meshes, textures (ok, this one is automatic in form of mip-mapping, but you still need to stream those properly and separately), computations. 

Source: wikipedia.

Unfortunately while the level of detail for meshes or textures is relatively straightforward (though in open world games it is not enough and gets replaced by custom long-distance rendering tech), it’s often hard to solve it for other effects.

How should the global illumination scale with distance? Volumetric effects? Lighting? Particles?

All hard problems that are often solved in very custom ways, through faking, hacking, and manual labor of artists…

Loading, LOD switching and streaming

Level of detail needs to be streamed. Ok, an object far away takes a few pixels on the screen, we can use just tens of vertices and maybe no textures at all – in total maybe a few kilobytes, all good. But then the camera starts to move closer, and we need to load the material, textures, meshes… All of this needs systems and solutions coded up. What happens when the camera just teleports? Streaming system needs to handle it.

Furthermore, switching between those LODs has to be as seamless as possible. Most visual glitches in games with some textures missing / blurry, things flickering, popping, appearing are caused by the streaming, loading, and the LOD switching systems!

Open world

I put the “open world” as a separate item, but it’s a collection of many constraints and requirements that together is different in a qualitative way – it’s not just a robust streaming system, but everything needs to be designed for streaming in open world games. It’s not just good culling – but a great culling system working with AI, animations and many other systems. Open world and large scale rendering is a huge pipeline constraint on almost everything, and has to be considered as such.

When I joined the AC4 team at Ubisoft and saw all the “open world” tech for streaming, long distance rendering and similar, or the long range shadows elements for Far Cry, I was amazed with how specialized (and smart) work was needed to fit those on consoles.

Budget

Finally, the budget… I’ll tackle the “production” budget below, but hope this one is self-explanatory. If a new technique needs some memory or CPU/GPU cycles, they need to be allocated.

One common misconception is that 30ms is “real time technique”. It can be, but only if this is literally the only thing rendered, and only in a 30fps game.

For VR you might have to deal with budgets of sub 10ms to render 2 eye views for the whole frame! This means that a lighting technique that costs 1ms might already way be too expensive!

Similarly with RAM memory – all the costs need to be considered holistically and usually games already use all available memory. Thus any new component means sacrificing something else. Is a feature X worth reducing texture streaming budget and risking texture streaming glitches?

Finally, another “silent” budget is disk space. This isn’t discussed very often, but with modern video game pipelines we are at a point where games hardly fit on Blu Ray disks (it was a bigger challenge on God of War than fitting in memory or the performance!).

Similarly, patches that take 30GBs – it’s kind of ridiculous and usually a sign of an asset packaging system that didn’t consider patching properly… But anyway, a new precomputed GI solution that uses “only” maybe 50MB per level chunk “suddenly” scales to 15GB on disk for the whole 30h game and that most likely will not be acceptable!

The process

While I left the process for the end (as might be least interesting to some audience), challenges listed here are the most important.

Many “alternative” approaches to rendering are already practical in many ways (like signed distance fields), but the lack of tooling, pipelines, and production experience is a complete blocker for all but for special examples that for example rely on user created content (see brilliant Dreams or Claybook).

Artistic control

It’s easy to get excited with some amazing results of procedural rendering (been there for many decades), photogrammetry, or generally 3D content capture and re-rendering – the results look like eye-candy and might seem like removing a huge chunk of production costs.

However artists need full control and agency in the process of content making.

Capturing a chair with a neural implicit representation is easy, but can you change the color of the seat, make some part of it longer, and maybe change the paint to varnished? Or when the art director decides that they want a baroque style chair, are you going to look for one?

Generally most artists are skeptical of procedural generation and capture (other than to quickly fill huge areas of levels that are not important to them) as they don’t have those controls.

On Assassin’s Creed 4, one of my teammates worked on procedural sky rendering (it was a collaboration with some other teams, and also IIRC had an evaluation of some middleware). It looked amazing, but the art director (who was also a professional painter) was not able to get exactly the look he wanted, and we went with very old-school, static (but pretty!) skyboxes.

Developing a procedural rendering system that will produce sky exactly like this painterly Assassin’s Creed 4 concept art is going to be challenging! Souce: Ubisoft promo art.

Tools

Every tool or technique that is creative / artistic, needs tooling to express the control, place it on levels, allow for interaction. “Tools” can be anything from editing CSV files to dedicated WYSIWYG full editors.

Even the simplest new techniques need some tools / control (even if it is as simple as on/off) and it needs to be written.

The next level of complexity is for example a new BRDF or material model – it will have to be integrated in a material editor, textures processed through platform pipeline and baking, compressed to the right format, shader itself generated and combined with other features… All while not breaking the existing functionality.

Finally, if you propose something to replace mesh representation you need to consider that suddenly the whole ecosystem that artists have used – from content authoring like Z Brush, 3D Studio Max and Maya, animation tools and exporters like Motion Builder or dedicated mocap software, to finally all engine plugins, importers/exporters. This is like redoing work of the decades of the whole industry. Doable? Sure. Reasonable? It depends on your budget and the benefits it might bring.

I have a theory about the popularity of engines like Unity and Unreal Engine – it didn’t come from any novel features (though there are many!), but from the mature and stable, expressive, and well known tooling (see below).

Education and experience

Some video game artists have been around for a few decades and built lots of knowledge, know-how and experience. Proposing something that doesn’t build on it can be a “waste” of this existing experience. To break old habits, benefits have to be huge, or become standard across the whole industry (Physically Based Rendering took a few years to popularize).

If you want to change some paradigms, are you able to spend time educating people using those in their daily work, and would they believe it is truly worth it?

Support of the others

It took me a while and some hard lessons to realize that the “emotional” side of collaboration is more important than the technical one.

If your colleagues like artists or other programmers are excited about something – everything will be much easier. If the artists are not willing to change their workflow, or don’t believe that something will help them, the whole feature or project can fail.

I have learned it myself and only in hindsight and hard way: early in my career I was mostly on teams with people similarly excited and similarly junior as me; so later on and on some other team the first time I proposed something that my colleagues didn’t care for, I didn’t understand the “resistance”, proceeded anyway, and the feature has failed (simply was not used).

Good “internal marketing”, hype, and building meaningful relationships full of trust in each other’s competence and intentions are super important and this is why it’s hard to ship something if you are an “outsider”, consultant, or a remote collaborator.

Testing

I’ve written a post about automatic testing of graphics features, and it goes without saying that productionizing anything takes many months of testing – by the researcher/programmer themselves, by artists, QA.

Anything that is not tested in practice can be assumed to be broken, or not working at all.

Debuggers and profilers

Every feature that requires some input data needs to provide means of validation and debugging behaviors and this input. Visualizations of intermediate results, profilers showing how much time is spent on different components, integration with other debugging tools.

Most graphics programmers I know truly enjoy writing those (as they both look cool, and pay off the time investment very quickly), but it has to be planned, costs time and needs to be maintained.

Developing a new technique, try to think “if things go wrong for some reason, how myself or someone less technical will be able to tell why and fix the problem? What kind of tools are needed for that?”

Edge cases

Ah, the edge cases. Things that are supposed to be “rare” or even negligible, but are guaranteed to happen during actual, large scale production. Non-watertight or broken topology meshes? Guaranteed. Zero-area triangles? Guaranteed. Flipped normals? Guaranteed. Materials with roughness of exactly zero? Sure. Black or fully white albedo? Yes. This one level where the walls have to be so thin that the GI leaks? Yep.

Production is full of those – sometimes to the extent that what is considered a pathologically bad input in a research paper, represents most of in-game assets.

Many junior graphics programmers who implement some papers in the first months of their careers, find out that described techniques are useless on the “real” data – hitting an edge case on literally the first tried asset. This doesn’t encourage future experimentations (and to be fair – could be avoided if authors were more honest in the “limitations” sections of publications and actually explored it more).

Practitioner mindset is to think immediately “How can this break? How can I trigger this division by zero? What happens if x/y/z?” and generally preparing to spend most of the time handling and robustifying those edge cases.

Production budget

Finally, something obvious – the production budget. Everything added/changed/removed from production has a monetary cost – of time, software licenses, QA, modifying existing content, shifting the schedule. Factoring it in is difficult, especially from an engineer position (limited insight into full costs and their breakdown), but tends to come naturally with some experience.

Anecdote from Witcher 2the sprite bokeh was something I implemented during a weekend ~2 weeks before gold master. I was super excited about it, and so were the environment and cinematic artists. We committed to shipping it and between us spent the next two weeks crunching to re-work the DoF settings in all existing cinematics (obviously many didn’t need to be tweaked, only some). Glad that we were super inexperienced and didn’t know not to do stuff like this – as the feature was super cool, and I cannot imagine it happening in any of my later and more mature companies. 🙂 

Summary

To conclude, going from research to product in the case of graphics algorithms is quite a journey!

A basic implementation for a tech demo is easy, but the next steps are not:

  • One needs to make sure the new algorithm can coexist with and/or support all the other existing product features,
  • It has to fit well into an existing rendering pipeline, or the pipeline needs to be modified,
  • All the aspects of the production process from tooling, through education to budgeting need to include it.

Severity of the constraints varies – from ones easy to satisfy, easy to workaround or hack, through more challenging ones (sometimes having to heavily compromise and drop them), but it’s worth thinking about the way ahead in the process and proper planning.

As I mentioned multiple times – I don’t want to discourage anyone from breaking or ignoring all those guidelines and “common knowledge”.

The bold and often “irrational” approaches are the ones that sometimes turn revolutionary, or at least are memorable. Similarly, pure researchers shouldn’t let some potential problems hinder their innovation.

But at the same time, if people representing both the “practical” and “theoretical” ends of the research and technology progress understand each other, it can lead to much healthier and easier “technology transfer” and more useful/practical innovation, driven by needs, and not the hype. 🙂

PS. if you’re curious about some almost 10 year old “war stories” about rendering of the foliage in The Witcher 2 – I tried to capture those below:

Supplement – case study – foliage

I started with a joke about how foliage rendering breaks everything else, but let’s have a more serious analysis into how much work and consideration went into rendering of Witcher 2’s foliage.

Around the time I joined CD Projekt Red in The Witcher 2 pre production, there was no “foliage system” or special solution for rendering vegetation. Artists were placing foliage as normal meshes – grass blade one after another on a level. Every mesh was a component and object inside an entity object, and generally was very heavy and using “general” data structures, same no matter if for rocks, vegetation, or the player character. Throughout the production a lot had to change!

Speedtree importer and wind

One of key artists working on vegetation was using Speedtree middleware a lot – but just as an editor, we were not using their renderer.

Speedtree supports amazingly looking wind, and artists wanted similar behavior in game. One of my first graphics related tasks was adding importing of some wind data from Speedtree into the game, and translating some of the behavior into our vertex shaders.

This was mostly tooling work, and took a few weeks – I needed to modify the tools code, add storage for this wind data, different vertex shader type, material baking code, support wind on non-speedtree assets. All for just supporting something that existed and wasn’t any kind of innovative/research thing.

But it looked great and I confirmed to myself that I was right dreaming of becoming a graphics programmer – it was so rewarding!

Translucency and normals

The next category of tasks with foliage that I remember working on was not implementing anything new, but debugging. We had a general “fake translucency” solution (emulating scattering through inverse boosted Lambertian term). But it relies on the normals pointing away, while our engine supported two sided normals (a mesh face normal flipped depending on which side it is rendered from) and some assets were using it. Similarly, artists were “hacking” normals to point upwards like the terrain to avoid some harsh lighting and smooth everything out, and such meshes didn’t show translucency, or it looked like some wrong, weird specular.

Fixing importers, mesh editor, and generally debugging problems with artists, and educating them (and myself!) took quite a bit of time.

To alpha test, or to draw small triangles?

As we were entering more performance critical milestones, foliage was more and more of a performance problem. Alpha testing is costly and causes overdraw, while drawing small triangles causes so-called quad overdraw. Optimizing the overdraw of foliage was a never ending story – I remember when we started profiling some levels, there was a single asset that took 8ms to render on powerful PCs. We did many iterations and experiments to make it run fast. In the end, the solution was hybrid – many single grass blades, but leaves were organized as alpha tested “leaf cards”.

Level layer meshes

One of the first important optimizations – just for the part of the CPU cost, and reducing loading times – was to “rip” all meshes that were simple enough (no animations other than wind, no control by gameplay) into a separate, much smaller data representation.

This task was something I worked on with lots of mentorship from a senior programmer, and was going back to my engine programming roots (data management, memory, pipeline). 

While it didn’t address all the foliage needs, it was a first step that allowed to process tens of thousands of such objects (instead of hundreds), and artists realized they needed better tooling to populate a large world.

Automatic impostor system

A feature that my much more senior colleague worked on – adding the option of automatic impostor generation. After an user-defined distance, mesh would turn into an auto-rotating billboard – with different baked views from different sides.

While I didn’t implement it, later was improving and debugging it quite a lot (the programmer who implemented it was the most senior person on the team overwhelmed with tasks and generally often worked by doing initial implementation and then handling it off to more junior people – and it was a great learning and mentoring opportunity) on things like inpainting alpha tested regions for correct mip-mapping, or how to optimize backing w.r.t. the GBuffer structure (to avoid extra GBuffer packing/unpacking).

Foliage editor

The next foliage specific work was again around tooling!

Our senior programmer spent a few heavy weeks on making a dedicated foliage editor, with brushes, layers, editors. Lots of mundane work, figuring out UX issues, finding best data representations… Things that need to happen, but almost no paper will describe.

The result was quite amazing though, and during the demo, my colleague showed the artists how he was able to create a forest in 5minutes and right away, our game got populated by them with great looking foliage and vegetation (in my biased opinion, the best looking in games around that time).

Foliage culling and instancing

With lots of manual optimization work, we were nearing the shipping date of Witcher 2 on PC and starting to work on the X360 version. A newly hired programmer, now my good friend, was looking into performance, especially on the 360 of the occlusion culling (an old school “hipster” algorithm that you probably never heard of, a beam tree). Turns out that building a recursive BSP from hundreds of occluders every frame and processing thousands of meshes against it is not a great idea, especially on an in-order PowerPC CPU of X360.

My colleague spent quite a lot of time on making the foliage system a real “system” (and not just a container for meshes) with hierarchical representation, better culling, and hardware instancing (which he had experience with on X360, unlike anyone else in the team). It was many weeks of work as well, and absolutely critical work for shipping on the Xbox.

Beyond Witcher 2 – terrain, procedural vegetation

This doesn’t end the story of vegetation in Witcher games.

For Witcher 3 (note: I barely worked on it; I was moved to the Cyberpunk 2077 team) my colleagues went into integration of the terrain and foliage systems with streaming (necessary for an open world game!), while our awesome technical artist coded procedural vegetation generation inspired by physical processes (effects of altitude, the amount of sun, water proximity), and “game of life” that were used for “seeding” the world with plausibly looking plants and forests.

I am sure there were many other things that happened. But now imagine having to redo all this work because of some new technique – again, doable, but is it worth it? 🙂 

Posted in Code / Graphics | Tagged , , , , , , , , | 4 Comments

Compressing PBR material texture sets with sparsity and k-SVD dictionary learning

Introduction

In this blog post, I am going to continue exploration of compressing whole PBR texture sets together (as opposed to compressing each texture from the set separately) and using the fact that those textures are strongly correlated.

In my previous post, I have described how we can “decorrelate” some of the channels using PCA/SVD and how many of the dimensions in such a decorrelated space contain small amount of information, and therefore can be either dropped completely, stored at a lower resolution, or quantized aggressively. This idea is a backbone of almost all successful image compression algorithms – from JPEG to BC/DXT.

This time we will look at exploring some redundancy across both the stored dimensions, as well as the image space using sparse methods and kSVD algorithm (original paper link).

From my personal motivation, I wanted to learn more about those sparse methods, play with implementing them and build some intuition – one of my coworkers and collaborators on my previous team at Google, Michael Elad is one of the key researchers in the field and his presentations inspired and interested me a lot – and as I believe that until you can implement and explain something yourself, you don’t really understand it – I wanted to explore this fascinating field (in a basic, limited form) myself.

First I am going to do present a simplified and intuitive explanation of what “sparsity” and “overcompletness” mean and how their redundancy can be an advantage. Then I will describe how we can use sparse approximation methods for tasks like denoising, super-resolution, and texture compression. Then I will provide a quick explanation of k-SVD algorithm steps and how it works, and finally show some results / evaluation for the task of compressing PBR material texture sets.

This post will come with a colab notebook (soon – I need to clean it up slightly) so you can reproduce all the code and explore those methods yourself.

Sparse and dictionary based methods in real-time rendering?

I tried to look for some other references to dictionary based methods in real time rendering and was able to find only of three resources: a presentation by Manny Ko and Robin Green about kSVD and OMP, a presentation by Manny Ko about Dictionary Learning in Games, and a mention of using Vector Quantization for shadow mask compression by Bo Li. VQ is a method that is more similar to palletizing than full dictionary learning, but still a part of the family of sparse methods. I recommend those presentations, (a lot of what I describe about the theoretical part – and how I describe – overlaps with them) and if you are aware of any more mentions of those IMO underutilized methods in such a context, please leave a comment and I’ll be happy to link.

Vector storage – complete and orthogonal basis, vs overcomplete and sparsely used frames / dictionaries

To explain what are sparse methods, let’s start with a simple, 2 dimensional vector. How can we represent such a vector? “This is obvious, we can just use 2 numbers, x and y!”. Sure, but what do those numbers mean? They correspond to an orthonormal vector basis. x and y represent actually a proportion of position along the x axis, and the y axis. If you have done computer graphics, you know that you can easily change your basis. For example, a vector can be decribed in so called “world space” (with arbitrary, fixed point), “camera / view space”, or for some simple video game AI purpose in space of a computer-controlled character.

Same vector, two different storage basis.

Vector basis might have different properties. Two that are very desired are orthogonality and normality, often combined together as orthonormal basis. Orthonormality gives us trivial way of changing information between two basis back and forth (no matrix inversion required, no scaling/non-unity Jacobians introduced) and some other properties related to closed form products and integrals – it’s beyond scope of this post, but I recommend “Stupid Spherical Harmonics Tricks” from Peter-Pike Sloan for some examples how SH being orthonormal makes it a desired basis for storing spherical signals.

Choice of “the best” basis matters a lot when doing compression – I have described it in my previous blog, but as a quick refresher, “decorrelating” dimensions can give us basis in which information is maximized in each sequential subset of dimensions – meaning that the first dimension has the most information and variability present, the next one the same or smaller amount etc. This is limited to linear relationships and covariance, but lots of real data shows linear redundancy.

Demonstration from my previous post of finding a storage basis in which we maximize information stored in single dimension.

One of the requirements of a space to be a basis is completeness and uniqueness of representation. Uniqueness of the representation means that given a fixed basis, there is only a single way to describe a vector in this space. Completeness means that every vector can be represented without loss of information. In my last post, we looked at the truncated SVD – which is not complete, because we lose some original information and there is no way how we can represent it back.

What is quite interesting is that we can go further than vector basis and go towards overcomplete frames (let’s simplify – a lot – for the purpose of this post and say that frame is just “some generalization” of a basis). This means that we have more vectors in space than the “rank” of the information we want to store. Overcomplete frames can’t be orthogonal or unique (I encourage readers for whom this is a new concept to try to prove it – even if just intuitively). Sounds pretty useless for information compression? Not really, let me give here a visual example:

Example of a set of points/vectors that cannot be represented well by any two orthogonal vectors, but on the other hand three frame vectors describe this set pretty well.

Note that while in the case of overcomplete frame every point can be described in infinitely many ways (just like x + y = 5 has infinitely many solutions of x and y values that satisfy it), we usually are going to look for a sparse representation. We want as many of the components to be zero (or close to zero enough so that we can discard them, accepting some errors), and the input vectors described by a linear combination of less frame vectors than the size of the original basis.

If this description seems vague, looking at the above example – with complete basis, we generally need two numbers to describe the input point set. However, with an overcomplete frame, we can use just a single 2-bit basis index and a float value describing magnitude along it:

Sparse representation in a “proper” overcomplete frame can approximate some input data pretty well.

Yes, we lose some information, and there is an approximation error – but there is always an error in the case of lossy compression – and the error in such a basis is way smaller than from any truncated SVD (keeping a single dimension of orthogonal basis).

Terminology wise – we commonly refer to a frame vector as atom, and collection of atoms as dictionary – and I will use atoms and dictionary mostly throughout this post.

Dense vs sparse representation – a dense representation has a minimal number of dimensions (rank) of the set, but every item represented in it is going to be represented by multiple coefficients. With sparse representation, we might have many more dimensions, but every item is going to be represented by a small subset of those – most coefficients are zero.

Also note – we will come back to this later in the next section – I have used here a 2D example and a 1-sparse representation (single atom to represent input) – but with higher dimensions, we can (and in practice do) use multiple atoms per vector – so if you have 16 atoms, any representation that uses 1-15 of them can be considered sparse. Obviously if the input space is 3D, and your dictionary has 20 atoms, while a 10 atom representation of a vector “could” be seen as sparse, it probably won’t make sense for most practical scenarios.

Sparsity and images – palettes and tiles

Now I admit, this 2D case is not super compelling. However, things start to look much more interesting in some much higher dimensions. Imagine that instead of storing 20 numbers for a 20 dimensional vector, you could store just a few indices and a few numbers – in very high dimensions, sparse representations get more and more interesting. Before going towards high dimensional case, let me just show some interesting connection – what happens if for a single pixel of e.g. RGB image we store only index (not even a weight) from a dictionary? Then we end up with an image palette.

Example images compressed with palettization and their palettes – source: wikipedia, author: Ricardo Cancho Niemietz

Image palettization is an example of a sparse representation method! One of many algorithms to compute image palette it is k-means, which is a more basic version of k-SVD that I’m going to describe later.

Palettization generalizes to many dimensions – if you have a PBR texture set with 20 channels, you can still palettize it – just your entries will be 20 dimensional vectors.

However, we can go much further and increase the dimensionality (hoping for more redundancy in more dimensions) much further. Let’s have a look at this Kodak dataset image and at image tiles:

Notice in colored rectangles how much self similarity there is inside this image!

There is lots of self-similarity in this image – not just within pixel values (those would palettize very well), but among whole image patches! In such a scenario, we can take a whole patch to become our input vector. I have mentioned in the past that thinking beyond “image is 2D matrix / 3D tensor” is one of the most important concepts in numerical methods for image processing, but as a recap, we can take an RGB image and instead of a tensor, represent it by a matrix:

Representing and image or image patch as a matrix.

Or we can go even further, and represent it as a single vector:

Representing and image or image patch as a vector.

So you can turn a 16×16 RGB tile into a single 768 dimensional vector – and with those 16×16 tiles, a 512 x 512 x3 (RGB) image into a 32×32 image of 768 vectors. This takes time to get used to, but if you internalize this concept (and experiment with it yourself), I promise it will be very useful!

What are sparse methods useful for?

Sparse methods were (and to a lesser extent still are) a “hot” research topic because of a simple “philosophical” observation – the world we live in, and the signals we deal with (audio, images, 3D representation), are in many ways sparse and self-similar.

A 100% random image would not be sparse, but images that we capture with cameras or render are in some domains.

A random image is not sparse – there is no structure or self-similarity to it. On the other hand, natural images are – notice lots of self similarities, flat regions, simple geometric constructs like edges, corners, textures.

I will not present any proofs of it (also personally I haven’t seen one that would be both rigorous and not handwavy, while staying intuitive), however let me describe some implications and use-cases of it.

Compression

Very straightforward and the use-case I am going to focus on – if instead of storing full “dense” information of a large, multidimensional vector (like image tile), we can instead store a dictionary, and indices of atoms (plus their weights). If the compressed signal was truly sparse in some domain, this can result in massive compression ratios.

Denoising

What happens when we add white noise to a sparse signal? It’s going to become non-sparse! Therefore by finding best possible sparse approximation of it, we are going to effectively decompose it into sparse and non-sparse component, which is going to be discarded and hopefully, corresponds to noise.

Crude, quickly coded k-SVD based denoising of a very noisy signal. While the output image doesn’t look good (things like blended, overlapping tiles or varying atom count per tile are a must), it is a demonstration that it’s possible to remove most of the noise!

Super-resolution

Principle of self-similarity and sparsity can be used for super-resolution of low-resolution, aliased images. Different “phases” of the sampled aliased signal produce different samples, but if we assume there is an underlying common shared representation, we could find it and reconstruct them.

Aliased triangle and different tiles corresponding to the same original shape.

In this example if we correctly located all the aliased tiles and be certain that single, same high resolution atom corresponds to all of them, then we’d have enough information to remove most of the aliasing and reconstruct the high resolution edge.

Computer vision

Computer vision tasks like object recognition or object classification don’t work very well with hugely dimensional data like images. Almost all computer vision algorithms start by constructing some representation of the image tile – corner, line detection. While in the “old times” those features were hard-coded, nowadays state of the art relies on artificial neural networks with convolutional layers – and the convolutional layers with different filter banks jointly learn the dictionary (edges, corners, higher level patterns), as well as use of this dictionary data (I will mention this connection around the end of this post).

Compressive sensing / compressed sensing

This is a really wild, profound idea that I have learned about just this year and definitely would like to explore more. If we assume that the world and signals we sample are sparse in “some” domain and we know it, then we can capture and transmit much less information (orders of magnitude less!), like random subsamples of the signal. Later we can recompute the missing pieces of the information by forcing sparsity of the reconstruction in this other domain (in which we we know the signal is sparse). This is really interesting and wild area – think about all the potential applications in signal/image processing, or in computer graphics (strongly subsampling expensive raytracing or GI and easily reconstructing the full resolution signal).

Note: Sparsity of simple, orthogonal basis

It is important to emphasize that you don’t need any super fancy overcomplete dictionary to observe signal sparsity. Even simple, orthogonal basis can be sparse – for example Fourier transform, DCT, image gradient domain, decimating wavelets – and many of them are actually very good / close to optimal in “general” (average of different possible images) scenarios. This is a base for many other compression algorithms.

Natural images are usually sparse in Fourier domain.
Left: original image, Right: Only 10% of the largest coefficients in Fourier domain = immediate 10x compression ratio. The output doesn’t look great perceptually (ringing, separate treatment of color channels = color shifts; both can be avoided easily), but this was close to zero consideration or effort!

kSVD for texture and material compression

Later in this post, I am going to describe the k-SVD algorithm – but first, let’s look at use of sparse, dictionary representations and learning for PBR material compression.

I have stated the problem and proposed a simple solution in my previous blog post. As a recap, PBR materials consist of multiple textures representing different physical properties of a surface of a material. Those properties represent surface interaction with light through BRDF models. Different parts of that texture might correspond to different surface types, but usually large parts of the material correspond to the same surface with the same physical properties. I have demonstrated how those properties are globally correlated. Let’s have a look at it again:

An example PBR material, credit cc0textures, Lennart Demes. There are 5 different textures, but each texture corresponds to the same physical location.

After my longish introduction, it might be already clear that we can see lots of self-similarity in this image and very similar local patches. Maybe you can even imagine how those atoms will look like – “here is an edge”, “here is a stone top”, “here is a surface between the stones” that repeat themselves across the image.

It seems that it should be possible to reuse this “global” information and compress it a lot. And indeed, we can use one of sparse representation learning algorithms, kSVD to both find the dictionary, as well as fit the image tiles to it. I will describe the algorithm itself below, but it was used in following way:

  • Split the NxN image into n x n sized tiles after subtracting image mean,
  • Set the dictionary size to contain m atom tiles,
  • Run kSVD to learn the dictionary and allowing for up to k atoms per tile,
  • Recombine the image.

Just for the initial demo (not optimal parameters, I selected the dictionary to contain 8×8 tiles, dictionary sized like 1/32th of the original image, 8 atoms per tile). Afterwards, the texture set will look like this:

Top: reconstruction, middle: reference, bottom: abs error.

I’m presenting it in a low resolution as later I will look more into proper comparisons and the effect of some parameters. From such a first look, I’d say it doesn’t look too bad at all given that per every tile instead of storing 8x8x9 numbers, we store only 9 indices and 9 weights, plus the dictionary is only 1/32th of the orignal image!

But more importantly, how do the atoms / dictionary look like? Note: atoms are normalized across all channels and we have subtracted the image mean; I am showing absolute value of each pixel of each atom:

Dimension-wise slices (visualization) of unsorted atoms from fitting a dictionary to a PBR texture set.

Ok, this looks a bit too messy for my taste. Let’s sort them based on similarity (greedy, naive algorithm: start with the first tile, find most similar to it, then proceed to the next tile until you run out of tiles):

Dimension-wise slices (visualization) of sorted atoms from fitting a dictionary to a PBR texture set.

Important note: from the above 5 differently colored subsets, first 8×8 tile of each of them corresponds to the same atom. They are sliced across different PBR properties for visualization purpose only (I don’t know about you, but I don’t like looking at 9 dimensional images😅).

This is hopefully clearer and also shows that there are some clear “categories” of patches and words in our dictionary: some “flat” patches with barely any texture or normal contribution at all, some strong edges, and some high frequency textured regions. An edge patch will show an edge both in normal-map as well as the AO, while a “textured” patch will show texture and variation in all of the channels. Dictionary learning extended correlation between global color channels to correlation of full local structures. I think this is pretty cool and the first time I saw it running I was pretty excited – we can automatically extract quite “meaningful” (even semantically!) information and how the algorithm can find some self-similarities, repeating patterns and use it to compress the image a lot.

I will come back to analyzing the compression and quality (as it depends on the parameters), but let’s now look at how the k-SVD works.

k-SVD algorithm

I have some bad news and some good news. The bad news are that finding optimal dictionary and atom assignments is NP-Hard problem. ☹ We cannot solve it for any real problems in reasonable time. The good news is that there are some approximate/heuristic algorithms that perform well enough for practical applications!

k-SVD is such an algorithm. It is a greedy, iterative, alternating optimization algorithm that is made of the following steps:

  1. Initialize some dictionary,
  2. Fit atoms to each vector,
  3. Update the dictionary given current assignment
  4. If satisfied with the results, finish. Otherwise, go back to step 2.

This going between steps 2 and 3 is the “alternating” part, and repeating it is “iterative”.

Note that there are many variations of the algorithm and improvements in many follow-up papers, so I will look at some flavor of it. 🙂 But let’s look at each step.

Dictionary initialization

For the dictionary initialization, you can apply many heuristics. You can start with completely random atoms in your dictionary! Or you can take some random subset of your initial vectors / image tiles. I have not investigated this properly (I am sure there are many great papers on initialization that maximizes the convergence speed), but generally taking some random subset seems to work reasonably. Important note is that you’d want your atoms to be “normalized” across all channels – you’ll see why in the next step.

Atom assignment

For the atom assignment, we’re going to use Orthogonal Matching Pursuit (OMP). This part is “greedy” and not necessarily optimal. We start with the original image tile and call it a “residual”. Then we repeat the following:

  1. Find the best matching atom compared to residual. To find this best matching atom according to squared L2 norm, we can simply use dot product. (Think and try to prove why the dot product – cosine similarity of normalized vectors – ordering is equivalent to the squared L2 norm, subject to offset and scale).
  2. Looking at all atoms found so far, do a least squares fit against the original vector.
  3. Update the residual.

We repeat it as many times as the desired atom count, or alternatively until error / residual is under some threshold.

As I mentioned, this is a greedy and not necessarily optimal part. Why? Look at the following scenario where we have a single 4 dimensional vector, 3 atoms, and allow max 2 atoms per fit:

Given our yellow vector to fit, at the first step if we find the most similar atom, it is going to be the orange one – the error is the smallest. Then we would find the green one, and produce some combination that would produce 2 last dimensions too high/too low – a non-negligible error. By comparison, optimal, non-greedy assignment of 2 atoms “red” and “green” would produce zero error!

Greedy algorithms are often reasonable approximations that run fast enough (the decision made is only locally optimal, so we can consider very small subset of solutions), but in many cases are not optimal.

Dictionary update

In this pass, we assume that our atom assignment is “fixed” and we cannot do anything about it – for now. We go even further, as we look at each atom one by one, and assume that all the other atoms are also fixed! Given almost everything fixed, we optimize: given this assignment and all the other atoms being fixed, what is the optimal value for this atom?

So in this example we will try to optimize the “red box” atom, given two “purple box” items using it.

The way we approach it is solving the following: if there was no this atom contribution, how can we minimize the error of all the atoms using it? We look at all errors of every single item using this atom and try to find some single vector that multiplied by different weights will maximize the removal of average of those errors (so matching it).

In a visual form, it will look something like this:

Updating a single atom to approximate all the residuals of multiple items is simply a SVD rank1 decomposition! One vector will be the locally optimal atom, while the orthogonal vector will be a weight vector.

If you are reader of this blog, I hope you recognized that we are looking for a separable, rank 1 approximation of the error matrix! This is where the SVD of k-SVD comes from – we use truncated SVD and compute a rank 1 approximation of the residual. One of separable vectors will be our atom, the other one – updated weights for this atom in each approximated vector.

Note that unlike the previous step, which could lead to unoptimal results, here we have at least a guarantee that each step will improve the average error. It can increase the error of some items, but on average / in total it will always be decreased (or it would have been the same if SVD would found that the initial atom and the weight assignment were already minimizing the residual).

Also note that updating the atoms one by one and updating the weights / residuals at the same time makes it similar to Gauss-Seidel method – resulting in a faster convergence, but at a cost of sacrificed parallelism.

And this is roughly it! We alternate the steps of assigning atoms and updating them until we are happy with the results. There are many nitty-gritty details and follow-up papers, but described like this, kSVD is pretty easy to implement and works pretty well!

How many iterations do you need? This is a good question and probably depends on the problem and those implementation details. For my experiments I was using generally 8-16 iterations.

Different parameters, trade-offs, and results

Note for this section – I am clearly not a compression expert. I am sure there are some compression specific interesting tricks and someone could do a much better analysis – I will not pretend to be competent in those. So please treat it as mostly theoretical analysis from a compression layman.

I will analyze all three main parameters one by one, but first, let’s write down a formula for general compression/storage cost:

Image tiles themselves – k indices, k weights for each of the tiles. Indices need as many bits as log2 of the number of atoms m in the dictionary, and the weights need to be quantized to b bits (which will affect the final precision/error). Then there is the dictionary, which is going to be simply m of n*n sized tiles, each element containing original number of dimensions. So the total storage is going to be:

It’s worth thinking about comparing it to a “break even” ratio, so simply N*N*d. Ignoring the image, dictionary break even is m <= (N/n)^2. Ignoring the dictionary, image break even is:

Seeing those formulas, we can analyze the components one by one.

First of all, we need to make sure that m is small enough so that we actually get any compression (and our dictionary is not simply the input image set 🙂 ). In my experiments I was setting it to 1/32-1/16 of N^2 total tiles of the input for larger tiles, or simply fixed small value for very small tiles.

Then assuming that m * n^2 is small enough, we focus on the image itself. For better compression ratios we would like to have tile size n as large as possible without blowing this earlier constraint – this is a quadratic term. As usually, think about the extreme: a tile 1×1 is the same as no tiling/spatial information at all (still might offer a sparse frame alternative to SVD discussed in my previous post), and large tiles means that our “palette” elements are huge and there are few of them.

Generally, for all the following experiments, I looked only at very significant compression ratios (mentioned dictionary itself being 3-6% size of the input) and visible / relatively large quality loss to be able to compare the error visually.

People more competent in compression will notice lots of opportunities for improvements – like splitting indices from weights, deinterleaving the dictionary atoms, exploiting the uniqueness of atom indices etc. – but again, this is outside of my expertise.

Because of that, I haven’t done any more thorough analysis of the “real” compression ratios, but let’s compare a few results in terms of image quality – both subjective, and objective numbers.

Tile size 1, 3 atoms per pixel, dictionary 32 atoms -> PSNR 34.9

Top: reconstruction, middle: reference, bottom: abs error.
Single pixel element dictionary is not the most fascinating or insightful one. 🙂

This use-case is the simplest, as it is a sparse equivalent of the SVD we have analyzed in my previous post. Instead of dense 6 weights per pixel (to represent 9 features), we can get away with just 3 and just 32 elements in the dictionary, we can a pretty good SNR! Note however the problem (will be visible on those small crops in all of our examples) – tiny green grass islands disappear, as they occupy tiny portion of the data and don’t contribute too much to the error. This is also visible in the dictionary – none of the elements is green. Apart from it, the compressed image looks almost flawless.

Tile size 2, 4 atoms per tile, dictionary 256 atoms -> PSNR 32.16

Top: reconstruction, middle: reference, bottom: abs error.

In this case, we switch to a tiled approach – but with tiny, 2×2 tiles. We have as many weights per tile, as there are pixels per tile – but pixels were originally 8 channel and now we have per atom only a single weight. The PSNR / error are still ok, however the heightmap starts to show some “chunkiness”. In the dictionary, we can slowly start to recognize some patterns – and how some tiles represent only low frequency, while other higher frequency patterns! The exception is the heightmap – it wasn’t correlated too much with the other channels details, which explains errors and lack of high frequency information. A larger dictionary or more atoms per tile would help to decorrelate those.

Tile size 4, 8 atoms per tile, 6% of the image in dictionary -> PSNR 32.76

Top: reconstruction, middle: reference, bottom: abs error.
Note: this is only a subset of the dictionary.

4×4 tiles is an interesting switching point as we can compress dictionary with the hardware BC compression. We’d definitely want that, as the dictionary gets large! Note that we get less atoms per tile than there were pixels in the first place! So it’s not only compressing the channel count, but also less than actual pixel count. The dictionary looks much more interesting – all those tiny lines, edges, textured areas.

Tile size 8, 12 atoms per tile, 6% of the image in dictionary -> PSNR 28.11

Top: reconstruction, middle: reference, bottom: abs error.
Note: This is only a subset of the dictionary.

At this point, the compression got very aggressive (per each 8×8 pixels of the image, we store only 12 weights and indices, that both can be quantized!) – and the tiles are the same size as JPG ones. Still, visually looks pretty good, and the oversmoothing of normals shouldn’t be an issue in practice – but if you look very closely at the height map, you will notice some blocky artifacts. The dictionary itself shows a lot of variety of patches and content type.

Performance – compression and decompression

Why are dictionary based techniques not used more widely? I think the main problem is that the compression step is generally pretty slow – especially the atom fitting step. Per each element and per each atom, we are exhaustively searching for the best match, performing a least squares fit, recomputing the residual, and then searching again. Every operation is relatively fast, but there are many of them! How slow it is? To generate the examples in my post, colab was running in a few minutes per result – ugh. On the other hand, this is a Python implementation, with brute-force linear search. Every single operation in the atom fitting part is trivially parallelizable, and can be accelerated using spatial acceleration structures. I think that it could be made to run a few seconds / up to a minute per a texture set – which is not great, but also not terrible if not part of an interactive workflow, but an asset bake.

How about the decompression performance? Here things are much better. We simply fetch atoms one by one and add their weighted contributions. So similar to SVD based compression, this becomes just a series of multiply-adds and the cost is proportional to the atom count per tile or pixel. We incur a cost of indirection and semi-random access per tile, which can be slow (extra latency), however when decompressing larger tiles, this is very coherent across many neighboring pixels. Also, generally dictionaries are small so should fit in cache. I think it’s not unreasonable to also try to compress multiple similar texture sets using a single dictionary! In such a case when drawing sets of assets, we could be saving a lot on the texture bandwidth and cache efficiency. Specific performance trade-offs depend on the actual tile sizes, and for example whether block compression is used on the dictionary tiles. Generally, my intuition says that it can be made practical (and I’m usually pessimistic) – and Bo Li’s mention of using VQ for shadowmask (de)compression confirms it.

Relationship to neural networks

When you look at the dictionary atoms above, and at visualizations of some common neural networks, there is some similarity.

Neural network filters learned for image classification from a classic paper by Krizhevsky et al.

This resemblance might seem weak, but the network was trained on all kinds of natural images (and for a different task), while our dictionary was fit to a specific texture set. Still, you can observe some similar patterns – appearance of edges, fixed color patches, more highly textured regions. It also resembles some other overcomplete frames like Gabor wavelets. I think of it intuitively that as the network also learns some representation of the image (for the given task), those neural network features are also a overcomplete representation, and however they are not sparse, some regularization techniques can partially sparsify them.

Neural networks have mostly replaced sparse methods as they decoupled the slow training from (relatively) fast interference, however I think there is something beautiful and very “interpretable” about the sparse approaches, and when you have already learned a dictionary and don’t need to iterate, for many tasks they can be attractive and faster.

Conclusions

My post aimed to provide some gentle (not comprehensive, but rather a more intuitive and hopefully inspiring) introduction to sparse and dictionary learning methods and why they can be useful and attractive for some image compression and processing tasks. The only novelty is a proposal of using those for compressing PBR textures (again I want to strongly emphasize the idea of exploiting relationships between the channels – if you compress your PBR textures separately, you are throwing out many bits! 🙂 ), but I hope that you might find some other interesting uses. Compressing animations? Compressing volumetric textures and GI? Filtering of partial GI bakes (where one part of level is baked and we use its atoms to clean up the other one)? Compressing meshes? Let me know what you think!

Posted in Code / Graphics | Tagged , , , , , , , , , , | 6 Comments

Dimensionality reduction for image and texture set compression

Teaser image: Example PBR Texture set compressed with presented technique.Textures credit cc0textures, Lennart Demes.

In this blog post I am going to describe some of my past investigations on reducing the number of channels in textures / texture sets automatically and generally – without assuming anything about texture contents other than correspondence to some physical properties of the same object. I am going to describe the use of Singular Value Decomposition (a topic that I have blogged about before) / Principal Component Analysis for this purpose.

The main use-case I had in mind is for Physically Based Rendering materials with large sets of multiple textures representing different physical material properties (often different sets per different material/object!), but can be extended to more general images and textures – from simple RGB textures, to storing 3D HDR data like irradiance fields.

As usually, I am also going to describe some of the theory behind it, a gentle introduction on a “toy” problems and two examples on simple RGB textures from Kodak dataset, before proceeding to the large material texture set use-case.

Note: There is one practical, significant caveat of this approach (related to GPU block texture compression) and a reason why I haven’t written this post earlier, but let’s not get ahead of ourselves. 🙂

This post comes with three colabs: Part 1 – toy problem, Part 2 – Kodak images, Part 3 – PBR materials.

Intro – problem with storage of physically based rendering materials

I’ll start with a confession – this is a blog post that I was aiming to write… almost 3 years ago. Recently a friend of mine asked me “hey, I cannot find your blog post on materials dimensionality reduction, did you end up writing it?” and the answer is “no”, so now I am going to make up for it. 🙂

The idea for it came when I was still working on God of War and one of the production challenges was disk and memory storage cost for textures and materials. Without going into details on what were the challenges of our specific material system and art workflows, I believe this is a challenge for all contemporary AAA video games – simply physically based rendering type of workflows requires a lot of textures – stressing not only the BluRay disk capacities, but also the gamers are not happy about downloading tens of gigabytes of just patches…

To list a few example different textures used per a single object/material (note: probably no single material used all of them, some of them are rare, but I have seen all in production):

  • Albedo textures,
  • Specularity or specular color textures,
  • Gloss maps,
  • Normal maps,
  • Heightmaps for blending or parallax mapping,
  • Alpha/opacity maps for blending,
  • Subsurface or scattering color,
  • Ambient occlusion maps,
  • Specular cavity maps,
  • Overlay type detail maps.

Uff, that’s a lot! And some of those use more than one texture channel! (like albedo maps that use RGB colors, or normal maps that are usually stored in two or three channels). Furthermore, with increasing target screen resolutions, textures also need to be generally larger. While I am a fan of techniques like “detail mapping” and using masked tilers to avoid the problem of unique large textures everywhere, those cannot cover every case – and the method I am going to describe applies to them as well.

All video games use hardware accelerated texture block compression, and artists manually decide on the target resolution of different maps. It’s time consuming, not very creative, and fighting those disk usage and memory budgets is a constant production struggle – not just for the texture streaming or fitting on a BluRay disk, but also the end users don’t want to download 100s of GBs…

Could we improve that? I think so, but before we go to the material texture set use-case, let’s revisit some ideas that apply to “images” in more general sense.

Texture channels are correlated

Let’s start with a simple observation – all natural images have correlated color channels. “Physical” explanation for it is simple – no matter what is the color of the surface, when it is lit, its color is modulated by the light (in fact by the light spectrum and after interacting with its surface through BRDF, but let’s keep our mental model simple!). This means that all of those color channels will be at least partially correlated, and light intensity / luminance modulates whatever was the underlying physical surface (and vice versa).

What do you mean by “correlated” color channels and how to decorrelate them?

Let’s look at a contrived example – list of points representing pixels in a two color (red and green) channel texture to keep visualization in 2D. In this toy scenario we can observe a high correlation between red and green in the original red/green color space. What does this mean? If the value of green channel brightness is high, it is also very likely that the red channel will is also high. On the following plot, every dot represents a pixel of an image positioned in the red/green space on the left:

Left: scatter plot of points in our example in original, RG space. Right: scatter plot of the same points when “decorrelated” by projecting them onto luminance axis.

On the right plot, I have projected those on the blue line, representing something like luminance (overall brightness).

We can see that from the original signal which was changing a lot on both axis, after projecting onto luminance axis, we end up with significantly less variation and no correlation on the perpendicular “chrominance” (here defined as simple difference of red and green channelvalues) axis. Decorrelated color space means that values of a single channel are not dependent or linearly related to the values of another color channel. (There could be a non-linear relationship, but I am not going to cover it in this post and it becomes significantly more difficult to analyze, requiring iterative or non-linear methods.)

This property is useful for compression, as while the original image needed for example 8 bits for both red and green color channels, in the decorrelated space we can store the other axis with a significantly smaller precision – there is much less variation, so possibly 1-2 bits would suffice, or we could store it in smaller resolution.

This is one of the few reasons why we use formats like YUV for encoding, compressing and sending images and videos. (As a complete offtopic, three other reasons that come to my mind right now are also interesting – human visual perception system being less perceptive to color changes and allowing for more compression of color; chroma varying more smoothly (less high frequencies) in natural images; and finally for now obsolete now historical reasons, where old TVs needed “backwards compatibility” requiring transmitted signal to be possible to view on black&white old TVs ).

Let’s have a look at two images from Kodak dataset in RGB and YUV color spaces. I have deliberately picked one image with many different colors (greenish water, yellow boat, red vests, blue shorts, green and white tshirts), and the other one relatively uniform (warm-green white balance of an image with foliage):

RGB image and Y, U, and V color channels.
RGB image and Y, U, and V color channels.

It’s striking how much of the image content is encoded just in the Y, luminance color channel! You can fully recognize the image from the luminance, while chrominance is much more “ambiguous”, especially on the second picture.

But let’s look at it more methodically and verify – let’s plot the absolute value of the correlation matrices of color channels of those images.

Correlation matrices of two Kodak images / image crops in RGB space. The one with less colors (right) has almost perfectly correlated color channels! The other one (left) still has very strong correlation, but only between some channel pairs.

This plot might deserve some explanation: Value of 1 means fully correlated channels (as expected on the diagonal – variable is fully correlated with itself).

Value of zero means that two color channels don’t have any linear correlation (note – there could be some other correlation, e.g. quadratic!) and the matrices are symmetric, as the red color channel has same correlation with the green as the green with the red one.

First picture, with more colors has significantly less correlation between color channels, while the more uniformly colored one shows very strong correlation.

I claimed that YUV generally decorrelates color channels, but let’s put this to a test:

Absolute value of correlation matrices of two Kodak images / image crops in YUV space. Value of 1 means fully correlated or negatively correlated values (as expected on diagonal – variable is fully correlated with itself). The one with less colors (right) after conversion to YUV has almost no cross channel correlation The other one (left) has some decorrelation, but it is not as strong as the first example.

The YUV color space decorrelated more colorful image to some extent, while the less colorful one got decorrelated almost perfectly.

But let’s also see how much information is in every “new” channel with displaying absolute covariances (one can think of covariance matrix as correlation matrix multiplied by variances):

Covariance matrices of the two YUV converted Kodak images. Covariance is a pretty good measurement of “information” as also shows us how much variation is there in each color channel, and in relationship between them.

In those covariance plots we can verify our earlier intuition/observation that the luminance represents the image content/information pretty well and most of it is concentrated there.

Now, the Y luminance channel is created to roughly correspond to perception of the brightness of the human vision. There are some other color spaces like YCoCg that are designed for more decorrelation (as well as better computational efficiency).

But what if we have some correlation, but not along this single axis? Here’s an example:

Despite strong linear correlation, projecting onto fixed axis only partially decorrelates color channels.

Such a fixed axis doesn’t work very well here – we have somewhat reduced the ranges of variation, but there is still some “wasteful” strong linear correlation…

Is there some transformation that would decorrelate them perfectly while also providing us with the “best” representation for a given subset of components? I am going to describe a few different “angles” that we can approach the problem from.

Line fitting / least squares vs dimensionality reduction?

The problem as pictured above – finding best line fits – is simple. In theory, we could do some least squares linear fit there (similarly to what I described in my post on guided image filters). However, I wanted to strongly emphasize that line fitting is not what we are looking for here. Why so? Let’s have a look at the following plot:

Difference between least squares line fitting and PCA/SVD. The first one seeks smallest sum of squared errors in the original y space, while the latter obtains a line with the smallest sum of squared errors of the projection.

Least squares line fitting will minimize the error and fit for the original y axis (as it is written to minimize the average “residual” error a*x+b – y), while what we are looking for is a projection axis with smallest error when the point is projected on that line! In many cases those lines might be similar, but in some they might be substantially different.

Here’s an example on a single plot (hopefully still “readable”):

Least squares line fitting – blue line – might obtain substantially different results from optimal projection axis – orange line.

For the case of projection, we want to minimize the component perpendicular to the projection axis, while linear line fitting / linear least squares minimize the error measured in the y dimension. Those are very different theoretical (and also practical when the number of dimensions grows) concepts!

In the earlier plot I mentioned name PCA, which stands for Principal Component Analysis. It is a technique for dimensionality reduction – something that we are looking for.

Instead of writing how to implement PCA with covariance matrices etc., (it is quite simple and fun to derive it using Lagrange multipliers, so might write a separate post on it in future) I wanted to look at it from a different perspective – of Singular Value Decomposition, which I blogged about in the past in the context of filter separability

I recommend that post of mine if you have not seen it before – not because of its relevance (or my arrogance), but simply as it describes a very different use of SVD. I think that seeing some very different use-cases of mathematical tools, concepts, and their connections is the best way to understand them, build intuition and potentially lead to novel ideas. As a side note – in linear algebra packages, PCA is usually implemented using SVD solvers.

Representing images as matrices – image doesn’t have to be a width x height matrix!

Before describing how we are going to use SVD here, I wanted to explain how we want to represent N-channel images by matrices.

This is something that was not obvious to me, and even after seeing it, defaulting to thinking about it in different ways took me years. The “problem” is that usually in graphics we immediately think of an image as a 2D matrix – width and height of the image get represented with matrix width and height; simple adding color channels immediately poses a problem, as it requires a 3D tensor. This thinking is very limited, only one of few possible representations and arguably a not very useful one. It took me quite a while to stop thinking about it this way and it is important even for “simple” image filtering (bilateral filter, non-local means etc.).

In linear algebra, matrices represent “any” data. How we use a matrix to represent image data is up to us – but most importantly, up to the task and goal we want to achieve. Representation that we are going to use doesn’t care for spatial positions of pixels – they don’t even have to be placed on an uniform grid! Think of them just as “some samples somewhere”. For such use-cases, we can represent the image as all different pixel positions (or sample indices) on one axis, and color channels on the other axis.

Two equivalent image representations – as a w x h x c 3D tensor, or as a w*h x c 2D matrix.

This type of thinking becomes very important when talking about any kind of processing of image patches, collaborative filtering etc. – we usually “flatten” patch (or whole image) and pack all pixels on one axis. If there is one thing that I think is particularly important and worth internalizing from my post – it is this notion of not thinking of images as 2D matrices / 3D tensors of width x height pixels.

Important note – as long as we remember the original dimensions and sample locations, those two representations are equivalent and we can (and are going to!) transform back and forth between them.

This approach extends to 3D (volumetric textures), 4D (volumetric videos) etc. – we can always pack width * height * depth pixels along one axis.

SVD of image matrices

We can perform the Singular Value Decomposition of any matrix. What do we end up with for the case of our image represented as a w*h x c matrix?

(Note: I tried to be consistent with ordering of matrices and columns / rows, but it’s usually something that I trip over – but in practice it doesn’t matter so much as you can transpose the input matrix and swap the U and V matrices.)

The SVD is going to look like:

Singular value decomposition of our image data. We end up with three separate matrices, U, S, V where U matrix is the same size as the input image, singular value matrix S can be seen as a c x c diagonal matrix (depending on used interpretation there might be more rows, but the rest entries are zero), and a c x c V matrix.

Oh no, instead of a single matrix, we ended up with three different matrices, one of which is as big as the input! How does this help us?

Let’s analyze what those matrices represent and their properties. First of all, matrix U represents all of our pixels in our new color space. We still have C channels, but they have different “meaning”. I used color coding of “some” colors to emphasize that those are not our original colors anymore.

Those new C channels contain different amount of energy / information and have different “importance”. Amount of it is represented in the diagonal S matrix, with decreasing singular values. The first new color channel will represent the largest amount of information about our original image, the next channel less etc. (Note the same used colors as in matrix U).

Finally, there is a matrix V that transforms between the new color channels space, and the old one. Now the most important part – this transformation matrix is orthogonal (and even further, orthonormal). It means that we can think of it as a “rotation” from the old color space, to the new one, and every next dimension provides different information without any “overlap” or correlation. I used here “random” colors, as those represent weights and remappings from the original, RGBA+other channels, to the new vector space.

Now, together matrices T and S inform us about transforming image data into some color space, where channels are orthogonal, and sorted by their “importance”, and where first k components are “optimal” in terms of contained information (see Eckart–Young theorem).

Edit / bonus section: A few readers have commented in a thread on twitter (and on Facebook) that while Eckart-Young is from perspective of linear algebra, usually image compression and stochastic processes analysis use the name Karhunen–Loève theorem/transform. It is a concept that is new to me (as I have not really worked with neither of those fields) and I have to and will be happy to read more about it. Thanks to everyone for pointing out this connection, always excited to learn some new concepts!

Think of it as rotating input data to a perfectly decorrelated linear space, with dimensions sorted based on their “importance” or contribution to the final matrix / image, and fully invertible – such representation is exactly what we are looking for.

How does it relate to my past post on SVD?

A question that you might wonder is – in my previous post, I described using SVD for analyzing separability and finding approximations of 2D image filters – how does this use of SVD relate to the other?

We have looked in the past at singular components, sorted by their importance, combined with single rows + columns reconstructing more and more of the original filter matrix.

Here we have a similar situation, but our U “rows” are all pixels of the image, every one with single channel, and V “columns” are combinations of the original input channels that transform to this new channel space. Just like before, we can add singular values multiplied by the corresponding vectors one by one, to get more and more accurate representation of the input (no matter whether it’s a filter matrix or a flattened representation of an image). The first singular value and color combination will tell us the most about the image, the next tells a bit less less information etc. Some images (with lots of correlation) have huge disproportion of those singular values, while in some other ones (with almost no correlation) they are going to be decaying very slowly.

This might seem “dry” and is definitely not exhaustive. But let’s try to build some intuition for it by looking at the results of SVD on our “toy” problem. If we run it on toy problem matrix image data as it is, we will immediately encounter a limitation that is easy to address:

Left: original data with a plotted yellow line corresponding to the “luminance” and a green line representing new first “dimension” of the SVD. Center: Same data converted to luminance/chrominance. Right: Same data when transformed by SVD matrix V and S.

The line corresponding to the first singular value passes nicely through the cluster or our data points (unlike simple “luminance”), however clearly doesn’t decorrelate points – we still see some simple linear relationship in the transformed data.

This is a bit disappointing, but there is a simple reason for it – we ran SVD on original data, fitting a line/projection, but also forcing it to go through original zero!

Usually when performing PCA or SVD in the context of dimensionality reduction, we want to remove the mean of all input dimensions, effectively “centering” the data. This way we can “release” the line intercept. Let’s have a look:

Left: original data with a plotted yellow line corresponding to the SVD without mean subtraction, and a green line representing new first “dimension” of the SVD with “centering” of the data before fitting. Center: Uncentered SVD shows some remaining linear correlation and value range skew away from zero. Right: Same data with “centered” SVD shows perfect decorrelation and a significant range reduction of the second dimension, while more information is represented in the first dimension.

As presented, shifting the fit center to the mean of the data, shifts the line fit and achieves perfect “decorrelating” properties. PCA that runs on covariance matrices does this centering implicitly, while in the case of SVD we need to do it manually.

In general, we would probably want to always center our data. For compression-like scenarios it has an additional benefit that we need to compress only the “residual”, so difference between local color, and the average for the whole image, making the compression and decompression easier at virtually no additional cost.

In general, many ML applications go even step further, and “standardize” the data, bringing it to the same range of values by dividing them by their standard deviations. For many ML-specific tasks this makes sense, as for tasks like recognition we don’t really know the importance of ranges of different dimensions / features. For example if someone was writing a model to recognize a basketball player based on their weight, height, and maybe average shot distance, we don’t want to get different answers depending on whether we enter the player’s weight in kilograms vs pounds!

However for tasks like we are aiming for – dimensionality reduction for “compression” we want to preserve the original ranges and meanings, and can assume that they are already scaled based on their importance. We can take it even further and use the “importance” to guide the compression, and I will describe it briefly around the end of my post.

SVD on Kodak images

Equipped with knowledge on how SVD applies to decorrelating image data, let’s try it on the two Kodak pictures that we analyzed using simple YUV transformations.

Let’s start with just visualizing both images in the target SVD color space:

Results of projecting images into a color space resulting from Singular Value Decomposition. Note that it seems to be very similar to the YUV color space!

Results of projection seem to be generally very similar to our original YUV decomposition, with one main difference is that our projected color channels are “sorted”. But we shouldn’t just “eyeball” those, instead we can analyze this decomposition numerically.

Covariance matrices and singular values

Comparing covariance matrices (and square roots of those matrices – just for visualization / “flattening”) shows the real difference between such a pretty good, but “fixed” basis function like YUV, and a decorrelated basis found by SVD:

Absolute value of covariance matrices that I normalized to 1 in the first channel (position 0,0 of each different basis projection) – in addition to regular covariance matrices on the left, I also added a version displaying a square root for dynamic range compression / easier visualization on the right.

As expected, SVD fully decorrelates the color channels. We can also look at how the singular values for those images decay:

Decay of the energy in different channels of YUV conversion and in SVD – in linear and in log scales.

So again, hopefully it’s clear that with SVD we get faster decay of the “energy” contained in different image channels – the first N channels will reconstruct the image better. Let’s look at the visual example.

Visual difference when keeping only two channels

To fully appreciate the difference, let’s look at how the images would look like if we used only two channels – YUV vs SVD and corresponding PSNRs:

Using only two “decorrelated” color channels – YUV vs SVD – visual difference as well as PSNR.

The difference is striking visually, as well as very significant numerically.

This shouldn’t come as a surprise, as SVD finds a reparametrization / channel space that is the “best” projection for thr L2 metric (which is also used for PSNR computations), decorrelates, and sorts this space according to importance.

I wonder why JPEG wouldn’t use such decomposition – I would be very curious to read in the comments! :)I have two guesses of mine: JPEG is an old color format with lots of “legacy” and support in hardware – both encoders, as well as decoders. A single point-wise floating point matrix multiply (where matrix is just 4×3 – 3×3 multiplication and adding back means) seems super cheap to any graphics programmers, but would take a lot of silicon and power to do efficiently in hardware. A second guess is related to the “perceptual” properties of chrominance (and worse human vision sensitivity to it) – SVD doesn’t guarantee us that decomposition will happen in the rough direction of luminance, could be “any” direction.

Edit / addendum: While my remark re JPEG was about the YUV luminance/chrominance decomposition as compared to doing PCA on the color channels, the use of DCT basis for the spatial compression component (instead of mentioned before KLT and eigendecomposition of data directly) is considered “good enough”, and I learned from a reply twitter thread with Per Vognsen and Fabian Giesen about some justifications for its use and another set of connections to stochastic processes.

Now, this use case – decorrelating RGB channels – can be very useful not for standard planar image textures, but also for 3D volumetric colored textures. I know that many folks store and quantize/compress irradiance fields in YUV-like color spaces, and using there SVD will most likely have better quality/compression ratios. This would become even more beneficial for storing sky / ambient occlusion in addition to the irradiance, and one of the further section includes an example AO texture (albeit for the 2D case).

Relationship to block texture compression

If you have worked with GPU “block texture compression” (like DXT / BC color formats) you can probably recognize some of the described concepts – in fact, the simplest BC1 format does a PCA/SVD and finds a single axis, on which colors are projected! I recommend Nathan Reed’s blog post on this topic and a cool visualization.

This is done per a small, 4×4 block and because of that, we can expect much more energy be stored in the single component. This is a visualization of how such BC compression would +/- look like with our toy problem:

The simplest variants of BC compression project colors onto a single line that is found using PCA/SVD approaches and then quantized heavily.

Any blocks that don’t have almost all of information in the first singular value (so are not constant or have more than a single gradient) are going to end up with discoloration and block artifacts – but since we talk about just 4×4 blocks, GPU block compression usually works pretty well.

Block compression will be also pose a problem for the initial use-case and research direction I tried to follow – but more in that later!

What if we had many more channels?! PBR texture sets.

After a lengthy introduction, time to get back to the original purpose of my investigations – PBR texture sets for materials. How would SVD work there? My initial thinking was that if there are many more channels, some of them have to be correlated (as local areas will share physical properties – metal parts might have high specularity and gloss, while diffuse might be rough and more scratched). While such correlations might be only weak, there are many channels, and a lot of potential for also different, non-obvious correlations. Or correlations that are “accidental” and not related to the physical properties. 🙂

Since I don’t have access to any actual game assets anymore, and am not sure about copyright / licensing of many of the scenes included with the engines, I downloaded some texture set from cc0textures by Lennart Demes (such an awesome initiative!). Texture sets for materials there are a bit different than what I was used to in the games I worked on, but are more than enough for demonstration purpose.

I picked one set with some rocks/ground (on GoW, the programmers were joking that half of GoW disk was taken by rock textures, and at some point in the production it wasn’t that far off from true 🙂 ).

The texture set looks like this:

Example PBR texture set created by Lennart Demes that we are going to analyze and “compress” with SVD.

For the grayscale channels I took only single channels, and normals are represented using 3 channels, giving total of 9 channels.

The cool thing is – we don’t really have to “care” or understand what those color channels represent! We could even use 3 color channels to represent the “gray” textures, and the SVD would recognize it automatically (with some singular values ending up exactly zero) but I don’t want to present misleading/too optimistic data.

For visualization, I represented this texture set as horizontally stacked textures, but the first thing we are going to do is to first represent it as 2048 x 1365 x 9 array, and then for purpose of SVD as (2048 * 1365) x 9 one.

If we run the SVD on it (and then reshape it to w x h x 9), we get the following 9 decorrelated new “color”/image channels:

New color channels resulting from the SVD decomposition stacked together. U used data in U matrix multiplied by the S matrix to demonstrate the relative importance of it – feature magnitudes.

Alternatively, if we display it as 3 RGB images:

Same data as above (U times S from the SVD decomposition), but with 9 channels displayed as three RGB textures.

Immediately we can see that information contained in those decays very quickly, last four containing barely any information at all!

We can verify this looking at the plot of singular values:

As seen from the textures earlier, after 5 color channels, we get significant drop off of the information and start observing diminishing returns (“knee” of the cumulative curve).

Let’s have a look at how did SVD remap the input to the output channels and what kind of correlations it found.

In the following plot, I put abs(S.T) (using absolute value in most of my plots, as it’s easier to reason about existence of correlations, not their “direction”):

In this matrix, columns represent input features, and rows represent the output channels/features. Color/magnitude signifies abs() of the contribution. Input channel 0 is AO and contributes to four output features; channels 1,2,3 are albedo and map mostly to first two channels, and a bit to channels 5/6/7 etc.

I marked two input textures – albedo and normals, and as they are a 3-channel representation of the some physical property and we can see how because of this correlation they contribute to very similar output color channels, almost like same “blocks”. We can also observe that for example the last two output features contribute to a little bit of AO, little bit of albedo, and a bit to one channel of normals – comprising some leftover residual error correction in all of the channels.

Now, if we were to preserve only some of the output channels (for data compression), what would be the reconstruction quality? Let’s start with the most “standard” metric, PSNR:

PSNR of the reconsctruction if we take only N channels from the SVD decomposition. Note that even with zero channels, we get non-zero PSNR due to mean-subtraction. Generally PSNRs above ~36dB are considered “ok”. I “clip” above 50dB as such high values don’t make any difference for any perceptual purpose.

I skipped keeping of 9 channels in the plot, as for those we would expect “infinite” PSNR (minus floating point imprecisions / roundoff error).

So generally, for most of the channels, we can get above ~30dB with keeping only 5 data channels – which is what we intuitively expected from looking at the singular values. This is how reconstruction from keeping only those 5 channels looks like:

Top: Original PBR texture set. Bottom: Reconstructed from the first 5 SVD channels. Difference is most visuble on albedo and the roughness.

Main problems I can see is slight albedo discoloration (some rocks lose the desaturated rock scratches) and different frequency content on the last, roughness texture.

Going back to the PSNRs plots, we can see that they are different per channels and e.g. displacement becomes very high quickly, while the color and roughness stay pretty low… Can we “fix” those and increase the importance of e.g. color?

Input channel weighting / importance

I mentioned before that many ML applications would want to “standardize” the data before computing PCA / SVD and make sure range (as expressed by min/max or standard deviations) is the same to make sure we don’t bake in assumption of importance of any features as compared to the others in the analysis.

But what if we wanted to bake our desired importance / weighting into the decomposition? 🙂

Artists do this all the time manually by adjusting the resolutions of different sets of PBR textures to different resolutions; most 3D engine have some reasonable “defaults” there. We can use the same principles to guide our SVD and improve the reconstruction ratio. To do it, we would simply rescale the input channels (after the mean subtraction) by some weights that control the feature importance. For the demonstration, I will rescale the color map to be 200% as important, while reducing the importance of displacement to 75%.

This is how channel contributions and PSNR curves look like before/after:

Channel contributions without and with channel weighting increasing the contribution of albedo (channels 1/2/3) and decreasing the weight of the fourth channel. Notice how with the weighting most of albedo contributions happen in the first singular value, and the 4th channel of displacement gets distributed among few multiple values.
PSNR curves before and after the channel weighting increasing the contribution of albedo (channels 1/2/3) and decreasing the fourth channel. Notice how increase of the reconstruction accuracy of some channels had to lower it for some other ones, similarly global PSNR.

Such simple weighting directly improved some of the metrics we might care about. We had to “sacrifice” PSNR and reconstruction quality in different channels, but this can be a tuneable, conscious decision.

And finally, the visual results for materials reconstructed from just 5 components:

Top: original textures. Middle: SVD decomposition and reconstruction from 5 components without weighting. Bottom: Reconstruction from 5 components that includes increasing the weight of the albedo, while decreasing height map (third displayed texture). We observe the visual error decreasing significantly on the albedo – no discoloration, while increasing significantly for the height map.

Other than some “perceptual” or artist-defined importance, I think that it could be also adjusted automatically – decreasing weight of channels that are reconstructed beyond diminishing returns, and increasing of the ones that get too high error. It could be some semi-interactive process, dynamically adjusting number of channels preserved and the weighting – an area for some potential future research / future work. 🙂

Dropping channels vs smaller channels

So far I have described “dropping channels” – this is really easy to get significant space savings (e.g. 5/9 == almost 50% savings with PSNR in the mid 30s), but isn’t it too aggressive? Maybe we can do the same thing as JPG and similar encoders do – just store the further components at smaller resolutions.

For the purpose of this post and just demonstration, I will do something really “crude” and naive – box filter and subsample (similar to mip-map generation), with nearest neighbor upsampling – we know we can do better, but it was simpler to implement and still demonstrates the point. 🙂

For demonstration I kept the 4 components at the original resolution, then 3 components at half resolution, and the final 2 components at quarter resolution. In total this gives us slightly smaller storage than before (7/16th), but the PSNRs are as follow: Average PSNR 36.96, AO 43.69, Color 41.08, Displacement 29.98. Normal 41.25, Roughness 35.13.

Not too bad for such naive downsampling and interpolation!

When downsampled, images looked +/- the same, so instead here is a zoom-in to just crops (as I expected to see some blocky artifacts):

Top: original textures crops, bottom: crops of textures reconstructed from aggressive decimation of further SVD channels with super-naive bilinear/box filter and NN upsampling.

With the exception of the height map which we have purposefully “downweighted” a lot, the reconstruction looks perceptually very good! Note that this ugly “ringing” and blocky look of it would be significantly better if even simple bilinear interpolation was used.

I’d add that we typically have also a third option – adjusting the quantization and the number of bits we encode each of our channels. This actually makes the most sense (because our channel value ranges get smaller and smaller and we know them directly from the proportions singular values), and is something I’d recommend playing with.

Runtime cost – conversion back to the original channel space

Before I conclude with the biggest limitation and potentially a “deal breaker” of the technique, a quick note on the runtime cost of such decomposition technique.

The decomposed/decorrelated color channels are computed with zero mean and unity standard deviation, so we would generally want to shift them by 0.5 and rescale so that they cover range [0, 1], instead of [-min, max]. Let’s call this a (de)quantization matrix M. So far we have matrices U (pixels x N), S (N x N diagonal), T (N x N, orthogonal), M (we could do N+1 x N to use “homogenous coordinates” and get translations for mean centering for free), input feature weight matrix / vector W, and vector with means A.

We can multiply all of those matrices (other than U which becomes our new image) offline and as a part of precomputation and using homogenous coordinates (N input features + ‘1’ constant in additional channel), we end up with a (N+1 x N) conversion matrix if keeping all channels, or (N x M+1) where M < N if dropping some channels. The full runtime application cost becomes a matrix multiply per pixel – or alternatively, a series of multiply-adds – each reconstructed feature becomes a weighted linear combination of the “compressed” / decorrelated features, so total cost per a single texture channel is M+1 madds.

This is very cheap to do on a GPU without any optimizations (maybe one could use “tensor cores” which are simply glorified mixed-precision matrix-multipliers?), but has a disadvantage that even if we are going to use a single channel (for example heightmap for raymarching or in a depth prepass), we still need to read all the input channels. If you have many passes or use-cases like this, this could elevate the required texture bandwidth significantly, but otherwise could be almost “free” (ALU is generally cheap, especially if doesn’t have any branching that would elevate the used register count).

Note however that it’s not all-or-nothing – if you want you could store some textures as before, and compress and reduce dimensionality of the other ones.

Caveats / limitations – BC texture compression

Time to reveal the reason why I haven’t followed up on this research direction earlier – I imagine that to many of you, it might have been obvious throughout the whole post. This approach is going to generally lose a lot of quality when being compressed with GPU block compression formats.

You remember how I described that the simplest BC formats do some local PCA and then just keep only single principal component? Now, if we purposefully decorrelated our color channels and there is no correlation anymore, such a fit and projection is going to be pretty bad! 😦

Now, it’s not that horrible – after all, we have found global decorrelation for the whole texture. Luckily, this doesn’t mean that there won’t be local (per block) one.

BC formats do this analysis per block (otherwise every image would end up as some form of “colored grayscale”) and there still might be potential for local correlations and not losing all the information with block compression. This is how our color channels to compress look like:

Crops of the colors of analyzed texture transformed to the decorrelated SVD space. Despite global decorrelation, we still observe blocks and areas of constant color or just two colors.

So generally, locally we see mostly flat regions, or combinations of two colors.

This is not any exhaustive test, proof, or analysis, just a single observation, but maybe it won’t be as bad as I thought when I abandoned this idea for a while – but it’s a future research direction! If you have any thoughts, experiments, or would like to brainstorm, let me know in the comments or contact me. 🙂

Summary

In this post we went through quite a few different concepts:

  • Idea of “decorrelating” color channels to maximize information contained in them and remove the redundancy,
  • One alternative possible way how to think about images or image patches for linear algebra applications – as 2D matrices, not 3D tensors,
  • Very brief introduction to use of SVD for computing “optimal” decorrelated channel decompositions (SVD for PCA-like dimensionality reduction),
  • Relationship between the least-squares line fitting and SVD/PCA (they are not the same!),
  • Demonstration of SVD decomposition on a 2-channel, “toy” problem,
  • Relationship of such decomposition to Block Compression formats used on GPUs,
  • Reducing the resolution/quantization/fully dropping “less significant” decorrelated color channels,
  • Comparison of efficacy of SVD as compared to YUV color space conversion that is commonly used for image transmission and compression,
  • Potential application of dimensionality reduction for compressing the full PBR texture sets,
  • And finally, caveats / limitations that come with it.

I hope you found this post at least somewhat interesting / inspiring and encouraging some further experiments with using the multi-channel texture / material cross-channel correlation.

While the scheme I explored here has some limitations for practical GPU applications (compatibility with BC formats), I hope I was able to convince you that we might be leaving some compression/storage benefits unexplored, and it’s worth investigating image channel relationships for dimensionality reduction and potentially some smarter compression schemes.

As a final remark I’ll add that applicability of this technique depends highly on the texture set itself, and how quickly the singular values decay. I have definitely seen more correlated texture sets than the example, but it’s also possible that on some inputs the correlation and technique efficacy might be worse – all depends on your data. But you don’t have to make a decision whether to use it or how many channels to keep manually – by analyzing the singular values as well as the PSNR itself it can be easily deducted from data and the decision automated.

The post comes with three colabs: Part 1 – toy problem, Part 2 – Kodak images, Part 3 – PBR materials which hopefully will encourage verification / reproduction / exploration.

Thanks for reading!

Posted in Code / Graphics | Tagged , , , , , , , , | 12 Comments

“Optimizing” blue noise dithering – backpropagation through Fourier transform and sorting

Introduction

This will be a blog post that is second in an (unanticipated) series on interesting uses of the JAX numpy autodifferentiation library, as well as an extra post in my very old post series on dithering in games and blue noise specifically.

Teaser: we will “optimize” a dithering matrix directly in Fourier domain and backpropagate for gradient descent together with backpropagation through sorting, to obtain specific perceptual, blue-noise dithering properties. The image shows comparison between white, and blue noise dithering.

I am going to describe how we can use auto-differentiation and loss function optimization to perform a precomputation of a dithering pattern directly with a desired Fourier/frequency spectrum and the desired range of covered values by backpropagating through those two (differentiable!) transforms.

While I won’t do any detailed analysis of the pattern properties and don’t claim any “superiority” or comparisons to any of the state of the art (hey, this is a blog, not a paper! 🙂 ), it seems pretty good, highly tweakable, works pretty fast for tileable sizes, and to my knowledge, this idea of generating such patterns directly from spectrum with other (spatial, histogram) loss components is quite fresh.

If you are new to “optimization”, gradient descent, and back-propagation, I recommend at least my previous post just as a starting point, and then if you have some time, reading proper literature on the subject.

This blog post comes with a “scratchpad quality” colab notebook that you can find and play with here. It is quite ugly and unoptimal code, but being as simple as possible and thus a starting point for some other explorations.

Motivation

Motivation for this idea came from one of the main themes that I believe in and pursue in my personal spare time explorations / research:

Offline optimization is heavily underutilized by the computer graphics community and incorporating optimization of textures, patterns, and filters can advance the field much more than just “end-to-end” deep learning alone.

This was my motivation for my past blog post on optimizing separable filters, and this is what guides my interests towards ideas like accelerating content creation or compression using learning based methods.

Given a problem, we might want to simplify it and extract some information from data that is not known in a simple, closed form – this is what machine learning is about. However, if we run the “expensive” learning, optimization and precomputation offline, the runtime can be as fast – or even faster if we simplify the problem – as before and we can still use our domain knowledge to get the best out of it. Some people call it differentiable programming, but I prefer “shallow learning”.

Playing with different ideas for differentiating and optimizing Fourier spectrum for some different problem, I realized – hey, this looks like directly applicable to the good, old blue-noise!

Blue noise for dithering – recap

Let’s get back on-topic! It’s been a while since I looked at blue noise dithering and when writing it, I felt like my blog post mini-series has exhausted this topic to the point of diminishing returns (FWIW I don’t think that blue noise is a silver bullet for many problems; it has some cool properties, but its strength lies in dithering and only extreme low sample counts for sampling).

Before I describe benefits of the blue noise for dithering, have a look at an example dithered gradient from my past blog post:

A linear gradient dithered with some white vs blue noise. Can you guess which one is which? 🙂

Generally – informal definition that is most common in graphics – blue noise is type of noise that in a general sense has no low frequency content. This spectral property is desirable for dithering for the following main reasons:

  • Low frequencies present in the noise cause low frequencies of the error, which are very easily “caught” by the human visual system,
  • With blue noise dithering pattern, we are less likely to “miss” high frequency details of a given range of values. Imagine if low frequency caused “rise” of the pattern in a local spot, while it had high frequencies or information in different range – they would all quantize to a similar value.
  • Finally, with further downstream processing and high resolution displays, high frequency error is much easier to filter out by either image processing, or simply our imperfect eyes, with eye sight or processing reducing the variance of the apparent error.
  • Unlike some predefined patterns like Bayer matrix, blue noise has no visible “artificial” structure, making it both more naturally looking, as well as less prone to amplify some of the artificial frequencies.

All those properties make it a very desireable type of noise / pattern for simple look-up based dithering. (Note for completeness: if you can afford to do error diffusion and optimize the whole image based on the error distribution, it will yield way better results).

“Optimizing” Fourier spectrum

One of the blue noise generation methods that I initially tried for 1D data was “highpass filter and renormalize” by Timothy Lottes. The method is based on a simple, smart idea – highpass filtering followed by renormalizing values for proper range/histogram – and is super efficient to the point of applicability in real-time, but has limited quality potential, as renormalization “destroys” the desired spectral properties. Iterating this method doesn’t give us much – every highpass operation done using finite size convolution has only local support, while the renormalization is global. This imbalance was in my explorations leading to geting stuck in some low quality equilibrium, and after just a few iterations a ping-pong of two exactly same patterns.

If only we knew some global support transform that is able to look directly at spectral content… If only such transform was “tileable”, so the spectra were “circular” and periodic… Hey, this is just plain old Fourier transform!

While in many cases, Fourier being valid on periodic / circular patterns is a nuisance that needs to be addressed; similarly its global support (single coefficient affects the whole image), but here both of those properties work to our advantage! We want our spectral properties to work on toroidal – tiled – domains, as well as we want to optimize such pattern globally.

Furthermore, the Fourier transform is trivially differentiable (every frequency bucket is a sum of exponentials on the input pixels) and we can even directly write a loss function that mixes operations directly on the frequency spectrum (but even a single “pixel” of the spectrum will impact every single pixel of the input image. With such gradient based optimization at every step we can just slightly “nudge” our solution towards desired outcome and maybe won’t get stuck immediately in a low quality equilibrium.

Fourier transform is differentiable and we can backpropagate any loss function defined on the frequency spectrum back to the original spatial domain. Interesting side-effect of the global spatial support is that modifying even single frequency bucket has effect on all of the input samples. Similarly, every input sample contributes to all frequency buckets.

Similarly, if we sort pixels by value (do get ordering that would generate a histogram), we can just push them “higher” and lower and we can back-propagate through sorting.

Similarly, backpropagating loss function defined on sorted values (~a CDF of a histogram) is trivial as it requires only “reindexing” values (remembering which input value index corresponded to given sorted bucket). In this case loss and gradient are more “local” – single value maps back to a single value, but if we optimize for the whole histogram, all values will have some non-zero gradient.

Writing gradient of both of those manually wouldn’t be very hard… But in JAX it is already done for us, we already got both operations implemented as part of the library, and we can JIT the resulting gradient evaluation, so nothing left for us but to design a loss and optimize it!

Fourier domain optimization – whole science fields and prior work

I have to mention here that optimizing frequency spectra, or backpropagating through them is actually very common and it is used all the time in medical imaging, computational imaging, radio / astronomical imaging, and there are types of microscopy and optics based on it. Those are super fascinating fields and it probably takes lifetime to study those properly, and I am complete layman when it comes to those – but before I conclude the section, I’ll link a talk by Katie Bouman on the recent event horizon telescope which also does frequency domain processing (and in a way uses the whole Earth as a such Fourier telescope – no big deal! 🙂 ).

Setting up loss function and optimization tricks

As I mentioned in my previous post, designing a good loss function is the main part of the problem when dealing with optimization-based techniques… (along things like convergence, proofs of optimality, conditioning, and better optimization shemes… but even many “serious” papers don’t really care about those!).

So what would be the loss component of our blue noise function? I came up with the following:

Components of our dithering blue noise loss function:
1. Histogram uniformity. We want some desired histogram shape. For simplicity I used uniform one, but a triangular or Gaussian one might be a better choice for smooth gradients!
2. Frequency content focused on high frequencies, with absence of the low ones.
3. General frequency spectrum uniformity – we don’t want “spikes” similar to e.g. a Bayer pattern.

This is just an example – again, I wanted to focus on the technique, not on the specific results or choices! If you some alternative ideas, please let me know in the comments below 👇.

Again, code for this blog post is available and immediately reproducible here.

There are also many “meta” tuning parameters – what would be our low frequency cutoff point? How do we balance each loss component? What histogram shape we want? The best part – you can try and adjust any of those to taste and to your liking and specific use-case. 🙂

When running optimization on such a compound loss function, I almost immediately encountered getting stuck in some “dreaded” local minima. On the other hand, when optimizing only for the loss sub-components, the optimization was converging nicely. Inspired by the stochastic gradient descent that is commonly known to generalize and converge to better global minima than any batch based / whole dataset optimization, I used this observation to “stochasticize” / randomize the loss function slightly (to help “nudge” the solutions out of the local minima or saddle points).

The idea is simple – scale the loss components by a random factor, different per each gradient update / iteration. Given my “complete” loss function comprised of three sub-components, I scale each one by a random uniform in the range 0-1. This seemed to work very well and help optimization to achieve plausibly looking results without local minima.

As a final step, I run fine-tuning on the histogram only part (could as well just force-normalize, but reused existing code), as without it histograms were “almost” perfect. I think in practice it wouldn’t matter, but why not have a “perfect” unbiased dithering pattern if we can?

Left: Pretty good resulting optimized histogram – result of optimizing for the whole loss – individual component is very close to perfect dithering histogram. Right: A “perfect” histogram after fine tuning (or just renormalization).

Results

Without further ado, after playing for a few tries with different loss components and learning rates, I got a following 64×64 dithering matrix:

And the following spectrum:

One of the components of the loss that is hardest to optimize for is spectrum uniformity. It makes sense to me intuitively – smoothness of the spectrum means that we can propagate values only between ajdacent frequencies, which then affect all pixels and most likely cancel each other out (especially together with other loss components). Most likely a smarter metric (maybe just optimize directly for desired value in buckets? 🙂 ) would yield better results – or better optimizer, or simply running the optimization for longer than a few minutes, but they are not too bad at all! And it’s not 100% clear to me if it is necessary to have perfectly uniform spectrum.

Since those blue noise dithering matrices would be used after tiling for dithering, I tested how they look on 2bit dithering of an RGB image from Kodak dataset:

The difference is striking – which is not surprising at all. It’s even more visible on a 1-but quantized image (though here we can argue that both look unacceptably low quality / “retro” heh):

Still – in the blue noise dithered version you can actually read the text and I think it looks much better! I haven’t compared with any other techniques and not sure if it really matters in practice, especially for something so simple as dithering that has also more expensive, better alternatives (like error diffusion).

Conclusions

I have presented an alternative approach to designing dither matrices – by directly optimizing the energy function on the frequency spectrum, and the value range, and running a gradient descent to optimize for those objectives.

I haven’t really exhausted the topic and there are many interesting follow-up questions:

  • how a triangular or Gaussian noise value PDF affects the optimization?
  • What is the desired frequency cutoff point for dithering?
  • How to balance the loss terms and how to express smoothness more elegantly? Does the smoothness matter?
  • Would better optimization schemes like e.g. momentum based yield even faster convergence?
  • What about extending it to more dimensions? Projection-slice property of Fourier transform acts against us, but maybe we can optimize frequency in 2D, and some other metric in other dimensions?

…and probably many more, that I don’t plan to work more on for now, but I hope my post content and immediately-reproducible Python JAX code that I included will inspire you to some further (and smarter than my brute-force!) experiments.

Post scriptum / digression – optimization vs deep learning – which one is “expensive”?

While writing this post, I contemplated on something quite surprising – a digression on optimization and its relationship with deep learning – something I found counter-intuitive and paradoxically converse views between practitioners and academia: When switching from working in “the industry” on mostly production (and applied research), to working in research groups and interacting with academics, I immediately encountered some opposite views on deep neural networks.

For me, a graphics practitioner, they were cool, higher quality than some previous solutions, but impractical and too slow for most real product applications. At the same time, I have heard from at least some academics that neural networks were super efficient, albeit somewhat sloppy approximations.

What the hell, where does this disconnect come from? Then, reading more about state of the art methods prior to DL revolution, I realized that most of them were using optimization of some energy function on a whole problem and solving a complex, non-linear and huge problem with gradient descent (for example on a non-linear system where every unknown is a pixel of an image!) per every single instance of the problem, and from scratch.

By comparison, artificial neural networks might seem brute-force and wasteful to a practitioner, (per every pixel, hundreds if not thousands/millions of huge matrix multiplies of mostly meaningless weights and “features”, and gigantic waste of memory bandwidth) but the key difference is that the expensive “loss” optimization happens offline. In the runtime/evaluation it is “just” passing data through all the layers and linear algebra 101 arithmetic.

So yeah, our perspectives were very different. 🙂 But I am also sure that if we care about e.g. power consumption, energy efficiency and “operations per pixel”, we can take the idea of optimizing offline much further.

Posted in Code / Graphics | Tagged , , , , , , , , | 4 Comments

Bilinear texture filtering – artifacts, alternatives, and frequency domain analysis

In this post we will look at one of the staples of real-time computer graphics – bilinear texture filtering. To catch your interest, I will start with focusing on something that is often referred to as “bilinear artifacts”, trapezoid/star-shaped artifact of bilinear interpolation – what causes them? I will discuss briefly some common bilinear filtering alternatives and how they fix those, link a few of my favorite papers on (fast) image interpolation, and analyze the frequency response of common cheap filters.

Screenshot from my favorite video game of all time – Deus Ex (2000). See this diamond shape of the pupil? This is what I will refer to as “bilinear artifacts”.

As usual, I will work on the topic in “layers”, starting with basics and “introduction to graphics” level and perspective – analyzing the “spatial” side of the artifact. Then I will (non-exhaustively) the alternatives – bicubic and biquadratic filter, and finally analyze those from a signal processing / EE / Fourier spectrum perspective. I will also comment on relationship between filtering as used in the context of upsampling / interpolation, and shifting images. (Note: I am going to completely ignore here downsampling, decimation, and generally trilinear filtering / mip-mapping.)

Update: Based on a request, I added a link to a colab with plots of the frequency responses here.

What is bilinear filtering and why graphics use it so much?

I assume that most of the audience of my blog post are graphics practitioners or enthusiasts and most of us use bilinear filtering every day, but a quick recap never hurts. Bilinear filtering is a texture (or more generally, signal) interpolation filter that is separable – it is a linear filter applied on the x axis of the image (along the width), and then a second filter applied along the y axis (along the height).

Note on notation: Throughout this post I will use linear/bilinear almost interchangeably due to this property of bilinear filtering being a linear filter applied in the two directions sequentially.

Starting with a single dimension, linear filter is often called a “tent” or “triangle” filter. In the context of input texture interpolation, we can look at it as either “gather” (“when computing an output pixel, which samples contribute to it?”), or “scatter” operation (“for each input sample, what are its contributions to the final result?”). Most of GPU programmers naturally think of “gather” operations, but here this “triangle” part is important, so let’s start with the scatter.

For every input pixel, its contributions to the output signal form a “tent” – a “triangle” of pixel contributions is centered at sample N, with height equal to the sample/pixel N value, and its base spans from the sample (N-1) to (N+1). To get the filtering result we sum all contributing “tents” (there are either 1 or 2 covering each output pixel). All of triangles have the same base length, and all of them overlap with a single other triangle on either side of the top vertex. Here is a diagram that hopefully will help:

(Bi)linear filtering is called triangle or tent filtering as contributions of the input samples to the filtered results are “triangles” or “tents”.

Switching from “scatter” to a more natural (at least to people who write shaders) gather approach – between samples (N+1) and (N) we sum up contributions of all “tents” covering this area, so (N+1) and N, and at exact position of the sample N we get only its contribution.

Formula for this line is simply xn * (t – 1) + xn-1 * t. This means that in 1D each output sample has two contributing input pixels (though sometimes one contribution is zero!), and we linearly interpolate between them. If we repeat this twice per x and y, we end up with four input input samples and four weights (t_x – 1) * (t_y – 1), (t_x ) * (t_y – 1), (t_x – 1) * (t_y ), (t_x) * (t_y). Since the weights are just multiplication of independent x and y weights, we can verify that this filter is separable (if you are still unsure what it means in practice, check out my previous blog post on separable filters).

Bilinear filtering is ubiquitous in graphics and there are a few reasons for it – but the main one is that is super cheap, and furthermore, hardware accelerated; and that it provides significant quality improvement over nearest-neighbor filtering.

All modern GPUs can do bilinear filtering of an input texture in a single hardware instruction. This involves both memory access (together with potentially uncompressing block compressed textures into local cache), as well as the interpolation arithmetic. Sometimes (depending on the architecture and input texture bit depth) the instruction might have slightly higher latency, but unless someone is micro-optimizing or has a very specific workload, the cost of the bilinear filtering can be considered “free” on contemporary hardware.

At the same time, it provides significant quality improvement over “nearest neighbor” filtering (just replicating each sample N times) for low resolution textures.

Nearest neighbor filtering (red) produces discontinuities, while bilinear sampling (blue) produces continuous values between samples.
Bilinear vs nearest neighbor filtering.

Being supported natively in hardware is a huge deal and the (bilinear) texture filtering was one of the main initial features of graphics accelerators (when they were still mostly separate cards, in addition to the actual GPUs).

Personal part of the story – I still remember the day when I saw a few “demo” games on 3dfx Voodoo accelerator on my colleague’s machine (it was a Diamond Monster3D) back in primary school. Those buttery smooth textures and no pixels jumping around made me ask my father over and over again to get us one (at that time we had a single computer at home, and it was my father’s primary work tool) – a dream that came true a bit later, with Voodoo 2.

Side remark – pixels vs filtering, interpolation, reconstruction

One thing that is worth mentioning is that I deliberately don’t want to draw input – or even the output pixels as “little squares” – pixels are not little squares!

Classic tech memo from Alvy Ray Smith elaborates on it, but if you want to to deeper into understanding filtering, I find it extremely important to not make such a “mistake” (I blame “nearest neighbor” interpolation for it).

So what are those values stored in your texture maps and framebuffers? They are infinitely small samples taken at a given location. Between them, there is no signal at all – you can think of your image as “impulses”. Such signal is not very useful on its own, it needs a reconstruction filter that reconstructs a continuous, non discrete representation. Such a reconstruction filter can be your monitor and in this case, pixels could be actually squares (though different LCD monitors have different pixel patterns), or a blurred dot of a diffracted ray in CRT monitor. (This is something that contemporary pixel art gets wrong and Timothy Lottes used to have interesting blog posts about – I think they might have gotten deleted, but his shadertoy showing more correct emulation of a CRT monitor is still there).

For me it is also useful to think of texture filtering / interpolation in a similar way – we are reconstructing a continuous (as in – not discrete/sampled) signal and then resample it, taking measurements / values at new pixel positions from this continuous representation. This idea was one of key components in our handheld mobile super-resolution work at Google, allowing to resample reconstructed signals at any desired magnification level.

Example from published past work that I worked on on how given a continuous reconstruction of a sparse signal (from many locally randomly shifted input frames) can be sampled at different destination grid densities, achieving higher zoom factors with less aliasing and more details.

Bilinear artifacts – spatial perspective – pyramid, and mach bands

Ok, bilinear filtering seems to be “pretty good” and for sure is cheap! So what are those bilinear artifacts?

Again a zoom-in into a screenshot from Deus Ex intro – notice the star-like pupil.

In the post intro I have shown example artifact from a 20 year old video game back when the asset textures were very low resolution, but those happen anytime we interpolate from low resolution buffers. In my Siggraph 2014 talk about volumetric atmospheric effects rendering, I have shown an example of problematic artifacts when upsampling low resolution 3D fog texture:

Volumetric fog bilinear artifacts – notice stair/saw-like look.

In my old talk proposed to solve this problem with additional temporal low-pass effect (temporal jitter + temporal supersampling), because the alternative (bicubic / B-spline sampling – more about it later) was too expensive to apply in real time. Nota bene such a solution is actually pretty good at approximating blurring / filtering for free if you already have a TAA-like framework.

If we get back to our “tent” filter scatter interpretation, it should become easier to see what causes this effect. Separable filtering applies first a 1D triangle filter in one direction, then in another direction, multiplying the weights. Those two 1D ramps multiplied together result in a pyramid – this means that every input pixel will “splat” a small pyramid, and if the ratio of the output to the input resolutions is large and textures have high contrast, then those will become very apparent. Similarly those pyramids on any edge that is not aligned with perfect 45 degrees, will create jaggy, aliased appearance.

Horizontal tent filter followed by a vertical tent filter results in a “pyramid”.

Ok, it’s supposed to be a pyramid, but what’s the deal with this bright star-like lines?

The second “spatial” effect of why bilinear filtering is not great are so-called Mach bands.

This phenomenon is one of the fascinating “artifacts” (or conversely – desired features / capabilities) of human visual system. Human vision cannot be thought of as a “sensor” like in a digital camera or “film” in analog one – it is more like a complicated video processing system, will all sorts of edge detectors, contrast boosting, motion detection, embedded localized tonemapping etc. Mach bands are caused by tendency of HVS to “sharpen” image and emphasize any discontinuity.

Perfectly linear gradient (for example generated by linear interpolation). Do you also see this apparent bright white line on the right side of the gradient?

(Bi)linear interpolation is continuous, however its derivative is not (derivative of a piecewise linear function is a piecewise constant function). In mathematical definition of smoothness and C-continuity we say that it is C0 continuous, but not C1 continuous. I find it fascinating that HVS can detect such discontinuity – detecting features like lines and corners is like differentiation and indeed, most common and basic feature detector in computer vision, Harris Corner Detector analyzes local gradient fields.

To get rid of this effect, we need some filtering and interpolation that has a higher order of continuity. Before I move to some other filters, I have to mention here also a very smart “hack” from Inigo Quilez – by hacking the interpolation coordinates, you can get interpolation weights that are C1 continuous.

Screenshot from: https://www.shadertoy.com/view/XsfGDn, author Inigo Quilez. Using hacked UVs together with hardware bilinear interpolation to get smoother filtering. Be sure to check Inigo’s excellent article.

It’s a really cool hack and works very well with procedural art (where manually computed gradients are used to produce for example normal maps), but produces “rounded rectangles” type of look and aliases in motion, so let’s investigate some other fixes.

Bilinear alternatives – bicubic / biquadratic

The most common alternative considered in graphics is bicubic interpolation. The name itself is in my opinion confusing, as cubic filtering can mean any filtering where we use 4 samples, and filter weights result from evaluating 3rd order polynomials (similarly to linear using 2 samples and a 1st order polynomial). In 2D, we have 4×4 = 16 samples, but we also interpolate separably. However there are many alternative 3rd order polynomials weights that could be used for filtering… This is why I avoid using just the term “bicubic”. Just look at Photoshop image resizing options, does this make any sense!?

Bicubic, or bicubic sharper?  ¯\_(ツ)_/¯

The most common bicubic filter used in graphics is one that is also called B-Spline bicubic (after reading this post, see this classic paper from Don Mitchell and Arun Netravali that analyzes some other ones – I revisit this paper probably once every two months!). It looks like this:

As you can see, it is way smoother and seems to reconstruct shapes much more faithfully than bilinear! Smoothness and lack of artifacts comes from its higher order continuity – it is designed to have matching and continuous derivatives at each original sample point.

It has a few other cool and useful properties, but my personal favorite one is that because all weights are positive, it can be optimized to use just 4 bilinear samples! This old (but not dated!) article from GPU Gems 2 (and a later blog post from Phill Djonov elaborating on the derivation) describe how to do it with just some extra ALU to compute the look-up UVs. (Similarly in 1D this can be done in just 2 samples – which can be quite useful for low-dimensionality LUTs).

Its biggest disadvantage – it is visibly “soft”/blurry, looks like the whole image got slightly blurred – which is what is happening… Some other cubic interpolators are sharper, but you have to pay a price for that – negative filter weights that can cause halos / ringing, as well as make it impossible to optimize so nicely.

Second alternative that caught my attention and was actually a motivation to finally write something about image resampling was a recent blog post from Thomas Deliot (and an efficient shadertoy implementation from Leonard Ritter) on biquadratic interpolation – continuous interpolation that reconstructs a quadratic function and considers 3 input samples (or 9 in 2d). I won’t take the spotlight from them on how it works, I recommend checking the linked materials, but included an animated comparison – it looks very good, sharper than the bspline bicubic, and is a bit cheaper.

Post update: Won Chun pointed to me a super interesting piece of literature on quadratic filters from Neil Dogson. The link is paywalled, but you will easily find references by googling for “Quadratic interpolation for image resampling”. It derives the quadratic filter with an additional parameter that allows to control filter sharpness similarly to bicubic filters parameters B/C. The one that I used for analysis is the blurriest, “approximating” one. Here is a shadertoy from Won Chun and a corresponding Desmos calculator that proposes also a 2 sample approximation of the (bi)quadratic filter, cool stuff!

Interpolation – shifting reconstructed signal and position dependent filtering

A second perspective that I am going to discuss here is how texture filtering creates different local filters and filter weights depending on the fractional output/input pixel positions relationship.

Let’s consider for now only 1D separable interpolation and the range of 0.0-0.5 (-0.5 to 0.0 or 0.5 to 1.0 can be considered as switching to different sample pair, or symmetric).

With a subpixel offset of 0.0, linear weights will be [1.0, 0.0], and with a subpixel offset of 0.5, they will be [0.5, 0.5]. Those two extremes correspond to a very different image filters! Imagine input filtered with a convolution filter of 1 (this returns just the original one) vs a filter of [0.5, 0.5] – this corresponds to low quality, box blur of the whole image.

To demonstrate this effect, I made a shadertoy that compares bilinear, bicubic bspline, biquadratic, and a sinc filter (more about it later) with dynamically changing offsets.

Same texture bilinearly interpolated. Upper left half of the image includes shift by 0.5 pixels, the other one no (and thus no filtering). I upsampled the resulting image with NN interpolation for magnification.

So with an offset of U or V of 0.5, we are introducing very strong blurring (even more with both shifted – see above), and no blurring at all with no fractional offsets.

Knowing that our effective image filters change based on the pixel “phase” we can have a look at spectral properties and frequency responses of those filters and see what we can learn about those resampling filters.

Bilinear artifacts – variance loss, and frequency perspective

So let’s have a look at filter weights of our 3 different types of filters as they change with pixel phase / fractional offset:

As expected, they are symmetric, and we can verify that they sum to 1. Weights are quite interesting themselves – for example quadratic interpolation with fractional offset of 0.0 has the same filter weights as the linear interpolation with offset of 0.5!

Looking at fractional offset dependent, we can analyze how those weights affect the signal variance, as well as the frequency response (and in effect, frequency content of the filtered images). Variance change resulting from a linear filter is equal to the sum of the squares of all its weights. So let’s plot the effect on input signal variance from those three filters.

Position dependent variance reduction of three different analyzed filters.

With the bilinear filter, the total variance changes (a lot!) with the subpixel position – which will cause apparent contrast loss. Variance and contrast loss due to blending/filtering is an interesting statistical effect, very visible perceptually, and one of my favorite papers on procedural texture synthesis from Eric Heitz and Fabrice Neyret achieves very good results from blending random blended tilings of an input texture mostly “just” by designing a variance preserving blending operator!

Now, moving to the full frequency response of a those filters – for this we can use for example Z-transform commonly used in signal processing (or go directly with complex phasors). This is goes beyond scope of this blog post, (if you are interested in details, let me know in the comments!) but I will present the resulting frequency responses of those 3 different filters:

Frequency responses of linear, quadratic, and bspline at fractional offsets of 0.0, 1/6, 1/3, 1/2. Each plot corresponds to a different fractional coordinate offset.

Looking at those plots we can observe a very different behavior. As we realized above, linear at no fractional offset produces no filtering effect at all (identity response), and both other filters are lowpass filters. Interestingly, no offset is where quadratic interpolation performs the most lowpass filtering, the opposite of both other filters! To be able to reason about the filters in more of isolation, I also plotted each filters frequency response at different phases on a single plot for each:

Frequency responses of linear, quadratic, and bspline at fractional offsets of 0.0, 1/6, 1/3, 1/2. Each plot corresponds to a different filter.

Looking at these plots, we can see that bspline cubic is the most “consistent” between the different fractional offsets, while quadratic is less consistent, but filters out somewhat less of high frequencies (which is part of its sharper look). Linear is the least consistent, going from between a very sharp image, and a very blurry one.

This inconsistency (both in terms of total variance, as well as frequency response – in fact the first can be derived from the latter from Parseval’s theorem) is another look at what we perceive as bilinear artifacts – “pulsing” and inconsistent filtering of the image. Looking at linear image shifted with phase 0.0 and 0.5 is like looking at two very different images. While bicubic bspline over-blurs the input, it is consistent, doesn’t “pulse” with changing offsets and doesn’t create aliased visible patterns. I see the biquadratic filter as kind of a “middle ground” that looks very reasonable and definitely will use it in practice.

In the original eye screenshot texture we can see filtering / variance reduction in action. Pixels close to the interpolated texel origins are sharp (green), some get blurred horizontally (purple), some vertically (yellow), and some both (red). Such an non-uniform effect of variable blurring and variable contrast is yet another take on the bilinear artifacts.

Given how bicubic and similar filters have quite consistent lowpass filtering, some of the blurring effect can be compensated with an additional, sharpening filter. This is what for example my friend Michal Drobot proposed in his talk on TAA for Far Cry 4 (and others have arrived to and used independently) – after resampling, do a global uniform sharpening pass on the whole image to counter some of the blurriness. Given how blurriness is relatively constant, sharpen filter can be even designed to recover the missing input frequencies accurately!

Bilinear / bicubic alternatives – (windowed) sinc

It’s worth noting that resampling filters don’t have to necessarily so heavily blur out the images and their frequency response can be much more high frequency preserving. This is beyond scope of this blog post, but some bicubic filters are actually designed to be slightly sharpening and boosting some of the high frequencies (Catmull-Rom spline – often used in TAA for continuous resampling).

Such built-in sharpening might not be desired, so many filters are designed to be as frequency preserving as possible (windowed sinc like Lanczos). As I mentioned, it is beyond the scope of this blog post, but just for fun, I include example frequency response of a truncated (non windowed!) sinc, as well as final gif comparison. Such a sinc is also featured in my shadertoy.

Unwidowed sinc frequency response at different subpixel offsets – note how there is no dominant lowpass filtering (with exception of last part, close to Nyquist), but we get lots of frequency response ringing and we will observe some oversharpening. Do not use unwindowed sinc in practice!
All filters compared. Note that truncated sinc might be the sharpest, but it looks unacceptably (ugly!) with over-filtering and significantly stronger image contrast. If you’re looking for sharpness, better choice might be some tweaked bicubic filter, or windowed sinc – Lanczos.

For much more complete and comprehensive treatment of filter trade-offs (sharpness, ringing, oversharpening), I again highly recommend this classic paper from Mitchell and Netravali – be sure to check it out.

Conclusions

This post touched on lots of topics and connection between them (I hope that it could be inspiring for you to go deeper into them) – I started with analyzing the most common, simple, ardware accelerated image interpolation filter – bilinear. I discussed its limitations and why it is often replaced (especially when interpolating from very low resolution textures) by other filters like bicubic – which might seem counter-intuitive given that they are much blurrier. I discussed what causes the “bilinear artifacts”, both from the simple spatial filtering perspective, as well as the effect of variance loss / contrast reduction / lowpass filtering. I barely scratched surface of the topic of interpolation, but I had fun writing it (and creating the visualizations), so expect some follow up posts about it in the future!

Posted in Code / Graphics | Tagged , , , , , | 11 Comments

Using JAX, numpy, and optimization techniques to improve separable image filters

By the end of this blog post, we aim to improve the results of my previous blog post of approximating circular bokeh with a sum of 4 separable filters and reducing its artifacts (left) through an offline optimization process to get the results on the right.

In today’s blog post I will look at two topics: how to use JAX (“hyped” new Python ML / autodifferentiation library), and a basic application that is follow-up to my previous blog post on using SVD for low-rank approximations and separable image filters – we will look at “optimizing” the filters to improve the filtered images.

My original motivation was to play a bit with JAX and see if it will be useful for me, and I immediately had this very simple use-case in mind. The blog post is not intended to be a comprehensive guide / tutorial to JAX, nor a complete optimization primer (one can spend decades studying and researching this topic), but a fun example application – hopefully immediately useful, and inspiring for other graphics or image processing folks.

The post is written as separate chapters/sections (problem statement, some basic theory and challenges, practical application, and the results) – feel free to skip ones that seem obvious to you. 🙂

This post comes with a colab that you can run and modify yourself. The code is not very clean, mostly scratchpad quality, but I decided that it’s still good for the others if I share it, no matter how poorly it reflects on my coding habits.

Recap of the problem – separable filters

In the last blog post, we have looked at analyzing if convolutional image filters can be made separable, and if not, finding separable approximations (as a sum of N separable filters). For this we used Singular Value Decomposition and using a low rank approximation by taking first N singular values.

We have found that to approximate a 50×50 circular filter (can be though of as a circular bokeh), one needs ~13 separable passes over the whole image, and even 8 singular components (that have extremely low numerical error) can produce distracting, unpleasant visual artifacts – negative values cause “ringing”, and some “leftover” corner errors.

Unpleasant, unacceptable visual artifacts in an image filtered with a rank 4 approximation of the circular kernel.

I have suggested that this can be solved with optimization, and in this post I will describe most likely the simplest method to do so!

Side note: After my blog post, I had a chat with a colleague of mine, Soufiane Khiat – we used to work together at Ubisoft Montreal – and given his mathematical background much better than mine, he had some cool insights on this problem. One of suggestions was to use singular value thresholding algorithm and some proximal methods and generally it is probably the way to go for the specific problem – but a bit of against “educational” goal of my blog posts (and staying as general as possible). I still recommend reading about this approach if you want to go much deeper into the topic and again – thanks Soufiane for a great discussion.

What is optimization?

Optimization is a loaded term and quite likely I am going to use it in a way you didn’t expect! Most graphics folks understand it as in “low level optimization” (or algorithmic optimization), aka optimizing the computational cost – memory, CPU/GPU usage, total timing spent on computations – and this is how I used to think of term “optimization” for many years.

But this is not the kind of optimization that a mathematician, or most academics think of and not the topic of my blog post!

Optimization that I will use is the one in which you have some function that depends on a set of parameters, and your goal is to find the set of parameters that achieves a certain goal, usually minimum (sometimes maximum, but they can be equivalent in most cases) over this function. This definition is very general – and in theory it even covers also computational performance optimizations (we are looking for a set of computer program instructions that optimizes performance while not diverging from the desired output).

Optimization of arbitrary functions is generally a NP-hard problem (there are no solutions other than exploring every possible value, which is impossible in the case of continuous functions), but under some constraints like convexity, or looking for local minima, it can be made feasible, and relatively robust and fast – and is basis of modern convolutional neural networks, and algorithms like gradient descent.

This post will skip explaining the gradient descent. If it’s a new or vague concept for you, I don’t recommend just trusting me 🙂 – so if you would like to get a good basic, but intuitive and visual understanding, be sure to check this fantastic video from Grant Sanderson on the topic.

What is JAX?

JAX is a new ML library that supports auto-differentiation and just-in-time compilation targeting CPUs, GPUs, and TPUs.

JAX got very popular recently (might be a sampling bias, but seemed to me like half of ML twitter was talking about it), and I am usually very skeptical about such new hype bandwagons. But I was also not very happy with my previous options, so I tried it, and I think it is popular for a few good reasons:

  • It is extremely lean, in the most basic form it is a “replacement” import for numpy.
  • Auto-differentiation “just works” (without any gradient tapes, graphs etc.) over most of numpy and Python constructs like loops, conditionals etc.
  • There are higher level constructs developed on top of it – like NNs and other ML infrastructure, but you can still use the low-level constructs.
  • There is no more weird array reshaping and expressing everything over batches, constantly thinking about shapes and dimensions. If you want to process a batch, just use functions that map your single-example functions over batches.
  • It is actively developed both by open-source, as well as some Google and DeepMind developers, and available right at your fingers with Colab – zero installation needed.

I have played with it for just a few days, but definitely can recommend it and it will become my go-to auto-diff / optimization library for Python.

Optimization 101 – defining a loss function

This might seem obvious, but before we can start optimizing an objective, we have to define it in some way that is understandable for the computer and optimizeable.

What is non-obvious is that coming up with a decent objective function is the biggest challenge of machine learning, IMO a much bigger problem than any particular choice of a technique. (Note: it also has much wider implications than our toy problems; ill-defined objectives can lead to catastrophes in wider social context, e.g. optimizing for engagement time leads to clickbaits, low quality filler content, and fake news).

For our problem – optimizing filters, we know three things that we want to optimize for:

  • Maintain the low rank of the filter,
  • Keep the separable filter similar to the original one,
  • Avoid the visual artifacts.

The first one is the simple – we don’t have much wiggle room there. We have a hard limit on maximum of N separable passes and fixed number of coefficients. But on the other hand, the two other goals are not well defined.

The “similarity” that we are looking for can be anything – average squared error, average absolute error, maximum error, some perceptual/visual similarity… Anything else, or even any combination of the above. Mathematically, we use term of distance, or metric. Convenient and well researched are metrics that are based on p-norm and mathematicians like to operate on squared Euclidean distance (L2 norm), so average squared errors. Average squared error is so often used as it usually has a simple, closed form solutions like linear least squares, linear regression, PCA), but in many cases it might not be the right loss. Defining perceptual similarity is the most difficult and is an open problem in computer vision, with the recent universal approach of using similarity features extracted by neural networks for the purpose of image recognition.

The third one “avoid the visual artifacts” is even more difficult to define, as we are talking about artifacts that are present in the final image, and not about numerical error in the approximated filter. Deciding on components of the loss function to avoid visual artifacts and tuning them is often the most time consuming part of optimization and machine learning.

Left: original filter, right: rank 4 approximation. Notice the negative values (blue color) and some leftover “garbage” values in the corners. Those will contribute to visual artifacts.

Looking at artifacts in the filtered images I think that the two most objectionable ones are: negative values causing ringing, and some “garbage” pixels in the corners of the filter.

Putting it together – target loss function for optimizing separable image filters

Given all that together, I decided on minimizing a loss function that will sum three terms:

  • Squared mean error term,
  • Heavily penalizing any negative elements in the approximated filter,
  • Additional penalty when a zero element of the original filter becomes non-zero after the approximation.

This is a very common approach – summing together multiple different terms with different meanings and finding the parameters that optimize such a sum. Open challenge is tuning the weights of the different components of the loss function, a further section will show the impact of it.

This loss function might not be the best one, but this is what makes such problems fun – often designing a better loss (closer corresponding to our intentions) can lead to significantly better results without changing the algorithm at all! This was one of my huge surprises – so many great papers just propose simple optimization framework together with an improved loss function to advance state of the art significantly.

Also if you think that it is quite “ad-hoc” – yes, it is! Most academic papers (and productionized code…) have such ad-hoc loss functions where each component might have some good motivation, but they end up with a hodge-podge that doesn’t make too much sense when put together, but empirically works (often verified by ablation studies).

In numpy, an example implementation of the loss function might look like:

def loss(target, evaluated_filter):
  l2_term = L2_WEIGHT * np.mean(np.square(evaluated_filter - target))
  non_negative_term = NON_NEGATIVE_WEIGHT * np.mean(-np.minimum(evaluated_filter, 0.0))
  keep_zeros_term = KEEP_ZEROS_WEIGHT * np.mean((target == 0.0) * np.abs(evaluated_filter))
  return l2_term + non_negative_term + keep_zeros_term 

Now that we have a loss function to optimize, we need to find parameters that minimize it. This is where things can become difficult. If we want to have rank 4 approximation of a 50×50 filter, we end up with 4*(50*2) == 400 parameters. If we wanted to do a brute force search in this 400 dimensional space, this could take a very long time! Let’s say we just wanted to evaluate 10 different potential values – this would take 10^400 loss function evaluations – and this is quite a toy problem!

Luckily, we are in a situation where our initial guess obtained through SVD is already kind of ok, and we want to just improve it. This is a classic assumption of “local optimization” and can be achieved through greedily minimizing the error, for example by coordinate descent, or even better gradient descent – going in the direction where the error is decreasing the most. In a general scenario we don’t get any theoretical guarantees of finding the true, best solution, but we are guaranteed that we will get a solution that will be better or at worst the same when compared to the initial one (according to our loss function).

Gradient descent in JAX

How do we compute the gradient of our loss function? For the function that I wrote it is relatively simple and follows calculus 101, but anytime we would change it, we need to re-derive our gradient… Wouldn’t it be great if we could compute it automatically?

This is where various auto-differentiation libraries can help us. Given some function, we can compute its gradient / derivative with regards to some variables completely automatically! This can be achieved either symbolically, or in some cases even numerically if closed-form gradient would be impossible to compute.

In C++ you can use Ceres (I highly recommend it; its templates can be non-obvious at first, but once you understand the basics, it’s really powerful and fast), in Python one of many frameworks, from smaller ones like Autograd to huge ones like Tensorflow or PyTorch. Compared to tf and pt, I wanted something lower level, simpler, and more convenient to use (setting up a graph or gradient tapes is not great 😦 ) – and JAX fills my requirements perfectly.

JAX can be a drop-in replacement to a combo of pure Python and numpy, keeping most of the functions exactly the same! In colab, you can import it either instead of numpy, or in addition to numpy. Here is code that computes our separable filter from list of separable vector pairs, and the loss function.
(note: I will paste some code here, but I highly encourage you to open the colab that accompanies this blog post and explore it yourself.)

import numpy as np
import jax.numpy as jnp

# We just sum the outer tensor products.
# vs is a list of tuples - pairs of separable horizontal and vertical filters.
def model(vs):
  dst = jnp.zeros((FILTER_SIZE, FILTER_SIZE))
  for separable_pass in vs:
    dst += jnp.outer(separable_pass[0], separable_pass[1])
  return dst

# Our loss function. 
def loss(vs, l2_weight, non_negativity_weight, keep_zeros_weight):
  target = model(vs)
  l2_term = l2_weight * jnp.mean(jnp.square(target- REF_SHAPE))
  non_negative_term = non_negativity_weight * jnp.mean(-jnp.minimum(target, 0.0))
  keep_zeros_term = keep_zeros_weight * np.mean((REF_SHAPE == 0.0) * jnp.abs(target))
  return l2_term + non_negative_term + keep_zeros_term 

Note how we mix some simple Python loops and logic in the model function, and numpy-like code (I could have imported jax.numpy as simply np, and you can do it if you want to, but to me such library shadowing feels a bit confusing).

Ok, but so far there is nothing interesting about it; I just kind of rewrote some numpy code as jax.numpy, what’s the big deal?

Now, the “magic” is that you compute the gradient of this function just by writing jax.grad(loss)! By default the gradient is wrt the first function parameter (you can change it if you want). JAX has some limitations and gotchas on what it can compute the gradient of and can require workarounds, but most of them feel quite “natural” (e.g. that PRNG requires explicit state) and I haven’t hit those in practice with my toy example.

Our whole gradient descent step function would look like:

def update_parameters_step(vs, learning_rate, l2_weight, non_negativity_weight, keep_zeros_weight):
  grad_loss = jax.grad(loss)
  grads = grad_loss(vs, l2_weight, non_negativity_weight, keep_zeros_weight)
  return [(param[0] - learning_rate * grad[0], param[1] - learning_rate * grad[1]) for param, grad in zip(vs, grads)]

I was mind-blown how simple it is – no gradient tapes, no graph definitions requires. This is how auto-differentiation should look like from user’s perspective. 🙂 What is even cooler is that the resulting function code can be JIT-compiled to native code for orders of magnitude speed-up! You can place a decorator @jax.jit above your function, or manually create optimized function from jax.jit(function). This makes it really fast and allows you to pick-and-choose what you jit (you can even unroll a few optimization iterations if you want).

Tuning loss function term weights

Time to run an optimization loop on our loss function. I picked a learning_rate of 0.1 and 5000 steps. Those are ad-hoc choices, step count is definitely an overkill, but it just worked and the optimization is pretty fast even on the CPU, so I didn’t bother tweaking them. The whole optimization looks like this:

# Our whole optimization loop.
def optimize_loop(vs, l2_weight, non_negativity_weight, keep_zeros_weight, print_loss):
  NUM_STEPS = 5000
  for n in range(NUM_STEPS):
    vs = update_parameters_step(vs, learning_rate=0.1, l2_weight=l2_weight, non_negativity_weight=non_negativity_weight, keep_zeros_weight=keep_zeros_weight)
    if print_loss and n % 1000 == 0:
      print(loss(vs))
  return vs

Finally we get to the problem of testing different loss function term weights. Luckily, because our problem is small, and jit’d optimization runs very fast, we can actually test it with a few different parameters.

One thing to note is that while we have 3 terms, in fact if we multiply all 3 of them by the same value, we will end up with loss function 3x bigger, but with the same parameters. So we have effectively just 2 degrees of freedom and can keep one of the weights “frozen” – I will set the L2 mean loss as just 1.0 and operate on the weight for the non-negativity and keeping-zeros.

Without further ado, here are our rank 4 separable circular filters after the optimization:

Left to right: keep_zeros_weight of 0.0, 0.3, 0.8.
Top to bottom: non_negativity_weight of 0.0, 0.5, 1.5.

We can notice how those two loss terms affect the results and interact with each other. Non-negativity will very effectively reduce the negative ringing, but won’t address the artifacts in corners of the filter.

The extra penalty of the zero terms becoming non-zero nicely cleans up the corners, but there is still mild ringing around the radius of the filter.

Both of them together reduce all artifacts, but the final shape starts to deviate from our circle, becoming more like a sum of four separate box-filters (which is exactly what is happening)! So it is a trade-off of accuracy, filter shape, and the artifacts. There is no free lunch! 🙂

Results when applied to an real image

Let’s look at the same filters when applied to the same image we have looked at in my previous blog post. Note: I have slightly boosted the gamma function from 7.0 to 8.0 as compared to my previous post to emphasize the visual error.

Left to right: keep_zeros_weight of 0.0, 0.3, 0.8.
Top to bottom: non_negativity_weight of 0.0, 0.5, 1.5.
I highly encourage opening this image in a new tab and zooming in – otherwise the artifacts and differences are hard to notice.

When you zoom in, the difference in artifacts and overall appearance becomes quite obvious. I personally like the most the center picture, and the one below it. Column on the right minimized artifacts the most, especially the bottom-right example, but to me looks too “blocky”.

Can you design a better loss function and parameters for this problem? I am sure you can, feel free to play with the colab! 🙂

Some random ideas that can be fun to evaluate and get better understanding of how loss functions affect the results: penalizing zeros becoming non-zeros more as they get further away from the center, using L1 loss instead of L2 loss, playing with maximum allowed error, optimizing for computational performance by trying to create filter with explicitly zero weights that could get skipped in a shader.

When such simple optimization through gradient descent is going to work?

While I mentioned that optimization is a whole field of mathematics and computer science and beyond scope of any simple blog, in my posts I usually try to give the readers some “intuition” about the topics I write about, and practical tips and tricks, so will mention a few gotchas regarding the optimization.

Gradient descent is so successful and ubiquitous technique (have you heard about this whole field of machine learning through artificial neural networks?!) because of few interesting properties:

First of all, in case of multivariate functions, we don’t need to be strictly convex! There might exist some path in the parameter space that gradient descent can use to “walk in the valleys and around the hills”.

If we are “lucky”, descent can optimize even a non-convex function!

Second related observation is that the more dimensions we have, the easier it is to find some “good” solution (that might not be the total global minimum, but some almost-as good local minimum). Over-completeness and over-parametrization is one of the things that makes very deep networks train well. We are not looking for a single unique perfect solution, but one of many good ones, and the more parameters and dimensions there are, the more (combinatoricly explosive) combinations of parameters can be ok. Think about the example I have visualized above – in 1D we could never “walk around” a locally bad solution, the more dimensions we add, the more paths are there that can lead to improving the results.

It’s one of the areas that are researched in machine learning and called “lottery ticket hypothesis”. I personally found this to initially to be super non-intuitive and was totally mind blown to see how adding fully linear layers to a network can make a huge difference on it converging to good results.

What helps gradient descent a lot is some good initial initialization. This not only increases the chances of convergence, but also the convergence speed. For our filters initializing them with SVD-based low-rank approximation was very helpful – feel free to try if you can get the same results with fully randomized initialization. The worst possible initialization would be something totally-uniform, and with its gradient being same for all variables (or even worse, zero gradient). An example: if optimizing a function that is product of a and b, don’t initialize them both to zero – think how the gradients would look like and what would happen in such case. 🙂

A second practical advice is to keep the learning rate as low as your patience allows for. You avoid the risks of “jumping over” your optimal solution.

Visualization on 1D toy example how wrong learning / optimization rate can cause lack of correct convergence.
Red: too large learning / optimization rate can jump over the global maximum!
Green: Lower learning rate is “safer” for local optimization.

Too high of a learning rate can make your gradients “explode” and jump completely out of reasonable parameter space. Probably everyone who ever played with gradient descent saw loss or parameters becomeing NaNs or infinity at some point and then debugged why. There can be many causes, from a wrong loss function, through bad behaving gradients (derivatives of certain function can get huge; e.g. sqrt around zero), but very often it is too large of learning rate. 🙂 If your optimization is too slow, instead of increasing the optimization rate, try momentum-based or higher order optimization – both are actually trivial to code with JAX, especially for such small use-cases!

Conclusions

In this blog post, we have looked at incorporating some simple offline optimization using numpy and JAX to our programming toolset. We discussed difficulties of designing a “good” loss function and then tuning it, and applied it to our problem of producing good separable image filters. If you do any kind of high performance work, don’t be deterred from machine learning – you don’t have to do it in real time and on the device, but can use it offline. For example of optimizing/computing offline approximations that can make something feasible in real time, check out one of papers that finally got out (I contributed to a longer while ago) – real time stylization through optimizing discrete filters.

My personal final conclusion is that I want more of such simple, yet expressive and powerful libraries like JAX. The less boilerplate the better! 🙂

Posted in Code / Graphics | Tagged , , , , , , , , , , , | 9 Comments