PhD Summary

As I begin the run up to the work that will make up my final thesis I wanted to be able to put down the theory we are following and how we plan on evaluating it.

We propose that with Moore’s Law coming to an end and the “Free Lunch” no longer being possible as we saturate Amdahl’s Law with multiple processor systems developers will have to look beyond standard scalability optimisation and accurate program transforms to allow the best performance in the software that they write.

As a result they will need to use appropriate approximation to write code that will perform only the computations needed to reach the desired result. As computing currently stands we frequently compute answers to the highest possible precision, even when that isn’t necessary. This is the standard behaviour of most shared libraries (as we have spoken about on this blog before) and leads to a significant waste of energy worldwide.

At a small scale this kind of optimisation is already taking place for specialised use-cases - such as limited hardware or the games industry. However, scaling the solution to be part of the standard software engineering pipeline is more difficult. Currently hand-crafting solutions to these problems can take many hours of an engineers time and validating the solution, especially when a code base is changing, is very challenging.

To be able to automate this process we need to be able to change the whole hardware agnostic software development pipeline.

We want programmers to be able to specify a function with an output type but also a desired accuracy and for each input to the function a valid input range. These heuristics can then be evaluated by the compiler and used to automatically produce save transforms which will improve performance.

With this system a library could be provided semi-compiled and when a function from that library is used it can be marked up appropriately so that it can be correctly compiled into existing programs optimally and avoid wasted computing.

This does mean that linking to existing dynamic libraries would produce a boundary through which approximations could not be assumed and as such would perform worse than full or partial source compilations.

With this setup it would be possible to compile approximately generically and allow specific optimisations for specified hardware. This would allow for us to tackle global data centre energy waste by reducing the overall power needed to compute programs.


The Problem With Abstractions and Libraries

In 1952, two years after his involvement in the invention of the first stored program computer, David Wheeler proposed the idea of the subroutine. A simple idea which is now ubiquitous throughout computing.

David Wheeler , first person ever to earn a PhD in Computer Science. The 50’s were unprepared for his collars.

David Wheeler, first person ever to earn a PhD in Computer Science. The 50’s were unprepared for his collars.

The introduction of subroutines was proposed in ACM 52 as “a self-contained part of a programme, which is capable of being used in different programmes. It is an entity of its own within a programme”. In his paper he defines functions as well as the concept of libraries of functions.

These concepts define the fundamental pattern of modern software engineering. It has become common place to emphasise the phrase “Don’t reinvent the wheel” to new programmers to teach that it is bad practice to write code for anything that already exists and works unless they have a very good justification. It is much more common for people to use code from a popular or tested library for any problem which is not highly specific to the task at hand.

In his paper Wheeler outlines the standard specification of a library: well documented, highly specified and appropriately accessible. Although he refers to problems related to punch cards and paper tape, the concepts he introduced have remained surprisingly unchanged. He even went as far as to point out that documentation for the library may be the hardest part - a highly prescient statement for the state of libraries sixty years later.

Before we get ahead of ourselves – this is not an article in praise of David Wheeler’s subroutines, quite the opposite. While this contribution has been the basis of so much of our software engineering design principles and undoubtably we would not have come so far in computer software without such a pivotal idea being proposed. This is in fact an article to outline the problems of this exact approach and how in the age of widespread computing the simple principle of library-based subroutines may be more akin to a dark pattern.

Subroutines were proposed for use on Electronic delay storage automatic calculator (EDSAC)

Subroutines were proposed for use on Electronic delay storage automatic calculator (EDSAC)

This drastic attack on the major contribution of a famous computer scientist (renowned hero of my department no less!) can thankfully be explained and defended with his own words: “All problems in computer science can be solved by another level of indirection”. In this quote he is expressing the common idea in computer science that adding abstractions to computing problems often makes them easier to solve, understand or integrate into existing solutions.

The concept of abstraction can be demonstrated by looking at programming languages which are themselves an abstraction built between the programmer and the lowest level of computer instructions. Without modern programming languages it would be practically impossible to maintain and test the large code-bases and complexity of systems we now require as part of managing the modern world.

What is unsaid in the quote is that abstraction is not free. Each level of indirection comes at a cost. We could expand on the quote by saying “Each level of indirection required to solve the problem easily, increases the price of solving the problem above its true cost”.

We can return to our programming language example to show how this becomes true. In the old days of Wheeler working on EDSAC the instructions would have been written specifically for the machine that he was working on. His library would have consisted of subroutines written with his exact hardware and use-cases in mind. When we add the indirection of a programming language, we have added a barrier between the hardware and the programmer. We can no longer guarantee that the instructions the user expects to emerge after the compilation process of their language will in fact be the ones they expect. This is definitely the case when we consider modern design processes which target multiple hardware configurations from a single programming language abstraction. We have in effect removed the guarantee of optimal code output in exchange for the ability to solve the problem of complex programming or multiple hardware targets.

Once again Wheeler can defend me in my attack on our use of subroutines. In his paper on subroutines he gives two examples of what would make good subroutines but appends the suggestions with a question about how they should be implemented. He raises the problem of: Given a library which contains abstractions of problems we want to solve, how should those abstractions be implemented? No one solution will result in a library that would be optimal for all use-cases and so we would incur overhead cost for some uses.

It is my belief that the current approach of implementing very rigid libraries has resulted in high costs of abstraction. Let’s take one of the example functions in Wheeler’s paper, sine(x). A reasonably complex function on which many papers have been written with aims of optimising the implementation to give minimal error. Most programming languages with a standard mathematical library will contain an implementation of sine. But, which implementation will it contain? Will it be as accurate as possible at the given precision? Is that what is required by your program? Is the cost of reaching that accuracy linear? Would you get a result that is only slightly less accurate for much less computation? These are all valid questions for someone who wants to write a program that seeks to run optimally, you could say this is a financially important (or environmentally important, depending on how you see the world) question if this is a program that is going to take up hours of processing time over multiple datacentres.

This is where subroutines as proposed by Wheeler and adopted by the whole tech world have failed us. There will be one implementation and very little choice in configuring the result without “rolling your own” and that is considered bad practice and dangerous (think of the bugs!). So, we have a generation of programmers highly reliant on prewritten solutions which are not ideal for their use-case, often with very little information to determine if they are actually what they want or measure the impact they have on the final result. They are paying a huge cost for abstraction and it is likely they don’t realise it as this is how we have taught them to program.

While this may have only been a minor inconvenience in the past, this problem may be turning into a disaster much faster than anyone might realise. We have a group of problems which are making this a major dilemma: the end of the free lunch (Moore and Amdahl), cloud computing increasing global access to large scale computing, huge increase in hardware abstractions and increased pressure for highly performant software!

My research is into optimisation for large scale computing where I am working on approaches to library development and function generation to try and increase the performance of applications without sacrificing the accuracy of the output. The majority of this work is through the removal of unneeded abstractions and correct measurement of what is truly needed for a problem. Without this work and the work of others in related fields, computing will hit a performance wall when we can no longer make chips any smaller or more parallel. As a result to get the actual optimal performance of output from the hardware we have I believe we will need to have a paradigm shift into new ways of thinking of abstractions and see Wheelers definitions as the basis for more complex but efficient library systems rather than using it as the unscalable blueprint it is.

Ruminations on software complexity

This article is a collection of some my thoughts on the complexity of modern software design and the effect it has on the performance and power efficiency of computing in our lives.

I have a growing concern about what I am thinking of as `soft upper-bound of computability`. By this, I am referring to a boundary in the ability to effectively compute ever more complex problems. As software developers we have all been able to see the effects of Moore’s Law as systems moved to be ever more parallel, first with multiple cores and then with ever growing data centers. Then over time we have been hearing even more mention of Amdahl’s and Gustafson's laws as the benefit of each extra processing thread is reduced at scale.

What we are really saying when worrying about the effect of these ‘laws’ is that we do not have enough computation to be able to perform the task we want at the scale we want or in the time we need.

My concern is that we are developing software inefficiently and overly complexly. This may be due to tight deadlines and a need for generic computing. In the end this could lead to the software community hitting a wall of performance and then being forced to go back and rewrite entire systems we rely on for them to be able to perform as well as they originally should have.

As a community we know that global power consumption for computing devices is not insignificant. If we are writing inefficient code we are using power we didn’t need to and on large distributed systems that has major consequences.

Java, Javascript and Python

High-level, dynamically typed, interpreted - these words should strike fear into the complexity conscious programmer. I think it is safe to say that an interpreted language is unlikely to reach the performance of a well compiled language. Or to put it another way, an interpreted language is going to waste power through inefficient execution more than a well compiled program.

What do I mean by inefficient execution? It means that the program will perform tasks that are either not necessary for the successful execution of the given task, or will perform the given task in a way which requires more steps than an optimal solution would choose. For most programs of a reasonable complexity the ‘optimal’ solution may be in practical terms unsolvable, but we can measure more and less optimal solutions.

Java, Javascript, Python and similar languages all have nice usability features which make writing programs simpler, cheaper and easier to distribute to the myriad of different hardware targets that exist today - and nearly all of those features incur these costs. Then, because they are so nice to use and use everywhere, they are used and used EVERYWHERE. Even modern bank cards have a version of the JVM on them!

Don’t misunderstand my concerns here - these languages are wonderful tools for prototyping and the ease of use brings many users who would have difficulty with more complex languages. The problem is that these are being used for world-wide distributions such as the Android operating system!

Java has the ability to Just-In-Time compile programs but that only allows compilation with a very small scope an is limited to many factors of how the program is used. When you think about it though, the program itself is not going to change a lot and doesn’t need to use this feature for the task - things could be precompiled in a language that is more efficient but they aren’t because Android was originally written in Java for historical reasons and is not locked in with support for other binaries being an after thought.

We can also consider the problem of Javascript. A scripting language based on Java that is supported on all major browsers and is the main way of writing “applications” that run in a browser. Javascript is known for it’s poor performance, security problems and general strange behavior. So why was it used? Javascript is simple. Web developers historically aren’t programmers and as a result simpler interfaces have been used for content generation - HTML, CSS and Javascript. As the web became more dynamic we have continued to use these inefficient tools at greater and greater scale. So now we have websites being accessed billions of times all running code locally on user devices in inefficient ways. This is all made worse by the growth of ‘web apps‘ which seek to implement fully featured programs in a browser - I couldn’t tell you if this is better or worse than flash.

A consequence of this has been websites which drain mobile batteries - sometimes inadvertently because of advertisements - on modern devices which have specifications much more impressive than common desktop computers a decade ago but which are incapable of smoothly running software at the same performance.

Shown here: Two devices which probably have the same performance when browsing an excel spreedsheet.

Shown here: Two devices which probably have the same performance when browsing an excel spreedsheet.

We need to as a group of engineers really measure the effectiveness of the solutions we are using for the scale of the task we are doing. If this same approach was used in building construction the end user would be able to see the result - in computing these problems go hidden and are just assumed to be because of the hardware or other problems, when it is a problem that can be improved with a thorough look at the pipeline of the tools, languages and output of our software production process.

Packages, Libraries and Other Junk Drawers

Have you ever had a draw at your desk where you put useful things, some of those would be pulled out and put away frequently, others only now and then, and some things would go into that draw and never removed until you moved desk? Let me introduce you to Python packages.

When we write programs are told to not “reinvent the wheel” - if a solution exists already and it is proven to be good we should use it. The problem is that in Python, and many other languages which have a shared repository of libraries, when we try and fetch the wheel we also often get the horse and cart too.

If you want to use only a small subset of features from a library you must take the entire package. This increases the complexity of the program. It can now potentially do more. Bad for security, bad for compilation, bad for helpful IDEs.

You have just downloaded a large package to use one function or one type. But due to the nature of some languages which are interpreted/run-time compiled/JIT compiled or just in general compiled at the user side of things, you have just performed the same wasteful and dangerous task for every user, run, or installation of your program (depending on how it is used).

Did your program request version v1.0.1 of the package? But the user only have v1.0.2 installed? Better download hundreds of megabytes of software so you can access that important function you were told not to implement.

Why don’t you just extract that single function or class out of the package and ship that embedded in your own code? Is that allowed in the license? Will that somehow change the licensing of the code you are shipping with? Best not walk into those risky waters - make the user download 300mb of useless code instead.

This sounds hyperbolic, but some small packages in Python have very large downloads most of which are simply dependencies. The used:unused ratio of code is frankly crazy and leads to waste.

We can get around this with compiling libraries into our programs so that only the used code is shipped and other nice things (which happen to have good performance benefits) but then it is all complicated by dynamically linked programs which are unfortunately essential for some use-cases.

Packages also make for complex documentation of code. If you have a large project it is going to happen that some library is used partially for a problem and then due to the library not providing all the functionality that is required for a problem a subset of similar functions will be written for more specific behavior. Now, a new person on the project sees some functions being called and has to dig into a massive project to see if the function is from the library or your own code base. Increasing reader complexity.

An interesting side note about programming and language: English is a writer responsible language, this means that it is responsibility of the writer to provide a valid argument and all evidence in a structured and easy to follow way and connect all the dots. In English a good argumentative paragraph will be formed like

‘A plus B gives C. We know this because A is … and … is b. Therefore A plus B must give C.’

This is a property of language that doesn’t exist in some other languages. Unfortunately it appears coding is one of the ones without this property and as a result if everything is not strictly defined it can be very difficult for a person reading code for the first time to know for certain that A, B and C have any relation and if they do - what that relation is or where it is defined.

Exponential Code Complexity By Line

Let’s propose a new simple low-level coding language. It will have a small number of instructions, a single fixed type, 6 fixed register variables and will be used to solve simple maths problems.
Our instructions:

  • ADD

  • MUL


Our type: Float64
Our registers: INPUT0, INPUT1, INPUT2, x, y, z
Example program:

f(x,y,z) = y * x + z

4: MUL x , y, i
5: ADD i , z, i

This results in a simple language that can do simple things.
Now, I want to know how likely it is that the example program above is correct. One way we can look at this is to consider each line in the program a decision.

The first line could be any one of our 3 instructions with any combination of possible inputs for them. This gives ADD and MUL 214 possible combinations of register inputs and STORE 30 possibilities (36 - 6 for the six assignments to itself which would be invalid).

If we were asked to pick the first line at random we would have a 1/458 or 0.2% change of guessing the correct first line. However, the first three lines are order independent so the first line guess odds go up to 3/458.

So what are the odds of generating the whole sequence or a valid version of this sequence?

Lines 1-3 = (3/458 * 2/458 * 1/458). Each line after is 1/458

Line Chance Running Chance
1-3 6 * (1/458^3) 6 * (1/458^3)
4 1/458 6 * (1/458^4)
5 1/458 6 * (1/458^5)
6 1/458 6 * (1/458^6)

This means we would have to run up to 1.5 * 10^15 iterations to brute force find even this simple function. The fact that this number increases so quickly is the reason that most state for function synthesis to be an impossible problem.

However, humans perform function synthesis everyday so the complexity of the problem can’t be this high for a simple problem. We simply haven’t provided enough constraints. A human implementing this simple function would not be trying to produce the entire function in one step. They would approach the problem as setup phase, work phase and return phase, or something more complicated with real world knowledge of the problem we can’t provide to the computer. If we adjust our program to match this phase setup we get:

1.1: STORE
INPUT0 , x
1.2: STORE INPUT1 , y
1.3: STORE INPUT2 , z

2.1: MUL x , y, i
2.2: ADD i , z, i


At each phase we are guaranteeing that the phase before has completed correctly, this gives us an additive rather multiplicative association between the phases.

Line Chance
Phase 1 6 * (1/458^3)
Phase 2 1/(458^2)
Phase 3 1/458
Total 6 * (1/458^3) + 1/(458^2) + 1/458 = 0.00218823582

In this configuration we only need to do approximately 457 iterations to find the correct solution. A ridiculous improvement over our original outlook by validating the steps as we go along.

If we were a human programmer and writing this function and wanted to validate the behavior by debugging we have essentially broken the function into the stages needed to verify the behavior against some test set. This is simplification that humans use all the time when programming.

For this article we are interested in that human behavior, because like the computer a programmer cannot synthesize a new program from scratch it takes steps. We have shown that. So why do we see 1000-line functions in programs?

If you have had any experience with large code bases, you will know that these gargantuan functions and classes are where the bugs creep in, or that unexpected behavior begins after a change elsewhere. The programmer writing or editing that function is having to deal with (choices in the environment)^( lines in the function) complexity.

This is where what we mentioned about packages comes back. Introducing a large package of functions effectively increases the number of choices in the environment thus drastically reducing the comprehension of larger less isolated functions. This is why we have namespaces and limit the visibility of libraries/modules being included in projects but this is often not managed well and code-bases become incomprehensible to new programmers who don’t have a full idea of the project in their head. This means that changes can’t be safely made because the odds that it was the “correct” line that has been placed is much much lower.

When programs get complex in this way the chances of them being safely changed or optimised is low. It is simply too expensive in programmer time and risk. We need to approach the problem of optimisation from the ground-up in development.

If you don’t believe me - open the chromium project and tell me that you would be confident to be given a task and make a change and be confident that it wouldn’t inadvertently affect another part of the program or introduce a bug.

(† - In reality, not lines but number of operations or decisions)


I am going to conclude these rants here for now and maybe continue with a few more examples of complexity problems effecting performance another time.

To wrap it all up: We are using abstractions which are resulting inefficient code being deployed at a large scale. This inefficiency is coming not only from the languages we use, but how we use them from a lack of understanding of how mistakes and sub-optimal choices are being made. If we don’t begin to produce high-quality scale-able code now, we are going to hit a wall where performance increases wont be possible at the scale we require and that will require a tremendous amount of work to rebuild under higher pressure than we have now. And high-pressure coding doesn’t get the best results either…

16-bit Floating-point Error and Activation Function Tolerance

In the last post we showed that small error in the activation function used for a network did not negatively impact the performance or output of a neural network (using our relatively small networks).

As part of our work on systems which are error tolerant, we took the different activation functions that are commonly used and an approximation known as the “serpentine” curve and ran it through our networks to see how they performed to determine if error in the activation function or the shape of the function had much of an impact.

Below we show the series of results for each function with a 16-bit implementation and after that there is a table showing the amount of error relative to a full-precision for each.

The error is calculated by taking every half-precision floating point value between 0.0 and 5 and passing it into the target function and measuring the different from the full 64-bit implementation, as opposed to the earlier work where we compared approximations to their source.

From this we can see that dropping to 16-bit floating point doesn’t incur any penalty to the functions with only linear component (Relu, LinearX, Constant) as these functions all take an input that exists in half-precision and return the same input or a fixed output that is also a valid half-precision number without any chance of overflow or rounding.

On the other hand our ‘tanh’ and ‘serpentine’ functions both perform calculations in half-float leading to errors which propagate to the return value. We can see by the median value that most results do no incur very much error at all but some values are very wrong.


Noticeably the approximation of the ‘Tanh’ function, ‘Serpentine’, has lower maximum and average error. To inspect this further we plotted the error for every input in the range 0.0 to 5.0 for both functions.

In both results the inherent pattern of the floating-point numbers is visible, but much noticeable on the ‘serpentine’ plot. The exact pattern is a little worrying as it changes the pattern from being close enough to ‘random’ to possibly having some implication in error propagation.

In our tests the network we used appears to be tolerant enough of either error and give good enough results for both - however the sigmoid-like ‘serpentine’ function is much cheaper than the ‘tanh’ which raises the question why it isn’t being used instead.

Additionally, the error across the range is non-linear for both. If we were to use this in a network the weights which allow for input closer to 0 will have a more diverse set of choices for values and allowing for a higher-precision fit than those with much larger activation values. This should result in a change in how the network converges on the correct answer.

More importantly is the propagation of error. In neural networks there is the problem of exploding and diminishing gradients which is commonly known (and one of the reasons for the popularity of non-linear but basically linear functions like ReLu). Allowing error to propogate through an application means that the combined error could cause unwanted behaviour such as that with gradients but also that the gradient you are using during back propagation may not be correct for the activation function that is being used.

This leaves us with the question of how tolerant are networks to small errors in the activation function, particularly with low precision data-types. Are there certain patterns of input or weights which in conjunction with non-linear error would slow or distort the learning process? Or will this type of error on larger networks limit the convergence onto the best possible solution?

Sometimes Activation Functions Only Vaguely Matter

This is a follow on from the last post where we discussed improving neural network performance by allowing for cheaper approximate functions to be used instead of the costly full-fat activation functions.

In Machine Learning an activation function is used to decide which neurons should be activated. It is a transform of the signal from layer to the next. To be able to fit to complex functions an activation layer must be non-linear otherwise the network is only capable of simple linear regression.

A list of common activation functions can be found on the wikipedia page which also gives the properties and limitations of each.

As you can see in that table, there is a lot and they seem to be very strictly defined. If we were to stick with ‘tanh’ as we looked at in the last post we could see it being implemented with more or less error. We were interested in figuring out if an implementation of ‘tanh’ which allowed an enormous amount of error, but was still non-linear, would be unable to perform the same task as the standard implementation of the ‘tanh’ function.

So we came up with two alternatives. ‘ApproxTanh3’ an order 3 polynomial approximation and ‘ApproxTanh12’ an order 12 polynomial approximation. The plot of these against the standard implementation can be seen below:


As you can see, the high order approximation is a reasonable fit to the actual tanh implementation where as the low order approximation basically falls off a cliff around x=0.5 .

So, what happens when we use this in a real neural network?

For this task we have chosen the standard MNIST handwriting dataset and a pretty simple convolutional neural network.

The dataset is a very common one, it is a collection of 28x28 pixel images of handwritten numbers and the number they represent. The standard test error rate of a convolutional neural network on this dataset is between ~0.2% and ~1.7% for different standard implementations.

And our neural network looks like this where {CHOSEN ACTIVATION} is either one of our approximations or the real ‘tanh’ implementation:

Conv2D(32, (5, 5), input_shape=(1, 28, 28), activation='relu')
MaxPooling2D(pool_size=(2, 2))
Dense(128, activation = {CHOSEN ACTIVATION} )
Dense(num_classes, activation='softmax')

With this network with the same seed we get these results:

Function TEST ERROR RATE (%)
Keras.Tanh 0.97
ApproxTanh3 0.84
ApproxTanh12 1.31

This was very surprising to us. We expected the terrible order 3 approximation to break the ability for the neural network to fit the problem, but it’s results were within the ranges we expected and beat the standard implementations error rate in this specific seed (Other runs show some variance of +/- one percent).

In this example, our terrible activation function was able to match the standard implementation. It seems for certain problems where the inputs to the activation stay within certain ranges that the exact shape of the activation function isn’t important as long as it is vaguely correct within the working range.

This means that there is a large range of error tolerance in this system. A fact which it seems, for some networks, could be exploited to reduce learning times!

Improving the Performance of TensorFlow Activation Functions

In our research we are currently investigating existing applications which are tolerant to functions which return a value that is within a tolerance range rather than relying on absolute unit-in-last-place guarantees.

Machine-learning is a popular area of computing at the moment which just happens to conform to this paradigm. From specialised GPUs to TPUs we can see large teams approaching machine learning with variable precision stages and noise inducing layers to speed up or aid in the learning process.

While there is a lot of areas here which show levels of error tolerance, we have chosen to focus on the activation functions. In particular we are looking at the popular ‘tanh’ activation.


The function ‘tanh’ is a useful function due to its non-linearity and convenient output mapping of [-1,1]. Unfortunately, Tanh is an expensive function to compute accurately.

So, if we want to improve this we need to begin with the facts we know already:

  • Some activation layers in a neural networks can be expensive

  • Machine learning algorithms often use low-precision floats (16/32-bit)

  • Floating-point cannot give absolutely correct answers for real numbers

  • Inputs to some learning algorithms are often normalised to be values between zero and one

  • Accuracy between zero and one in floating-point is non-linear. So error is higher in functions working closer to one

  • Approximations of complex functions can result in better performance

  • Machine learning is inherently error-tolerant

  • Machine learning has a major problem with run-times being too long

From these facts we can assume:

  1. The current ‘tanh’ function implementation has an acceptable and variable level of error

  2. An approximation of ‘tanh’ which has better performance and acceptable levels of error will not impact on the overall outcome of the learning process and be valuable to the programmer

  3. Any implementation which does not impact the overall result negatively and improves run-time would be a positive contribution.

So, if we want to show our assumptions apply to the real-world, we need to prove each one.

1) This is the simplest assumption to prove. The Tensorflow website provides a few machine learning tutorials, one of which is on Text Classification. This example uses a sigmoid activation in it’s final step. Changing this to a ‘tanh’ activation has no negative impact on the learning process. The accuracy of the output model remains in the 80-90% accuracy range.

Since we are using ‘tanh’ and the output is acceptable, the error that we know it has due to its underlying type must be valid and hasn’t impacted the convergence of the learning process.

2) Next we want to take an approximation and compare the result of using that to the official implementation to see if it has any difference characteristics which might be detrimental.

To do this we must select an implementation for the replacement. We have kept to a simple polynomial for this example as the resulting error is very small on the limited input range we are using ([0,1]). Although there are other, better performing, approximations with different costs associated.

With our replacement Activation function selected, we simple replace the final activation layer to point to our function and can then run the learning process. To ensure fairness between all implementations we use a fixed seed and reset the global state of Tensorflow for each run. This will allow us to see the ways in which the implementations may diverge through the learning process.


The tests give us the above results. The first chart is the standard implementation, while the other three are order-16 polynomials implementations in 64, 32 and 16-byte floating-point. As can be seen, there is very little difference between the approximations. The approximations do diverge from the official implementation around epochs 15-20. This divergence is expected due to the accumulation of small differences between the implementation but has no impact on the overall accuracy of the resulting model. In all tests that were ran the approximations and official implementation all converged to around the same accuracy.

This shows that the approximations do not negatively impact the overall result of the model.

NOTE: These tests were all trained on a CPU due to the NVIDIA libraries being non-deterministic. This non-determinism prevented 1-to-1 comparison of different activation functions with the same seed.

3) Next we need to consider the performance impact. As we are limited to working with a very small model due to problems with NVIDIA’s implementation preventing 1-to-1 testing on the GPU the activation layer is only a small part of a small model and this makes it difficult to accurately measure through the noise. To get around this we will be testing the implementations on the CPU separately (until we do further work with more time!).

Run on (8 X 3600 MHz CPU s)
CPU Caches: L1 Data 32K     (x4),  L1 Instruction 32K   (x4),
            L2 Unified 262K (x4),  L3 Unified 8388K     (x1)
Benchmark                                      Time  
Tanh Standard Implementation            |      212 ms 
--Order 16 Polynomial Implementations-- |
SSE           64bit Float               |      114 ms 
Nonvectorised 64bit Float               |      144 ms 
SSE           32bit Float               |      114 ms 
Nonvectorised 32bit Float               |      129 ms 

Different machine learning libraries will take different approaches to vectorisation. As our functions are trivially vectorised on the CPU we see huge improvements over the standard ‘tanh’ performance, but we also see large improvements even in the standard scalar implementations.

A win of 25-50% on performance is significant when running a network with hours spent training, even if only a portion of that is spent with the activation functions.


With these three assumptions shown to be true in limited practice we are in a good position to make stronger assertions for larger neural networks and hopefully present a method to improve the performance of your Tensorflow models without having to make any tangible sacrifice!

Installing Tensor Flow for GPU

The TensorFlow website provides a brief outline of how to get a tensorflow programs running on an NVidia card but due to updates on the NVidia site, some of the instructions are a little hard to follow and it is easy to make a mistake, so I will give an updated guide here.


  • Run DDU, a tool for removing current display drivers from your computer.
    DDU removes the current NVidia driver from your computer and any associated files. This prevents the a common error when rolling back to older NVidia drivers. 
  • Install the CUDA Toolkit version 9.0. This toolkit comes with the correct driver version for using it. There are newer versions of the toolkit now, but version 9.0 is what is required for TensorFlow-GPU. If the link here breaks, this version of the toolkit is found in the archives.
  • Install cuDNN v7.0.5 (Dec 5, 2017), for CUDA 9.0. Like the CUDA Toolkit there are newer versions of this tool, so this one is found in the archives. 

The TensorFlow website suggests to set the correct PATH variables in Windows according to the documentation - but in my experience with these specific installers that is done during the installation process.

Once everything is installed and you have restarted your computer you can install tensorflow-gpu from pip or through Anaconda

pip3 install --upgrade tensorflow-gpu

Then you can run the test python scripts found here to validate the GPU support. 
NOTE: There is a rather nasty bug in the example GPU scripts - when it suggests to test a maximum percentage usage of memory on the GPU ensure that the maximum number you set is less than what is currently available. Any contention for memory on the GPU causes a hard driver crash.

With that all up and running - enjoy much faster machine learning!

What can we learn from GPU Frame Captures: Stardew Valley

If you read my last post on frame captures we discussed how GPU frame captures can be used to analyse performance, and determine how a scene is rendered for a specific game. 

It also gives insight into what the game is doing right (from a performance perspective) and how it can be improved.

In the last post we looked into Europa Universalis 4 where we found an astronomical amount of draw-calls, strange scaling behaviour and sub-optimal texture packing to render a relatively simple (although quite large) world.

In this post we are looking at a retro game for comparison. Many people who havent kept up with modern graphics development may believe that modern "retro" looking games use the same old rendering techniques that used to be mandatory on the old consoles - this isn't the case!

Stardew valley mimics the aesthetic of old SNES games like Harvest Moon but the process by which it draws the scene makes it more dynamic, smooth and pretty doing many things that would have been impossible back then to make a great experience now.

Rendering Pattern

The rendering pattern is very simple for this 2D game. It follows the expected patternn of drawing the floor and background followed by a distance sorted order of objects. In a 2D game with this view, the objects "further away" are those higher-up the screen as they will not obscure those further down. This allows the renderer to not have to worry about depth testing.

The full pattern for rendering as I could figure it out from a capture appears to be:


In this pipeline the only steps which really take any time are the two full screen passes, they are the initial ground rendering after the screen clear, and the full-screen weather effects- in our case, snow.

The initial ground rendering is a simple tiled geometry representing the game world grid and each grid cell has UV coordinates to map to a cell in the bound ground textures. 

The snow rendering is similar but mapping to a different section of the texture as shown below:

The snow is then animated by updating which cell is read for each game-world quad each frame. Notice that the background for the snow texture is not black, it is a dark grey. This is what adds the hazy effect to the scene as the texture is not directly written to the scene as the other blocks have been, it is an additive effect (D3DBLENDOP_ADD) resulting in softening of the image.

All in all it produces a quite nice, and retro effect - even though this approach might be too intensive for most real "retro" games.


Another interesting feature of this game is that it lets you customise how your character looks. Instead of using those options to generate a new sprite sheet for rendering your character, the game instead keeps all the possible options in memory and builds up the character chunk by chunk based on your choice from these options, each frame.


This is a neat and simple trick that avoids having to create a new texture when a game is loaded - but limits the extension and number of choices for the player as adding any new options increases the memory usage of the game through the entire run-time, although this is probably trivial on a game of this scale.

So how do we improve on this?

A game of this scale and already tiny frame rendering times (<2ms) there isnt really any need to improve anything. Unless you were running an incredibly outdated machine this game will run incredibly well.

However, if we were thinking of running this game on a very limited device we would probably want to consider the techniques that were used on the actual old retro games.

For starters, the majority of the scene (with an exception of the weather effects) doesn't change frame to frame - and those bits which do are somewhat predictable. There is no reason why parts of the scene couldnt be rewritten as things or the camera move and otherwise kept the same - this would infact throw out a majority of the draw-calls when the camera is stationary. The draw backs to this approach is that any mistakes in the implementation have really ugly artefacts.

This game also falls prey to the same issues as EU4 when it comes to texture sizes (although most are trivial in this case making the slight error not too big of a problem).  There are a number of large textures in this game that are just slightly above a power of two texture size - resulting in some wasted memory.

Otherwise, this frame capture is incredibly nice and easy to investigate and navigate.

What can we learn from GPU Frame Captures: Europa Universalis 4

A common technique used in the games industry to analyse the state of a game in development and look for where improvements can be made is to capture a frame(all the GPU work being done between each frame being shown to the user) from the game and analyse the results to see the impact of the different elements on the screen and how they might be able to be rearranged.

In doing this we can look at the relative costs of the different aspects of the processes in the scene. An example of this would be that the programmer suspects that the newly implemented bloom post-effect has had too big an impact on performance. So the programmer grabs a few frames from the game and looks at the time that is being taken to perform that effect at different places in the game, and what the cost of that is relative to the rest of the scene.

Another use for this approach is for new programmers coming onto a complex project to get a quick look at the rendering 'pipeline' that has been implemented. As in, what is rendered when, by what shader and in what order. This is quite useful for someone who only needs to make a minor change on a project.

An interesting side-effect of this is that we are able to frame-capture fully finished games and look at what the code is asking our computer to do and from that derive how the rendering system of that game works to some extent.

In this post we will be looking at Europa Universalis 4(EU4) frame captures and constructing a flow chart of how that game is rendered. EU4 is the forth game in the Europa series from Paradox Interactive released in 2013.

To capture frames from the game we will be using Intel's GPA. GPA is one of the simpler and less detailed frame capturing tools available but is good enough for this example and will allow someone with GPU to be able to follow along if they wish.

(For those interested in more complex capture tools the most popular amongst those I know in industry is RenderDoc, but AMD and NVidia each have there own tools which provide specific functionality for features of their cards in Radeon GPU Profiler and NVIDIA NSight respectively. Microsoft also have a tool called PIX but I haven't personally used that in some time but have heard it works with AMDs tools now.)


First thing we want to do is use the GPA Graphics Monitor tool to launch EU4 from its directory. This will launch the game with the GPA overlay, giving us real-time performance information and the option to capture frames. In this example I will be running on a NVIDIA 1080 GPU so will be targeting the max settings so that we can see where each option is placed in the pipeline. The settings and overlay can be seen in the screenshot below.


The intention was to run with max settings, but we have had to step down from 4k to 2560x1440 due to the game performance dropping to below 1FPS at the higher resolution while running with Intel GPA (and not good without it either...) . Something that will be covered later.

With the settings covered and GPA running we now need to take some frame captures. As this game has a relatively simple view, we will be taking a capture close to the ground, mid-zoom out and full zoom-out to get a full coverage of the some of the most common cases.

This is performed by pressing Ctrl+Shift+C when the game is in the right position. An important note when capturing is to take note of the CPU usage. This data is not always captured with the frame and can indicate if there is a CPU bottleneck which may be causing poor GPU performance. In the case of our capture (below) we arent using very much CPU at all so that should not impact the results much here.

In my captures I have used a save file from a playthrough with a random 'new world'. This is due to my own interest in seeing if the randomly generated terrain is handled differently than the standard world map geometry.

Once the three captures have been taken, open Intels Graphics Frame Analyser tool where the captures should be displayed.


We can start by opening the zoomed-out capture first, as this has the least detail and should give use the broad strokes of the rendering process. This will give you a view of the data like this:


The important things to look at here are the top timeline and the view frame at the bottom left. The timeline on the top will allow you to set the x and y axis to be the GPU duration to highlight particularly expensive calls and the view on the bottom will highlight what is being rendered on each render target for each of the calls in the timeline.


As an example I selected the group of similar draw calls (ID=~1000 to ~10400) which is appears to handle only the drawing of the borders in of the world and this is shown in the bottom frame with a pink highlight of the pixels which are being rendered to. Above that there is information on my selection. It says that we selected 9483 draw calls which account for 90.1%(!!!) of the total rendering time for the frame! Scary, but we will get to that in the section on how to improve this pipeline. For now this is just to show you how this tool works.

We will use this approach to see how the scene is rendered by stepping through the draw calls and seeing what they are doing. From what we can gather the process for drawing the world appears to be starting with the zoomed out capture:

Zoomed-out capture analysis

With the numbers referring to the GPU instruction ID

With the numbers referring to the GPU instruction ID


The timeline is broken up by some taller commands, there are the buffer copies which nicely split the pipeline into the four coloured sections shown in the diagram. The length of each section is not entirely representative of the total rendering time but is an indicator that something isn't as ideal as it could be in this particular pipeline.

Overall the pipeline is relatively simple (by modern standards). It uses a depth map for correctly layering objects in the scene and each stage of pipeline ties nicely into the options from the settings. With the exception of shadows which appear to be disabled or not rendering at this level of zoom. 

From a graphics programmers view, there is a number of very interestingly strange things going on with the way objects are submitted for drawing in this pipeline.

For starters the sheer number of draw calls is ridiculously large for even a major AAA game and this leads to a bottleneck in the gpu as each of the submitted draw calls are relatively tiny with only tens of primitives at the best of times. This is shown, rather crudely, in GPA by the occupancy and thread dispatch boxes showing red for the frame.


This is starving the GPU. For a lot of these calls the GPU is probably spending more time in the driver than it is actually processing the data it is being given. This is particularly noticed when the game is rendering borders, state names, rivers and trees.

The terrain is also interesting. The world is rendered as 36, 8196-triangle grids. Essentially one big height map broken into chunks, which reads its lighting from the precomputed base texture in the setup stage. 8196-triangles is a relatively insignificant amount of triangles to draw and each of the 36 grids appears to share the same texture and shader state. There doesn't appear to be an apparent reason this is separated instead of being one giant geometry. If the reason is LOD (level of detail) related this could be resolved CPU side very cheaply to select and combine multiple vertex buffers or vertex buffer sections into one call.

There is a lot of other details but they appear to be the same throughout all captures, so I will address them in the optimisation section at the end.

Close Capture Analysis


As suspected,. the shadows showed up in the close zoom levels. In the zoomed out pipeline map there was some strange depth maps being cleared and appeared not to be used later in the pipeline. When we have a closer view these render targets become used for storing and processing the shadow map that is produced in the setup stages, the shaders being used later in the pipeline are also binding this as a texture and appear to be using it - the rest of the pipeline remains unchanged in my casual exploration.


The borders, text and UI remain as the main culprit of GPU use, but with the additional of a lot of calls rendering the scene objects for the shadow map and processing adding some extra work.

This is offset by some good culling of the terrain. In the far out LOD the entire terrain and water were rendered as many chunks together. At this level, the fewer chunks reduces the total number of draw calls submitted, but the culling was not correct and some out of view chunks were submitted.


Mid-Zoom Capture

Our mid-level capture appears to have the same pattern as the far-out zoom. All the city, unit detail and shadow is LOD'ed out, leaving only the map rendering. The gives us the same behaviour with less of the draw-calls as less is visible. Rendering borders still covers the majority of the calls, but it isnt as dramatic as the far out zoom.


We can see in this image that the depth buffers used for the shadow map are still bound and cleared each frame (RT2-5) and are apparently unused, but I may have missed something.

Sins of a Render Empire

So in the last three sections I gave an overview of the pipeline and how it changes in different configurations - essentially some features are disabled as LOD features based on distance. However, this isnt the only place to look when we are considering performance. So in this section we are going to cover a few oddities that need addressing if this game was going to be optimised.

4k Support

As I mentioned in the setup for this, I intended to run this on max setting at 4k resolution. This currently isn't possible with a high-end intel chip and NVidia 1080 - a little strange for a game from 2013.

The game runs roughly fine at 4k when we dont enable all the on-off options in the video menu. A quick investigation into this and it seemed that the shadows really take most the time. A problematic quirk of resolutions is that when we double the size we quadruple the number pixels (which also quadruples the amount of work). In this engine, the shadow buffers appear to be sized based on the full rendering resolution. So, going from 1440p to 4k doesnt just quadruple the cost of rendering the scene, it has to be rendered twice at that size, so it is a ~8x increase in rendering cost. 

Additionally the trees are expensive at 4k. This is a relatively simple reason, at 4k we see more pixels on each tree. Each tree has high texture detail and for some reason a detailed normal map. So now we have also quadrupled the texture read cost, with little coherency because the trees are small and dense and the texture resolution is reasonably high.

Water is similarly effected, but not as much as the trees due to similarity in the pixel space helping cache coherency.

In the game there are a number of what appear to be generated textures, the size of these appears to be loosely based on the resolution that is being used. However, they do not stick to power of two texture sizes. So a texture that is 1029x1029  is actually a 2048x2048 texture under the hood on certain hardware, this doesnt effect the appearance of the texture but does have performance implications and is just a massive waste of memory.

This next complaint is just because I play this game a lot. At 4k the menus get really tiny and that's a pain.

DirectX 9.0c

This game is written with DirectX 9.0c. In 2013 when this came out, DirectX 10 and 11 had long been a standard and DirectX 12 was well on the way as well as AMD's early experiments with Mantle which led to Vulkan.

AMD and NVIDIA put a lot of work into optimising drivers for modern hardware. This focus is obviously for patterns and use-cases in common software. DirectX 9.0c misses a lot of features that could really make this type of game fly on even a basic laptop.

Overbinding and draw calls

Textures are bound each time they are used. Tiny objects are submitted to draw calls. There is either a massive lack of batching or it is not apparent in the capture.

This puts the game at the mercy of the PCI-bus and the drivers. Every time you ask the GPU to do something you risk a major state change stalling the next draw. So much of the textures being bound are identical (outside of the UI) and there is a lot of texture slots to bind, some textures are so low resolution they may as well be constant buffers and allow for some nice out of order operations to be able to be done.

Texture Space Wastage

The cubemap for the sea is a good example of this. Due to the angle of the camera only a limited section of the cubemap will be accessed, so half of the cube map is just blank. This is more likely a trade-off than an error as the DX9 cubemaps have some strange rules - but it isn't ideal in a modern game.




From this we have shown how to analyse a game through frame capture. Shown how to extract the pattern it uses for rendering and view the content of each rendering step. We then covered what could be causing some of the performance hiccups we see during gameplay and how they could be fixed.

I would like to add the captures frames to this post but I think that might count as distributing content from the game and I am not sure of the legality of that, so for now I hope my instructions on how to set this up on your own system with your own copy of the game is enough!


C++17: The new problem with 'auto'

Since C++11 introduced 'auto' there have been discussions about whether it increases or decreases the readability and comprehensibility of code. 

Personally, I believe that auto is useful for making code concise when used with an IDE that can resolve the type for you when needed without having to go through too much digging around, but can be harmful is overly used or applied in non-obvious ways.

C++14 extended on the use of 'auto' in a logical fashion. If it is alright to use for type definitions then it should be acceptable as the return type in the definition of a function where type can be deduced from the return statement or a trailing definition added to the end of the function definition.

This I find to be a little bit less reasonable in the first case as it demands the programmer to actively explore the implementation of a function to understand its usage, but due to the limitations there can only be one return type so finding and understanding any single path through the function will give you a thorough understanding of the type that is being returned.

This isn't too terrible, but it leaves the user experience of the programmer being a little unnecessarily tedious as the first half of the function definition is very unclear.

Lets look to the other half of the function declaration, the inputs to a function. From C we already had variadic inputs which allow for any number of inputs to a function, this was then further extended to to variadic function templates in C++11. 

In combination with the function auto return type we now have a function that can take any arguments and return any single-type that. This exact types that are in play or acceptable are non-obvious from the function definition and require full understanding of the source implementation to be able to use safely. This is very bad and the current state of play as of C++14 - but not as bad as it is going to get.

Now, to the point of this post. C++17 introduces the very much sought after compile-time if statement in the form of 'if constexpr(...)'. This allows for whole blocks of function to be discarded or included based on a logical check at compile-time. Very very useful and a great addition that could simplify a lot of code and produces more efficient output by giving more information to the compiler.

However, if we consider alongside what we have been discussing so far we will see that this changes the behaviour of the function auto return type. Where as in earlier versions of C++ the auto would refer to a single return type (unless some complicated templating was in use) we can now have a function of arbitrary return type based on a compile-time decision. Changing our single deduced return type with arbitrary input into an arbitrary return type with arbitrary inputs. Essentially removing all useful information from the definition of the function and requiring a full understanding of all control paths through the function to fully know which inputs are valid and what it will return.

This is a problem of weakly typed languages and one of the strengths of C++ was not having this problem. It leads to very confusing code like this:

//Abusive Case:
template<class... Args>
auto AutoFunction(Args...args)
    constexpr int n = sizeof...(Args);
    auto argtuple = std::make_tuple(args...);

    if constexpr (n == 1)
        if constexpr(std::is_same<type_list<Args...>::type<0>, int>::value)
            return 0;

        else if constexpr(std::is_same<type_list<Args...>::type<0>, float>::value)
            return 0.f;

    else if constexpr (n == 2)
        if constexpr    (   std::is_same<type_list<Args...>::type<0>, float>::value
                        &&  std::is_same<type_list<Args...>::type<1>, float>::value
            return false;
        return std::map<void*, int>();


int main()
    //All cases.
    auto a = AutoFunction(2.f);        //returns float
    auto b = AutoFunction(2);          //returns int
    auto c = AutoFunction(2.f, 2.f);   //returns bool
    auto d = AutoFunction();           //returns std::map

    auto input = SomeFunction(12);
    auto whatAmI = AutoFunction(input);

    return 0;

In this example 'AutoFunction' is essentially acting as four different functions and which function it is behaving as will be determined by the result-type of 'SomeFunction' which itself could have the same problem.

The number of lines of code needed to be able to correctly and safely use 'whatAmI' has went from simply the definition of AutoFunction to the entire function as well as any functions which may feed as the input.

This is a terrible way to be able to acceptably write code. From the outside the function appears to be sensible but can hide strange behaviour. Programmers are far from constantly vigilant and this will only lead to problems.

What is especially problematic with this way of writing code is that it is actually very powerful. There are numerous algorithms and patterns which could be improved this way and may result in a better compiled output. It is simply that the behaviour is not clear, it is not signaled that it may behave that way and therein lies our problem.

I don't want 'if constexpr' removed, it is incredibly useful. I don't want 'auto' return types removed either. I simply believe that for them to be a non-dangerous addition there needs to be something else present to make the programmer using the function aware.