From 41 frames per second to 1560 - Full app 38x speedup

Having spent a couple of evenings back porting Async to iOS 7 and OS X Mavericks (10.9) and releasing it as Async.legacy I've gone back to trying to squeeze some more performance out of the GrayScott Cellular Automata app that Simon Gladman presented at the last London Swift Meetup.

For me this was an interesting case to see how fast something that is almost entirely CPU and memory bound can be made and it gave me a chance to play. Not many things need optimising if well strutured but this was a case where it could clearly be relevant.

Simon's original code calculates about 10fps in debug mode and displays many of them. Built with optimisations it increases to about 41fps calculated but it very rarely updates the screen due to the timing mechanism he used rather than calling back to the main thread. All this was done on a 70 pixel square calculation.

Running the latest code on a 70 pixel square calculation it calculates between 1550 and 1600 frames most seconds for a speedup of about 40 times and it is displaying far more frames to the screen too (well assigning the images to the image property of the imageView, the screen framerate is far lower).

This post focusses on making the main solving work multi-threaded for performance and in the optimisation of the inner loop. At this point we are moving beyond the point where we are optimising by improving the style, purity and immutability of the code. Some of the changes (inlining simple functions) go directly against good style and should only be done in inner loops. The parallelisation of the main solver is also something which makes the code less clean and tidy as is the incorporation of the pixelData generation into the main loop.

I've already posted twice about this progress. In the first post it dealt with the change to GCD from NSOperation and how to call back to the main thread to update the UI and also the performance benefits seen when using constant Swift arrays compared to NSArray/NSMutableArray at least when building with optimisations enabled.  Debug builds can be significantly slower. In the second post I sped up the rendering by switching from drawing pixels as 1x1 rectangles to a drawing context to directly filling in the pixel data of a bitmap and making a UIImage/NSImage from that CGImage.

All the code can be found in the Github repo and you can explore the progression and some of the dead end branches if you really want.

Swift Debug / -Onone Builds Are Slow

This is very important and can make a massive difference. In some code it may be a 50x speedup by allowing optimisation. Do not complain that Swift is slow unless you are doing release builds or building with -O (Fastest) or even -Ounchecked (Fastest, Unchecked).

Further Optimisation is Still Possible

I'm far from claiming that the app is fully optimised even with the changes described below. Apart from anything else it may be possible to apply the Metal or Accelerate frameworks to it and even before that the innerloop probably still has room for tuning. The other area that I am not happy is optimal is the bitmap creation for rendering. I wonder if it may be possible to write even more directly to the offscreen buffer for the layer, if anyone can point me in the direction of good guides to these areas please do.

Optimisation is not a Straight Line

While in this post I will largely discuss the changes that did improve performance if you look at the network view of my Github repo you will see that there are a number of branches to nowhere. These were probably not the only false starts. Some things that you expect to make things faster don't and can actually make things slower so there was a fair bit of experimentation on the way to this state.

The Optimisations

Break the work into n-pieces where n is the number of processors available

In the multi-core world that we now live in this is an obvious choice for performance however it isn't always easy to do and the gains are often less than you expect...

In fact until the last few commits I was actually finding that on my quad-core MacBook it was fastest running using only a single queue for the main work and being effectively single threaded through that code. I had assumed that the fact that the threads were writing to adjacent blocks of memory (actually different areas of the same array) was causing the memory bus to become congested with having to keep all the processors' views of the memory in sync. However the fact that this wasn't the case suggests that I should investigate more. It may have been the result of the way that I was waiting multiple times on a semaphore rather than waiting on a dispatch group that was causing the problem (in the code examples I will show the right way to do it)

If you read the previous article when we left the solver function it was a pure function without side effects and it looked like this:

The fact that it is a pure function where the output depends only on the arguments and there are no side effects (except for printing some performance measurements) meant that it was immediately safe to parallelise it and as long as at the return of the function the result is collated and prepared at the point of return and it continued to have limited side effects it would not have any effects on other areas of the code.

UPDATE - WARNING -Arrays are not threadsafe so the following code is incorrect. See the Github repo for the latest. Either you have to create separate arrays for each queue to process OR you can use .withUnsafeMutableBuffer to get at a raw memory area. See this follow up post.

As the function works through an array of data and the results do not affect each other (as all results are based on the previous state) it was clearly possible to split the array into sections and have a different dispatch handle each section. This is where inside the function we go highly impure and use what I would regard as very poor functional programming style by passing the output arrays as an inout parameter to an inner function responsible for processing the section but this was an efficiency guess that I made that actually copying and joining significant size arrays would slower than directly writing into different parts of it.

Here is the public function and the declaration of the inner function (I've removed the logging code if you want to see the real one look in the Github repo):

One thing the eagle eyed may note is that the function now returns a tuple with two arrays rather than a single array. The reason is that I brought the bulk of the renderer method into the same place to avoid having to iterate through the data array again to create the PixelData. I should test again separating it with the latest code. This is probably an example of using a worse style (less separation of concerns, less re-usability) to try to get better performance in the inner loop. In this case I was aiming to improve the temporal locality of access to the data.

The internal solver code converted for solving the particular section rather than the whole frame looked something like this:

This is very similar to how the whole solver looked before I changed it to work section by section rather than on the whole frame every time except for the inclusion of the code that was in the renderer preparing the bitmap to be converted to an image.

Parallel Results

At this stage the results of parallelism were a little disappointing. On the (dual core) iPad Retina Mini I was getting a real speedup of about 30% for an additional 76% CPU utilisation (170% CPU usage) and when I ported to OS X the results on my (quad core i7) Mac were very disappointing. Despite being able to get the CPU usage up over 500% (i7 processors support hyperthreading with 2 virtual processors sharing one real core which really helps when code is likely to stall) depending on the number of queues I split the job into the parallel performance was consistently lower (almost half) than when I was running it with a single solver queue.

This was very disappointing and I tried a few things that didn't help very much such as making the partial solvers return separate output arrays in case there were issues with cache coherency and memory contention. I did look at the GCD dispatch_data methods but didn't actually get round to trying that. After going away from it for a couple of days I just concluded that the i7 wasn't great at threading where there data is being shared between the threads (even though they write to different sections they are adjacent). This turned out to be wrong when I got back to this code and looked at optimising the inner loop...

Inner Loop Optimisation

Initial Failure

I initially started to try to separate out the processing of the edges and I was planning to try to use the accelerate framework to speed up the processing of the data in the middle. I may yet return to some of that work but my initial results were showing that processing the edges was taking as long as the bulk in the middle that I was hoping to accelerate.

Function Call Removal

What I did notice was that the function calls to deal with the wrapping of the reading of cells could be inlined. Simon had done a nice job of abstracting the logic to wrap the reading of cells into an extension on the Int type but it turned out that deep in the processing loop that this was an expensive call. This change led to something more than a three times speed improvement and I gained a further 10% or so by inlining the clip method that Simon had extended Double with to clip values to between 0.0 and 1.0.

The resulting function looks like this:

This is less tidy but turned out to work much faster and even enabled the OS X build to perform well multithreaded. My current guess at why it does so much better is that I believe it is much easier for the CPU's branch prediction to pick the good guess more often but I'm not sure how to test that.

To be clear this is NOT nicer code but it is faster.

Peformance - iPad Retina Mini (dual core ARM64)

To slow it down to more sensible speeds I normally run with a much larger square to calculate. 256 pixels square is a good balance on the iPad and on my laptop which I'll cover in more detail later I can run at 512 pixels square comfortably. The performance should scale with the area (square of the length) so 256 pixels should be 13.4 times slower than 70 pixels.

At 70 pixels - About 1560fps and 140%CPU usage (2+ solver dispatches). 1570fps with single dispatch. There just isn't enough data to be worth splitting it.

At 256 pixels - About 220fps about 150%CPU usage (2+ solver dispatches). 175fps at 118% CPU usage (single solver dispatch).

At 512 pixels - About 60fps about 160%CPU usage (2+ solver dispatches). 45fps at 113% CPU usage (single solver dispatch). On such a big canvas and default settings this moves pretty slowly.

Performance - 2.3GHz Quad Core Haswell i7 MacBook Pro

Take these figures with a very large pinch of salt as they are only approximate and my visual estimates from the varying fps figures displayed roughly every second. As you can see dispatching up to 4 blocks does help and there are some gains beyond that but they are limited at least on my computer, they may be different on computers with more real cores.  Also I was finding that the NSImageView seemed to be much more sensitive to the size of the view (whether it needs scaling) and whether it was visible than the UIImageView on the iPad so it may be there is a better way to do this, I haven't done OS X applications before this test.


1

2

3

4

8

16

32

70 (crashing - needs investigation)

9400

13500


12000

13000

12000

9800

256

720

1230


1500

1600

1740

1720

512

185

270


370

330

340

340

1024

42

70

80

85



90


Anyone with a Mac Pro out there willing to give it a run?

Let me know what you think and whether you have suggestions for next phases in improvement.