How changing one character improved performance by 83%
At Codeweavers we strongly believe in and practice Continuous Integration (CI) and Continuous Deployment (CD). Last year we deployed to production 21,174 times (51,154 commits last year too!). If you are unfamiliar with CI and CD in-short; a developer commits a piece of code. That code is then built on our CI server and tests (unit, integration etc) are run against it. If everything builds and passes tests, then it is automatically queued for deployment to our testing environment. If that deploy is successful it is automatically queued for deployment to our production environment.
It is not uncommon for a developer to commit their code and have it deployed to production in fifteen minutes. We refer to this entire process as the pipeline, and we monitor the pipeline via a real-time application we built called Gaffa. Keeping the pipeline open and builds flowing through at a consistent pace is incredibly important to us.
So imagine our dismay when Gaffa showed about twenty builds in the queue, builds that used whizz through CI were taking what felt like forever. A quick walk around the office revealed a crowd of developers hovering around a desk rapidly discussing what it could be. Having nothing to add to the conversation and not wanting to become another cook, I left them to it.
Fast forward a few weeks, builds were flowing through CI at their regular speedy pace. After enquiring to the fix; turns out no one could find out what caused the builds to run slowly; but it was established that the code built fine but test execution was slow.
However, someone did discover that if we opened the Resharper installer on CI then test execution speed was fast. Yes, you read that right. How that particular workaround was discovered, I have no idea. Sure enough, remoting onto CI and closing the Resharper installer resulted in builds on CI taking longer. Re-opening the Resharper installer and the builds are quick again!
Fast forward another few weeks, and between client work. I re-visit the ‘accepted’ solution. It did not feel right that leaving the Resharper installer open had become apart of our CI and CD flow. A quick check of the CI logs showed that every back-end application. For brevity we will focus on three back-end applications.
Application | Installer Open | Installer Closed | Diff |
---|---|---|---|
A (20,518 tests) | 157 seconds | 512 seconds | +226% |
B (14,957 tests) | 63 seconds | 304 seconds | +382% |
C (10,332 tests) | 104 seconds | 373 seconds | +258% |
Yikes, that is a pretty severe degradation in test execution. So what is special about the Resharper installer? In short, nothing. After spending two or so days tearing it apart, we came to the conclusion that the Resharper installer was not doing anything particularly clever or nefarious.
Time for a change of tact; does anyone else have this issue? Plugging “nunit slow test execution” into Google pops out this issue. Reading the entire issue thread did not set any bulbs off. However, that issue thread was linked to a more recent issue. Several dings went off:-
Having Visual Studio open did not reproduce the same behaviour for us, but having the Modern UI application open did! From here on out we switched to a much smaller test suite that only contained 147 tests, this was to save waiting minutes between slow and fast runs. With the smaller test suite a ‘slow’ run was seven seconds, and a ‘fast’ run was one second.
Now that we have at least two applications that show the same behaviour and a smaller test suite, it is time to go looking for other applications that have the same behaviour. The only noticeable link between the applications being WPF. We find two more examples; Animation Examples and Animation Timing. The latter example is especially interesting because test execution improved only when the animation was running.
It seemed like something animation related was causing test execution to either be slow or fast. Having very little experience with animation and programming related to that. I admit on the issue thread that I am beyond my skill level and begin the process of bowing out, feeling defeated I go to bed.
Fresh eyed and bushy tailed (okay, that was a blatant lie - I am not a morning person) I set about re-reading the issue thread, a lot of the time walking away and re-visiting a problem helps viewing it from a different angle.
A couple of people make references to “high resolution timing mode” and “windows timer resolution”. At that point in my career I had no idea what they were talking about, Googling the latter phrase reveals one of my now favourite blog posts by Bruce Dawson. Please do read the post by Bruce, it goes into great detail about the Windows Timer Resolution, highlights WPF as a culprit for changing it, and explains how to get the current value.
Grabbing a copy of ClockRes
it was trivial to confirm each of the aforementioned applications lower
the Windows Timer Resolution from 15.625ms
to
0.500ms
. It is not easy to explain how the Windows Timer
Resolution can affect your code. It is easier to demonstrate it instead.
Set to its default value (15.625ms
) this piece of code will
actually sleep for ~15ms
in every loop:-
for (int x = 0; x < 10; x++)
{
var sw = Stopwatch.StartNew();
.Sleep(1);
Thread.WriteLine(sw.Elapsed.TotalMilliseconds);
Console}
13.9164
15.1045
15.4075
15.4501
15.4724
15.4026
15.4622
15.4798
15.4576
15.4696
If we now lower the Windows Timer Resolution to 0.500ms
and re-run the above code we get:-
1.3298
1.4895
1.4491
1.4496
1.4639
1.4486
1.4363
1.4509
1.4621
1.4611
This
comment in the issue thread describes how to change the Windows
Timer Resolution programmatically if you want to, not that it is
recommended. NtSetTimerResolution
looks to be an
undocumented API. If anyone knows the history of it - who discovered it,
when it added to Windows I would love to know!
Out of interest I decide to compile the smaller test suite against
NUnit v2.6.4 and confirm
it is not sensitive to the Windows Timer
Resolution. This discovery helps us a lot, as it entirely eliminates our
code and suggests a performance regression was added sometime after
NUnit v2.6.4. Deciding to slow move up the versions of NUnit within two
hours I confirm that NUnit became sensitive to the Windows Timer
Resolution at version
3.4.0. The version before was v3.2.1
.
Not knowing the NUnit code base at all, it made sense to binary
chop the commits between versions v3.2.1
and
v3.4.0
. Then build each chopped version and run the smaller
test suite with the Windows Timer Resolution set to its default
(15.625ms
) and its minimum (0.500ms
). This
process was somewhat tedious and laborious (but made easier with Jammy
Dodgers and PG Tips). A day later I find the commit
responsible for making NUnit sensitive to the Windows Timer
Resolution. I submit a pull request
containing a one character change to fix the performance regression.
But why is that the fix? We have established that the default value
for the Windows Timer Resolution is 15.625ms
and
Thread.Sleep(1)
sleeps for the value of the Windows Timer
Resolution. Therefore, changing the value from 15.625ms
to
0.500ms
meant that NUnit would sleep for a shorter period
of time. Using Thread.Sleep(1)
makes NUnit sensitive to the
Windows Timer Resolution. Using Thread.Sleep(0)
allows any
other threads to execute if they
are ready and in the case there is no outstanding work it
immediately returns and continues.
Trying the fix out on the applications we tested before, we can see an amazing difference.
Application | Pre-Change | Post-Change | Diff |
---|---|---|---|
A (20,518 tests) | 512 seconds | 85 seconds | -83.39% |
B (14,957 tests) | 304 seconds | 68 seconds | -77.63% |
C (10,332 tests) | 373 seconds | 60 seconds | -83.91% |
It took a while for the issue to be found, fixed and released. It would not have been possible without the collaborative effort all of those involved in the issue thread - both offline and online. Quick overview of the timeline:-
A bug was introduced in NUnit v3.4.0
that made it
sensitive to the Windows Timer Resolution. This meant depending on what
your Windows Timer Resolution is set to, your test execution speed
could be slow. The bug was fixed in NUnit v3.8.0
.
It is strongly recommended that you upgrade to the latest version of
NUnit to take advantage of the performance fix.
We discovered the same sensitivity in JetBrains dotCover v3.1 onwards. However, they chose a different fix unfortunately it looks like this fix has not been released yet.
Tweet