Strings Are Evil

Codeweavers is a financial services software company, part of what we do is to enable our customers to bulk import their data into our platform. For our services we require up-to-date information from all our clients, which includes lenders and manufacturers across the UK. Each of those imports can contain several hundred megabytes uncompressed data, which will often be imported on a daily basis.

This data is then used to power our real-time calculations. Currently this import process has to take place outside of business hours because of the impact it has on memory usage.

In this article we will explore potential optimisations to the import process specifically within the context of reducing memory during the import process. If you want to have a go yourself, you can use this code to generate a sample input file and you can find all of the code talked about here.

Establishing a baseline

The current implementation uses StreamReader and passes each line to the lineParser.

using (StreamReader reader = File.OpenText(@"..\..\example-input.csv"))
{
    try
    {
        while (reader.EndOfStream == false)
        {
            lineParser.ParseLine(reader.ReadLine());
        }
    }
    catch (Exception exception)
    {
        throw new Exception("File could not be parsed", exception);
    }
}

The most naïve implementation of a line parser that we originally had looked something like this:-

public sealed class LineParserV01 : ILineParser
{
    public void ParseLine(string line)
    {
        var parts = line.Split(',');

        if (parts[0] == "MNO")
        {
            var valueHolder = new ValueHolder(line);
        }
    }
}

The ValueHolder class is used later on in the import process to insert information into the database:-

public class ValueHolder
{
    public int ElementId { get; }
    public int VehicleId { get; }
    public int Term { get; }
    public int Mileage { get; }
    public decimal Value { get; }

    public ValueHolder(string line)
    {
        var parts = line.Split(',');

        ElementId = int.Parse(parts[1]);
        VehicleId = int.Parse(parts[2]);
        Term = int.Parse(parts[3]);
        Mileage = int.Parse(parts[4]);
        Value = decimal.Parse(parts[5]);
    }
}

public static void Main(string[] args)
{
    AppDomain.MonitoringIsEnabled = true;

    // do the parsing

    Console.WriteLine($"Took: {AppDomain.CurrentDomain.MonitoringTotalProcessorTime.TotalMilliseconds:#,###} ms");
    Console.WriteLine($"Allocated: {AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize / 1024:#,#} kb");
    Console.WriteLine($"Peak Working Set: {Process.GetCurrentProcess().PeakWorkingSet64 / 1024:#,#} kb");

    for (int index = 0; index <= GC.MaxGeneration; index++)
    {
        Console.WriteLine($"Gen {index} collections: {GC.CollectionCount(index)}");
    }
}

Our main goal today is to reduce allocated memory. In short, the less memory we allocate, the less work the garbage collector has to do. There are three generations that garbage collector operates against, we will also be monitoring those. Garbage collection is a complex topic and outside of the scope of this article; but a good rule of thumb is that short-lived objects should never be promoted past generation 0.

Almost 7.5 GB of memory allocations to parse a three hundred megabyte file is less than ideal. Now that we have established the baseline, let us find some easy wins…

Easy win 1

Eagle-eyed readers will have spotted that we string.Split(',') twice; once in the line parser and again in the constructor of ValueHolder. This is wasteful, we can overload the constructor of ValueHolder to accept a string[] array and split the line once in the parser. After that simple change the statistics for V02 are now:-

Great! We are down from 7.5GB to 4.2GB. But that is still a lot of memory allocations for processing a three hundred megabyte file.

Easy win 2

Quick analysis of the input file reveals that there are 10,047,435 lines, we are only interested in lines that are prefixed with MNO of which there are 10,036,466 lines. That means we are unnecessarily processing an additional 10,969 lines. A quick change to V03 to only parse lines prefixed with MNO:-

public sealed class LineParserV03 : ILineParser
{
    public void ParseLine(string line)
    {
        if (line.StartsWith("MNO"))
        {
            var valueHolder = new ValueHolder(line);
        }
    }
}

This means we defer splitting the entire line until we know it is a line we are interested in. Unfortunately this did not save us much memory. Mainly because we are interested in 99.89% of the lines in the file. The statistics for V03:-

Strings in the .NET ecosystem are immutable. Meaning that anything we do to a string always returns a brand new copy. Therefore calling string.Split(',') on every line (remember there are 10,036,466 lines we are interested in) returns that line split into several smaller strings. Each line at minimum has five sections we want to process. That means in the lifetime of the import process we create at least 50,182,330 strings..! Next we will explore what we can do to eliminate the use of string.Split(',').

Splits are never cool

Calling string.Split(',') on the above line will return a string[] containing:-

Guarantees established, we can now build a short lived index of the positions of all the commas for a given line:-

private List<int> FindCommasInLine(string line)
{
    var list = new List<int>();

    for (var index = 0; index < line.Length; index++)
    {
        if (line[index] == ',')
        {
            list.Add(index);
        }
    }

    return list;
}

Once we know the position of each comma, we can directly access the section we care about and manually parse that section.

private decimal ParseSectionAsDecimal(int start, int end, string line)
{
    var sb = new StringBuilder();

    for (var index = start; index < end; index++)
    {
        sb.Append(line[index]);
    }

    return decimal.Parse(sb.ToString());
}

private int ParseSectionAsInt(int start, int end, string line)
{
    var sb = new StringBuilder();

    for (var index = start; index < end; index++)
    {
        sb.Append(line[index]);
    }

    return int.Parse(sb.ToString());
}

public void ParseLine(string line)
{
    if (line.StartsWith("MNO"))
    {
        var findCommasInLine = FindCommasInLine(line);

        var elementId = ParseSectionAsInt(findCommasInLine[0] + 1, findCommasInLine[1], line); // equal to parts[1] - element id
        var vehicleId = ParseSectionAsInt(findCommasInLine[1] + 1, findCommasInLine[2], line); // equal to parts[2] - vehicle id
        var term = ParseSectionAsInt(findCommasInLine[2] + 1, findCommasInLine[3], line); // equal to parts[3] - term
        var mileage = ParseSectionAsInt(findCommasInLine[3] + 1, findCommasInLine[4], line); // equal to parts[4] - mileage
        var value = ParseSectionAsDecimal(findCommasInLine[4] + 1, findCommasInLine[5], line); // equal to parts[5] - value
        var valueHolder = new ValueHolder(elementId, vehicleId, term, mileage, value);
    }
}

Whoops, that is worse than expected. It is an easy mistake to make but dotTrace can help us here…

Constructing a StringBuilder for every section in every line is incredibly expensive. Luckily it is a quick fix, we constructor a single StringBuilder on the construction of V05 and clear it before each usage. V05 now has the following statistics:-

Phew we are back on the downwards trends. We started at 7.5GB and now we are down to 3.2GB.

Lists are not always nice

At this point dotTrace becomes an essential part of the optimisation process. Looking at V05 dotTrace output:-

Building the short lived index of commas positions is expensive. As underneath any List<T> is just a standard T[] array. The framework takes care of re-sizing the underlying array when elements are added. This is useful and very handy in typical scenarios. However, we know that there are six sections we need to process (but we are only interested in five of those sections), ergo there are at least seven commas we want indexes for. We can optimise for that:-

private int[] FindCommasInLine(string line)
{
    var nums = new int[7];
    var counter = 0;

    for (var index = 0; index < line.Length; index++)
    {
        if (line[index] == ',')
        {
            nums[counter++] = index;
        }
    }

    return nums;
}

2.6GB is pretty good, but what happens if we force the compiler to use byte for this method instead of the compiler defaulting to use int:-

private byte[] FindCommasInLine(string line)
{
    byte[] nums = new byte[7];
    byte counter = 0;

    for (byte index = 0; index < line.Length; index++)
    {
        if (line[index] == ',')
        {
            nums[counter++] = index;
        }
    }

    return nums;
}

2.6GB was pretty good, 2.4GB is even better. This is because an int has a much larger range than a byte.

Pooling byte arrays

V06 now has a byte[] array that holds the index of each comma for each line. It is a short lived array, but it is created many times. We can eliminate the cost of creating a new byte[] for each line by using a recent addition to the .NET ecosystem; Systems.Buffers. Adam Sitnik has a great breakdown on using it and why you should. The important thing to remember when using ArrayPool<T>.Shared is you must always return the rented buffer after you are done using it otherwise you will introduce a memory leak into your application.

public void ParseLine(string line)
{
    if (line.StartsWith("MNO"))
    {
        var tempBuffer = _arrayPool.Rent(7);

        try
        {
            var findCommasInLine = FindCommasInLine(line, tempBuffer);
            // truncated for brevity
        }
        finally
        {
            _arrayPool.Return(tempBuffer, true);
        }
    }
}

private byte[] FindCommasInLine(string line, byte[] nums)
{
    byte counter = 0;

    for (byte index = 0; index < line.Length; index++)
    {
        if (line[index] == ',')
        {
            nums[counter++] = index;
        }
    }

    return nums;
}

Down to 2.2GB, having started at 7.5GB. It is pretty good, but we are not done yet.

Goodbye StringBuilder

Calling StringBuilder.ToString() inside of the decimal and int parsers is incredibly expensive. It is time to deprecate StringBuilder and write our own¹ int and decimal parsers without relying on strings and calling int.parse() / decimal.parse(). According to the profiler this should shave off around 1GB. After writing our own int and decimal parsers V08 now clocks in at:-

1.1GB is a huge improvement from where we were last (2.2GB) and even better than the baseline (7.5GB).

Skipping commas

Until V08 our strategy has been to find the index of every comma on each line and then use that information to create a sub-string which is then parsed by calling int.parse() / decimal.parse(). V08 deprecates the use of sub-strings but still uses the short lived index of comma positions.

An alternative strategy would be to skip to the section we are interested in by counting the number of preceding commas then parse anything after the required number of commas and return when we hit the next comma.

This would also means we can deprecate the rented byte[] array because we are no longer building a short lived index:-

public sealed class LineParserV09 : ILineParser
{
    public void ParseLine(string line)
    {
        if (line.StartsWith("MNO"))
        {
            int elementId = ParseSectionAsInt(line, 1); // equal to parts[1] - element id
            int vehicleId = ParseSectionAsInt(line, 2); // equal to parts[2] - vehicle id
            int term = ParseSectionAsInt(line, 3); // equal to parts[3] - term
            int mileage = ParseSectionAsInt(line, 4); // equal to parts[4] - mileage
            decimal value = ParseSectionAsDecimal(line, 5); // equal to parts[5] - value
            var valueHolder = new ValueHolder(elementId, vehicleId, term, mileage, value);
        }
    }
}

Unfortunately V09 does not save us any memory, it does however reduce the time taken:-

Another benefit of V09 is that it reads much more closer to the original implementation.

The war between classes and structs

This blog post is not going to cover the difference or the pros/cons of classes vs structs. That topic has been covered many times. In this particular context, it is beneficial to use a struct. Changing ValueHolder to a struct in V10 has the following statistics:-

Finally, we are below the 1GB barrier. Also, word of warning please do not use a struct blindly, always test your code and make sure the use case is correct.

Goodbye StreamReader

As of V10 the line parser itself is virtually allocation free. dotTrace reveals where the remaining allocations occur:-

Well this is awkward, the framework is costing us memory allocations. We can interact with the file at a lower-level than a StreamReader:-

private static void ViaRawStream(ILineParser lineParser)
{
    var sb = new StringBuilder();

    using (var reader = File.OpenRead(@"..\..\example-input.csv"))
    {
        try
        {
            bool endOfFile = false;
            while (reader.CanRead)
            {
                sb.Clear();

                while (endOfFile == false)
                {
                    var readByte = reader.ReadByte();

                    // -1 means end of file
                    if (readByte == -1)
                    {
                        endOfFile = true;
                        break;
                    }

                    var character = (char)readByte;

                    // this means the line is about to end so we skip
                    if (character == '\r')
                    {
                        continue;
                    }

                    // this line has ended
                    if (character == '\n')
                    {
                        break;
                    }

                    sb.Append(character);
                }

                if (endOfFile)
                {
                    break;
                }

                var buffer = new char[sb.Length];

                for (int index = 0; index < sb.Length; index++)
                {
                    buffer[index] = sb[index];
                }

                lineParser.ParseLine(buffer);
            }
        }
        catch (Exception exception)
        {
            throw new Exception("File could not be parsed", exception);
        }
    }
}

Well, 695MB is still better than 768MB. Okay, that was not the improvement I was expecting (and rather anti-climatic). Until, we remember we have previously seen and solved this problem before. In V07 we used ArrayPool<T>.Shared to prevent lots of small byte[]. We can do the same here:-

private static void ViaRawStream(ILineParser lineParser)
{
    var sb = new StringBuilder();
    var charPool = ArrayPool<char>.Shared;

    using (var reader = File.OpenRead(@"..\..\example-input.csv"))
    {
        try
        {
            bool endOfFile = false;
            while (reader.CanRead)
            {
                // truncated for brevity

                char[] rentedCharBuffer = charPool.Rent(sb.Length);

                try
                {
                    for (int index = 0; index < sb.Length; index++)
                    {
                        rentedCharBuffer[index] = sb[index];
                    }

                    lineParser.ParseLine(rentedCharBuffer);
                }
                finally
                {
                    charPool.Return(rentedCharBuffer, true);
                }
            }
        }
        catch (Exception exception)
        {
            throw new Exception("File could not be parsed", exception);
        }
    }
}

Version	Took (ms)	Allocated (kb)	Peak Working Set (kb)	Gen 0 Collections
01	8,750	7,412,303	16,720	1,809
02	6,922	4,288,289	16,716	1,046
03	8,375	4,284,873	16,744	1,046
04	9,813	6,727,664	16,872	1,642
05	8,125	3,199,195	16,636	781
06	8,078	2,454,297	16,548	599
07	8,891	2,258,272	16,752	551
08	6,047	1,160,856	16,816	283
09	5,703	1,160,856	16,572	283
10	5,594	768,803	16,512	187
11	6,781	32	12,620	0