LINQ Batch extensions: A correction

In a post last year I explored how we can use LINQ to batch or page items from a collection in such away as to return a sequence of sub-sequence, each having equal length (except for possibly the last one). In that post I made a significant mistake that I would like to correct here.

Overview

Let us quickly recap the idea behind the batching method I implemented in the other post.

Take for example the list { 1, 2, 3, 4, 5, 6, 7, 8 }. We could tell the method in question to batch it into groups of 3, which would return { {1, 2, 3}, {4, 5, 6}, {7, 8} }.

Back then, I ended up with two implementations, one that enumerated the entire list and then returned the sub-sequences in a deferred fashion, and a second one which worked fully deferred.

Unfortunately, the second method contained a bug that could cause incorrect assignments to sub-sequences in certain use cases.

Analyzing the original implementation

The completely deferred implementation we ended up with was as follows.

public static IEnumerable<IEnumerable<T>> Batch<T>
    (this IEnumerable<T> query, int batchSize)
{
    using(var enumerator = query.GetEnumerator())
    {
        while (enumerator.MoveNext())
        {
            yield return enumerator.enumerateSome(batchSize);
        }
    }
}

static IEnumerable<T> enumerateSome<T>
    (this IEnumerator<T> enumerator, int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return enumerator.Current;
        if (!enumerator.MoveNext())
            yield break;
    }
}

The public Batch method is the entry point for the algorithm, and the only thing it does is create the enumerator, and yield return sequences created with the enumerateSome method.

That method enumerates items up to the size of the batches from the original collection.

And with that lies the hitch: We use the same enumerator object for all the different batches.

This is no problem if we enumerate all batches fully and in order.

If we do not however, we will get all sorts of strange re-ordering, duplication, and skipping of our elements, or even worse.

For example, consider the following code.

var integers = new int[] {1, 2, 3, 4, 5};
var batches = integers.Batch(3).ToList();

What we would expect is the list of batches to have two elements that would enumerate to {1, 2, 3} and {4, 5}. However, that is not the case.

If you like, this is the moment where you can try and figure out yourself what is actually happening by stepping through the implementation above. Or just read on and let me tell you.

I actually did not realise how bad the result would be before writing this. Before, I thought we would only get batches with switched elements, and similar minor problems.

But instead…

What we actually get is a list of five enumerators, each of which will throw an exception when we try to enumerate it. The reason for this is that since we never enumerate any of the batches before starting the next one – the call to ToList forces the Batch method to yield until the end – the only enumeration that happens is in Batch, which therefore will create one batch for each element.

The reason that all batches will throw is that they do not actually store what part of the sequence to return, they simply return whatever element the shared enumerator is at. And in this case we have enumerated the entire sequence, so there is nothing left to return, and Current throws an InvalidOperationException.

So that is bad.

Fixing the problem

Unfortunately, the cause of this problem was my eagerness to make the entire method entirely deferred. As it turns out this is not possible. (While I did not find an analytical proof, I am very sure about that. If I am wrong, please let me know. I would be delighted!)

As mentioned above, the issue lies with how enumerateSome never actually enumerates any elements – unless we enumerate the batch right away, before continuing to enumerate the sequence of batches.

The only way of fixing this that I can see is to change this method to actually enumerate the entire batch into a collection, and return that one.

There are a number of ways of going about that, but the easiest is probably the following.

static IEnumerable<T> enumerateSome<T>
    (this IEnumerator<T> enumerator, int count)
{
    var list = new List<T>(count);
    for (int i = 0; i < count; i++)
    {
        list.Add(enumerator.Current);
        if (!enumerator.MoveNext())
            break;
    }
    return list;
}

This will enumerate the entire batch into a list, and return that list, instead of the deferred enumerator.

If we like we can still give the method the appearance of using deferred execution, and at the same time hide our implementation by instead of returning the list directly, enumerating it with yield return.

static IEnumerable<T> enumerateSome<T>
    (this IEnumerator<T> enumerator, int count)
{
    var list = new List<T>(count);
    for (int i = 0; i < count; i++)
    {
        list.Add(enumerator.Current);
        if (!enumerator.MoveNext())
            break;
    }
    foreach (var item in list)
    {
        yield return item;
    }
}

We could make further micro-optimisations by using an array instead of a list, but this is unlikely to make a difference in any but the most performance critical use cases. And in those cases, going with deferred execution may not be the right thing to do in the first place.

In either case, both of these last two replacements of enumerateSome work fine and solve the problem in the original code.

Conclusion

I hope you found this analysis of a severe problem in my original algorithm, as well as finding the appropriate solution interesting.

If you did, make sure to share this post on your favourite social media.

In either case, let this serve as proof that one should not blindly believe everything one reads on the internet – no matter the intentions of the author – though I doubt this was a contested statement to begin with.

Enjoy the pixels!

Leave a Reply