LINQ Extensions 3: Batch into sub-sequences

After last weeks post on extracting elements out of a list by minimum or maximum keys Ody Mbegbu mentioned on Google+ how he feels that something LINQ is missing is the functionality to batch, page, or divide a sequence into sub-sequences of a given size.

That is what we are going to look at today!

In fact, Ody already posted a method doing what we want:

``````static IEnumerable<IEnumerable<T>> Batch<T>(
this IEnumerable<T> query, int batchSize)
{
int count = query.Count();
int batchCount = count <= batchSize
? 1
: (count % batchSize == 0
? (count / batchSize)
: (count / batchSize + 1)
);

for (int i = 0; i < batchCount; i++)
yield return query
.Skip(i * batchSize)
.Take(batchSize);
}
``````

We will take this method as a starting point, and explore if and how it can be improved.

A slight modification

First, note how code calculating the number of batches is quite complicated. It handles three different cases:

1. less than one full batch;
2. last batch is exactly full;
3. last batch is only partly full.

The method will always return the minimum number of batches needed to contain all elements, with the exception of an empty input. In that case, a sequence containing an empty sequence will be returned.

While that may be desired behaviour, this is something I will change. Doing so will allow us to simplify both this and the rest of the code easier. Of course, we could always re-add the special case, if we like.

In my version, the method will simply return an empty sequence for an empty input.

Calculating number of batches

With that out of the way, note how the number of batches will always be the length of the input, divided by the batch size, rounded up.

Unfortunately, there is no such thing as rounding up with integer division. We can easily simulate it however. All that is needed is to see that rounding up really means to always add 1 to the result, except for when the division has no remainder. Instead of deciding between the two cases using a conditional statement however, we can add `(batchSize - 1)` before dividing, which results in exactly the desired behaviour:

``````var batchCount = (count + batchSize - 1) / batchSize;
``````

Analysing runtime

The above is really just a minor quibble however.

The real meat of this post is looking at the runtime of the presented method.

Note how it enumerates the input once to determine its count, and then once more for each batch.

That means that the method’s runtime is O(n + nk) with n the size of the input and k the number of batches. The latter is really a linear function of n, so the worst case performance of the method is O(n + nO(n)) or O(n2).

This seems less than optimal. After all, in the end we return each item of the input only a single time – albeit differently organized.

Optimising runtime – from square to linear

The former intuition is in fact correct. We can make the method run in linear time easily enough. All we have to do is make sure we do not enumerate the entire list, every time we `Skip()` the previous batches.

As a first step, we can convert out input into a `List<T>` using LINQ’s `ToList()` which we can then access directly using indices:

Unfortunately, as it turns out, the problematic `Skip()` not actually optimised to take advantage of the collection type. We could easily enough write out own optimised version, however, if we like:

``````static IEnumerable<IEnumerable<T>> Batch<T>(
this IEnumerable<T> query, int batchSize)
{
var asList = query.ToList();
var batchCount = (asList.Count + batchSize - 1)
/ batchSize;

for (int i = 0; i < batchCount; i++)
yield return query
.SkipOnList(i * batchSize)
.Take(batchSize);
}

static IEnumerable<T> SkipOnList<T>(
IList<T> list, int count)
{
for (int i = count; i < list.Count; i++)
yield return list[i];
}
``````

This method now does in fact run in guaranteed linear time.

Examining deferred execution

Another topic to consider is whether our method should execute immediately, or defer until we actually need the result.

The original method was entirely deferred – except for one thing: It determined the number of batches on its first call. That means that if the collection should change before we are done with our query, we might get unexpected results. We might miss elements that fall into additional batches, or we might get any number of empty batches in the end. Even worse, we might skip elements, or get them more than once, if elements are inserted or removed.

Our improved implementation does not have these problems anymore. The result will contain exactly those elements that were in the sequence when we called the method. In that way it is executed immediately.

The result of the method is deferred however. Double so in fact. Both the outer for loop, as well as the skip-take combination use `yield return` to defer execution.

Implementing total deferred execution

While the previous implementation is entirely adequate, I thought it would be nice to see the method implemented in an completely deferred way.

To make sure we do not end up with more code than we need, let us start completely fresh by specifying in pseudo code what we want our method to do:

``````while(query is not empty)
{
return up to batchSize items from query as IEnumerable
}
``````

That is really all we want to do. No calculations or other complications are required.

The way we can implement this is by using only a single `IEnumerator<T>`, that we share between the logic that splits into batches, and the logic that enumerates each batch.

The first step is to get the enumerator, and loop over it, until there are no more elements left:

``````using(var enumerator = query.GetEnumerator())
{
while (enumerator.MoveNext())
{
// return batchSize items here!
}
}
``````

Note how we use the `using` statement to make sure the enumerator will be appropriately disposed in case of an exception.

We have multiple options for implementing the loop’s body. For example, we could create a `List<T>` which we fill with our items, and then return and switch out with a new list, every time it reaches the maximum batch size.

EDIT: Unfortunately, the implementation below is severely broken. I did not notice this until well after implementing it. Do NOT use the code below!

It is in fact not possible to implement this part in a deferred fashion, and so the solution mentioned above (enumerating the batch into a list or array) has to be used.

For a corrected version, see this post, which also analyses and explains the problem with the code below.

This would again result in a not fully deferred solution. So instead we will write a helper method which allows us to return a deferred `IEnumerable<T>` from an enumerator, with a maximum number of elements:

``````static IEnumerable<T> enumerateSome<T>
(this IEnumerator<T> enumerator, int count)
{
for (int i = 0; i < count; i++)
{
yield return enumerator.Current;
if (!enumerator.MoveNext())
yield break;
}
}
``````

Note how we call `MoveNext()` after yielding the first element of the batch. The reason for this is that we already called it once in our calling method, to make sure there are any elements to batch in the first place. Other than that, the method is rather straight forward.

All in all, we end up with the following solution:

``````public static IEnumerable<IEnumerable<T>> Batch<T>
(this IEnumerable<T> query, int batchSize)
{
using(var enumerator = query.GetEnumerator())
{
while (enumerator.MoveNext())
{
yield return enumerator.enumerateSome(batchSize);
}
}
}

static IEnumerable<T> enumerateSome<T>
(this IEnumerator<T> enumerator, int count)
{
for (int i = 0; i < count; i++)
{
yield return enumerator.Current;
if (!enumerator.MoveNext())
yield break;
}
}
``````

This solution is completely deferred – on both levels – and has linear runtime as well.

Conclusion

Starting out with a quick implementation of a method to batch or page a sequence into equal parts, I hope this exploration of how to analyse and optimise a method like this has been interesting.

We ended up with what I consider a rather elegant – even pretty – implementation that I hope will come in handy for some.

Enjoy the pixels!

1. Benny Tordrup says:

Not quite sure that I follow the use of the function. Could you provide an example?

• Paul Scharf says:

Hey Benny,

It is a bit of a situational thing for sure.

However, there are a number of ways in which the function can come in handy.
An example would be for dividing a large number of items into pages, maybe you only want to show 10 items on each page.
Another case might be that you may want to split up a large number of independent tasks, either to be run on different threads, or to be run one batch at a time, without blocking the UI.