LINQ: merging collections with Concat, Zip, and Join

Continuing my series of posts on LINQ, today I want to write about a few of the LINQ extension methods that take multiple input collections and return a single one.

Specifically, I want to take a look at the following methods: Concat(), Zip() and Join(). These all take two input collections, and combine their elements into a single resulting collection in different ways.

Concat

Concat() is easily the simplest of the methods we will be looking at today.

As its name suggests this method concatenates two sequences into a single one. This means that it simply appends the second sequence to the first, so we can enumerate both at once.

Here are a few examples.

var a = new int[] { 1, 2 };
var b = new int[] { 3, 4 };

foreach (var n in a.Concat(b))
{
    // enumerates 1, 2, 3, 4
}
foreach (var n in b.Concat(a))
{
    // enumerates 3, 4, 1, 2
}
foreach (var n in a.Concat(b).Concat(a))
{
    // enumerates 1, 2, 3, 4, 1, 2
}

Note that like most LINQ methods, Concat() uses deferred execution and will only start enumerating each input sequence, as we access the first element of that sequence.

Naturally, this also means that modifications to the input collections will be reflected when enumerating afterwards.

var a = new int[] { 1, 2 };
var b = new int[] { 3, 4 };

var query = a.Concat(b);

foreach (var n in query)
{
    // enumerates 1, 2, 3, 4
}

a[0] = 0;

foreach (var n in query)
{
    // enumerates 0, 2, 3, 4
}

It is interesting to note that Concat() is in a way a special case of the SelectMany() LINQ extension method. However, that is a topic for another post.

Zip

Zip() is somewhat more complicated than Concat() and also behaves quite differently.

We can think of it somewhat like a physical zipper. However instead of interleaving elements – alternating between the two input collections – it takes one element from each input collection, and merges the two into one output value.

To do this, we need to not only supply the two sequences, but also a Func<TFirst, TSecond, TResult>, where TFirst and TSecond are the element type in the input sequences, and TResult is the element type of the output.

var a = new int[] { 1, 2 };
var b = new int[] { 3, 4 };

foreach (var n in a.Zip(b, (x, y) => x + y))
{
    // enumerates 4 (1 + 3), 6 (2 + 4)
}

var strings = new string[] { "..", "..." };

foreach (var n in a.Zip(string, (x, s) => x + s.Length))
{
    // enumerates 3 (1 + 2), 5 (2 + 3)
}

If the two input sequences are of different length, Zip() will finish returning values once one of the inputs reaches its end. In other words, the output will always have as many elements as the smallest input sequence.

Unsurprisingly, Zip() is also implemented using deferred execution and will only enumerate its input sequences as far as it has to and will never enumerate the extra values of a longer input collection.

In addition it will also reflect changes to the input collections, as we would expect.

var a = new int[] { 1, 2 };
var b = new List<int> { 3, 4 };

var query = a.Zip(b, (x, y) => x + y);

foreach (var n in query)
{
    // enumerates 4 (1 + 3), 6 (2 + 4)
}

a[0] = 0;
b.Add(10);

foreach (var n in query)
{
    // enumerates 3 (0 + 3), 6 (2 + 4)
}

Join

Join() is the most complicated extension method we are going to look at today.

Those familiar with databases and query languages with SQL will be able to guess what this method does. However, we will take things slowly, to make sure we do not miss any important details.

They key idea behind the method is that it “correlates the elements of two sequences based on matching keys” (MSDN).

Essentially this means that it will look at all pairs of elements from the two input sequences – i.e. all elements of their Cartesian product – and will filter them if the two items share some specified key-value.

Like Zip(), it will then apply a function two those pairs of elements, returning one new value per matching pair.

For example, imagine we have a list of Players, each having a name and an integer id. Further, we have a list of Units with a name, that each belong to a player. The way that each unit keeps track of its player is by storing the player’s integer id.

class Player
{
    public string Name { get; set; }
    public int Id { get; set; }
}
class Unit
{
    public string Name { get; set; }
    public int PlayerId { get; set; }
}

Let us now say that we wanted to print a list of all units, outputting the unit’s name, and the name of it’s player as follows:

Unit1, Player1
Unit2, Player1
Unit3, Player2
etc.

Of course, there are many different ways to do this. For example, we could create a dictionary from player id to player (if we like, even using LINQ), then loop over all units, retrieve the units player, and then print the information.

That code could look something like this:

var playersById = players.ToDictionary(p => p.Id);
foreach (var unit in units)
{
    var player = playersById[unit.PlayerId];
    Console.WriteLine("{0}, {1}", unit.Name, player.Name);
}

Overall, this really is not bad, and for a simple example like this I would probably even write exactly this kind of code.

For more complex situation however it can be nice to use Join() instead. If we rewrite our example, it would look as follows.

foreach(var names in units.Join(
    players,
    u => u.PlayerId,
    p => p.Id,
    (u, p) => Tuple.Create(u.Name, p.Name)
    ))
{
    Console.WriteLine("{0}, {1}", names.Item1, names.Item2);
}

I put each parameter of Join() on its own line, so make it easier to read.

Note how we have to give Join() two key-selector delegates that select the keys to match – in our case u.PlayerId and p.Id.

Further, just like Zip(), we need to specify how to merge the two elements from the different input collections into one value to. In this case we simply create a tuple to easily pass them on to the loop body.

At a first glance, this may not seem clearer than the first approach above. However, consider that in this case, we do not have to introduce any additional variables, nor create a dictionary manually, which makes it more likely that our code will be correct.

Further, Join() can handle a lot of cases that we did not deal with in our first implementation.

If we would have multiple players with the same id, it would print additional name tuples for all units with that id, while our approach above would actually fail at runtime (ToDictionary() will throw an exception if there are duplicate keys). While that is a case we are unlikely to encounter in the specific example of units and players, we could easily find other examples, where such duplicates occur a lot.

Additionally, Join() will also be able to handle if we encounter a unit without an associated player (and not print any corresponding tuples).

We could add both of these features to our first implementation, but by then we would end up with significantly longer code than the LINQ solution allows.

Lastly, there is of course value in using common and well known functionality like LINQ over writing the code yourself, since other programmers will be able to see that you are performing a join at a glance, without having to read and understand the entire block of code.

Conclusion

Though simple, Concat() can be a very useful LINQ method. Zip() and Join() are somewhat more complicated, but by no means less useful.

All three of them – but especially the first two – are very specific in what they can do, so that one may not use them as often as methods such as Select() and Where(), but they are nonetheless powerful helpers to deal with merging of collections in different ways.

I hope this post has given you an overview of what these methods are about, and that you got an idea of what they can be used for.

Certainly let me know if you have any questions or if there are any other topics you would like me to cover in a future post.

Enjoy the pixels!

Leave a Reply