Linq and set theory: Union() and Intersect()

So far, in my series of posts on LINQ we have covered only methods acting on single collections. Today we are going to look at some of the LINQ methods that take a second collection as parameter.

Specifically, we will at the classical set theory operators of union, intersection, and relative complement, which in LINQ are implemented with the Union() and Intersect() methods.

Element uniqueness

Before we can get started looking at these three LINQ extension methods, we first have to discuss element uniqueness, or distinctiveness in the context of sets and set theory.

We start with the definition of a set:

A set in set theory is a collection of well defined and distinct objects.

This means that a set never includes duplicates, or multiple equivalent elements. We will see below why this is a useful property.

Most types of collections of virtually any programming language do not model this constraint. For example, it is perfectly possible to have an array or a list of equal elements in C#.

var ones = new int[] { 1, 1, 1, 1 };

This is not a valid set.

As it happens, C# includes collection types that can be used to preserve this constraint, like HashSet or SortedSet, both implementing the ISet<T> interface.

This comes at a cost however, and while these can be very useful types, for most applications it is easier and more straight forward to use simple lists. In other cases, especially while working with LINQ, we may not have a choice but to only deal with simple IEnumerable<T>s.

At times however, we want to make sure to enforce the uniqueness property on a collection, effectively enumerating it while ignoring any additional time an element that we have already seen occurs.

LINQ has a method just for that purpose.

Distinct()

LINQ’s Distinct() method takes a collection and returns a deferred IEnumerable<T> that will return all distinct elements inside the original collection. In other words it will skip all occurrences of elements that have already been enumerated.

var numbers = new int[] { 1, 1, 2, 1, 3, 2 };

foreach (var n in numbers.Distinct())
{
    // enumerates 1, 2, 3
}

Note that the result has no guaranteed order. While in Microsoft’s .NET framework, the order remains the same as that of the source collection, this is not a requirement according to the specifications. Therefore we should not write code depending on the order being preserved.

Further, as so many LINQ extension methods, Distinct() is implemented using deferred execution. That means that any changes to the collection will be represented in following enumerations.

var numbers = new int[] { 1, 1, 2, 1, 3, 2 };

ver distinct = numbers.Distinct();

numbers[0] = 4;

foreach (var n in distinct)
{
    // enumerates 1, 2, 3, 4
}

numbers[1] = 5;

foreach (var n in distinct)
{
    // enumerates 1, 2, 3, 4, 5
}

Union()

In set theory, the union of two sets is another set containing all elements of either set. For example, the union of the sets {a, b} and {c, d} would be {a, b, c, d}.

Note that since we are looking at sets, duplicate elements only show up a single time in the result. For example, the union of the sets {a, b} and {c, a} is {a, b, c}.

LINQ’s Union() extension method implements just this operation. It takes two collections, and returns a new one combining all distinct elements from the two sources – no matter if elements have duplicates within either sequence.

var first = new int[] { 1, 1, 2 };
var second = new int[] { 2, 3, 4 };

foreach (var n = first.Union(second))
{
    // enumerates 1, 2, 3, 4
}

Unlike Distinct(), Union()‘s output is guaranteed to be ordered in such a way, that it will first enumerate the distinct elements in the first collection – in the same order as they occur there – and then enumerate any remaining elements from the second collection, again in the same order.

This means that Union() is not a commutative operation in the strictest sense. Switching the two collections could cause the result to enumerate in a different order. However, it is commutative in the sense that the resulting collection contains the same elements in either case.

var first = new int[] { 1, 1, 2 };
var second = new int[] { 2, 3, 4 };

foreach (var n = first.Union(second))
{
    // enumerates 1, 2, 3, 4
}
foreach (var n = second.Union(first))
{
    // enumerates 2, 3, 4, 1
}

Unsurprisingly, Union() is also implemented in a deferred fashion.

Intersect()

Intersection is another set operation taking two sets as input. Similar to the union, the resulting set contains elements of the input sets. However, it only contains those elements that are found in both of the input sets.

For example, the intersection of the sets {a, b} and {c, a} is {a}, since only a appears in both sets.

This is implemented in LINQ’s Intersect() method. Just like with Union(), this method also forces element uniqueness and will not return any duplicates, even if both input collections contain duplicate elements.

var first = new int[] { 1, 1, 2, 2 };
var second = new int[] { 2, 3, 4, 1, 1 };

foreach (var n = first.Intersect(second))
{
    // enumerates 1, 2
}

The order in which elements are returned is the one in which they are found in the first collection.

Again, this method is also implemented using deferred execution. However, it is noteworthy that once enumeration starts, the entire second list is enumerated at once before the first element is returned. The first collection is enumerated as the resulting sequence is consumed. (The official documentation gives different behaviour, however it seems to be incorrect.)

Note that similar to Union(), Intersect() is commutative in the sense that the returned collection contains the same elements in either case. However, the order may differ.

var first = new int[] { 1, 1, 2, 2 };
var second = new int[] { 2, 3, 4, 1, 1 };

foreach (var n = first.Intersect(second))
{
    // enumerates 1, 2
}
foreach (var n = second.Intersect(first))
{
    // enumerates 2, 1
}

The distinctiveness of the returned collections of both methods are the reason I spent time talking about Distinct() earlier. These operations are commonly used and well understood in set theory, which is likely the reason they were implemented to enforce the distinctiveness constraint. Any other implementation – for example allowing duplicates if both sets contain the same duplicate – may lead to serious confusion, and would also require a more complicated implementation.

Conclusion

In this post we took a look at the LINQ methods Union() and Intersect(), while also mentioning Distinct() to discuss the distinctiveness property of sets.

These represent some of the most commonly used operations in set theory, and having them implemented in LINQ can save a lot of worry and code, in a variety of circumstances.

I hope you found this useful, and in either case feel free to drop a comment below to let me know what you think, and what you would like me to write about next.

Enjoy the pixels!

Leave a Reply

2 comments

  1. Scott Westfall says:

    Thanks, I’m a weekend programmer moving from Pascal to C#. I was wondering if C# contained the set functionality of Pascal. Based on your article, it does, though maybe not quite as conveniently.

    • Paul Scharf says:

      I’m afraid I’m not very familiar with Pascal, so I might not be able to help much. But from what I can see from a quick Google search, Pascal seems to have some collection specific syntax that C# indeed does not have.

      What is shown in the post above is pretty much as concise and neat as collections can be defined and worked with in C#, though that also highly depends on what kind of collection we are talking about.

      I use mostly arrays in the post because they are so easy to instantiate, but in real code, I would for most things use specialized and higher level collection classes like List or Set.

      Hope that helps at all.