by skeet via Jon Skeet: Coding Blog on 9/3/2010 7:54:46 PM
Warning: this post is quite long. Although I've chosen a simple operator to implement, we'll encounter a few of the corner cases and principles involved in LINQ along the way. This will also be a somewhat experimental post in terms of format, as I try to work out the best way of presenting the material.
We're going to implement the "Where" clause/method/operator. It's reasonably simple to understand in general, but goes into all of the deferred execution and streaming bits which can cause problems. It's generic, but only uses one type parameter (which is a big deal, IMO - the more type parameters a method has, the harder I find it to understand in general). Oh, and it's a starting point for query expressions, which is a bonus.
"Where" has two overloads:
Before I go into what it actually does, I'll point out a few things which are going to be common across nearly all of the LINQ operators we'll be implementing:
I fully expect that most readers will be comfortable with all of these concepts, so I won't go into them in more detail. If any of the above makes you nervous, please familiarise yourself with them before continuing, otherwise you're likely to have a hard time.
The purpose of "Where" is to filter a sequence. It takes an input sequence and a predicate, and returns another sequence. The output sequence is of the same element type (so if you put in a sequence of strings, you'll get a sequence of strings out) and it will only contain elements from the input sequence which pass the predicate. (Each item will be passed to the predicate in turn. If the predicate returns true, the item will be part of the output sequence; otherwise it won't.)
Now, a few important details about the behaviour:
Many of these points will be true for a lot of our other operators too.
The overload which takes a Func<TSource, int, bool> lets the predicate use the index within the sequence as well as the value. The index always starts at 0, and increments by 1 each time regardless of previous results from the predicate.
Ideally, we'd like to test all of the points above. The details of streaming and how many times the sequence is iterated over are frankly a pain to deal with, unfortunately. Given how much there is to implement already, we'll come back to those.
Let's have a look at some tests. First, here's a simple "positive" test - we're starting with an array of integers, and using a lambda expression to only include elements less than 4 in the output. (The word "filter" is omnipresent but unfortunate. It's easier to talk about "filtering out" elements than "including" them, but the predicate is expressed in a positive way.)
I've kept the TestExtensions from MoreLINQ, despite NUnit coming with CollectionAssert. I find the extension methods easier to work with for three reasons:
Basically, AssertSequenceEqual does what you'd expect it to - it checks that the actual result (usually expressed as the variable you call the method on) matches the expected result (usually expressed as a parameter array).
So far, so good. Now let's check argument validation:
I'm not bothering to check the name within the ArgumentNullException, but importantly I'm testing that the arguments are being validated immediately. I'm not trying to iterate over the result - so if the validation is deferred, the test will fail.
The final interesting test for the moment is also around deferred execution, using a helper class called ThrowingEnumerable. This is a sequence which blows up with an InvalidOperationException if you ever try to iterate over it. Essentially, we want to check two things:
We'll need to do something similar for other operators, so I've written a small helper method in ThrowingEnumerable:
Now we can use that to check that Where really defers execution:
These tests have all dealt with the simpler predicate overload - where the predicate is only passed the item, not the index. The tests involving the index are very similar.
With all the tests passing when running against the real LINQ to Objects, it's time to implement it in our production code. We're going to use iterator blocks, which were introduced in C# 2 to make it easier to implement IEnumerable<T>. I have a couple of articles you can read if you want more background details... or read chapter 6 of C# in Depth (either edition). These give us deferred execution for free... but that can be a curse as well as a blessing, as we'll see in a minute.
At its heart, the implementation is going to look something like this:
Simple, isn't it? Iterator blocks allow us to write the code pretty much how we'd describe it: we iterate over each item in the source, and if the predicate returns true for that particular item, we yield (include) it in the output sequence.
Lo and behold, some of our tests pass already. Now we just need argument validation. That's easy, right? Let's give it a go:
Hmm. Our validation tests still seem to be red, and putting a breakpoint on the "throw" statements doesn't help... they're not getting hit. What's going on?
I've given a few pretty broad hints already. The problem is deferred execution. Until we start trying to iterate over the result, none of our code will run. Our tests deliberately don't iterate over the result, so validation is never performed.
We've just hit a design flaw in C#. Iterator blocks in C# simply don't work nicely when you want to split execution between "immediate" (usually for validation) and "deferred". Instead, we have to split our implementation into two: a normal method for validation, which then calls the iterator method for the deferred processing:
It's ugly, but it works: all our index-less tests go green. From here, it's a short step to implement the version using an index as well:
Now the bar is green, and we're done. Hang on though... we haven't used it every way we can yet.
So far, we've been calling the method directly (although as an extension method) - but LINQ also provides us with query expressions. Here's our "SimpleFiltering" test rewritten to use a query expression:
(Note that the name is different here to in the downloadable code, to stop the blog server software blocking the name of the method. Grr.)
That will produce exactly the same code as our earlier test. The compiler basically translates this form into the previous one, leaving the condition (x < 4) as a lambda expression and then converting it appropriately (into a delegate in this case). You may be surprised that this works as we have no Select method yet... but in this case we have a "no-op" select projection; we're not actually performing a real transformation. In that case - and so long as there's something else in the query, in this case our "where" clause - the compiler effectively omits the "select" clause, so it doesn't matter that we haven't implemented it. If you changed "select x" to "select x * 2", it would fail to compile against our Where-only LINQ implementation.
The fact that query expressions are just based on patterns like this is a very powerful feature for flexibility - it's how LINQ to Rx is able to only implement the operators that make sense in that environment, for example. Similarly, there's nothing in the C# compiler that "knows" about IEnumerable<T> when it comes to query expressions - which is how entirely separate interfaces such as IObservable<T> work just as well.
There's been a lot to take in here, in terms of both implementation and core LINQ principles:
Linq-To-Objects-2.zip
Many people have asked for a source repository for the project, and that makes sense. I'm putting it together a source repository for it now; it's likely to be done before I post the next part.
Original Post: Reimplementing LINQ to Objects: Part 2 - "Where"
The content of the postings is owned by the respective author. CSharpFeeds is not responsible for the contents of the postings. This site is automatically generated and cannot be reviewed for abusive content. If you find abusive content on CSharpFeeds, please contact us. Designated trademarks and brands are the property of their respective owners. All rights reserved.