by skeet via Jon Skeet's Coding Blog : C# on 1/4/2008 11:42:00 PM
Marc Gravell and I have now implemented a lot of LINQ standard query operators on the "push" model of IDataProducer as opposed to the "pull" model of IEnumerable. My good friend Douglas Leeder (who doesn't use C#) has been with me this weekend, and through explaining the "big picture" to him in various ways, and taking his feedback, I think I've now got a good way of communicating it. Voting.
It's a "real life analogy" which is always dangerous - don't think of it too literally - I'm not claiming that it's meant to be an absolute 1:1 correspondence. However, I think it neatly demonstrates the problem and some of the benefits of the solution we've come up with.
In order to make all this concrete, all of the code is real, and can be downloaded as a zip of a VS2008 solution. It contains a binary of an unreleased version of MiscUtil which is where the DataProducer stuff currently lives.
Let's suppose we're trying to find out what the favourite colour is of everyone in the world. Now, for the purposes of the demo code, there are only four colours and six people in the world: that makes the diagrams nice and easy, and we can see results simply too. Extending the data to the rest of the real world is left as an exercise to the reader. We may also want additional information, such as the average ages of people voting for particular colours.
Here's our complete sample data - the five members of my family, and Douglas:
Note how the colours are specified with variations of case. We'll use that later as an example of why you might need to specify a "key comparer".
There are various ways of implementing this in LINQ, and for each model we'll provide code and think about how it would work in the real world.
This is the model which "normal" LINQ uses - you only ever pull data, using an IEnumerable<T>. Here's a simple query expression which gives the answers we want (admittedly unordered - we'll ignore that for now):
There are two problems here.
Firstly, we're using ToUpper() to get round the "RED != red" problem. This is not only bad in terms of internationalisation, but it also loses data. We really want to get the original string as the key, and then use a case-insensitive comparer. We can do this by a manual call to GroupBy instead of using query expressions - there's an overload which takes an IEqualityComparer.
Secondly, the result of the "group ... by" keeps all the voter data temporarily. It has to all be available at the same time before the "select" kicks in. This runs counter to the normal "streaming" idea of LINQ. This is inherent in the nature of the "pull" model, as I'll explain in a minute.
Now, let's see what this looks like in the real world. People come into a room through a door, and a "grouper" asks them for their favourite colour. The grouper then tells each voter (immediately) which corner of the room to stand in. The result at the end of the grouping is this:
After the initial grouping, another person goes to each group in turn, finding out their key and doing a head count. That group is then free to go. The important thing is that this person can't do their job until all the data is in, because they've got to be able to see everyone in order to count them.
The fact that we used "group voter by ..." meant that the result of the grouping still involved whole people. As we're only going to do a head count, we only need something saying "There was a person here." We can change our original query to do that quite easily:
This time, after the grouping takes place, the room looks like this:
The use of 1 here is purely incidental: we could have used 'group "Spartacus" by ...' and the results would have been the same. It's just something which can be counted.
Now, there's good and bad here:
The problem with the pull model is that each aggregator always wants to be the only thing pulling. The call to MoveNext will block until more data is available. That's a real problem when you want to have multiple aggregators (one vote counter per colour). We could do a complicated threading manoeuvre, with each colour getting its own thread and the "grouper" pushing items out to relevant threads. Again though, that doesn't scale - the four extra threads in our example aren't too bad, but imagine other groupings with potentially thousands of keys.
The alternative is to change the model. Instead of having a greedy aggregator pulling data, we change to aggregators who observe data being pushed past them, and also observe a special "all the data has now been pushed" signal. Before we look at the code to do this, let's think about what it could be like in real life. We don't know how many different colours will be voted on, but we know what we need to do with each one: count the number of votes for them. In detail, the situation would be something like this:
We never have more than one voter in the room at once:
Let's have a look at the code involved now.
There are two sides to the code here: the code that the LINQ user has to write, and the code Marc Gravell and I have implemented. We'll look at the client code in a few different scenarios.
Keeping to the normal "start with a data source, do something, then select" model involves stepping away from query expressions. We've got a new extension method on IEnumerable<T> called GroupWithPipeline, which takes a key selector (just like the normal GroupBy) and what to do with the results of each grouping. Here's the new code (which requires a using directive for MiscUtil.Linq.Extensions):
IEnumerable<T>
GroupWithPipeline
GroupBy
MiscUtil.Linq.Extensions
How about making this a bit smarter now? Let's try to also work out the minimum and maximum ages of the voters for each colour. Conceptually this is just a case of adding extra observers along with each vote counter in the "real life" model above. The code is remarkably simple:
The fact that it uses "Value1", "Value2" and "Value3" isn't ideal, but unfortunately there's no way round that as far as we've worked out - for this part.
GroupWithPipeline uses a few types internally which you can use directly instead: DataProducer (implementing IDataProducer) and Future (implemeting IFuture). If I go into the details here, I'll never get this posted - but that may come into another post if there's enough interest. However, let's have a look at how it can be used. First, let's find the results of a few aggregates of our voters, this time without any groupings:
The output of the code is what you'd expect, but there are a few things to note:
IFuture<int>
int
Value
Count
ProduceAndEnd
Select
Max
Where
This part wouldn't have been available when I started writing this post. I hadn't quite realised the power of the pattern yet. By implementing GroupBy on IDataProducer, we can perform the original grouping in a query expression, in a pretty normal kind of way... except that this time we can apply multiple aggregates, never buffering the data beyond the results of the aggregation:
IDataProducer
There's just one tricky bit in here - you must call AsEnumerable before the data is produced, otherwise the aggregators will stream all their data with nothing watching for the results. In fact, AsEnumerable builds a list internally - the final results are buffered, but only those results. There's really not a lot that can be done about that.
AsEnumerable
So, there we go. That may or may not be a bit clearer now. I'm still learning both the power of the pattern, its potential uses, and the best ways of explaining it. Feedback is very welcome, both on the technical front and about the explanation. I'm absolutely convinced that it's a useful pattern in some situations (though not all). All the code will be released as part of MiscUtil eventually, of course - we're still tidying it up and producing a bit more documentation at the moment.
Original Post: "Push" LINQ revisited - next attempt at an explanation
The content of the postings is owned by the respective author. CSharpFeeds is not responsible for the contents of the postings. This site is automatically generated and cannot be reviewed for abusive content. If you find abusive content on CSharpFeeds, please contact us. Designated trademarks and brands are the property of their respective owners. All rights reserved.