CSharpFeeds - All your C# feeds in one place.

Sponsors

Sunday, September 20, 2009

Efficient "vote counting" with LINQ to Objects - and the value of nothing

by skeet via Jon Skeet: Coding Blog on 9/20/2009 8:48:00 PM

EDIT: Please ignore this post for the moment. It's wrong. I'll leave up the incorrect version for the moment so you can see what it will be about, but it doesn't quite work. The size of Empty ends up being 1, annoyingly enough. I'll edit this post properly when I get the chance.

A long time ago, I'm pretty sure I started writing Push LINQ in response to a request to effectively count votes. To recap the example in the linked blog post, I wanted to work out the most popular colours by taking a bunch of people, asking them for their favourite colours, and then counting the votes.

This is easy to express in LINQ to Objects, with a query like this:

var query = from voter in Voter.AllVoters()
            group voter by voter.FavouriteColour into grouped
            select new { Colour = grouped.Key, Votes = grouped.Count() };

foreach (var entry in query)
{
    Console.WriteLine("Colour {0} has {1} votes", entry.Colour, entry.Votes);
}

This has the unfortunate side-effect of keeping references to all the people until the query has completed. In a streaming scenario, this is definitely not desirable. The workaround - until I thought of the quirk which is the subject of this blog post - is to just keep a constant value for each person; we only need to know which way they "voted", after all:

var query = from voter in Voter.AllVoters()
            group 1 by voter.FavouriteColour into grouped
            select new { Colour = grouped.Key, Votes = grouped.Count() };

foreach (var entry in query)
{
     Console.WriteLine("Colour {0} has {1} votes", entry.Colour, entry.Votes);
}

This unfortunately still means keeping a value for each person. We'll end up with an awful lot of copies of "1" in memory in this case. But we can do better! Let's introduce a single extra line of code (outside the method with this implementation) and change the grouping slightly...

public struct Empty {}
...
var query = from voter in Voter.AllVoters()
            group new Empty() by voter.FavouriteColour into grouped
            select new { Colour = grouped.Key, Votes = grouped.Count() };

foreach (var entry in query)
{
     Console.WriteLine("Colour {0} has {1} votes", entry.Colour, entry.Votes);
}

I don't know why I hadn't thought of it before. We don't really want to save any data at all about the individual votes, other than that they happened. So let's not store any. The great thing about the Empty struct is that it takes up no space at all when you create an array of N of them, however big N is. There's the array overhead of just "here's the size of the array" (zero bounds are assumed for "vectors" in the CLR) but that's it. When LINQ to Objects has to create a grouping for each colour, it's incredibly cheap to do so. Oh, every so often it will need to resize an array, but should be extremely cheap too, so long as it ends up as a bit blit somewhere - after all, copying 0 bytes is pretty quick. (This is the bit that doesn't work. We end up with 1 byte per value, annoyingly.)

Note that it has to be a struct to give the benefit. We don't want references to a singleton instance or anything like that. That would basically be like the second case, but even worse on a 64-bit CLR. We want a value which takes no memory at all.

The beautiful thing is that because the internal class used to implement IGrouping<TKey, TValue> implements IList, even the Count() operation will be O(1). Basically we have an implementation which should be as efficient as Push LINQ, but without all the complexity of learning a new concept.

I'm sure there are improvements which could be made - overriding Equals etc, but they wouldn't have any benefit in this particular case.

This is not to say that Push LINQ is redundant, of course - it just happens to be avoidable in this particular case. I suspect there may be other times where Empty could be useful, too. It's a tricky worth remembering...

email it!bookmark it!digg it!

Original Post: Efficient "vote counting" with LINQ to Objects - and the value of nothing

Subscribe

New Feed

Product Spotlight

Recently Updated Sources

Legal Note

The content of the postings is owned by the respective author. CSharpFeeds is not responsible for the contents of the postings. This site is automatically generated and cannot be reviewed for abusive content. If you find abusive content on CSharpFeeds, please contact us. Designated trademarks and brands are the property of their respective owners. All rights reserved.

Advertise with us