I had not seen a good .NET performance challenge in a while so I figured that I would start one. Obviously the real benefit will come in about two weeks when all of these items are analyzed and discussed (I wager everyone involved will learn something!). If this works out well I will try to run these either bi-weekly or monthly. Let me know your opinion in comments below.
The Prizes
Aside from becoming alpha nerd for a few weeks, the winner of this contest will receive a book from the very well known computer scientist Donald E. Knuth. Just to give things an interesting twist the book has absolutely nothing to do with optimizations. The prize is a copy of Things a Computer Scientist Rarely Talks About, it is the transcripts of Knuth’s 1999 lecture series at MIT discussing the relations between faith and science. If you wish to watch the web casts of these lectures you still can here.
The runner up will receive a copy of Donald Knuth’s TAOCP V4F3 “Generating all Combinations and Partitions” as this book will help them win a future contest.
There will also be a prize TBD for the fastest safe version of the code.
I will do a submission but I cannot win a prize.
The Problem
Wc (word count) is an old unix program. One of its main tasks is to go through a set of data and count the number of repetitions of a given word. For instance given the sentence “Hello, yes hello” it would tell you that the word Hello was used twice and that the word yes was used once. While seemingly trivial this problem offers many interesting aspects such as dealing with the trade off of memory usage for large vs small data sets.
Hint: since it is a classic problem there is a lot of research on it
Here is a naive version of a word counter with a test harness if you can’t beat this you need more optimization J
|
class Program {
static Dictionary<string, int> WordCounts = new Dictionary<string, int>();
static void Main(string[] args) {
string test = "The quick brown fox jumps over the lazy dog";
Array.ForEach<string>(test.ToUpper().Split(' '), new Action<string>(AddToHash));
foreach (string current in WordCounts.Keys) {
Console.WriteLine(current + ":" + WordCounts[current]);
}
}
static void AddToHash(string _Item) {
if (!WordCounts.ContainsKey(_Item)) {
WordCounts.Add(_Item, 0);
}
WordCounts[_Item] += 1;
}
}
|
|
Listing 1: Example of a naive word counter |
Some notes.
1) Capitalization differences do not equate to a difference.
2) You must handle normal punctuation properly there will be no apostrophes/quotes in the test data as they bring up many questions on their own.
3) Code will be run on a 32 bit processor (don’t use 64 bit stuff hoping for a 64 bit processor)
4) My processor is a dual core (hint)
5) You must use the IWordCounter interface
6) You do not need to sort your data before returning
7) If you do not pass unit tests performance tests will not be run.
UPDATE: here are some test strings to show ... the quick sample code meets some but not all of these
"Hello! Greg" results in Hello 1, Greg 1
"Hello!Greg" results in HelloGreg 1
"Hello\nGreg" results in Hello 1, Greg 1
"A.D.D." results in ADD 1
"Wow, how great!" results in Wow 1, how 1, great 1
"wow \n\n\n great" results in wow 1, great 1
"it was a man-eating shark." results in it 1, was 1, a 1, man-eating 1, shark 1
numbers can be a bit odd .. although the data has no numbers in it I will describe the behavior (I will also show how to avoid this in a parser if you wanted to). If you wanted to you could support a special case for numbers but I have NO test data with numbers in it .. this is how they will turn out if you follow the general rule above of ignoring punctuation.
1,000,000 results in 1000000 1
$1,000,000 results in $1000000 1
$ 1,000,000 results in $ 1, 1000000 1
"1.2 1.20 120" results in 12 1, 120 2
Submission Process
Submissions must use the IWordCounter interface provided here, this is to help me by making unit/performance testing a bit easier.
|
using System;
using System.Collections.Generic;
using System.Text;
namespace ConsoleApplication39 {
struct WordEntry {
public string Word;
public int Count;
}
interface IWordCounter {
WordEntry[] CountWords(string _Text);
}
class WordComparer : IComparer<WordEntry> {
public int Compare(WordEntry x, WordEntry y) {
return x.Word.CompareTo(y.Word);
}
}
class WordCounter :IWordCounter {
Dictionary<string, int> WordCounts = new Dictionary<string, int>();
private void AddToHash(string _Item) {
if (!WordCounts.ContainsKey(_Item)) {
WordCounts.Add(_Item, 0);
}
WordCounts[_Item] += 1;
}
public WordEntry [] CountWords(string _Text) {
List<WordEntry> Words = new List<WordEntry>();
Array.ForEach<string>(_Text.ToUpper().Split(' '), new Action<string>(AddToHash));
foreach (string current in WordCounts.Keys) {
WordEntry w;
w.Word = current;
w.Count = WordCounts[current];
Words.Add(w);
}
return Words.ToArray();
}
}
class Program {
static void Main(string[] args) {
string test = "The quick brown fox jumps over the lazy dog";
IWordCounter Counter = new WordCounter();
WordEntry [] Counts = Counter.CountWords(test);
Array.Sort(Counts, new WordComparer());
Array.ForEach<WordEntry>(Counts, new Action<WordEntry>(PrintEntry));
}
static void PrintEntry(WordEntry _Entry) {
Console.WriteLine(_Entry.Word + " " + _Entry.Count);
}
}
}
|
|
Listing 2: Naïve example using required interface |
Submissions are to be EMAILED to me at Gregory<insertmylastname>1@gmail.com before midnight (EST) on Aug 10th, I will have results ready by the 13th. Do not post your submission in comments or I will be forced to delete it. I will post when I receive entries so you can be sure you entry has actually reached me. Also it is not your best interest to do this as someone could come through and optimize one line of your code to beat you :P
I will the day that the contest ends create a post including performance results for all code. I will also post a zip file of all submissions (including my performance harness & unit tests so others can view it).
Scoring
3 sets of data will be run through the submissions each will be run many times with a new object constructed for each iteration. There will be one very small set of data, one medium set of data and one large set of data. Each of these sections is worth the same number of points.
1. 10 points
2. 8 points
3. 7 points
4. 6 points
5. 5 points
The submission with the most combined points from the three sections will win. If there is a point based tie, the entry that did better on section three (the large set) will win.
Why three datasets? Optimizations for the largest data set often come at the expense of the smallest, what is sought after is the best all around solution.
Additional Rules
1) Only one submission per person to be evaluated, you can submit multiples for discussion if you come across an interesting tidbit that you think is worth sharing
2) Items will be run on the MS 2.0.50727 CLR (there can be no tricks such as adding IL instructions to mono)
3) You may not call out to unmanaged code!
4) You CAN use unsafe code
5) No you can’t know what the data is beforehand.
6) You must pass unit tests in order to be included (what good is fast if it doesn’t work?)
If you have any questions, need clarification, or have thoughts on the overall idea of these competitions please post them here.