CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Greg Young [MVP]


Performance Challenge: #1 Word Count

I had not seen a good .NET performance challenge in a while so I figured that I would start one. Obviously the real benefit will come in about two weeks when all of these items are analyzed and discussed (I wager everyone involved will learn something!). If this works out well I will try to run these either bi-weekly or monthly. Let me know your opinion in comments below.

The Prizes

Aside from becoming alpha nerd for a few weeks, the winner of this contest will receive a book from the very well known computer scientist Donald E. Knuth. Just to give things an interesting twist the book has absolutely nothing to do with optimizations. The prize is a copy of Things a Computer Scientist Rarely Talks About, it is the transcripts of Knuth’s 1999 lecture series at MIT discussing the relations between faith and science. If you wish to watch the web casts of these lectures you still can here.

The runner up will receive a copy of Donald Knuth’s TAOCP V4F3 “Generating all Combinations and Partitions” as this book will help them win a future contest.

There will also be a prize TBD for the fastest safe version of the code.

I will do a submission but I cannot win a prize.

The Problem

Wc (word count) is an old unix program. One of its main tasks is to go through a set of data and count the number of repetitions of a given word. For instance given the sentence “Hello, yes hello” it would tell you that the word Hello was used twice and that the word yes was used once. While seemingly trivial this problem offers many interesting aspects such as dealing with the trade off of memory usage for large vs small data sets.

Hint: since it is a classic problem there is a lot of research on it

Here is a naive version of a word counter with a test harness if you can’t beat this you need more optimization J

    class Program {

        static Dictionary<string, int> WordCounts = new Dictionary<string, int>();

        static void Main(string[] args) {

            string test = "The quick brown fox jumps over the lazy dog";

            Array.ForEach<string>(test.ToUpper().Split(' '), new Action<string>(AddToHash));

            foreach (string current in WordCounts.Keys) {

                Console.WriteLine(current + ":" + WordCounts[current]);

            }

        }

        static void AddToHash(string _Item) {

            if (!WordCounts.ContainsKey(_Item)) {

                WordCounts.Add(_Item, 0);

            }

            WordCounts[_Item] += 1;

        }

    }

 

Listing 1: Example of a naive word counter

 

Some notes.

1)      Capitalization differences do not equate to a difference.

2)      You must handle normal punctuation properly there will be no apostrophes/quotes in the test data as they bring up many questions on their own.

3)      Code will be run on a 32 bit processor (don’t use 64 bit stuff hoping for a 64 bit processor)

4)      My processor is a dual core (hint)

5)      You must use the IWordCounter interface

6)      You do not need to sort your data before returning

7)      If you do not pass unit tests performance tests will not be run.

UPDATE: here are some test strings to show ... the quick sample code meets some but not all of these

"Hello! Greg" results in Hello 1, Greg 1
"Hello!Greg" results in HelloGreg 1
"Hello\nGreg"  results in Hello 1, Greg 1
"A.D.D." results in ADD 1
"Wow, how great!" results in Wow 1, how 1, great 1
"wow     \n\n\n    great" results in wow 1, great 1
"it was a man-eating shark." results in it 1, was 1, a 1, man-eating 1, shark 1

numbers can be a bit odd .. although the data has no numbers in it I will describe the behavior (I will also show how to avoid this in a parser if you wanted to). If you wanted to you could support a special case for numbers but I have NO test data with numbers in it .. this is how they will turn out if you follow the general rule above of ignoring punctuation.


1,000,000 results in 1000000 1 
$1,000,000 results in $1000000 1 
$ 1,000,000 results in $ 1, 1000000 1 
"1.2 1.20 120" results in 12 1, 120 2

 

Submission Process

Submissions must use the IWordCounter interface provided here, this is to help me by making unit/performance testing a bit easier.

using System;

using System.Collections.Generic;

using System.Text;

 

namespace ConsoleApplication39 {

    struct WordEntry {

        public string Word;

        public int Count;

    }

 

    interface IWordCounter {

        WordEntry[] CountWords(string _Text);

    }

 

    class WordComparer : IComparer<WordEntry> {

        public int Compare(WordEntry x, WordEntry y) {

            return x.Word.CompareTo(y.Word);

        }

    }

 

    class WordCounter :IWordCounter {

        Dictionary<string, int> WordCounts = new Dictionary<string, int>();

       

        private void AddToHash(string _Item) {

            if (!WordCounts.ContainsKey(_Item)) {

                WordCounts.Add(_Item, 0);

            }

            WordCounts[_Item] += 1;

        }

 

        public WordEntry [] CountWords(string _Text) {

            List<WordEntry> Words = new List<WordEntry>();

            Array.ForEach<string>(_Text.ToUpper().Split(' '), new Action<string>(AddToHash));

            foreach (string current in WordCounts.Keys) {

                WordEntry w;

                w.Word = current;

                w.Count = WordCounts[current];

                Words.Add(w);

            }

            return Words.ToArray();

        }

    }

    class Program {

       

        static void Main(string[] args) {

            string test = "The quick brown fox jumps over the lazy dog";

            IWordCounter Counter = new WordCounter();

            WordEntry [] Counts = Counter.CountWords(test);

            Array.Sort(Counts, new WordComparer());

            Array.ForEach<WordEntry>(Counts, new Action<WordEntry>(PrintEntry));

        }

 

        static void PrintEntry(WordEntry _Entry) {

            Console.WriteLine(_Entry.Word + " " + _Entry.Count);

        }

    }

}

 

Listing 2: Naïve example using required interface

 

Submissions are to be EMAILED to me at Gregory<insertmylastname>1@gmail.com before midnight (EST) on Aug 10th, I will have results ready by the 13th. Do not post your submission in comments or I will be forced to delete it. I will post when I receive entries so you can be sure you entry has actually reached me. Also it is not your best interest to do this as someone could come through and optimize one line of your code to beat you :P

I will the day that the contest ends create a post including performance results for all code. I will also post a zip file of all submissions (including my performance harness & unit tests so others can view it).

Scoring

3 sets of data will be run through the submissions each will be run many times with a new object constructed for each iteration. There will be one very small set of data, one medium set of data and one large set of data. Each of these sections is worth the same number of points.

1.       10 points

2.       8 points

3.       7 points

4.       6 points

5.       5 points

The submission with the most combined points from the three sections will win. If there is a point based tie, the entry that did better on section three (the large set) will win.

Why three datasets? Optimizations for the largest data set often come at the expense of the smallest, what is sought after is the best all around solution.

Additional Rules

1)      Only one submission per person to be evaluated, you can submit multiples for discussion if you come across an interesting tidbit that you think is worth sharing

2)      Items will be run on the MS 2.0.50727 CLR (there can be no tricks such as adding IL instructions to mono)

3)      You may not call out to unmanaged code!

4)      You CAN use unsafe code

5)      No you can’t know what the data is beforehand.

6)      You must pass unit tests in order to be included (what good is fast if it doesn’t work?)

If you have any questions, need clarification, or have thoughts on the overall idea of these competitions please post them here.

                                                                            


Published Jul 27 2006, 05:37 PM by Greg
Filed under:

Comments

legobuff said:

Question on note 2... does "handle normal punctuation properly" also mean that I need to handle money also?  or will this be strictly words?

And thank you for the mental exercise.
# July 28, 2006 11:50 AM

Greg said:


*DELETED*
# July 28, 2006 1:55 PM

bushmango said:

How about you give us one of those fabled unit tests? You know, so we can learn some Test Driven Development =) That would allow us to make sure we understand and interpret the rules the same.
# July 28, 2006 3:09 PM

bushmango said:

Also, is this for English-only, ascii only, or do we have to support all of unicode?
# July 28, 2006 4:20 PM

Greg said:

OK. I apologize for my quick response a few hours ago which I am forced to change do to some things I didn't think of (namely abbreviations). Some people have brought up some really fun edge conditions with punctuation and my term of "handle properly". If you did it that way .. it should be 1-2 lines of code to change (and again I apologize).

I am updating the original post to include a bit more specific information. Properly will mean that it does not get included with the word.

here are some test strings to show ...

"Hello! Greg" results in Hello 1, Greg 1
"Hello!Greg" it results in HelloGreg 1
"A.D.D." results in ADD 1
"Wow, how great!" results in Wow 1, how 1, great 1
"wow     \n\n\n    great" results in wow 1, great 1
"it was a man-eating shark." results in it 1, was 1, a 1, man-eating 1, shark 1

numbers can be a bit odd .. although the data has no numbers in it I will describe the behavior (I will also show how to avoid this in a parser if you wanted to). If you wanted to you could support a special case for numbers but I have NO test data with numbers in it .. this is how they will turn out if you follow the general rule above of ignoring punctuation.


1,000,000 results in 1000000 1
$1,000,000 results in $1000000 1
$ 1,000,000 results in $ 1, 1000000 1
"1.2 1.20 120" results in 12 1, 120 2


If you think I am missing a case here let me know otherwise the general rule is to simply remove punctuation

The data is english data only. Data is passed as a .NET string which means it will be in the form of utf 16 but it will only be english characters.

# July 28, 2006 7:03 PM

Greg said:

First entry has been received from Eric Bowen (http://scrappydog.com/blogs/blog/default.aspx) of dingineer (http://digineer.com)

This submission will now be known as Bowen.

# July 31, 2006 1:57 PM

johnwood said:

Man I wish I had time to implement this! Next time be nice to me and choose a problem that doesn't involve optimizing a tokenizer too :)
# August 9, 2006 2:14 AM

Greg Young [MVP] said:

24 hour warning
# August 9, 2006 11:39 PM

bushmango said:

Hmm these new rules almost completely killed my tokenizer....
# August 10, 2006 12:30 PM

bushmango said:

My submission should be in your inbox
# August 10, 2006 1:58 PM

Greg said:

More submissions.

Garmon Edmon http://weblogs.asp.net/egarmon/ will be known as Edmon
Alois Kraus http://geekswithblogs.net/akraus1/ will be known as Kraus
Steve Bushman will be known as Bushman
Kevin Idzi http://kevin.idzi.com from CostCo will be known as Idzi
Blaine Wildon will be known as Wilson
Brandon Grossutti will be known as Grossutti
My sumission will be known as Young

Still 7 hours for submissions.

# August 10, 2006 5:06 PM

Greg said:

submissions are closed.
# August 11, 2006 12:37 AM

Greg said:

Sorry for delay over the weekend .. results will be up today.
# August 14, 2006 6:44 AM

Greg Young [MVP] said:

Results and analysis for &quot;Performance Contest #1 - Word Count&quot;
# August 15, 2006 1:30 AM

CoqBlog said:

Greg Young avait lancer un premier concours autour de l'optimisation : le comptage de mots.
Il vient...
# August 15, 2006 4:17 AM

Leave a Comment

(required)  
(optional)
(required)  

Enter the numbers above:
Add
Check out Devlicio.us!