CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Greg Young [MVP]

July 2006 - Posts

  • Performance Challenge: #1 Word Count

    I had not seen a good .NET performance challenge in a while so I figured that I would start one. Obviously the real benefit will come in about two weeks when all of these items are analyzed and discussed (I wager everyone involved will learn something!). If this works out well I will try to run these either bi-weekly or monthly. Let me know your opinion in comments below.

    The Prizes

    Aside from becoming alpha nerd for a few weeks, the winner of this contest will receive a book from the very well known computer scientist Donald E. Knuth. Just to give things an interesting twist the book has absolutely nothing to do with optimizations. The prize is a copy of Things a Computer Scientist Rarely Talks About, it is the transcripts of Knuth’s 1999 lecture series at MIT discussing the relations between faith and science. If you wish to watch the web casts of these lectures you still can here.

    The runner up will receive a copy of Donald Knuth’s TAOCP V4F3 “Generating all Combinations and Partitions” as this book will help them win a future contest.

    There will also be a prize TBD for the fastest safe version of the code.

    I will do a submission but I cannot win a prize.

    The Problem

    Wc (word count) is an old unix program. One of its main tasks is to go through a set of data and count the number of repetitions of a given word. For instance given the sentence “Hello, yes hello” it would tell you that the word Hello was used twice and that the word yes was used once. While seemingly trivial this problem offers many interesting aspects such as dealing with the trade off of memory usage for large vs small data sets.

    Hint: since it is a classic problem there is a lot of research on it

    Here is a naive version of a word counter with a test harness if you can’t beat this you need more optimization J

        class Program {

            static Dictionary<string, int> WordCounts = new Dictionary<string, int>();

            static void Main(string[] args) {

                string test = "The quick brown fox jumps over the lazy dog";

                Array.ForEach<string>(test.ToUpper().Split(' '), new Action<string>(AddToHash));

                foreach (string current in WordCounts.Keys) {

                    Console.WriteLine(current + ":" + WordCounts[current]);

                }

            }

            static void AddToHash(string _Item) {

                if (!WordCounts.ContainsKey(_Item)) {

                    WordCounts.Add(_Item, 0);

                }

                WordCounts[_Item] += 1;

            }

        }

     

    Listing 1: Example of a naive word counter

     

    Some notes.

    1)      Capitalization differences do not equate to a difference.

    2)      You must handle normal punctuation properly there will be no apostrophes/quotes in the test data as they bring up many questions on their own.

    3)      Code will be run on a 32 bit processor (don’t use 64 bit stuff hoping for a 64 bit processor)

    4)      My processor is a dual core (hint)

    5)      You must use the IWordCounter interface

    6)      You do not need to sort your data before returning

    7)      If you do not pass unit tests performance tests will not be run.

    UPDATE: here are some test strings to show ... the quick sample code meets some but not all of these

    "Hello! Greg" results in Hello 1, Greg 1
    "Hello!Greg" results in HelloGreg 1
    "Hello\nGreg"  results in Hello 1, Greg 1
    "A.D.D." results in ADD 1
    "Wow, how great!" results in Wow 1, how 1, great 1
    "wow     \n\n\n    great" results in wow 1, great 1
    "it was a man-eating shark." results in it 1, was 1, a 1, man-eating 1, shark 1

    numbers can be a bit odd .. although the data has no numbers in it I will describe the behavior (I will also show how to avoid this in a parser if you wanted to). If you wanted to you could support a special case for numbers but I have NO test data with numbers in it .. this is how they will turn out if you follow the general rule above of ignoring punctuation.


    1,000,000 results in 1000000 1 
    $1,000,000 results in $1000000 1 
    $ 1,000,000 results in $ 1, 1000000 1 
    "1.2 1.20 120" results in 12 1, 120 2

     

    Submission Process

    Submissions must use the IWordCounter interface provided here, this is to help me by making unit/performance testing a bit easier.

    using System;

    using System.Collections.Generic;

    using System.Text;

     

    namespace ConsoleApplication39 {

        struct WordEntry {

            public string Word;

            public int Count;

        }

     

        interface IWordCounter {

            WordEntry[] CountWords(string _Text);

        }

     

        class WordComparer : IComparer<WordEntry> {

            public int Compare(WordEntry x, WordEntry y) {

                return x.Word.CompareTo(y.Word);

            }

        }

     

        class WordCounter :IWordCounter {

            Dictionary<string, int> WordCounts = new Dictionary<string, int>();

           

            private void AddToHash(string _Item) {

                if (!WordCounts.ContainsKey(_Item)) {

                    WordCounts.Add(_Item, 0);

                }

                WordCounts[_Item] += 1;

            }

     

            public WordEntry [] CountWords(string _Text) {

                List<WordEntry> Words = new List<WordEntry>();

                Array.ForEach<string>(_Text.ToUpper().Split(' '), new Action<string>(AddToHash));

                foreach (string current in WordCounts.Keys) {

                    WordEntry w;

                    w.Word = current;

                    w.Count = WordCounts[current];

                    Words.Add(w);

                }

                return Words.ToArray();

            }

        }

        class Program {

           

            static void Main(string[] args) {

                string test = "The quick brown fox jumps over the lazy dog";

                IWordCounter Counter = new WordCounter();

                WordEntry [] Counts = Counter.CountWords(test);

                Array.Sort(Counts, new WordComparer());

                Array.ForEach<WordEntry>(Counts, new Action<WordEntry>(PrintEntry));

            }

     

            static void PrintEntry(WordEntry _Entry) {

                Console.WriteLine(_Entry.Word + " " + _Entry.Count);

            }

        }

    }

     

    Listing 2: Naïve example using required interface

     

    Submissions are to be EMAILED to me at Gregory<insertmylastname>1@gmail.com before midnight (EST) on Aug 10th, I will have results ready by the 13th. Do not post your submission in comments or I will be forced to delete it. I will post when I receive entries so you can be sure you entry has actually reached me. Also it is not your best interest to do this as someone could come through and optimize one line of your code to beat you :P

    I will the day that the contest ends create a post including performance results for all code. I will also post a zip file of all submissions (including my performance harness & unit tests so others can view it).

    Scoring

    3 sets of data will be run through the submissions each will be run many times with a new object constructed for each iteration. There will be one very small set of data, one medium set of data and one large set of data. Each of these sections is worth the same number of points.

    1.       10 points

    2.       8 points

    3.       7 points

    4.       6 points

    5.       5 points

    The submission with the most combined points from the three sections will win. If there is a point based tie, the entry that did better on section three (the large set) will win.

    Why three datasets? Optimizations for the largest data set often come at the expense of the smallest, what is sought after is the best all around solution.

    Additional Rules

    1)      Only one submission per person to be evaluated, you can submit multiples for discussion if you come across an interesting tidbit that you think is worth sharing

    2)      Items will be run on the MS 2.0.50727 CLR (there can be no tricks such as adding IL instructions to mono)

    3)      You may not call out to unmanaged code!

    4)      You CAN use unsafe code

    5)      No you can’t know what the data is beforehand.

    6)      You must pass unit tests in order to be included (what good is fast if it doesn’t work?)

    If you have any questions, need clarification, or have thoughts on the overall idea of these competitions please post them here.

                                                                                

    Posted Jul 27 2006, 05:37 PM by Greg with 15 comment(s)
    Filed under:
  • Review: Dunn Consulting BizTalk Boot Camp

    A few weeks ago I attended the beta version of Dunn Consulting’s BizTalk Boot Camp. I wanted to write a review on the course and to let others know about it because BizTalk is quickly gaining in popularity and is likely something you will have to deal with in the next few years.

    The course was taught mainly by Mark Berry with the assistance of Mark Dunn during labs (and a few hours of lecturing). Both Marks are highly regarded as teachers and have more experience teaching than I do writing code. Mark Berry in particular not only knows the material being taught but is also quite good with “other” material that undoubtedly will come up. If you do a bit of background on Mark Berry you can also see that many hold him in high regard; he has also spoken on BizTalk in such events as TechEd and MS livemeetings. The fact that every student has given the Marks high marks (pun intended) should also go a long way.

    As for the materials, let me just say that this is not BizTalk for bozos, expect to be challenged during the boot camp.  I have copy/pasted the syllabus below from the dunntraining.com website for those who are interested in looking through it. I don't know about you but one of my major anxieties about any training course is that it will cover approximately the first four chapters of a book with an instructor who has read the first five :)

    On our first day we spent most of our time setting up our VMs (while discussing general concepts on another thread). I feel this was a very important area of the class as if I had had everything setup for me I would have been staring at the computer for a while at home to get everything setup.  The setup for BizTalk including SharePoint and other tools is not a trivial process (especially trying to do it and make sure things are fairly secure).

    On a poor note the place hosting the class must have had some issues with their machines as some were very slow (even though they were running the same VM on supposedly the same hardware) as such some of the installations took a very long time. This was an issue for some throughout the class but I wager this would not be the same the next time the course was taught.

    On the first and second days we also spent some time going over the general layout of BizTalk. We learned the overall architecture of both BizTalk and competing solutions. The discussion that was had going over the pros and cons of BizTalk and the comparison to other similar products on the market was worth going to the course alone if you are in any type of decision making position.

    During the middle of the week we focused on the “meat” of BizTalk. Starting with setting up a basic pipeline that was “the most expensive file copy of all time” and ending with a complex set of orchestrations. This material flew by; I would wager more was covered than in a BizTalk book.

    My personal favorite in this area was learning about the schema mappings. I will be the first one to admit that I am an XML idiot. Somehow I managed to make it through this section without completely killing my BizTalk installation and eventually I even got the right data to show up. Note that in this section we did not just do simple mappings but also got to the level of writing our own functoids.

    What I would deem the most important item was the focus on debugging and solving problems. BizTalk often does not fail in such a way that the problem is immediately apparent if you are not very familiar with it. Loads of time is spent on how to troubleshoot the work that you do. When a student had a problem, the problem was put up onto the projector in order to have the class help with the debugging process. In fact in some cases bugs are deliberately introduced just so you can see how to fix them. For me the ability to understand what is going on and how to debug it far more important than learning any particular feature as when that feature breaks you won’t know how to get in and figure out what’s wrong.

    The last day although I missed a bit of the morning due to having a prior commitment we went through business rules engine, performance topics, and one of my favorite discussions versioning. When dealing with BizTalk in an enterprise level app versioning is of the upmost importance, it is not as simple as one may think and can in fact become quite complex. Some of the versioning does require forethought in order to be able to successfully versionize later … isn’t it great when you learn about these “gotchas” ahead of time as opposed to when you later need to versionize something?

    Overall, the mastery labs are really what made the boot camp for me. Basically the mastery labs would be done individually (with Marks walking around the room in case you had questions). The masteries were not simply reproducing what you had just learned, they often times included items that were natural extensions of what you had done. Best said you had to understand what you just learned in order to complete the mastery assignments.

    The other nice thing about the mastery labs is that they are given to you in much the same way that you would be handed an assignment at work. You are expected to work from a list of functional requirements to a working piece of software. Obviously in trying to do so you hit many road blocks and things you thought you understood but really did not. These masteries help show those weaknesses and give the instructor a chance to work through them with you.

    The only real negatives that I saw including the slow computers were items that should be expected in a beta class. I came across a few times where instructions were incorrect or where something took a bit longer than had been planned for.  In my book this is what a beta class is supposed to find and the Marks did a great job of handling any of the situations that did come up whether taking notes to alter the instructions, rescheduling the estimated times for sections, or even staying late on their own time to discuss further items they did not expect heavy interest in.

    I won’t say that I left the class a BizTalk expert but I did leave the class with a general understanding and feel ready to join a team with a more experienced member without feeling completely lost.

    Overall I give it a 4.5/5 which if you know me at all is very high as I am usually a tough grader J

     

    Module 1 – BizTalk Introduction

    -         Problem Space

    -         Alternatives

    -         Single and Multi-box Installation

    Module 2 – Designing Solutions

    -         Patterns – Splitter/Aggregator, FIFO, and Interrupt

    -         Automated Testing

    -         Source File Management

    Module 3 – Messaging Normalization and Routing

    -         Schema Creation

    -         Map/XSLT Creation

    -         Publish and Subscribe Routing

    -         Underlying SQL Tables and Stored Procedures

    -         Flat File to XML

    -          Custom Pipelines

    -         De-batching Interchanges

    -         Recovering Failed Interchanges

    Module 4 – Extending External Systems Integration

    -         Custom Pipeline Components

    -         Built-in Adapters

    -         Custom Adapters

    -         Signing and Encrypting Messages

    Module 5 – Debugging and Monitoring

    -         Health and Activity Tracking (HAT)

    -         Group Hub Queries

    -         Debugging Components

    -         C# Source

    -         Orchestration Debugger

    -         Subscription Viewer

    Module 6 – Business Process Management

    -         Orchestrations

    -         Transactions

    -         Correlation

    -         Message Aggregation

    -         Convoys

    -         FIFO settings

    Module 7 – Business Rules Engine

    -         The Rete Algorithm

    -         Vocabulary

    -         Defining Good Rules

    -         Using Policies from Code

    -         Tracking Rule Events

    Module 8 – Performance

    -         Load Testing

    -         Performance Counters

    -         Messagebox Clusters

    -         Settings for Tuning

    Module 9 – Updating

    -         Versioning

    -         High Availability Updates

    -         Interrupt Pattern

    -         Alternate Strategies

     

    Posted Jul 27 2006, 02:15 AM by Greg with 1 comment(s)
    Filed under:
  • Agile: Resistance is Futile

     

    Ok I am still catching up on all my blogs from the two ten day periods I was not around for. I found a real gem on Michael Stal's blog entitled Agility and the Borgs where he begins to analyze the process model of the borgs. I have to admit that growing up I watched TNG all the time which might be what made this post so interesting.

    It seems as if the queen is in charge of providing central goals but Borgs are able to achieve these goals by any means they consider appropriate. For now, we recognize that Borgs use communication and interaction, share their knowledge with each other, can interoperate independently as groups but not as individuals, continuously share their knowledge and abilities, strive for common code ownership.

    Does this sound oddly familiar; it is at the least an interesting thought. While seeing this, I remembered a paper by Anders Sandberg I had read discussing in some depth how such organisms would work.  One of the key points that are hit on is communication within the borganism.

    Communication is central to borganisation. By definition the units making up a borganism will be in close mental contact; the bandwidth and structure of this contact will determine much of the properties of the borganism.

    Could we also say that many of the properties of an agile team are directly derived from the bandwidth and structure of its communications? Another point that Anders discusses that we can probably learn a lot from deals with Group Think.

    Groupthink is a common problem in human groups: the group becomes divorced from reality due to its internal consensus (which may even be illusory); it fails to question its own assumptions and to take unwelcome aspects of reality into account. If the borganism has to keep its units in line, it is likely it will directly or indirectly counteract dissent, which may promote groupthink. Often the best way of avoiding groupthink is to allow dissenting minorities to present their view. On the other hand, borganisms with sufficiently high bandwidth may be less susceptible to groupthink than human groups. If the units can present not only their views but the mental processes which reached these views it may become easier to judge the relative merit of the different positions. They are no longer assertions about reality but rather different models which can be analysed using critical thinking, empirical testing or synthesis.

    This is a very common problem on software teams where the entire team has become deluded into thinking that they are in fact producing best software that they can. Agile teams often have less problems with this as there is much higher communication amoungst the team members. This also bolsters the commonly held belief that it is important to let minorities debate their beliefs (to a limit) within the team in order to help promote code ownership.

    Another thought deals with assimilation, do we as agilists when seeking new members not look for units who introduce new areas of knowledge in hopes that they will be assimilated into the team thus spreading the knowledge to the other units?

    Perhaps it is in fact agility which is is the underlying borg belief. When we look at their common saying it sure sounds agile.

    "We are the Borg. Lower your shields and power down your weapons. You will be assimilated. We will add your biological and technological distinctiveness to our own. Resistance is futile."

    They do however not always sound agile as can be seen in their other common hail

    "We are the Borg. Lower your shields and surrender your ships. We will add your biological and technological distinctiveness to our own. Your culture will adapt to service us. Resistance is futile."

    Perhaps the addition of the queen made them no longer agile? The major point of contention between the models in that all borgs are subservient to a single ruler whereas an agile structure would promote something more akin to an Island Model. Obviously in the case of following a single ruler the borganism would run into to scalability issues either due to a bottleneck created by the ruler or in communication to and from the ruler. Wikipedia had a great quote discussing some of the borg scalability issues.

    The hive mind, however, has its limitations. In every Borg episode, Borg drones react so slowly to incursions that they often don't respond to an enemy soldier until he actually destroys something or attacks somebody. In Q Who, the cube just sat there trying to decide what to do, waiting for so long that the Enterprise's crew was able to hold its own meetings, decide what to do, and then explore the Borg ship! In fact, the only time that we ever saw the Borg react and move quickly was "Descent", when they were not part of a collective. This is due to the inefficiencies of the hive mind and other similar symmetric multi-processing systems. (SMP while being great at some tasks unfortunately introduces extra overhead as a side effect; the inability of trillions of minds to equal Voyager's holo-doctor shows every indication that they have reached and then surpassed a scalability limit, thus explaining why they react so slowly).

    hmm ... what would happen if the borg were not part of the collective but instead part of the distinct groups as in "Descent"? Could the collective be agility gone wrong?

    Can we learn something from analyzing the borg?

     

     

    Posted Jul 24 2006, 10:48 AM by Greg with 1 comment(s)
    Filed under:
  • Performance : String Reverse

    There are some people are trying to get the “fastest” string reverse function. http://sqljunkies.com/WebLog/amachanic/archive/2006/07/17/22253.aspx

    http://weblogs.sqlteam.com/mladenp/archive/2006/03/19/9350.aspx

     There is some history on this problem at Justin Roger's blog

    You know me, I can't stay away from a performance challenge...

    Here is my original submission (about what you would expect and it’s a fairly naïve implementation) … Note that using the same string (breaking immutabilty rules and doing it in place) is not acceptable as an optimization :)

            public static unsafe string GregsUnsafeReverse(string s) {

                string ret = string.Copy(s);

                fixed (char* start = ret) {

                    char* begin = start;

                    char* end = start + s.Length - 1;

                    char* stop = start + (end - start) / 2;

                    while (begin < stop) {

                        *begin ^= *end;

                        *end ^= *begin;

                        *begin ^= *end;

                        begin++;

                        end--;

                    }

                }

                return ret;

            }

    Listing 1: unsafe code to reverse string

     

    There are two other unsafe entries over on Justin Roger's blog with similar performance characteristics.

    What no one is talking about is that this is a piss poor way of doing it. We are copying 16 bits at a time. I think we can do much better. In this example I will use int32s to do it (although it can be done even faster with int64s (especially on a 64 bit machine)). If I could get it to use a ROTL instead, this should easily beat the copy.

            public static unsafe string GregsInt32Reverse(string s) {

                UInt32 Low;

                UInt32 High;

       

                string ret = string.Copy(s);

                fixed (char* start = ret) {