An obligatory bit about me (skip this paragraph to get to real information) ... I am an independent consultant from Atlanta Georgia. I live here with my girlfriend Laura and our two dogs Machiavelli (a Neapolitan Mastiff) and Margeaux (our soon to be robotic Labrador). Our previous home was in New Orleans which is still our favorite travel destination. I like to post about a wide variety of items (DDD, Agile Development, Performance, Mono / Cross Platform Development, AOP, ice cream) I do my best to keep them in categories but I will not promise that they are always well categorized.
One of the things I like to post about (performance) involves some ramp up on the part of the reader. I figured I would start off by walking people through some basic unmanaged debugging using Visual Studio .NET so that it is a bit easier to follow later examples, and it allows me to just link to it in the future :D. I personally also use windbg (even though it has an interface which is an acquired taste) but VS.NET is quickly becoming a fully featured debugging tool and is much more accessible for simple optimization questions.
When we are dealing with code with a need to be highly performant looking at IL is often not good enough as the JIT optimizer will optimize our code much further than we could ever hope to on our own without things like unsafe code which also makes our code unverifiable. If you use a tool such as reflector or ildasm you can quickly see that the code generated in release and debug mode are almost identical. This brings to mind the question of why release mode is so much faster; it is the JIT optimizations that are applied. We cannot see these JIT optimizations by looking at the code in a managed way so we will look at the native code generated from our code.
I want to make it clear that I am not suggesting people do this all of the time. I cannot stress enough to be very wary of Premature Optimization. Make your code work first; your aim should be clarity not speed as it is not worth the effort … profile your code once done and re-visit sections that need to be sped up. You will generally see that <10% of your code is using up 70+% of your execution time, go back and optimize that 10%. Your focus should also generally be on measured speed (not just looking at generated code). Also keep in mind that the choice of data structures within your application is far more important than lower level optimizations.
That said knowing what happens under the covers is also fun! Let’s get on to setting up Visual Studio and walk through a simple example. To start with we need some code to expriment with.
|
static void Main(string[] args) {
for (int i = 0; i < 10; i++) {
Console.WriteLine("Hello World!");
}
} |
Now we need to get unmanaged debugging setup in Visual Studio. In order to do this we go to the properties of our project then to the debug tab. There is a checkbox there to “Enable unmanaged code debugging”, select it.
It is important to note that this option only applies to the current configuration that we are running in. We should therefore set it for both of our configurations. There is some additional work that we need to in order to deal with release unmanaged debugging but we will get to that later.
Insert a break point on the beginning of the for loop and go to the debug menu and select start (or press f5 J)
You will hit a breakpoint as usual... your screen should look something like the display in Figure 2 <insert picture>. If you do not have a call stack window, bring it up by going to the debug menu -> windows -> call stack (or by hitting ctrl + d then c)
Once we have our call stack up we can right click on our function and select “go to disassembly” by doing this we can get the following code listing for our code.
|
static void Main(string[] args) {
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 sub esp,38h
00000009 xor eax,eax
0000000b mov dword ptr [ebp-10h],eax
0000000e xor eax,eax
00000010 mov dword ptr [ebp-1Ch],eax
00000013 mov dword ptr [ebp-3Ch],ecx
00000016 cmp dword ptr ds:[00912DC8h],0
0000001d je 00000024
0000001f call 792B228E
00000024 xor esi,esi
00000026 xor edi,edi
00000028 nop
for (int i = 0; i < 10; i++) {
00000029 xor esi,esi
0000002b nop
0000002c jmp 0000003D
0000002e nop
Console.WriteLine("Hello World!");
0000002f mov ecx,dword ptr ds:[022B303Ch]
00000035 call 785D9074
0000003a nop
}
0000003b nop
for (int i = 0; i < 10; i++) {
0000003c inc esi
0000003d cmp esi,0Ah
00000040 setl al
00000043 movzx eax,al
00000046 mov edi,eax
00000048 test edi,edi
0000004a jne 0000002E
}
0000004c nop
0000004d lea esp,[ebp-0Ch]
00000050 pop ebx
00000051 pop esi
00000052 pop edi
00000053 pop |
What we are viewing is the actual native code being generated by the JIT mapped up to our code that we wrote. We can quite easily see how our for loop works at the native level, if you have never seen one before this probably looks quite odd but it is in fact rather normal. Let’s take a close look at what is going on here.
00000029 xor esi,esi
0000002b nop
0000002c jmp 0000003D
This bit of code initializes our counter in ESI, which is an index register that can be used for indexing arrays. You will notice here that an age old assembly “trick” is being used to set our counter to 0, instead of loading the value 0 into the register we XOR it against itself (anything XOR’ed to itself is 0). The next line Nop meaning (no operation) does just that. The code then tells it to immediately jump to 3D .. Sometimes jumps like this make us go through our code in a non-top down fashion (just like gotos in higher level languages such as C, VB, or C#). If we go to the other half of the for loop we can continue analyzing our code.
0000003c inc esi
The first instruction in the tail bit of code (3C) increments our counter which is held in the ESI register (You can view the velues in your registers using the "registers" window in the debug windows or by hitting ctrl+d then r). In our first iteration however we will skip this line, the jump to 3D as we do not want to increment our index on the first iteration.
0000003d cmp esi,0Ah
00000040 setl al
00000043 movzx eax,al
00000046 mov edi,eax
00000048 test edi,edi
0000004a jne 0000002E
3D (the place the previous code jumped to and the immediately following lines) represent the actual comparison back to our stop value 10 (0A) and if we have not reached that value the last line will jump to 4A (which is the inner body of our loop).
0000002f mov ecx,dword ptr ds:[022B303Ch]
00000035 call 785D9074
The first line here will load our constant string from memory into the ECX register (which is a common purpose register), you will notice that ECX is always used for passing the first parameter to a method (in the case of instance methods ECX always contains “this”), followed by EDX for the second parameter then a series of “push” statements to store others on the machine’s stack.
The next line actually calls the function. I will at a later point go through how to look this function up but right now we have no issues figuring it out as we can see from our source that it is Console.WriteLine. The code will then fall back through to our increment discussed earlier and back through our verification code (which will then jump back into our inner loop code).
Needless to say, we can see significant waste happening in our trivial example. A perfect example of such waste can be seen here
00000009 xor eax,eax
0000000b mov dword ptr [ebp-10h],eax
0000000e xor eax,eax
00000010 mov dword ptr [ebp-1Ch],eax
Where we set EAX to 0 twice in a row... This is due to the fact that we are currently running in debug mode and not allowing optimizations to occur. In other words this code has been JIT’ed without allowing the JIT to do some intelligent things to make our code go faster.
There are a few issues with viewing optimized code. The first issue is the fact that the debugger by default will disable JIT optimizations (believe me I myself have gone through many a long hour late at night not realizing I was looking at un-optimized code). The second issue has to deal with “Just My Code” and the optimized code is therefore skipped. I had originally found the method of getting around this in Vance Morrison's Post on the subject. (Thanks Vance I had been bothered by this for a while and ended up just looking at raw assembler without source using the other method listed).
To get around these issues we should follow the instructions given
1) Go to Tools -> Options -> Debugging -> General
2) Make certain that box labeled ‘Suppress JIT optimization on module load’ is UNCHECKED.
3) Also make certain the box labled ‘Enable Just My Code’ is also UNCHECKED.
Vance also suggests going to the advanced build screen to insure that your release dll is built with pdb only, but this defaults on my machine. We would then simply run our code the same way we ran it before.
Another way to view our code with JIT is to build in release mode, (add a Console.Read() at the end) . Start the executable without the debugger, then attach the visual studio debugger (select break all from the debug menu or hit ctrl+alt+break to break).
Using either method we can now view our JIT optimized code. Our code should come out something like the following
|
static void Main(string[] args) {
for (int i = 0; i < 10; i++) {
00000000 push esi
00000001 xor esi,esi
Console.WriteLine("Hello World!");
00000003 cmp dword ptr ds:[02271084h],0
0000000a jne 00000016
0000000c mov ecx,1
00000011 call 786FC654
00000016 mov ecx,dword ptr ds:[02271084h]
0000001c mov edx,dword ptr ds:[0227307Ch]
00000022 mov eax,dword ptr [ecx]
00000024 call dword ptr [eax+000000D8h]
for (int i = 0; i < 10; i++) {
0000002a add esi,1
0000002d cmp esi,0Ah
00000030 jl 00000003
00000032 pop esi
}
}
00000033 ret |
Wow this is a lot less code than what we had the first time through. JIT optimizations have really done a great job on this code. This difference is why it is so important to look at the actual disassembly at times as opposed to the IL as the JIT will often times recognize patterns in the IL that it can optimize.
An astute reader will notice that there is in fact more code in the inner part of our loop. This can be scary at first, but in this case the optimizer has inlined the Console.WriteLine method for us, saving us quite a bit of code. I will get more into inlining in a later post but it is a huge optimization that occurs.
We have at this point made our way through to looking at non-optimized and optimized native code in the debugger. I think this is a good place to start. Over the next few posts we will build off of this knowledge to look more in depth at some common optimizations that occur with the JIT, we will also get into some other tools and look how they can assist in our quest for better code. Hope to see you there.
Some Resources:
http://en.wikipedia.org/wiki/X86
http://www.codeguru.com/csharp/.net/net_general/il/article.php/c4635/ IL tutorial
http://burks.brighton.ac.uk/burks/language/asm/asmtut/asm1.htm ASM tutorial