Debugging Visual Studio 2010

I've been spending quite a bit of time recently exploring the Visual Studio 2010 beta. I'm impressed by many of its features, but that's not to say it doesn't have bugs. One of the most frustrating bugs I've encountered is one that completely hangs my machine. It happens regularly (I've experienced the bug over 50 times) and the only way to remedy the situation is do a hard power-off of your machine and reboot.

So, what do we know about the bug?  For one thing, It happens sporadically, usually right when I initiate a C++ project build, but there have been occasions where it happened at other times.  Second, the machine becomes completely unusable when it happens and windows stop drawing.  Third, the CPU of the machine spikes to 100% (the only way to figure this out given observation 2 above is by looking at the cpu of the host when the bug happens on a VM).  So I'm going to look at this problme in WinDbg a little and see if I can't figure out what's wrong.

So where to begin? Well surely the problem must be somehow related to Visual Studio. Since the CPU was stuck at 100% I suspected either an infinite loop or a spinlock.  The first step is to see if any processes stand out as having far more CPU time than normal.  I could have just blindly drilled directly down into Visual Studio processes, but it could always be a driver problem, or something running in the System process. A quick search of processes showed the following:

0: kd> !process 0 f
PROCESS 822338b0 SessionId: 0 Cid: 0250 Peb: 7ffde000 ParentCid: 0294
            DirBase: 02a50380 ObjectTable: e264ffb8 HandleCount: 33104.
            Image: vcpkgsrv.exe
            VadRoot 823b4138 Vads 1774 Clone 0 Private 8274. Modified 77725. Locked 0.
            DeviceMap e1d5cac8
            Token e17a83c8
            ElapsedTime 01:07:16.912
            UserTime 00:01:57.812
            KernelTime 00:16:32.203
            QuotaPoolUsage[PagedPool] 550980
            QuotaPoolUsage[NonPagedPool] 71840
            Working Set Sizes (now,min,max) (33950, 50, 345) (135800KB, 200KB, 1380KB)
            PeakWorkingSetSize 43646
            VirtualSize 249 Mb
            PeakVirtualSize 271 Mb
            PageFaultCount 882867
            MemoryPriority BACKGROUND
            BasePriority 8
            CommitCharge 8860

16 minutes of kernel mode CPU time certainly seems a little interesting, especially since the computer had frozen about 15 minutes prior to this, and I had been letting it run the whole time. If my guess is right, there's one thread that's used almost all of this time.

Sure enough, listing threads for this process I find the following:

0: kd> !threads 822338b0  f
THREAD 820ca2b0 Cid 0250.0408 Teb: 7ffdb000 Win32Thread: 00000000 RUNNING on processor 0
            Not impersonating
            DeviceMap e1d5cac8
            Owning Process 0 Image:
            Attached Process 822338b0 Image: vcpkgsrv.exe
            Wait Start TickCount 553117 Ticks: 4 (0:00:00:00.062)
            Context Switch Count 68260
            UserTime 00:00:03.125
            KernelTime 00:15:57.750
            Win32 Start Address cpfe!a_compiler_thread::compiler_thread_routine (0x3fb7582c)
            Start Address kernel32!BaseThreadStartThunk (0x7c8106f9)
            Stack Init f6be6000 Current f6be5a30 Base f6be6000 Limit f6be3000 Call 0
            Priority 8 BasePriority 8 PriorityDecrement 0 DecrementCount 16
            ChildEBP RetAddr Args to Child
            f6be5a6c 80545129 00000001 00000000 000000d1 nt!RtlpBreakWithStatusInstruction
            f6be5a6c 805cf8ef 00000001 00000000 000000d1 nt!KeUpdateSystemTime+0x175
            f6be5b00 805bc91f 55d9c6a8 00034828 00000670 nt!PsChargeSharedPoolQuota+0x5d
f6be5b1c 805bd1fe e2814088 825c23d8 f6be5b75 nt!ObpChargeQuotaForObject+0x6f
            f6be5b90 805bda20 f6be5bf0 822338b0 e28140a0
            f6be5be4 805c3052 e28140a0 000f0007 00000001 nt!ObpCreateUnnamedHandle+0x86
            f6be5cd8 805ab50a e28140a0 00000000 000f0007 nt!ObInsertObject+0xb0
            f6be5d40 8054162c 00fceb54 000f0007 00000000 nt!NtCreateSection+0x15c
            f6be5d40 7c90e514 00fceb54 000f0007 00000000 nt!KiFastCallEntry+0xfc
            00fceaf8 7c90d18a 7c8094e5 00fceb54 000f0007 ntdll!KiFastSystemCallRet
            00fceafc 7c8094e5 00fceb54 000f0007 00000000 ntdll!ZwCreateSection+0xc
            00fceb58 7c80955f 0000016c 00000000 00000004 kernel32!CreateFileMappingW+0x10b
            00fceb84 3fbeb857 0000016c 00000000 00000004 kernel32!CreateFileMappingA+0x6e
            00fcebbc 3fc2b88c 00000000 000000a4 00010000 cpfe!map_file_region+0x67
            00fcebd0 3fb95781 00010000 00010000 000000a0 cpfe!alloc_new_mem_block+0x36
            00fcebec 3fb14302 00000001 00000000 00000000 cpfe!_setjmp3+0x15dc7
            00fcec14 3fb159c3 00000098 278bf7f8 278c7d54 cpfe!check_operator_function_params+0x5f2
            00fcec44 3fb24f03 00000006 00009e38 278bf7f8
            00fcecdc 3fb57bbc 278bf7f8 00000003 00000003 cpfe!scan_class_definition+0x1a3
            00fced58 3fb48afb 278c7c70 00fcee60 00000000 cpfe!alloc_decl_position_supplement+0x245
            00fcee20 3fb396ca 00000000 02c30420 00000000 cpfe!required_token+0x55a
            00fcf148 3fb4f8da 00fcf1d4 00000000 00000000 cpfe!proc_define+0x252a
            00fcf178 3fb16edb 00fcf1d4 00000000 00000000 cpfe!template_directive_or_declaration+0x97
            00fcf1a4 3fb21310 00fcf250 00000000 00fcf1d4 cpfe!is_type_start+0x3db
            00fcf398 3fb55190 00000001 00000000 00000000 cpfe!declaration+0x250
            00fcf454 3fb16f53 00fcf4ac 00000000 00000290 cpfe!scan_void_expression+0x3ec
            00fcf47c 3fb21310 00fcf528 00000000 00fcf4ac cpfe!is_type_start+0x453
            00fcf670 3fb55190 00000001 00000000 00000000 cpfe!declaration+0x250
            00fcf72c 3fb16f53 00fcf784 00000000 00000148 cpfe!scan_void_expression+0x3ec
            00fcf754 3fb21310 00fcf800 00000000 00fcf784 cpfe!is_type_start+0x453
            00fcf948 3fb55190 00000001 00000000 00000000 cpfe!declaration+0x250
            00fcfa04 3fb16f53 00fcfa5c 00000000 00000000 cpfe!scan_void_expression+0x3ec
            00fcfa2c 3fb21310 00fcfad8 00000000 00fcfa5c cpfe!is_type_start+0x453
            00fcfc20 3fb5461c 00000001 00000000 00000001 cpfe!declaration+0x250
            00fcfc40 3fb6c891 00000001 00000000 00448bc8 cpfe!translation_unit+0x68
            00fcfc50 3fb6c946 00000000 00fcfcf8 00fcfcd8 cpfe!process_translation_unit+0xf4
            00fcfc80 3fb702d9 0000003b 004546b0 00fcfd54 cpfe!process_translation_unit+0x1a9
            00fcfc90 3fad8c41 0000003b 004546b0 00000000 cpfe!edg_main+0x2e
            00fcfd54 3fb75a9d 0000003c 01851858 00000000 cpfe!InvokeCompilerPassW+0x46b
            00fcfda4 3fb75a1a 00000025 0045fd10 b1b2ece7 cpfe!edge_compiler_main+0x50

15 minutes and 57 seconds of kernel mode time used, and just my luck it's in the Visual Studio code. It appears to be in some kind of code that's parsing my c++ in the background, perhaps looking for syntax errors or to enable Intellisense. Nothing seems particularly unusual about the callstack though, it doesn't appear to be stuck in a spin lock like I originally guessed. So I wonder if it's stuck at all.

At this point we don't know if it's stuck in kernel mode, or if the parsing code is stuck in some sort of loop calling an expensive kernel mode function over and over. If it does ever return from kernel mode, it should eventually hit the address 3fbeb857, which you can observe above is the return address of CreateFileMapping back into the Visual Studio code. So I put a breakpoint at 0x3fbeb857 and hit run. The breakpoint never gets hit.

So it is stuck in kernel mode. Perhaps it would be useful to know what file it's trying to map, since it's obviously unable to map it. Referring back to the callstack I've highlighted the first argument of CreateFileMapping (0000016c), which the MSDN documentation states is a handle to the file trying to be mapped.  Luckily this means it's easy to get information about this handle.

0: kd> !handle 16c f
processor number 0, process 822338b0
PROCESS 822338b0 SessionId: 0 Cid: 0250 Peb: 7ffde000 ParentCid: 0294
            DirBase: 02a50380 ObjectTable: e264ffb8 HandleCount: 33104.
            Image: vcpkgsrv.exe

Handle table at e18b9000 with 33104 Entries in use
016c: Object: 81f4b808 GrantedAccess: 0013019f Entry: e10402d8
            Object: 81f4b808 Type: (825e9ad0) File
            ObjectHeader: 81f4b7f0 (old version)
                        HandleCount: 1 PointerCount: 1
                        Directory Object: 00000000 Name:
                                    \DOCUME~1\ADMINI~1\LOCALS~1\Temp\edg41.tmp {HarddiskVolume1}

Ahaaa. Err, not really. At least it's more information, but this doesn't help too much.

At this point it might be useful to see a call trace. A crude way to do this is to just hit F5 and break a couple times, and see where you end up. When I did this about 5 times I only ever ended up inside a total of 3 different functions: nt!PspExpandQuota, nt!PsChargeSharedPoolQuota, and nt!MmRaisePoolQuota.

A cursory disassembly of these 3 functions shows that a certain code path of nt!PsChargeSharedPoolQuota calls nt!PspExpandQuota, which in turn has a certain code path that calls nt!MmRaisePoolQuota. There doesn't seem to be any indirect recursion between these 3 functions, so we can rule that out, and instead focus on finding the highest function in the callstack that ever gets reached. An obvious first guess is nt!PsChargeSharedPoolQuota just based on the call graph. Putting a breakpoint on the ret statement from that function reveals that it is never reached.

So, we're getting close. We've identified an infinite loop in the function nt!PsChargeSharedPoolQuota. Perhaps PspExpandQuota is not behaving according to one of PsChargeSharedPoolQuota's assumptions. At this point things become a little bit more difficult. None of these 3 functions are documented, so if we want to get any further we're going to have to do some reverse engineering of the disassembly to see what these functions are doing.

To get a slightly better call trace, we can set breakpoints at each call instruction, and immediately after the call instruction that print out what's about to happen or just happened, and the the arguments or return value. By disassembling these functions you can see how many arguments are being passed to them simply by looking at how many arguments are pushed on the stack. For example,

bp nt!PsChargeSharedPoolQuota+0x5d ".echo About to call PspExpandQuota. Arguments = ; dd /c 5 esp esp+0x10; g"

will print a message displaying all the arguments being passed to PspExpandQuota right before the function is invoked. /c 5 means 5 columns (since there's 5 arguments), and I've used the value 0x10 because this causes dd to display the values {esp, esp+4, esp+8, esp+0xC, esp+0x10}, which is the 5 arguments we need. When I did this I found that it was repeating the exact same sequence of PspExpandQuota -> MmRaisePoolQuota over and over in an infinite loop, with the exact same arguments every time.

This certainly explains the CPU spike, and the unusability of the machine. 

So why did it lock the entire machine?  Well a quick examination of !stacks shows every other thread in the system is blocked even though I was using 2 Virtual CPUs.  Using !locks I then found that thread 822338b0 was the exclusive owner of an ERESOURCE lock, and that there were 4 threads waiting on this lock.  Recall that thread 822338b0 is the same thread stuck in this infinite loop.  I didn't examine further because this was all I really needed to know.

If I have some time in the next few days I may try to make a post walking through a reverse engineer of PspExpandQuota and MmRaisePoolQuota, determine their function signatures, figure out the structure of the code, and if possible reverse only the infinite loop portion of PsChargeSharedPoolQuota to see what perhaps could be causing it.


  © Blogger templates ProBlogger Template by Ourblogtemplates.com 2008

Back to TOP