I still say it doesn't suck

Somehow my last post ended up on Reddit after being dormant for 4 months. Well, I saw one comment in particular, which I'll quote here:

Oh by the way i would so have liked him to make a rebuttal of this instead of the hand picked stupid anti C++ trolls he has chosen in his article.

Had I consulted with the author of this comment before I wrote the blog post, that might have been possible. Instead, it was just a guy rebutting some of the common (and stupid) claims I see about why C++ sucks. But let's go down the list of points in that FQA since they brought it up:

No compile time encapsulation
Kind of true. Shipping C++ interfaces that contain implementation details to customers is a bad idea. You can always use the pimpl idiom for your publicly exposed interface.

Outstandingly complicated grammar
While the grammar is indeed complicated, this has very little to do with slow compile times and error messages. Almost all compile time / error message complaints come from template instantiations. If parsing is really that much of a problem (which I highly doubt), I guess you could wait until parsers start using SSE4.2 hardware string instructions specifically designed to speed up parsing. But yea.. I don't really think parsing is an issue at all.

No way to locate definitions
Alot of the discussion here is true, but has little to do with the issue of locating definitions. I'm not saying it's *easy* to locate definitions, it's definitely a complicated problem which is why few tools get it right. Visual Assist and Visual Studio 2010 (both only for MSVC) are both extremely accurate (and fast), and one could easily be fooled that they were using a module-oriented language based on the ease of locating definitions with these tools. So it's possible.

No run time encapsulation
This is a feature of the language. The author even points that out in the last paragraph. Not sure why this is here.

No binary implementation rules
Considering that C++ code should be able to run on arbitrary processor architectures (check out Fermi if you want a blown mind), I don't see how a consistent ABI is even a possibility.

No reflection
Lack of reflection is a hindrance sometimes, but I fail to see how this is a fault of the language. Does Haskell have builtin support for reflection? Not that I'm aware of (although I'm no expert), but I think it does have reflection libraries.

One could easily add reflection support to C++ by just putting a certain macro at the beginning of every class definition. The great thing about that solution is that it's selective - you only ever have to pay the overhead price of reflection when you need a class to be reflected on.

Very complicated type system
Although it's a simple example, it seems somewhat contrived to me. Nevertheless, it does kind of suck. In theory the problem could probably be solved with an appropriate copy constructor that just copies the vector's internal pointer but not the stuff it points to, but it takes some additional work on part of the library writer.

Very complicated type-based binding rules
I agree that the rules are complicated.

Defective operator overloading
This doesn't make any sense. It says that overloaded operators have to return their arguments by value, but this is completely false. The rest of the argument is null-and-void with rvalue references.

Defective exceptions
Another one that doesn't make any sense. a) Even if C++ exceptions are bad, what are you comparing them to? Java exceptions maybe? Those are TERRIBLE. b) RAII is trivial to get right. c) Any debugger worth anything can automatically break at the point of throw. Those that can't, you can always manually put a breakpoint in the constructor of your exception classes.

Duplicate facilities
I'm not sure how the new C++ methods of doing things are worse than the C counterparts. The only example I can think of off the top of my head is with printf/scanf, etc, but that problem is solved with C++0x and variadic templates. Any other occurences of the C++ counterpart being worse are probably related to excessive copying, which is also solved in C++0x with r-value references.

That being said, all of the old facilities in the language should be deprecated. But it's not exactly practical to do when potentially billions of lines of code depend on it.

No high-level built-in types
The initialization problems described have been eliminated in C++0x, and more complex initialization schemes can be created through clever (i.e. obtuse) template mechanics. there are already ways to initialize vectors with syntax such as vector v = vec_init(3)(4)(5)(6)(7);

Sure, it's not as nice as having them built into the language, but what's amazing about C++ is that the generic programming system is powerful enough to let you extend the core language in ways you never even thought possible and you can achieve very elegant syntax for doing things. Sure, it's complicated and ugly to read the machinery behind such techniques. But show me another language with the power to essentially re-define its own syntax that is easy to read in 100% of cases.

I'm not disagreeing that more built-in types wouldn't be nice, they would.

Manual memory management
Feature. Regarding the "owner" issue, it can now be expressed (by and large) with C++ syntax using r-value references. The issue of preventing access to dead access is addressed by the combination of boost::shared_ptr (which define multiple "owners") and boost::weak_ptr (which define multiple "observers").

Defective metaprogramming facilities
Hah! Templates are defective. They're a lot of things, but not sure I'd say they're defective. Difficult yes. If by defective, you mean "most powerful perhaps of any other language in existence" then I guess they're defective. With power comes complexity.

Unhelpful standard library
Unhelpful because it doesn't provide GUI facilities? come on, you have got to be kidding. This whole point is just full of nonsense. not all platforms support GUIs or network sockets. C++0x does have regular expressions. Matrix requirements are highly platform / application dependent. Do you want row-major or column-major matrices? Should they use SSE instructions if available?

Defective inlining
It sounds to me like he's admitting that the standard requirements for inlining in C++ actually aren't defective, despite the headline of the section, but rather that link-time code generation is still a relatively new technology.

I suspect this is one of those people that writes about how other people have never measured the performance of inline functions, yet has also not measured it. In other words, just looking for points to rant about but not being able to put his money where his mouth is. Many inline functions actually get compiled *to nothing*. That's the reason that you can have some complicated template mechanics that end up instantiating 800,000 template classes, which call 5,000,000 functions all over the place, and the generated assembly ends up being a couple of instructions. Did you know that if you try to inline a large function, the compiler often will refuse?

Implicitly called & generated functions
I suppose the alternative to implicitly generated functions is having objects be in an undefined state if, for example, you don't supply the constructor or assignment operator? Or maybe just having the program fail to compile, in which case author would complain that you have to add lots of boilerplate code just to get a class to compile even when you don't need it.

Maybe his problem is with the debugger he used. I've certainly never heard of the problem described here with seeing tons of assembly code in the debugger and having to manually reconstruct offsets.

That's it! I hope to see at least 400 new comments on the reddit thread now.



Why C++ Doesn't Suck

It seems like it's becoming easier and easier these days to find rants from people about why C++ sucks. To see how easy it is, just go to google and type "C++ Sucks" and you'll find tons of blogs, just like this one, albeit with people making the opposite argument as the one I'm going to make. Since there seems to be so many people explaining why C++ sucks, I figured I might as well play devil's advocate and argue why C++ doesn't suck.

First, we might as well establish before we even start that the point I'm arguing for here (that C++ doesn't suck) is part meaningless, and part obvious. What does it even mean to say that a language sucks? Sure, there are things that are easier in some languages or other languages, and there are languages that are prone to user error more often than in other languages, but does that make the entire language suck? I think we should agree that in order for an entire language to suck, there must be no compelling reason to use it for any purpose in any industry. Such a language should excel at nothing, and lag behind at everything.

With that out of the way, I think the most logical place to start is by responding to some common complaints about C++. One of the most well known opponents of C++ is Linus Torvalds. I don't necessarily think that just because Linux is his brainchild that his opinion automatically holds more weight than anyone else's opinion on why C++ sucks, but since he's the most well-known, I'll start there.

Linus: "It's made horrible by the fact that a lot of substandard programmers use it, to the point that it's much easier to generate complete and utter crap with it."
Are we talking about a property of the language here, or a property of the people using it? Haskell is more difficult than C++, but not a lot of substandard programmers use it. So is one to infer that the only reason Haskell is a good language and C++ is a bad language is because Haskell doesn't have substandard programmers?

How does Perl fit into this picture? It's just as easy to generate complete and utter crap with Perl, and the reason for that actually is a property of the language. No doubt a lot of people will just respond by saying that Perl also sucks, but far fewer than will say that C++ sucks.

Linus:C++ leads to really really bad design choices. You invariably start using the "nice" library features of the language like STL and Boost and other total and utter crap, that may "help" you program, but causes:
- infinite amounts of pain when they don't work (and anybody who tells me that STL and especially Boost are stable and portable is just so full of BS that it's not even funny)
- inefficient abstracted programming models where two years down the road you notice that some abstraction wasn't very efficient, but now all your code depends on all the nice object models around it, and you cannot fix it without rewriting your app."

First of all, when he says "You" invariably start using... is this the substandard programmer "you" again? In any case, the source code of STL is always available, and Boost is fully open source and peer reviewed. Furthermore, I hate to say something that may seem like a personal attack (even though it's definitely not) but some of the biggest contributors to Boost are probably quite a bit smarter than Linus. Not to downplay him personally or his contributions to the community, but rather the converse, to demonstrate how great and unbelievable the contributions of the people on Boost have been.

I get the feeling that Linus hasn't used Boost very much. I don't blame him, after all he hates C++ and Boost has a high learning curve, what motivation could he possibly have for trying to learn it? But here, answer me this:
- What do you do when writing cross-platform code in C and you need to make heavy use of the filesystem? Can you make an API in C that handles this easily and is easy to use? Even most higher level languages don't have very detailed and complete filesystem APIs that are generic enough to be used across multiple platforms.
- What if you're writing an application built around an i/o completion port model of asynchronous io and the app should work on multiple platforms? How long would it take you to create such a library in C that provided a single interface for clients of your library to use?

Well maybe he can't buy any of these arguments because he said himself that Boost is neither portable nor stable (wrong on both counts but oh well). What if you need to write a non-portable app that preprocesses C or C++? Oh hey, I can do that in a few lines of code with Boost.Wave.

Okay, enough about Linus' complaints with C++. How about some more general complaints that aren't directly related to how C is better than C++.


Here is a blog that explains that C++ sucks because he couldn't find an elegant solution to manage HMODULEs (which for those who don't know anything about windows are sort of like a specialized type of file descriptor). I guess the author isn't familiar with reference counting though, because a shared_ptr<> solves this problem immediately with no extra work involved.


Here is a blog that explains that C++ (kind of) sucks for a bunch of reasons, all of which are, as usual, invalid. In this particular blog, the author is only comparing it with Java.

Ant vs. make -
Completely irrelevant. You can use Ant with C++ projects, and you can use make with Java projects.

Pointers - "Do you really need pointers? Do you really need to convert integers (memory addresses) to objects?"
Yes actually, I do. I guess the author has never worked on embedded software or real time systems. Or games for that matter. Or I guess anything other than canned business software.

Try making a game in C++ / DirectX and then try doing the same game in XNA implemented using C#. Make a scene with a single model and write some code that determines, when the user clicked the mouse, if an object was intersected and if so which point on the object was intersected. In C++ this is trivial. You have direct access to the Vertex Buffer you can simply offset into it, cast it to a pointer, and you're done. In XNA using C#, there is no good way to do this. The recommended way, which is pretty freaking awful (although admittedly very simple), is to write an fairly simple extension to the build system that augments the model with easily accessible vertex information. But here's the thing: The model already contains all the vertex information in the vertex buffer. Doing this approach at a *minimum* doubles the amount of storage required for a model's vertex information. You're storing the exact same information twice.

"Are you willing to put up with the hassle of manually collecting garbage (using delete statements)?"
Actually no, I'm not. That's why I don't do it in C++. Instead, I use reference counted pointers and automatically scoped pointers about 99% of the time. I actually don't remember the last time I manually invoked the delete keyword, but I invoke the new keyword constantly.

"How are you going to test that you don't have memory leaks?"
Sorry, are you trying to imply that Java and other managed languages don't have memory leaks? Because this is certainly not the case. If you leave a dangling reference around, you have memory leaks in Java and any other garbage collected language. In any case, I'll probably test that I don't have memory leaks the same way as I would in a managed language: by using a profiler.

CPU Architecture and OS -
"What if the architecture you're working on uses 48-bit memory addresses and you want to port your program to an architecture that uses 64-bit memory addresses? Have you thought about how you're going to do that?"
Of course, although I think a better example would have been 32 and 64-bit. C++ has a keyword called sizeof() that lets you determine how big a pointer is. Admittedly it requires more care than in managed languages, but I really don't remember the last time I saw an actual programmer using magic numbers in their code. I suppose it happens though, maybe this is related to Linus' "substandard programmers"?

"What CPU are you programming for?"
Doesn't matter. We're talking about C++, not assembly.

"What OS?"
Rarely matters. I'm using Boost and STL. The times where it does matter, Java wouldn't help me anyway. For example, my application deals with the filesystem and I need to be able to manage all types of objects that might exist on the filesystem. On windows this means Junctions, Symbolic Links, Hard Links, etc. On Linux this means Pipes, Sockets, Block Devices, Char Devices, etc. Does Java abstract all of this out for me so that I don't need to know what Operating System I'm on?

"Are you going to use Unicode?"
One of the only points in this blog I agree with. But only a little. C++0x has full unicode support.
JUnit - "Unit testing: How will you do it in C++?"
Probably using Boost.Unit.

Graphical User Interface -
"Suppose you want a GUI for your application. Which one will you use? Gtk, Qt, Win32, MFC, .NET?"
I'm not sure if this is a serious question. I thought we were talking about C++, why is .NET listed? Win32 and MFC are clearly platform specific, so I obviously won't be using them if cross-platform is a concern. Gtk and Qt are both fine and both do a decent job emulating a native GUI. So, again I'm not sure if this is a serious question.

Perhaps an even better answer is in order though. I won't use any Gui toolkit, because I won't develop GUI code in C++. Why would I use the language for something which it doesn't excel at? That's stupid. A good programmer knows how to make use of many different tools and picks the one most appropriate for the job. If I only cared about Windows I'd use a .NET gui and invoke my C++ code through C++/CLI or PInvoke. If I needed a cross-platform GUI I'd write it in Java and use JNI. There are plenty of things that Java doesn't excel at either, why aren't you talking about how to write Java code that deals with unsigned types, whether they be stored in a database, a file, or coming from the network or some server? probably because it's a freaking nightmare in Java, and of course there are languages much more suitable for this type of work if it's a huge part of your application.

Web Applications -
"What are you going to do about web applications?"
Nothing. I'll use .NET, Java, or something else more appropriate. Again with the one-size-fits-all.

Dynamic Linking - "Suppose you want to enable your application to have plugins, so that other developers can contribute parts of your application without seeing the core application's source code. How exactly are you going to do this? In Windows, you can use DLLs. What about in Linux? How are you going to distinguish between the two?"
The more I read this the more I realize the author just doesn't have any experience with C++. We're working off the assumption that this app is cross platform. So how hard is it to require that the plugin be cross-platform too?

In this hypothetical situation I'm already producing a different binary for both Windows and Linux, each of which uses some platform specific code (even though except in specialized situations all that platform specific code is hidden in the depths of Boost, STL, or some other 3rd party library). Is it that hard to just say that plugin compiles on both platforms as well? It then exposes a simple C-interface of routines that the main application dynamically links to (LoadLibrary/GetProcAddress on windows, dlopen/dlsym on Linux). Problem solved, simple.

Exceptions vs. Core Dumps -
"Would you rather receive an exception or a core dump when something goes seriously wrong in your application?"
Obviously a core dump. But is this a hypothetical application that has no logging, where we must make a mutually exclusive choice between exceptions and core dumps? All production quality apps would be writing to a log file and keeping track of anything unexpected. Despite that, I agree core dumps are more useful if you have to have one or the other. But this poster seems to be arguing that if Java is better at anything then C++ must be terrible. By that same logic, if C++ is better at anything, then Java must be terrible. Well C++ is better at things. One example is interfacing directly with the O/S.

In conclusion, neither language is terrible. They just excel at different things.

I think I should at least mention a few of my own thoughts on why C++ doesn't suck -- some advantages of C++ over other languages if you will.

C++ has a very strong, flexible type system.
The word 'painfully' certainly doesn't instill confidence in how awesome something is, but in the case of a type system the stronger the better. The more things the compiler can do for you the better the code will be. Sure, there are certain problems that just flat out can't be solved at compile-time, like catching buffer overruns, but it's well-known that type-problems represent a massive percentage of logic errors.

I know what you're thinking. "Are you f**king kidding me? You can cast arbitrary objects to pointers of any other type, and you call that strong?" Well, err.. No actually. That's a clear violation of a strong type system. But I've always subscribed to the philosophy that when you need a jackhammer, it's really nice to have a jackhammer sitting around, even if it's dusty.

So what do I mean by a strong type system then? The const keyword is one of the most excellent examples. It's one of the reasons I feel uncomfortable in almost every other language, no matter how well I understand the syntax and paradigms. It is really, really, really useful to have the compiler enforce that the state of certain objects cannot be modified. I absolutely hate the fact that Java and C# don't have an equivalent of this.

Haskell and ML have this automatically since they're functional languages and you don't modify values anyway you create new values, and this is such a powerful feature for the same reason that functional languages are so powerful. Not being able to modify state is actual the fundamental differentiating factor of functional languages and what makes them as unique and powerful as they are, and in a way the 'const' keyword in C++ brings us ever so slightly closer to this.

From a design standpoint, it's terrible knowing that 100% of the time, if I pass an instance of a class to a function, I have no guarantee about the state of that object after the function returns. I'd love to hear about other languages that support this type of type checking, I can't think of any off the top of my head.

Code Generation
I've already talked about this a little above, but code generation really takes generic programming to the next level. Of course, there is always a tradeoff. In C++ the tradeoff is difficult to diagnose error messages and easily misunderstood/misused features.

Difficult to diagnose error messages should eventually be a thing of the past, particularly if Concepts ever make it into C++0x as a technical report or something, although it's definitely annoying in the meantime. With some practice though you begin to easily be able to skip over all the BS and figure out the meat of an error message, especially if you frequently use the same compiler.

Misunderstood/misuse of features is the biggest issue when it comes to generic programming in C++, and it can definitely be a project killer in extreme cases (I've worked on projects that were literally killed because of this). Like I said earlier, coding standards should be tailored to the lowest common denominator of programmer that will be on a given team. C++ is a more advanced language than it was 6 or 7 years ago. I agree it's just not appropriate for certain people to be using all of its functionality.

The real issue here is that "old-style" C++ programmers are still trying to program in C++ and applying old-style C++ practices to codebases in which new-style C++ methodologies are being used. This is a bit of a fundamental problem, in some respects it would have been nice if the current evolution of C++ were just a completely new language that was simply not backwards compatible with old C++. I agree this is a real problem though, and probably the biggest problem facing C++ currently. The best solution until we see what the future has in store is to know the programmers on your team.

So, where is all of this leading? I concede that C++ is growing increasingly difficult to master and it takes a certain breed of programmer to be able to deal with all of its intricacies. I too have been burned by the terrible template programmer who ends up leaving me re-writing his entire framework because he came up with a bad design using templates. Does this mean C++ sucks? No, it just means that your coding standard should be somewhat tied to the lowest common denominator of C++ programmer on your team .

On that note, nobody, and I mean nobody should be using C++ without Boost. I'm serious, it should be just as required to use Boost as it is to uses classes and object-oriented design. I'm not going to sit around and argue about the pros and cons of object-oriented design (of which it has both, just like anything else in the world), so if you reject that objects are useful then be my guest.

Is Boost complicated? Yes, it is. But it is possible to realize massive benefits from boost while only using an extremely small subset of the available functionality. If all you use is boost::shared_ptr you already gain massive benefits. It's honestly a little insane to completely reject boost.

C++0x takes this one step further and eliminates even more of the complexity currently involved in using C++. things like shared_ptr and threads are built into the language. Unicode support is built into the language.

In the past I would say that C++'s main advantage has been viewed as being very close to the OS while still providing a reasonable model of object oriented design. I would say that this is changing. C++'s main strength is starting to become it's ability to express generic code. Even better, you don 't have to be an expert in *writing* generic code to take advantage of this. Leave that to the academics and the library designers if you think it's too complicated. All you have to do is use what's already out there. If you have neither the need for generic code nor the need to be close to the operating system, obviously there's going to be a more appropriate language for you. If you have the need for both of these (or even for just one of them), then C++ is still very appropriate. I think you'll be hard pressed to find an application that doesn't benefit from generic programming though.

C++ is certainly rather verbose at expressing highly generic code, much moreso than its highly generic competitors like Haskell or ML so if *all* you care about is genericity then one of those languages might be more appropriate. But in some ways, C++ is actually more generic than these languages just due to its code generation abilities. But those languages are completely dead in the water if you need to program close to the operating system, as are most other languages except C. C gets even uglier and more verbose though when you start trying to go cross platform. Even if you aren't an expert at creating generic code using C++ templates and boost::mpl, it doesn't take a genius to be an expert at using generic code.



Debugging Visual Studio 2010

I've been spending quite a bit of time recently exploring the Visual Studio 2010 beta. I'm impressed by many of its features, but that's not to say it doesn't have bugs. One of the most frustrating bugs I've encountered is one that completely hangs my machine. It happens regularly (I've experienced the bug over 50 times) and the only way to remedy the situation is do a hard power-off of your machine and reboot.

So, what do we know about the bug?  For one thing, It happens sporadically, usually right when I initiate a C++ project build, but there have been occasions where it happened at other times.  Second, the machine becomes completely unusable when it happens and windows stop drawing.  Third, the CPU of the machine spikes to 100% (the only way to figure this out given observation 2 above is by looking at the cpu of the host when the bug happens on a VM).  So I'm going to look at this problme in WinDbg a little and see if I can't figure out what's wrong.

So where to begin? Well surely the problem must be somehow related to Visual Studio. Since the CPU was stuck at 100% I suspected either an infinite loop or a spinlock.  The first step is to see if any processes stand out as having far more CPU time than normal.  I could have just blindly drilled directly down into Visual Studio processes, but it could always be a driver problem, or something running in the System process. A quick search of processes showed the following:

0: kd> !process 0 f
PROCESS 822338b0 SessionId: 0 Cid: 0250 Peb: 7ffde000 ParentCid: 0294
            DirBase: 02a50380 ObjectTable: e264ffb8 HandleCount: 33104.
            Image: vcpkgsrv.exe
            VadRoot 823b4138 Vads 1774 Clone 0 Private 8274. Modified 77725. Locked 0.
            DeviceMap e1d5cac8
            Token e17a83c8
            ElapsedTime 01:07:16.912
            UserTime 00:01:57.812
            KernelTime 00:16:32.203
            QuotaPoolUsage[PagedPool] 550980
            QuotaPoolUsage[NonPagedPool] 71840
            Working Set Sizes (now,min,max) (33950, 50, 345) (135800KB, 200KB, 1380KB)
            PeakWorkingSetSize 43646
            VirtualSize 249 Mb
            PeakVirtualSize 271 Mb
            PageFaultCount 882867
            MemoryPriority BACKGROUND
            BasePriority 8
            CommitCharge 8860

16 minutes of kernel mode CPU time certainly seems a little interesting, especially since the computer had frozen about 15 minutes prior to this, and I had been letting it run the whole time. If my guess is right, there's one thread that's used almost all of this time.

Sure enough, listing threads for this process I find the following:

0: kd> !threads 822338b0  f
THREAD 820ca2b0 Cid 0250.0408 Teb: 7ffdb000 Win32Thread: 00000000 RUNNING on processor 0
            Not impersonating
            DeviceMap e1d5cac8
            Owning Process 0 Image:
            Attached Process 822338b0 Image: vcpkgsrv.exe
            Wait Start TickCount 553117 Ticks: 4 (0:00:00:00.062)
            Context Switch Count 68260
            UserTime 00:00:03.125
            KernelTime 00:15:57.750
            Win32 Start Address cpfe!a_compiler_thread::compiler_thread_routine (0x3fb7582c)
            Start Address kernel32!BaseThreadStartThunk (0x7c8106f9)
            Stack Init f6be6000 Current f6be5a30 Base f6be6000 Limit f6be3000 Call 0
            Priority 8 BasePriority 8 PriorityDecrement 0 DecrementCount 16
            ChildEBP RetAddr Args to Child
            f6be5a6c 80545129 00000001 00000000 000000d1 nt!RtlpBreakWithStatusInstruction
            f6be5a6c 805cf8ef 00000001 00000000 000000d1 nt!KeUpdateSystemTime+0x175
            f6be5b00 805bc91f 55d9c6a8 00034828 00000670 nt!PsChargeSharedPoolQuota+0x5d
f6be5b1c 805bd1fe e2814088 825c23d8 f6be5b75 nt!ObpChargeQuotaForObject+0x6f
            f6be5b90 805bda20 f6be5bf0 822338b0 e28140a0
            f6be5be4 805c3052 e28140a0 000f0007 00000001 nt!ObpCreateUnnamedHandle+0x86
            f6be5cd8 805ab50a e28140a0 00000000 000f0007 nt!ObInsertObject+0xb0
            f6be5d40 8054162c 00fceb54 000f0007 00000000 nt!NtCreateSection+0x15c
            f6be5d40 7c90e514 00fceb54 000f0007 00000000 nt!KiFastCallEntry+0xfc
            00fceaf8 7c90d18a 7c8094e5 00fceb54 000f0007 ntdll!KiFastSystemCallRet
            00fceafc 7c8094e5 00fceb54 000f0007 00000000 ntdll!ZwCreateSection+0xc
            00fceb58 7c80955f 0000016c 00000000 00000004 kernel32!CreateFileMappingW+0x10b
            00fceb84 3fbeb857 0000016c 00000000 00000004 kernel32!CreateFileMappingA+0x6e
            00fcebbc 3fc2b88c 00000000 000000a4 00010000 cpfe!map_file_region+0x67
            00fcebd0 3fb95781 00010000 00010000 000000a0 cpfe!alloc_new_mem_block+0x36
            00fcebec 3fb14302 00000001 00000000 00000000 cpfe!_setjmp3+0x15dc7
            00fcec14 3fb159c3 00000098 278bf7f8 278c7d54 cpfe!check_operator_function_params+0x5f2
            00fcec44 3fb24f03 00000006 00009e38 278bf7f8
            00fcecdc 3fb57bbc 278bf7f8 00000003 00000003 cpfe!scan_class_definition+0x1a3
            00fced58 3fb48afb 278c7c70 00fcee60 00000000 cpfe!alloc_decl_position_supplement+0x245
            00fcee20 3fb396ca 00000000 02c30420 00000000 cpfe!required_token+0x55a
            00fcf148 3fb4f8da 00fcf1d4 00000000 00000000 cpfe!proc_define+0x252a
            00fcf178 3fb16edb 00fcf1d4 00000000 00000000 cpfe!template_directive_or_declaration+0x97
            00fcf1a4 3fb21310 00fcf250 00000000 00fcf1d4 cpfe!is_type_start+0x3db
            00fcf398 3fb55190 00000001 00000000 00000000 cpfe!declaration+0x250
            00fcf454 3fb16f53 00fcf4ac 00000000 00000290 cpfe!scan_void_expression+0x3ec
            00fcf47c 3fb21310 00fcf528 00000000 00fcf4ac cpfe!is_type_start+0x453
            00fcf670 3fb55190 00000001 00000000 00000000 cpfe!declaration+0x250
            00fcf72c 3fb16f53 00fcf784 00000000 00000148 cpfe!scan_void_expression+0x3ec
            00fcf754 3fb21310 00fcf800 00000000 00fcf784 cpfe!is_type_start+0x453
            00fcf948 3fb55190 00000001 00000000 00000000 cpfe!declaration+0x250
            00fcfa04 3fb16f53 00fcfa5c 00000000 00000000 cpfe!scan_void_expression+0x3ec
            00fcfa2c 3fb21310 00fcfad8 00000000 00fcfa5c cpfe!is_type_start+0x453
            00fcfc20 3fb5461c 00000001 00000000 00000001 cpfe!declaration+0x250
            00fcfc40 3fb6c891 00000001 00000000 00448bc8 cpfe!translation_unit+0x68
            00fcfc50 3fb6c946 00000000 00fcfcf8 00fcfcd8 cpfe!process_translation_unit+0xf4
            00fcfc80 3fb702d9 0000003b 004546b0 00fcfd54 cpfe!process_translation_unit+0x1a9
            00fcfc90 3fad8c41 0000003b 004546b0 00000000 cpfe!edg_main+0x2e
            00fcfd54 3fb75a9d 0000003c 01851858 00000000 cpfe!InvokeCompilerPassW+0x46b
            00fcfda4 3fb75a1a 00000025 0045fd10 b1b2ece7 cpfe!edge_compiler_main+0x50

15 minutes and 57 seconds of kernel mode time used, and just my luck it's in the Visual Studio code. It appears to be in some kind of code that's parsing my c++ in the background, perhaps looking for syntax errors or to enable Intellisense. Nothing seems particularly unusual about the callstack though, it doesn't appear to be stuck in a spin lock like I originally guessed. So I wonder if it's stuck at all.

At this point we don't know if it's stuck in kernel mode, or if the parsing code is stuck in some sort of loop calling an expensive kernel mode function over and over. If it does ever return from kernel mode, it should eventually hit the address 3fbeb857, which you can observe above is the return address of CreateFileMapping back into the Visual Studio code. So I put a breakpoint at 0x3fbeb857 and hit run. The breakpoint never gets hit.

So it is stuck in kernel mode. Perhaps it would be useful to know what file it's trying to map, since it's obviously unable to map it. Referring back to the callstack I've highlighted the first argument of CreateFileMapping (0000016c), which the MSDN documentation states is a handle to the file trying to be mapped.  Luckily this means it's easy to get information about this handle.

0: kd> !handle 16c f
processor number 0, process 822338b0
PROCESS 822338b0 SessionId: 0 Cid: 0250 Peb: 7ffde000 ParentCid: 0294
            DirBase: 02a50380 ObjectTable: e264ffb8 HandleCount: 33104.
            Image: vcpkgsrv.exe

Handle table at e18b9000 with 33104 Entries in use
016c: Object: 81f4b808 GrantedAccess: 0013019f Entry: e10402d8
            Object: 81f4b808 Type: (825e9ad0) File
            ObjectHeader: 81f4b7f0 (old version)
                        HandleCount: 1 PointerCount: 1
                        Directory Object: 00000000 Name:
                                    \DOCUME~1\ADMINI~1\LOCALS~1\Temp\edg41.tmp {HarddiskVolume1}

Ahaaa. Err, not really. At least it's more information, but this doesn't help too much.

At this point it might be useful to see a call trace. A crude way to do this is to just hit F5 and break a couple times, and see where you end up. When I did this about 5 times I only ever ended up inside a total of 3 different functions: nt!PspExpandQuota, nt!PsChargeSharedPoolQuota, and nt!MmRaisePoolQuota.

A cursory disassembly of these 3 functions shows that a certain code path of nt!PsChargeSharedPoolQuota calls nt!PspExpandQuota, which in turn has a certain code path that calls nt!MmRaisePoolQuota. There doesn't seem to be any indirect recursion between these 3 functions, so we can rule that out, and instead focus on finding the highest function in the callstack that ever gets reached. An obvious first guess is nt!PsChargeSharedPoolQuota just based on the call graph. Putting a breakpoint on the ret statement from that function reveals that it is never reached.

So, we're getting close. We've identified an infinite loop in the function nt!PsChargeSharedPoolQuota. Perhaps PspExpandQuota is not behaving according to one of PsChargeSharedPoolQuota's assumptions. At this point things become a little bit more difficult. None of these 3 functions are documented, so if we want to get any further we're going to have to do some reverse engineering of the disassembly to see what these functions are doing.

To get a slightly better call trace, we can set breakpoints at each call instruction, and immediately after the call instruction that print out what's about to happen or just happened, and the the arguments or return value. By disassembling these functions you can see how many arguments are being passed to them simply by looking at how many arguments are pushed on the stack. For example,

bp nt!PsChargeSharedPoolQuota+0x5d ".echo About to call PspExpandQuota. Arguments = ; dd /c 5 esp esp+0x10; g"

will print a message displaying all the arguments being passed to PspExpandQuota right before the function is invoked. /c 5 means 5 columns (since there's 5 arguments), and I've used the value 0x10 because this causes dd to display the values {esp, esp+4, esp+8, esp+0xC, esp+0x10}, which is the 5 arguments we need. When I did this I found that it was repeating the exact same sequence of PspExpandQuota -> MmRaisePoolQuota over and over in an infinite loop, with the exact same arguments every time.

This certainly explains the CPU spike, and the unusability of the machine. 

So why did it lock the entire machine?  Well a quick examination of !stacks shows every other thread in the system is blocked even though I was using 2 Virtual CPUs.  Using !locks I then found that thread 822338b0 was the exclusive owner of an ERESOURCE lock, and that there were 4 threads waiting on this lock.  Recall that thread 822338b0 is the same thread stuck in this infinite loop.  I didn't examine further because this was all I really needed to know.

If I have some time in the next few days I may try to make a post walking through a reverse engineer of PspExpandQuota and MmRaisePoolQuota, determine their function signatures, figure out the structure of the code, and if possible reverse only the infinite loop portion of PsChargeSharedPoolQuota to see what perhaps could be causing it.



Not all bytes are created equal.

Suppose you're trying to read a really large file. By really large I mean a few hundred GB. What's the first solution that comes to mind? Presumably it's the most simple solution: Open the file and start reading. So you go with this and begin implementing it. You're careful about efficiency because with such a large file it's obviously going to matter. So you put in some periodic logging to show you how fast the file is being read, start it up and verify that the performance is stable and good, and then you go get a cup of coffee and have a talk over the water machine only to come back and find that the performance is about 40% of what it was when you left earlier.

WHAT IS GOING ON? Hopefully to save you 3-4 days of fruitless investigation I can answer this question for you. Not all parts of your disk are created equal. It makes perfect sense once you understand why (doesn't everything?) but it can be a real shocker to bear live witness to such a drastic degredation of performance simply by performing an everyday operation like reading a file. Normally you just copy a file using windows explorer, your unix shell, or through some other method and you never see the live transfer rate being shown. If it's an FTP transfer or an upload you do see the transfer rate, but the file is never big enough to actually witness it degrade over time.

To understand why this happens, you have to do is consider the rotational mechanics of a (non SSD) hard drive.

There are platters, which are discs that contain data on both sides. Each side is organized into concentric circles called tracks. Each track contains a certain number of sectors, and each sector contains a certain number of bits to store data.

3.5" is a very common number for the outer track radius, and 1.5" is a common number for the inner track radius. This is a ratio of about 2.33, so that means the outermost track is about 2.33x as long as the innermost track.

There are two possible ways to organize the tracks on a platter:

  1. All tracks contain the same number of sectors
  2. Tracks contain variable numbers of sectors
Under the first method, there is a ton of wasted disk capacity on the outer tracks. Since the number of bytes / sector is fixed (usually 512), and since as you move toward the center of the disk the circumference of the tracks gets smaller and smaller, in order to organize a disk according to method 1 above you must pack the bits tighter and tighter into the tracks. If you're doing this, however, then you could have packed the outer tracks just as tightly and gotten more capacity. This would then imply the second method, variable numbers of sectors on each track.

Thus there is also more data on the outer tracks than on the inner tracks. Since a single platter is spinning at a fixed angular velocity (e.g. 7,200 RPM) you can read data faster from the outside. Not surprisingly, you will read data about 2.33x faster on the outside than from the inside.

Remember this next time you're profiling disk throughput and save yourself 2-3 days of headaches :)



Asynchronous I/O Using Boost

One of the things I've been spending a lot of time on lately is measuring performance and scalability of asynchronous disk I/O versus synchronous I/O for an application that reads and processes gigabytes (or even terabytes) of data off of various types of filesystems such as NTFS, ext, etc. The current code we have that handles the I/O reading is completely synchronous, and uses a rather naive memory model. Imagine something like this (simplified for the sake of brevity):

boost::uint16_t chunk_size = 4096; //read 4k at a time
const boost::uint64_t remaining = get_byte_count();
const boost::uint64_t offset = 0;

while (remaining > 0)
     boost::uint16_t read_bytes = std::min(chunk_size, remaining);
     std::vector read_buf(read_bytes);
     read_data(offset , read_bytes, &read_buf[0]);

     //process the data
     remaining -= read_bytes;

Obviously this is silly for a lot of reasons, most notably the re-allocation of a 4KB stack buffer every iteration through the loop. It also doesn't allow the processing of one chunk of data to happen while reading the next chunk of data. If the processing is something that takes a long time like, for example, encrypting or compressing the data, you can basically get all of it for free since disk I/O is usually pretty slow.

Anyway, for the code above, in the more complex setting in which I'm actually using it, with a SATA 3.0 Gb/s interface and a purported maximum sustained data transfer rate of 78MB/s, I get 25MB/s. Pretty god awful.

I tried to do some benchmarking using Intel Thread Profiler and MicroFocus DevPartner Studio but both of them tanked when profiling my code for than a few seconds. I think this was in part due to the producer / consumer model that we had in our legacy codebase, where "producing" involved generating a 4KB chunk of data, and "consuming" involved sending the same 4KB chunk of data across a network connection. Reading this much data happens so fast that we were generating millions and millions of context switches. So many context switches in fact that Intel Thread Profiler would die, I guess because it couldn't deal with the fact that context switches were happening constantly every few microseconds.

I decided to try an asynchronous approach. I didn't think it would buy much in this limited setting because the processing of the data had been left commented out just to get a theoretical upper bound on how fast I could be doing stuff. My initial attempt at this was to use I/O Completion Ports directly through the Windows API. The I/O Completion Model is basically Windows' answer to high performance scalable asynchronous I/O. There are ultimately 4 different categories of I/O:

* synchronous blocking I/O

* synchronous non-blocking I/O

* asynchronous blocking I/O

* asynchronous non-blocking I/O

I/O Completion ports is the 4th type. For details on how to support I/O Completion Ports in your own application please refer to the MSDN Documentation or to Jeffrey Richter's excellent book Windows Via C/C++. From a high level, the programming model using IOCP can be expressed using the following pseudo-code:

//Define constants for the events that we care about.
//We use this code when beginning an operation on an
//IOCP, and IOCP uses this code to notify us of what
//type of operation completed.
typedef enum {
} IocpEventType;

IOCP port = CreateIoCompletionPort(/*...*/);
HANDLE device = CreateFile(/*...*/);



IocpAssociateDeviceEvent(port, device, CustomEvent1);
IocpAssociateDeviceEvent(port, device, CustomEvent2);


while (!done)

    IocpEventType what = GetQueuedCompletionStatus(

port, /*...*/);
    switch (what)
    case CustomEvent1:
/*If this was, for example, a read, we should begin
          an asynchronous write to save the data*/
    case CustomEvent2:
//Perhaps a write completed, in which case we can read new 

The beauty of this is that non I/O related tasks can easily be fit into this model. In the handling of CustomEvent1, maybe we want to encrypt the data. Before the loop begins we can initialize a thread pool (Windows even provides built in worker thread pool support) and post a message to the thread telling it there is new data to work on. When it's done working on the data, it simply calls PostQueuedCompletionStatus() with a different value of the enumeration such as EncryptionEvent.

Benchmarking this solution showed that I was now reading off the disk at a sustained speed of about 35 MB / second and now I would actually see occasional burst speeds of 150-200MB/second. This was exciting! But not good enough. There are still a few concerns:

1. Portability is a requirement of this application. It must work on Linux.

2. Although I'm not always reading from the disk sequentially, I am reading sequentially more often than not. If I can determine that the next 1MB of data will all be sequential, why should I read 4KB at a time? Or if only the next 8KB of data is sequential, I can still read 2 chunks at a time. How should I determine exactly how much to read?

3. Why was the asynchronous model faster? In neither of the benchmarks was there significant amounts of computation going on with the data. There was the poor stack usage in the first example, but certainly that doesn't account for 10MB/s of lost throughput.

To address the first concern I began looking into Boost.Asio. Although all of the documentation, samples, and pre-built classes are geared towards network usage, I saw no reason I could not fit the model to work with disk I/O, or for that matter arbitrary computations.

In my next post I will give a brief tutorial of Boost.Asio (the documentation leaves a little bit to be desired), what I had to do to fit the model I needed around the Boost.Asio model, and the performance I was able to get out of Boost.Asio. After that, I will discuss how I addressed the second issue, as well as a nice use of inline assembly that, combined with the asynchronous model, allowed me to achieve a massive performance increase for these reads. Finally I offer some thoughts on the 3rd point.


  © Blogger templates ProBlogger Template by Ourblogtemplates.com 2008

Back to TOP