6.22.2009

Not all bytes are created equal.

Suppose you're trying to read a really large file. By really large I mean a few hundred GB. What's the first solution that comes to mind? Presumably it's the most simple solution: Open the file and start reading. So you go with this and begin implementing it. You're careful about efficiency because with such a large file it's obviously going to matter. So you put in some periodic logging to show you how fast the file is being read, start it up and verify that the performance is stable and good, and then you go get a cup of coffee and have a talk over the water machine only to come back and find that the performance is about 40% of what it was when you left earlier.

WHAT IS GOING ON? Hopefully to save you 3-4 days of fruitless investigation I can answer this question for you. Not all parts of your disk are created equal. It makes perfect sense once you understand why (doesn't everything?) but it can be a real shocker to bear live witness to such a drastic degredation of performance simply by performing an everyday operation like reading a file. Normally you just copy a file using windows explorer, your unix shell, or through some other method and you never see the live transfer rate being shown. If it's an FTP transfer or an upload you do see the transfer rate, but the file is never big enough to actually witness it degrade over time.

To understand why this happens, you have to do is consider the rotational mechanics of a (non SSD) hard drive.

There are platters, which are discs that contain data on both sides. Each side is organized into concentric circles called tracks. Each track contains a certain number of sectors, and each sector contains a certain number of bits to store data.

3.5" is a very common number for the outer track radius, and 1.5" is a common number for the inner track radius. This is a ratio of about 2.33, so that means the outermost track is about 2.33x as long as the innermost track.

There are two possible ways to organize the tracks on a platter:

  1. All tracks contain the same number of sectors
  2. Tracks contain variable numbers of sectors
Under the first method, there is a ton of wasted disk capacity on the outer tracks. Since the number of bytes / sector is fixed (usually 512), and since as you move toward the center of the disk the circumference of the tracks gets smaller and smaller, in order to organize a disk according to method 1 above you must pack the bits tighter and tighter into the tracks. If you're doing this, however, then you could have packed the outer tracks just as tightly and gotten more capacity. This would then imply the second method, variable numbers of sectors on each track.

Thus there is also more data on the outer tracks than on the inner tracks. Since a single platter is spinning at a fixed angular velocity (e.g. 7,200 RPM) you can read data faster from the outside. Not surprisingly, you will read data about 2.33x faster on the outside than from the inside.



Remember this next time you're profiling disk throughput and save yourself 2-3 days of headaches :)

0 comments:

  © Blogger templates ProBlogger Template by Ourblogtemplates.com 2008

Back to TOP