You are expected to understand this. CS 111 Operating Systems Principles, Fall 2006
You are here: CS111: [[2006fall:notes:lec15]]

Lecture 15 Notes: File System Atomicity and Disk Scheduling

by Jesse Chen, Victoria Pan, and Matt Esquivel

modified by Jonathan Chang
December 10, 2006

Incommensurate Scaling

Accessing data stored on hard disks is a common but costly operation for many applications. Techniques for improving performance in this area must be changed to adapt with the advances in technology. Due to incommensurate scaling, improvements in processor speeds occur more quickly than do advancements in disk speeds (they do not scale at the same rate). As a result, coders must compensate for this imbalance by utilizing techniques to improve performance while maintaining robustness, neutrality, and simplicity.

Disk Speeds

To figure out a way to improve communication between the processor and the disk, we must first grasp an understanding of how a disk operates and obtain a way to measure the performance of different strategies. Pictured below is a figure of a typical hard drive.

Calculating Disk Latency

A hard disk consists of many circular platters stacked on top of each other (mmm... pancakes). Each platter has many rings, known as tracks, where data can be stored. A mechanical arm terminated with a magnetic read head hovers over the track with the data desired and reads the data. To read data the drive must do the following:

  1. Move the head to the track that the data to be read is located (Seek)
  2. Wait for the data to be read to rotate under the read head (Rotational Latency)
  3. Read the data as it rotates under the head (Transfer Rate)

Seek Time
The total seek time is the amount of time for the read head to move from the outermost track to the innermost track. In this example our total seek time will be 24 ms. Note, however, that the average seek time is NOT half of 24 ms as one might suspect. This is due to the fact that both the starting track number of the head and the track of the data to be read are random. Imagine the read head is on an outermost track, then the average seek time to any random piece of data is 12 ms. However, if the read head happens to be in a middle track, then the average seek time is 6 ms. Considering both a random head location and a random read track, the average seek time is one third of 24 ms, or 8 ms.

Rotational Latency
The platters spin at a very fast rate, a 7200 RPM drive makes one revolution in 8.33 ms (7200 RPM = 120 rotations/second = 8.33 ms/rotation), which is the rotational latency. The data to be read must be under the read head, if it is not, the drive must wait until it is. If you're lucky, the data you want will be under the read head, but if you're not, the data can be up to one revolution away. The average rotational latency is one half the time it takes for one revolution. In the case of a 7200 RPM disk it is 4.17 ms.

Sustained Transfer Rate
The sustained transfer rate is the speed at which the disk can output data and for a disk with our specs it is 66 MB/s. On average, the time needed to read one randomly chosen 4KB block is:

read/write random blk = seek + rot lat + transfer =
                      = 8ms + 4.17ms + 4KB/(66MB/s) =
                      = 12.23ms (Approximately 12 million CPU cycles, assuming 1 GHz CPU)

An example

Consider the following user program which:

  1. reads data from the disk
  2. does some computation (takes 1 ms)
  3. writes computed data back to disk.
while(1) {
	char buf[4096];
	read(fd, buf, 4096);
	write(fd2, buf, 4096);

The average time for one iteration of this loop is:

time for 1 iter = read random blk + compute + write random blk 
                = 12.23ms + 1ms + 12.23ms 
                = 25.46 ms/iter

39.28 iterations per second
157 KB computed per second

Performance Strategies

How can we improve the performance of the previous example? Notice that the heaviest operations are rotational latency and seeks. To improve performance we must avoid rotational latency and seeks. If data is placed randomly on the disk, we have to pay these overheads every time we want to read or write data. To improve performance we must modify the file system in such a way that it data is laid out intelligently.

When observing the behavior of common programs, many exhibit the property of locality of reference, which states that immediately after an access to item x, we are likely to access an item close to x. Spatial locality means that after accessing a certain memory location, we mostly likely will access memory locations near it. This is particularly true for reading instructions since the next instruction is usually very close to the current instruction. Temporal locality means that if a certain memory location is accessed now, it is likely it will be accessed again in the near future. This happens when we are constantly accessing a variable and doing operations on it.

The file system should aim to keep blocks from a single file (or files in the same directory) in close proximity on disk. In this way, reading multiple blocks is faster because the only latency comes from data transfer, which is almost negligible. File fragmentation is when a file's data is not located contiguously on the disk. Proximity should be maintained to exploit locality of reference and improve performance.

Improving File System Performance

Below are some approaches that can be implemented by the kernel to improve performance.


Recall definition: Requesting data in advance hoping for useful data.

To achieve speculation, we can do the following:

  • On one read request, read many blocks into buffer cache
  • Later requests to the same data can be serviced immediately

Example - Buffer one track:

  1. If files are stored contiguously on the disk it is safe to speculate that subsequent reads will occur on the same track
  2. When a sector of disk is request check to see if it is already buffered, if not read the entire track into a buffer
  3. If a track has 64 blocks, then the OS kernel should read blocks [i to 63] if block i is requested.
  4. The next 63 reads happen to this track will be serviced immediately


Calculations based on above example code. ( An example).

read 1 track = seek + rot lat (not avg, but complete) + transfer = 
             = 8ms + 8.33ms + 0ms = 16.33ms

1st [iteration]: read 1 track + compute + write
		 = 16.33ms + 1ms + 12.23ms = 29.56ms

2nd-64th:       compute + write
                = 0ms + 1ms + 12.23ms = 13.23ms

Average computation = 13.49ms/iter
74 iterations per second
296KB computed per second


  1. Performance has nearly doubled
  2. The transfer time for reading a track is 0 because it is included in the rotational latency
  3. Blocks 2 through 64 are already read into the buffer cache so read time is 0.

Dallying and Batching

Recall Definitions:

We can also improve performance by dallying, or delaying a request in hope that we can batch it with future requests. This can improve performance by delaying a request that in the future might not be needed. For example, as mentioned in our course reader, a request to overwrite a disk block may be delayed, in hope that a second request will ask to write to the same block. If the second request shows up, the first request can be ignored and the second request will be used. Performance is enhanced by not wasting time on a request that is not needed. How long should we wait? There is not specific answer -- it depends on system and application specifics.

Dallying can also increase chances for batching, which combines several operations into one to reduce setup overhead (6-3 in your reader).

How can we achieve dallying/batching? Here's an example:

  • Write to cache periodically flush many blocks to disk
  • Write 64 blocks at a time (assume 64 blocks equals one track)

We use (as in the speculation technique) a buffer cache as a virtual file system. This reduces seek time for expensive operations such as read and write. Reading will allow us to read from the cache and writing will update the cache instead of the disk. Dallying and batching will allow us to perform more operations at once, reducing seek time even greater.

1st: read + compute + write
     = [seek + rot lat (not avg) + transfer] + compute + [seek + rot lat + transfer]
     = 8ms + 8.33ms + 0ms + 1ms + 8ms + 8.33ms + 0= 33.66ms

2nd-64th: compute = 1ms

Note: For the 2nd to 64th iteration, assume read and write are in cache.

Average = 1.51ms/iter = 662.25iter/s = 2649KB/s

We see much better performance here compared to speculation by paying the expensive overhead of writing to just the first iteration. Dallying and batching work together here. Dallying delays the read and write operations, which gives the opportunity to batch them into the first iteration.

Batching can also allow opportunities to reduce latency be reordering, which introduces us to disk scheduling.

Disk Scheduling

Disk scheduling decides the optimal order to perform a sequence of requests such that the total latency is reduced, by reducing the movement of the disk arm. In other words, the main goal is to maximize the overall throughput and not necessarily the individual delay of each request. At the same time, the situation where a certain disk request is never executed, known as starvation, must be avoided.

Here's a simple model:

LISK (Linear Disk): |0|1|.............|N| (0 to N)

  • The largest seek time is N
  • Seek time from i to j = |j - i|
  • Given a set {B} of requests with block numbers b0, b1, ..., bm, what order do we choose?
  • Assume we know the head's position = h

Here are five scheduling algorithms:

First Come First Serve (FCFS)

A common example is a line. The first person that gets in line will be served first, the second will be served second, etc.

FCFS is basically a queue.

  • b0, b1... in increasing order of job arrival. When a job becomes ready, it will be added to the end of the queue.
  • This order will not suffer from starvation because a job will execute based on its order on the queue. The execution of a job will not be prevented unless a job takes up too much time. A problem with FCFS is that we are doing nothing smart to reduce the wait time. The wait time is dependent on the order in which the requests are recieved.
  • Each block takes, on average, N/3 to complete, therefore, Expected completion: (m+1)N/3

Shortest Seek Time First (SSTF)

  • Go to the closest block to the head, or the one that minimizes rotational latency and seek time.
  • This order can cause starvation. Job execution will depend on how close it is to the head.
  • To see how starvation could occur, consider the following sequence of requests for data blocks:

Notice that although block 10 was requested 3rd it will never get served as long as requests for blocks 5 and 6 continue.

* But is it optimal? It is the most optimal (shortest schedule) if there is no starvation.

For example:

h = 10
{b0, b1, b2, b3} = {11, 0, 10, 1}
FCFS: 11, 0, 10, 1		
  time = 31 units (1+11+10+9)
SSTF: 10, 11, 1, 0	
  time = 12 units (0+1+10+1)

Example of starvation:
Sequence: {10, 11, 0, 1, 12, 13, 14, 15, 16...}
SSTF: 10, 11, 12, 13, 14, 15, 16...
      0 and 1 are starved because 10 through 16 are hogging the scheduler.

To minimize starvation

We use the idea: take requests in chunks, within chunks use SSTF, between chunks use FCFS

Elevator scheduling

Let's use the idea of an elevator! We have a direction, either up or down, and we keep going in that direction until we reach the end (no more requests in that direction). After reaching the end, we switch directions and go all the way to the other end. Here is some pseudocode to demonstrate this idea:

d = head direction (up, down)
h = current head position
int getNextBlock(){
  if (no bi) return -1 // no requests
  if(d == UP)
    if(no bi has bi > h)
      d = DOWN
      return getNextBlock()
      return smallest bi > h
  if(d == DOWN){
    if(no bi has bi < h)
      d = UP
      return getNextBlock()
      return largest bi < h

Using the same set: {11, 0, 10, 1}

Elevator Direction Order Time
d = UP 10, 11, 1, 0 12 units
d = DOWN 10, 1, 0, 11 21 units

Notice that this algorithm does not suffer from starvation! However, if you look carefully, the middle sectors are actually serviced more often than the sectors on the end. In particular, for every time a sector on the end is serviced, a sector in the middle is serviced twice.

C-Scan/Circular Elevator Scheduling

So how do we solve this minor issue and make service fair for every sector? We disconnect the cables and drop the elevator to the floor! In other words, we move only in one direction but in a circular manner (when we reach the end, wrap around back to the beginning).


  • Move only in one direction.
  • Treats all areas of disk equally often

Anticipatory Scheduling

Some processes may issue disk requests synchronously, which can cause poor performance in the other algorithms. Process A may issue successive requests only after the previous request has completed, so that it only has one pending request at any given moment. So what may happen is that at the moment when the request is completed, the scheduler assumes that Process A has no further requests (since Process A has not yet issued the next request) and moves on to perform Process B's requests. This is known as deceptive idleness.

The anticipatory scheduling algorithm allows the disk to handle Process A's requests consecutively, which improve performances by reducing the amount of time spent moving the disk head due to switching between requests from different processes.


  • After completing a request, wait for a brief period of time
  • If a nearby request occurs during this period, handle that request
  • Otherwise, use Circular Elevator Scheduling to choose the next request to service
  • If implementing this strategy, one must be careful not allow a process to starve another processess requests

An example:

Process A and Process B both start synchronously requesting disk 
accesses (with A's initial request being sent infinitesimally earlier 
than B's initial request), with a small delay (1 unit of time) between 
the completion of each respective process's request and the next 
issued request.

Order of request block numbers for each process:
A: 1, 2, 3, 4, 5,
B: 11, 12, 13, 14, 15

The disk scheduler sees, and processes the requests in this order
with h = 1 initially

Using Circular Elevator Scheduling:
1, 11, 2, 12, 3, 13, 4, 14, 5, 15
  time = 86 units

With Anticipatory Scheduling (with 1 time unit wait time)
1, 2, 3, 4, 5, 11, 12, 13, 14, 15
  time = 14 units spent moving the disk head + 9 units waiting = 23 units

This is just a hypothetical example with an arbitrary wait time chosen to demonstrate a situation where Anticipatory Scheduling can offer great benefits. The actual wait time chosen for the algorithm would be chosen based on the typical delay between requests for a system.

For more information about Anticipatory Scheduling, see:

Performance vs. Robustness

But wait? Last lecture, the professor emphasized that the careful ordering of disk writes will preserve the invariants for file system correctness. As a reminder, these invariants were:


  1. each block 1 purpose
  2. referenced → initialized
  3. referenced → marked not free
  4. unreferenced → marked free

Recall that if the ordering of atomic writes is carefully chosen, none of the invariants will be violated (except for the 4th invariant which is OK to violate). If none of these invariants are violated we are ensured that if the system crashes during an operation, the file system will continue to operate correctly.

So how can we get file system correctness while still gaining the performance benefits of smart disk scheduling (provided with dallying and batching)? It is important to note that some writes will affect the invariants while others will not. Only changes to non-data blocks affect the invariants. These are writes to things such as inodes and the free block bitmap, so careful ordering is crucial here. However, data block writes can be done in any order, so any disk scheduling method can be used there.

File System Atomicity

There is, however, a small problem. Suppose that there is a sequence of blocks to be written denoted by:

old blocks: ABCDE
new blocks: A'B'C'D'E'

Using a First Come First Serve order of scheduling data writes, the possible outcomes of the resulting file (in the case of your computer crashing in the middle of a write) are:


For a non FCFS order, any intermediate state is possible (any 1 block write, any 2 block writes, etc.)

Imagine writing your CS111 take-home final (in your dreams) and suddenly the power goes out. When you reboot, you'd probably like to see all of your changes or none at all, not some in between mixture. It would be a pain to track down and find which changes were saved and which were not. In some situations, partially saving a file could corrupt the file, causing all the data to be unreadable. This is why we wish to make atomic writes to a file so that the write completely happened or did not affect the file at all.

A very simple way to get atomic writes without sacrificing too much performance is to write the file twice. According to the Golden Rule of Atomicity, you should never write over the only copy because in the case of a system crash, you will not be able to restore either the original file or the modified file.


One way to achieve File System Robustness is to reserve an area of the disk for a journal or log to keep track of changes to the file system. It works in the following fashion:

To make a file system change

  1. Write modified blocks to log
  2. Write COMMIT RECORD to log
  3. Write modified blocks in log to main disk

After a crash

  1. Process log
  2. For each committed record, copy modified blocks to main disk

If a crash occurs before the COMMIT RECORD is written to the log, the system will ignore the log entry on reboot, and the original file will be intact. If a crash occurs after the COMMIT RECORD, the system will copy the modified blocks to the main disk, replacing the original blocks with the new blocks. If yet another crash happens during the copy, on reboot, the modified blocks will be copied to the main disk once again. After the copy is complete, the COMMIT RECORD is cleared, which will indicate that the main disk now contains the modified copy in its entirety.

The following are details on how to write a single block to the log:

Writing a block to log

  1. Mark block as used in Free Block Bitmap
  2. Add block to inode
  3. Zero out the block
  4. Commit Record
2006fall/notes/lec15.txt · Last modified: 2007/09/28 00:25 (external edit)
Recent changes RSS feed Driven by DokuWiki