Total Pageviews

Tuesday, March 30, 2010

System Calls (Linux/Unix Kernel)

System Calls
In simple words, it is a call from the user application to the Operating System to do some work for it.

Why do we need a System Call?
A very common example is reading a document from the hard disk.
To read the document from the scsi disk, you must know about the hardware details of disk. You need to program its controller so that you can pass scsi read command to it and tell the controller that you want to read a specific block. You must know the device specific protocol  (like scsi) so that you can talk to the disk. Overall it will be a nightmare for the user to do a simple read/write from the disk.
Solution is let the operating system (linux/unix/windows) do it for you. All operating systems have a driver code for any hardware device (disk, mouse, keyboard, memory, ethernet card etc) attached to the system. The driver code knows how to talk to the hardware devices.
So, application/user can simply make a call to the system saying “Hey ! can you please do this for me”.

How to make a System Call ?
Unix/Linux systems include various standard libraries that provide various APIs to the users. These APIs are defined by the POSIX standard. The advantage of using these standard APIs is that these are available on all unix systems which makes the application portable.
These APIs in turn call the platform specific system calls. These systems calls are generally a part of standard C library libc and are typically written in Assembly language. These are highly system dependent (both OS and processor specific).
A user program can make a system-call directly but it is not advisable to do so as these system calls may vary from system to system. There’s no guarantee that a system call present on one system (say HP-UX) will also be present on a different system (say Solaris). But, the APIs defined by POSIX are guaranteed to be present on all POSIX compliant systems.

Typically the flow is as follows:
Application ----> POSIX library APIs (fread) ----> System Call (read)

How System Call works
The purpose of system call is to transfer control from user application to operating system. User application cannot directly call kernel functions as the kernel has protected address space which is essential for system stability and security.
A processor provides certain instructions (like “int 0x80” on i386, “sc” on PowerpC) which are used to tarnsfer the control from user to kernel. When a user program execute these instructions, an exception is taken which changes the processor mode from user to kernel. The Program counter is changed so as to execute the exception handler at a specific location. The exception handler is nothing but the system call handler in kernel. This can then invoke appropriate driver code to interact with hardware.



System Call Number and Arguments
Typically there are 100s of different system calls to do different kinds of tasks. All of them execute the same instruction (“int 0x80” on i386,  “sc” on PowerPC). So, user must tell the kernel that which system call is being made. This is done by assigning a unique number to each system call.
Every processor specify an Appilication Binary Interface (ABI) that describes the low level interface between and application and operating system. The ABI describes how the system call number and arguments etc should be passed to OS. They also define how the OS will tell the user if the system call was executed successfully or not. This is done typicallly by passing the error number in a specific register.

Lets see how it happens on i386
As per the System V ABI for x86 (Linux, NetBSD follow System V ABI)
  • System call number is passed into eax register.
  • Arguments are passed in registers ebx, ecx, edx, esi and edi in order (the first five arguments). In case, the number of arguments are more than 5, then a single register is used to pass a pointer to user space address where the parameters are stored.
  • The return value is passed to the user via eax register. If return value is between -1 and -125 (value of eax), it means system call finished with an error. The actual error code (-eax) is then copied to “errno” variable and -1 is returned in eax.
read system call library function for i386
// read (fd, buf, numBytes)
read:
    pushl %ebx              // save ebx on stack
    mov1  8(%esp), %ebx     // pass first argument in ebx
    movl 12(%esp), %ecx     // pass second parameter in ecx
    movl 16(%esp), %edx     // pass third parameter in edx
    movl $3, %eax           // pass system call no in eax
    int  $0x80              // invoke system call
    cmpl $-126, %eax        // check return value
    jbe  out                // if no error goto out
    negl %eax               // negate the value of eax
    mov %eax, errno         // save the error code in errno
    mov, $-1, %eax          // pass –1 to eax
out:
    pop %ebx
    ret                     // return


System call interface on PowerPC
As per System V PowerPC ABI (linux also follows System V ABI),
  • System call number is passed in register r0
  • The arguments are passed in registers r3 to r10 (total 8 argumenst in order).
  • The error code from OS to system call is passed in r3. The summary overflow bit (CR0_SO) in the condition register CR0 indicates if error occurred. A unix system checks if there is an error, it copies error code to “errno” and return -1 to user application.
  • The return value(s) from system call to user application are passed in r3 and r4.
read system call library function for PowerPC
// read (fd, buf, numBytes)
read:
   li   r0, 3         // pass system call number to r0
                      // arguments have already been passed
                      // in r3, r4, r5 by the caller.
   sc                 // invoke system call 
   bnslr              // if no error return (Summary
                      // Overflow bit not set)
   lis  r4, errno@ha  // copy error code to errno variable
   stw  r3, errno@l(r4)
   li   r3, -1        // return -1 to user application
   blr

Accessing/Copying user data from kernel 
User is not allowed to access kernel pages which is essential for system security and stablity. However, kernel can read/write data from/to user space. Linux kernel provides two functions for this:
copy_to_user() is used to copy data from kernel address space to user address space.
copy_from_user() is used to read the data from user buffer to kernel page.
I will discuss more about these in some other post.

Kernel must verify all the arguments passed by user. The most important is the validity of the pointer provided by the user. Kernel must ensure that
  • address should lie with in user address space. It should not lie in kernel region else user may corrupt the kernel data.
  •  user has valid permissions to read/write data for that address.

This is all for today. We will discuss more on how kernel handles the system call later.

Till then, Have Fun !!!


Sunday, March 28, 2010

Synchronization in Linux/Unix Kernel - Part 4 (Reader-Writer Lock)

Till now we saw how we can atomically read-modify-write global variables using special instructions provided by the processor. We also saw the use and implementation of Spin Locks and Semaphores. In this post we will discuss about a new locking mechanism called Reader-Writer Lock.

Consider a case where we have a file shared across various processes with the following constraints.
  1. Multiple processes can read the file simultaneously.
  2. Only one process can write to the file at a time. No other process should be allowed to read/write the file while one of the process is writing to the file.
We can’t use Spin Lock here as it allows only one process to access the file. Spin Lock can’t distinguish between readers and writers. Also, read/write to file on disk is time consuming operation and spin lock is not suitable here.
Semaphores or Counting Semaphores are also not suitable here as they can’t distinguish between reader and writer.

We need a separate type of lock here called Reader Writer Lock.
It keeps track of number of readers/writers reading/writing to the file. It also keep track of number of readers/writers that are waiting to take the lock.

Reader Write Lock
  • If there are only readers it allows all of them to take the lock.
  • If a new writer comes it blocks the writer until all readers are done. But, if new readers keep coming and we allow them to take the lock, it will cause writers to starve.
  • Similarly, if a writer is active, all new readers and writers will be blocked.
Question is whom to wake up after the writer is done ? Should it wake up another writer or all other readers ? If writers are given preference, readers could starve.



To solve the problem of starvation
  • All new readers are blocked if there is any writer already waiting so that we don’t starve writers.
  • If a reader releases the lock, it takes no action if some readers are already active. When the last reader releases the lock, it issues wakeup to the waiting writer.
  • If a writer realease the lock, it issues wakeup to all the blocked readers. If there are no blocked readers, it issues wakeup to the single blocked writer.
With this scheme we can alternate access to the resource between a writer and bunch of readers.

Implementation of reader writer lock
typedef struct  rwLock {
   spinLock sl;          // spin lock to synchronize access
                         // to following members of lock
                         // by multiple threads.
   int nActive;          // number of active readers.
                         // if -1, a writer is active
   int nReadPending;     // number of blocked readers
   int nWritePending;    // number of blocked writers
   queue readers;        // queue to put blocked readers
   queue writers;        // queue to put blocked writers
} rwLock;

void LockRead (rwLock *lock)
{
   acquire the spin lock  // acquire the spin lock
   lock->nReadPending++;
   if (nWritePending) {   // if any writer already blocked
      add itself to readers wait queue
      release the spin lock
      sleep
      acquire the spin lock
   }
   while (lock->nActive == -1) {  // if any writer already
                                  //acquired the lock
      add itself to reader wait queue
      release the spin lock
      sleep
      acquire the spin lock
   }  
  
   lock->nActive++;
   lock->nReadPending--;
   release the spin lock
}

void LockWrite (rwLock *lock)
{
   spinLock(&lock->sl);     // acquire the spin lock
   lock->nWritePending++;

   while (lock->nActive) {  // any reader/writer active
      add itself to writer queue
      release spin lock
      sleep
      acquire the spin lock
   }
   lock->nWritePending--;
   lock->nActive = -1;
   release the spin lock
}

void UnlockRead (rwLock *lock)
{
   acquire the spin lock
   lock->nActive--;
   if (lock->nActive == 0) {  // no readers/writers active
      issue wakeup to writers
      release spin lock
   } else {
      release the spin lock   // if any reader writer
                              // active, don’t do anything
   }  
}

void UnlockWrite (rwLock *lock)
{
   Acquire the spin lock
   lock->nActive = 0;
  
   if (nReadPending) {       // if any readers blocked
      wake all readers
      release the spin lock
   } else {
      wake the single writer
      release the spin lock
   }
}

Note that  the structure rwLock lock has several members. All these members can be accessed by various processes concurrently. So, to synchronize access to all these members a single spin lock is used. This spin lock should be taken before accessing any member of the rwLock.

Reader Writer lock is used by filesystem code while doing any read/write to a file in a filesystem. The lock is acquired at the start of read/write system call and released as soon as the system call is done. It allows multiple readers to read the file but allows only a single writer to write the file.

Saturday, March 27, 2010

Synchronization in linux/unix kernel - Part 3 (Semaphores)


In my previous post we discussed how Spin Locks can be useful when we need to update multiple global variables atomically. We saw that Spin Locks should be acquired only for a short duration of time as the blocking threads spin in a loop and not doing any thing useful. In this post we will discuss the scenarios where we need to grab a lock for long duration and the ways to solve such problems.

Single Printer and Multiple Users Problem
Lets consider a common example where there is a single printer and multiple threads (thread1, thread2, thread3) that want to print their files (say file1, file2, file3 respectively) As there is only one printer, while one of the threads is printing its file, the others should wait. Also, we are not sure about how much time a single job will take to finish. It may take from few seconds to few hours depending on the size of the file. So, the waiting threads cannot spin in a loop as they will waste lot of cpu cycles. The better option for them is to sleep so that some other thread can run. When the printing job is done the thread at the front of queue should be awakened so that it can start printing its file.

This kind of problem is solved by using a synchronization mechanism called Semaphore.
To implement a semaphore we need a flag to indicate if printer is in use and a queue to keep info about waiting threads.

struct semaphore {
   int   flag;          // flag = 0, means resource BUSY,
                        // flag = 1, means resource is FREE
   queue q;             // queue to store pids of all waiting threads
                        // pid can be later used to issue wakeup to the 
                        // sleeping threads
};

The thread that needs to print the file will first check if the flag is equal to 0 (printer is available). If flag == 0, it will set flag to 1 (to indicate printer is in use) and start printing its file. The test-and-set should be done atomically else two or more threads may feel that they all have acquired the lock. See my previous posts to find out the problems if test-and-set is not done atomically.
If any thread finds that the flag is already set, (printer is busy), it will put itself on the queue and will sleep.
When the printing job is done, flag is again set to 0 to indicate that the printer is free And wakeup is issued to the threads waiting for the printer to become free. No need to issue wakeup to all waiting threads. Wakeup the thread which is waiting for the longest time (ie the thread at the front of queue). The woken-up thread again sets the flag = 1 and starts printing its file.
Semaphores are also called SLEEPING LOCKS as the waiting threads don’t spin but sleep.

Semaphore and related operations.
typedef struct  semaphore {
   int    flag;           
   queue  q;
} semaphore;

semaphore sem;                         // Create a semaphore for printer

void initSem(semaphore *sem)
{
   sem->flag = 0;                      // Mark resource as free. Printer is available
   initialize the queue with all 0s.   // No thread is waiting initially
}

void acquireSem(semaphore *sem)
{
   while (test_and_set(sem->flag)) {   // atomically test-and-set the flag
      add itself to the end of queue   // if flag is set, add itself to queue
         sleep;
      }
}

void releaseSem(semaphore *sem)
{
   atomic_set(sem->flag, 0);           // Mark the resource free. Set flag to 0
      if threads are waiting on queue,
      issue wakeup to the first thread on queue
}


Note: Care must be taken while adding/removing threads from the queue. As multiple threads may be adding/removing themselves to the queue, a spin lock should be used to prevent queue from corruption. Spin lock will ensure that only a single thread is being added/removed from the queue. In fact, a single spin lock can be used to synchronize access to both flag and queue. The spin lock should be released before going to sleep.
See the example at the end of this post to see how spin lock can be used to implement a semaphore.
This is not the actual implementation but gives you a basic idea on how semaphores work.
The semaphore discussed above can be used to implement a mutex as it allows only a single thread to access the shared resource. These are also called Binary Semaphore.

Multiple Printers and Multiple Users Problem
We saw how a semaphore can be used to synchronize the access to a shared resource. Think of a case where there are multiple resources of same type eg multiple printers. If we have 10 printers then we should allow 10 users to print their files simultaneously.

Solution is Counting Semaphore
This problem is solved by using Counting Semaphores. Here a counter is used to keep track of the threads that are using the shared resources.
struct semaphore {
    int   count;      // initial value of count is equal to the number of shared     
                      // resources.               
    queue q;
};

Implementing a Counting Semaphore
To implement a Counting Semaphore,  the initial value of the count is set to the number of available resources. So, if we have 10 printers initial value of count = 10.

To aquire a semaphore, thread first decrements the value of count by 1. And then tests if count >= 0, which means some printers are available and the thread can start printing his job. So, atmost 10 threads can be printing their files simultaneously.

If the value of count after decrement becomes negative, it means no printer is free and  the incoming thread will put itself on the sleep queue. After any printing job is done, count is inceremented by 1 and wakeup is issued to the first thread on the queue.

There are three cases
Case 1:  count is > 0. This mean that resource is available. For our 10 printer example, if value of count is 4, it means 4 printers are free.

Case 2:  count = 0. No resource available. For our 10 printer case, 0 means no printer is available and no one is waiting on the queue.

Case 3:  count < 0. No resource is available. For our 10 printer case, if value of count is -3, it means all printers are busy and 3 threads are waiting in the queue.

So, the value of count may be used to determine if any thread is waiting in the queue. If value of lock < 0, it means some threads are waiting in the queue.

Implementing a Counting Semaphore
typedef struct semaphore {
   int    count;
   queue  q
} semaphore;

semaphore sem;

void initSem(semaphore *sem)
{
   sem->count = 10;                  // Initialize the count to 10, to indicate
                                     // 10 resources (printers) are available
   initialize the queue with all 0s  // No thread is waiting initially  
}

void acquireSem(semaphore *sem)
{
   decrement sem->count
   if (sem->count < 0)  {
      add itself to the end of queue.
      sleep;                         // Sleep if no resource is available
    }
}

void releaseSem(semaphore *sem)
{
   increment sem->count         
   if (sem->count <= 0)   {                    
      remove thread from the fron of waiting queue and
      issue wakeup to it
   }
}

Again, this is not the actual implementation. The count should be modified/checked atomically. Similarly, a thread may be added only if the resource is not available. While acquiring the semaphore, all the operations like decrementing the count, testing its value and then putting the thread on sleep queue should be done atomically. So, a single spin lock should be used to prevent access to both count and queue.

Sempahore using a spin lock to prevent access to its data
typedef struct semaphore {
    int spinLock;      // spin lock to prevent access to count and queue.
    int    count;
    queue  q
}

Spin lock is accessed only for short time, while modifying/testing count value and while adding thread to the queue. The spin lock should be released before going to sleep.

Semaphore operations using a spin lock
void acquireSem(semaphore *sem)
{
   acquire spin lock
   decrement sem->count
   if (sem->count < 0)  {
      add itself to the end of queue.
      release spin lock
      sleep;                          // Sleep if no resource is available
    }
}

void releaseSem(semaphore *sem)
{
   Acquire spin lock
   increment sem->count         
   if (sem->count <= 0)   {                   
      remove thread from the front of waiting queue
      release spin lock
      issue wake up to removed thread    
   }
}

Note: Counting semaphores can be used to implement Mutex or Binary Semaphores by initializaing the count to 1.

Hope you find this article useful. If you have any comments/suggestions please post them below. 
Have Fun !!! 


Followers