Total Pageviews

Tuesday, March 30, 2010

System Calls (Linux/Unix Kernel)

System Calls
In simple words, it is a call from the user application to the Operating System to do some work for it.

Why do we need a System Call?
A very common example is reading a document from the hard disk.
To read the document from the scsi disk, you must know about the hardware details of disk. You need to program its controller so that you can pass scsi read command to it and tell the controller that you want to read a specific block. You must know the device specific protocol  (like scsi) so that you can talk to the disk. Overall it will be a nightmare for the user to do a simple read/write from the disk.
Solution is let the operating system (linux/unix/windows) do it for you. All operating systems have a driver code for any hardware device (disk, mouse, keyboard, memory, ethernet card etc) attached to the system. The driver code knows how to talk to the hardware devices.
So, application/user can simply make a call to the system saying “Hey ! can you please do this for me”.

How to make a System Call ?
Unix/Linux systems include various standard libraries that provide various APIs to the users. These APIs are defined by the POSIX standard. The advantage of using these standard APIs is that these are available on all unix systems which makes the application portable.
These APIs in turn call the platform specific system calls. These systems calls are generally a part of standard C library libc and are typically written in Assembly language. These are highly system dependent (both OS and processor specific).
A user program can make a system-call directly but it is not advisable to do so as these system calls may vary from system to system. There’s no guarantee that a system call present on one system (say HP-UX) will also be present on a different system (say Solaris). But, the APIs defined by POSIX are guaranteed to be present on all POSIX compliant systems.

Typically the flow is as follows:
Application ----> POSIX library APIs (fread) ----> System Call (read)

How System Call works
The purpose of system call is to transfer control from user application to operating system. User application cannot directly call kernel functions as the kernel has protected address space which is essential for system stability and security.
A processor provides certain instructions (like “int 0x80” on i386, “sc” on PowerpC) which are used to tarnsfer the control from user to kernel. When a user program execute these instructions, an exception is taken which changes the processor mode from user to kernel. The Program counter is changed so as to execute the exception handler at a specific location. The exception handler is nothing but the system call handler in kernel. This can then invoke appropriate driver code to interact with hardware.



System Call Number and Arguments
Typically there are 100s of different system calls to do different kinds of tasks. All of them execute the same instruction (“int 0x80” on i386,  “sc” on PowerPC). So, user must tell the kernel that which system call is being made. This is done by assigning a unique number to each system call.
Every processor specify an Appilication Binary Interface (ABI) that describes the low level interface between and application and operating system. The ABI describes how the system call number and arguments etc should be passed to OS. They also define how the OS will tell the user if the system call was executed successfully or not. This is done typicallly by passing the error number in a specific register.

Lets see how it happens on i386
As per the System V ABI for x86 (Linux, NetBSD follow System V ABI)
  • System call number is passed into eax register.
  • Arguments are passed in registers ebx, ecx, edx, esi and edi in order (the first five arguments). In case, the number of arguments are more than 5, then a single register is used to pass a pointer to user space address where the parameters are stored.
  • The return value is passed to the user via eax register. If return value is between -1 and -125 (value of eax), it means system call finished with an error. The actual error code (-eax) is then copied to “errno” variable and -1 is returned in eax.
read system call library function for i386
// read (fd, buf, numBytes)
read:
    pushl %ebx              // save ebx on stack
    mov1  8(%esp), %ebx     // pass first argument in ebx
    movl 12(%esp), %ecx     // pass second parameter in ecx
    movl 16(%esp), %edx     // pass third parameter in edx
    movl $3, %eax           // pass system call no in eax
    int  $0x80              // invoke system call
    cmpl $-126, %eax        // check return value
    jbe  out                // if no error goto out
    negl %eax               // negate the value of eax
    mov %eax, errno         // save the error code in errno
    mov, $-1, %eax          // pass –1 to eax
out:
    pop %ebx
    ret                     // return


System call interface on PowerPC
As per System V PowerPC ABI (linux also follows System V ABI),
  • System call number is passed in register r0
  • The arguments are passed in registers r3 to r10 (total 8 argumenst in order).
  • The error code from OS to system call is passed in r3. The summary overflow bit (CR0_SO) in the condition register CR0 indicates if error occurred. A unix system checks if there is an error, it copies error code to “errno” and return -1 to user application.
  • The return value(s) from system call to user application are passed in r3 and r4.
read system call library function for PowerPC
// read (fd, buf, numBytes)
read:
   li   r0, 3         // pass system call number to r0
                      // arguments have already been passed
                      // in r3, r4, r5 by the caller.
   sc                 // invoke system call 
   bnslr              // if no error return (Summary
                      // Overflow bit not set)
   lis  r4, errno@ha  // copy error code to errno variable
   stw  r3, errno@l(r4)
   li   r3, -1        // return -1 to user application
   blr

Accessing/Copying user data from kernel 
User is not allowed to access kernel pages which is essential for system security and stablity. However, kernel can read/write data from/to user space. Linux kernel provides two functions for this:
copy_to_user() is used to copy data from kernel address space to user address space.
copy_from_user() is used to read the data from user buffer to kernel page.
I will discuss more about these in some other post.

Kernel must verify all the arguments passed by user. The most important is the validity of the pointer provided by the user. Kernel must ensure that
  • address should lie with in user address space. It should not lie in kernel region else user may corrupt the kernel data.
  •  user has valid permissions to read/write data for that address.

This is all for today. We will discuss more on how kernel handles the system call later.

Till then, Have Fun !!!


2 comments:

  1. Thanks to the readers (Viney Kumar) for pointing out few typos in this article. Corrected them as per his comments.

    ReplyDelete
  2. Nice Article,

    but now a days INT 0x80 instruction is obsolete and kernels make use of SYSENTER/SYSEXIT for making system calls

    ReplyDelete

Followers