Event Handling Framework
libevent provides a simple portable framework for getting the events which uses the most efficient possible system calls available on your system.Sunday, February 3, 2008
Saturday, February 2, 2008
Unix Memory Model
There are some basic regions ("segments") provided by all Unix variants:- Stack: (Variable size) This is where information about function call sequences is stored. There is microprocessor support for the stack.
- Code: (Fixed size) The area of memory containing machine code instructions. Typically r+x permissions. Aka the text segment.
- Data: (Fixed size) The area of memory containing initialized data. This includes static variables, string constants, etc.
- BSS: (Variable size) The area of memory containing uninitialized data. This is where "heap allocated" objects live.
System V shared memory
System V provides an alternative mechanism for setting up shared memory, via the shmctl (), shmget (), shmat (), and shmdt () set of calls. These are not suggested for use, because- System V entities live in a seperate namespace with seperate access permissions and adminstrative tools (e.g., ipcs).
- System V entities are not automatically cleaned up if all programs using them exit, and can be a resource management nightmare.
Shared memory
Physical memory can be shared between two processes merely by manipulating their page tables. This happens automatically in modern Unixes in various circumstances, e.g.,- When implementing shared libraries, the dynamic loader will mmap the library, and the kernel will share the maps amongst processes.
- When forking, the child gets a copy of the parent's page table, i.e., their pages originally all coexist in physical memory; but the first time a page is written (by either), the kernel traps the write and makes a copy. Thus a child can share a large read-only data structure constructed by the parent prior to forking; although as caveat, in same languages, a read-only data structure is still written to by the runtime (e.g., garbage collection metadata).
- Threads share their entire page map. The OS will simply reset the stack pointer when switching contexts, as opposed to flushing the TLB.
mmap
The mmap () system call allows the programmer to associate a region of the process virtual address space with a file. It is an extremely general purpose utility:- It allows the mapped memory to have protection attributes (readable, writable, execable).
- It allows the process to have a private copy (on-write) version of the file; changes are private to the process and disappear when the process exits. Alternatively, it allows the process to share the mapping with other processes; writing the memory area is equivalent to writing the file.
- The memory mapped region need not correspond to an actual file (i.e. anonymous); by creating an anonymous mmap in a parent and forking, the children can share memory.
- The namespace for mmap corresponds to the filesystem, adhering to the "everything is a file" Unix ideal.
- Access permissions correspond to file permissions.
- The actual relationship between the virtual address space and physical memory consumed by the mmap is controlled by the OS; in particular, memory resources are automatically freed when all processes using an mmap either munmap or exit.
Associated with mmap are the system calls msync () and madvise (). msync instructs the OS to write all modified pages to disk, either synchronous (don't return until call is complete) or asynchronously (return after sync has been scheduled); the OS will also optionally asynchronously sync dirty pages to disk if the proper flag is passed to mmap. madvise provides hints to the kernel as to how the program will access the mmap, in order to optimize.
Unix Signals
Signals are an asynchronous notification mechanism. Signals are covered by a POSIX standard. Under Linux, the signal (7) man page contains the list of signals supported.POSIX.1 signals
Event Server
Servers typically handle three types of events: File descriptor, signal and timeouts1. File Descriptor Events
There are several system calls available to receive file descriptor events.
- select (2) is the most portable and least efficient.
- poll (2) is nearly as portable, less inefficient, and has a very intelligible interface.
- Linux has epoll (4) which is a vastly more efficient variant of poll.
- FreeBSD has kqueue, which is a single extensible kernel interface for all event handling.
- Finally, POSIX.4 defines asynchronous I/O (AIO).
- Portability: Maximum, Efficiency: Worst, Notification Type: readiness, level triggered
- Portability: Maximum, Efficiency: poor, Notification Type: readiness, level triggered
- Portability: solaris, Efficiency: acceptable, Notification Type: readiness, level triggered
- Portability: linux 2.4+ , Efficiency: good, Notification Type: readiness, level or edge triggered
- Portability: linux 2.6, freebsd, Efficiency: variable, Notification Type: completion
- Portability: bsd, os/x , Efficiency: good, Notification Type: completion and readiness, level or edge triggered
Using kqueue makes it easy to mix signal event and file descriptor event notification. There is an event filter for signals, interest is signaled similarly to file descriptors, and the events are delivered in same way as file descriptor events.
Besides kqueue, every other way is ugly.
If the signal is delivered during the poll system call, poll will be interrupted with return value EINTR, even if timeout is -1. Thus, you will handle the signal event with low latency. The bad news is, if a signal is delivered between the first sigprocmask call and the poll call, and the poll timeout is -1, poll will not be interrupted and the signal will not be handled until (if) a file descriptor event occurs. One way to guard against this is to have a maximum poll timeout, e.g., of 100ms, which means in exchange for the (slight) extra overhead of 10 system calls a second when idle, you will have a maximum signal latency of circa 100ms.
POSIX provides the pselect (2) system call, which is like the sequence
sigprocmask (SIG_SETMASK, &mask, &oldmask);
select (...)
sigprocmask (SIG_SETMASK, &oldmask, NULL);
except that the system call eliminates the possibility of a signal being delivered between when the sigprocmask call returns and the select call begins. It was designed for the usage outlined above with poll, and therefore sounds ideal. Unfortunately, pselect is broken under Linux. Also, it uses select, and we prefer poll.
Another alternative is to have your signal handler write to a file descriptor that is included in your poll set:
With this setup, you can have an arbitrary poll timeout and maintain low latency. However, it's important to use a pipe, so that the (typically 4 byte) write and read of the signo is atomic; under POSIX, only pipes guarantee a minimum atomic read/write size larger than 4 bytes. It's also important that the pipe be set to non-blocking (see below), to avoid deadlock.
Both techniques also apply to epoll.
3. Timeout Events
The next event type of interest is the timer, e.g., you want a 50ms timeout on a sub request. In general, you may have many more simultaneous timers pending in a complicated server than you have file descriptors, since there are generally multiple timeouts per request.
Once again, kqueues makes it easy. There is a timer filter type which is treated similarly to the file descriptor filters.
Also once again, every other way makes it ugly.
poll, epoll, etc. all provide a single timeout argument to the system call. The problem, then, is to consider the entire set of timers, determine the delta-t until the next timer goes off, and use that deltat as the timeout argument to the system call. By storing the timers in a binary tree sorted by (absolute, not relative) expiration time, the next expiring timer can be found in O (log (N))) time, and creating and removing timers can also be done in O (log (N)) time; the latter is especially important, since most timers are cancelled before expiring (since they are most often used to timeout subrequests, and most of the time, your subservers are within SLA). This technique utilizes the gettimeofday (2) system call.
Subscribe to:
Comments (Atom)