This part of the document will remain a random to-do list until I figure out how to do a glossary properly in SGML!
Please send suggestions.
MM algorithm based on two areas, one managed as a FIFO queue, and one as an LRU list.
Outdated Programmable Interrupt Controller present on Intel hardware.
Application Binary Interface; the interface of passed structures between the user processes (and libraries) and the kernel. For compatibility, it is important that these remain as static as possible (i.e. making sure that variables and structure members have the same bytesize as before, and in the same ordering). Occasionally breakage is necessary, requiring re-compilation of the user-space sources (note that this does not affect source-compatibility; that is a separate issue).
Advanced Configuration and Power Interface - replacement for APM that has the advantage of allowing O/S control of power management facilities.
Address Generation Interlocking, on x86. When execution of an instruction requires an address resulting from a non-completed instruction, the CPU must wait - this is known as an AGI stall.
Accelerated Graphics Port, on x86 boxes.
Generally, used for something which doesn't have the usual associated object. For example an anonymous address space is not interested in user address space (that is, no process context). Some common ones are:
A page of memory that is not associated with a file on a file system. This can come from expanding the process's data segment with brk(), shared memory segments, or mmap() with a MAP_ANON or MAP_PRIVATE flag. MAP_PRIVATE, although it maps in data from a file, is considered anonymous because any changes do not get written back to the file (any dirty pages have to be moved to swap if the page is freed from main memory).
The buffer cache contains buffers of data on their way to/from the disk. An anonymous buffer is not associated with a file. One example of this is data from a deleted file - it will not be written to any file, but is kept around until it is flushed.
See local APIC and IO-APIC.
Advanced Power Management, power management standard superseded by ACPI. APM and SMP just don't mix.
This is an acronym for the Address Resolution Protocol and this is how a network machine associates an IP Address with a hardware address.
Abstract Syntax Notation, a protocol for structured data, used, for example, in the Q.3 management protocol.
Professor Andrew S. Tanenbaum, writer of MINIX and several essential O/S books.
ATA Packet Interface, used by most CD-ROMs, and other devices.
Technique used in the VM code, referring to balancing various parameters such as the number of pages currently free, to avoid thrashing and other bad memory capacity artefacts. See zones, kswapd bug.
Base Address Registers, for PCI devices.
Binary-Coded Decimal - see a textbook.
See highmem.
kernel_lock, which locks the entire kernel from entry (no other task may run in the kernel code). It is recursive per process and dropped automatically when a process gives up the CPU, then regained on wake-up, in contrast to other spinlocks.
Used colloquially to mean a single bit error in some memory address. Often due to faulty memory (ECC memory can correct single bit errors). Often results in fake Oopsen, with addresses like 0x0008000. Also seen are values some small offset from zero, plus a bit error, which is where the value passed a NULL check due to the bit error, and then the kernel tried to access a structure member by means of the pointer, leading to the offset.
In UNIX-like filesystems, the usage of disks blocks is recorded in the block bitmap, where each set bit indicates a specific allocated block.
A set of standard kernel threads that execute tasks on a queue that have been registered with that type of bottom-half handler for execution. The code is run on return to user space or at the end of a hardware interrupt. In 2.3.43 a more general solution with softirqs and tasklets was implemented. Sometimes abbreviated to "bh", which should not be confused with buffer head, which is also abbreviated to "bh".
An intermediate buffer. Used for example, in "faking" alignment to a client from non-aligned resources.
Big-reader locks, used when there are many contending for read access to a resource, and very few contending for writes (thus the balance is towards very fast read locking, and very slow write locking).
BootStrap Processor, or the CPU which enables the other CPUs in an SMP system.
Block storage segment. This is the memory mapping section containing the data allocated for a binary image at execution time. Also known as "Block Started by Symbol" and "Bull-Shit Storage".
Branch Target Buffer - on x86 processors, the cache of recent conditional jump results.
The memory allocation scheme used in the kernel. A vector of lists of free pages is kept, ordered by the size of the chunk (in powers of two). When a chunk is allocated, it is removed from the relevant list. When a chunk is freed back to the free pages pool, it is placed in the relevant list, starting from the top. If it is physically contiguous with a present chunk, they are merged and placed in the list above (i.e. where the chunks are twice the size), and this operation percolates up the vector. As regions are merged whenever possible, this design helps to reduce memory fragmentation. FIXME
The buffer cache is a hash table of buffers, indexed by device and block number. LRU lists are maintained for the buffers in the various states, with separate lists for buffers of different sizes. With 2.3's unification of the buffer and page caches, each buffer head points to part or all of a page structure, through which the buffer's actual contents are available. FIXME
A structure containing information on I/O for some page in real memory. A buffer can be locked during I/O, or in several other states depending on its usage or whether it is free. Each buffer is associated with one page, but every page may have several buffers (consider the floppy on x86, where the I/O block size is 512 bytes, but each page is commonly 4096 bytes).
Used in kernel code in tests for "impossible" conditions. Signify a kernel bug or faulty hardware.
Giving a card on a bus (e.g. ISA,PCI) the ability to read/write directly to main memory. This is how DMA is performed on PCI busses.
Endianness.
Where the cache of a CPU represents the current memory set used by a task, there is said to be cache affinity with that task. A good thing if the task is regularly scheduled on that CPU. See processor affinity.
On an SMP system, ensuring that the local memory cache of each CPU is consistent with respect to the values which may be stored in other CPUs' caches, avoiding coherency problems such as the "lost update". This is achieved by the hardware in concert with the operating system.
A section of the hardware cache, around 32 bytes large. Kernel structures are often designed such that the commonly-accessed members all fit into one cache-line, which reduces cache pollution. Structures such as this are cache line aligned.
A hardware phenomenon in an SMP system, where two tasks on different CPUs are both accessing the same physical memory in a cache line. This means as each task runs, when it changes the memory, it must invalidate the other CPU's relevant cache line (to ensure cache coherency). Then, when the task on the other CPU runs, it must reload the cache line (as it's set invalid), before changing it. Repeat ad jocularum. A bad thing (TM). A common reason for putting a lock on a different cache line than the data mutexed by the lock : then the "other" task can grab and drop the lock without having to necessarily invalidate the cache line on the first CPU. FIXME
Where during execution of a task, another task is scheduled onto that CPU which disrupts useful lines of the current cache contents, which will be used soon. That is, cache pollution is a non-optimal situation where the other process would have been bettered scheduled on a different CPU or at a different time. The aim is to minimise the need to replace cache lines, obviously increasing efficiency.
x86 hardware support for mode switch to kernel (i.e. system call). In Linux, int 0x80 will trigger the call gate.
These are defined names of capabilities for specific tasks provided by the kernel, e.g. CAP_SYS_NICE.*
A process under the aegis of a chroot() syscall is in a chroot jail, and cannot access the file system above its notion of root directory /.
x86 assembler instructions for disabling and enabling interrupts, respectively. There are CPU-local and global variants of these. Code running with interrupts disabled must be fast, for obvious reasons (this is called interrupt latency).
Eric Raymond's proposal for a replacement to the current kernel build system. See http://www.tuxedo.org/~esr/kbuild.
A cache whose content is invalid or irrelevant with respect to some task to be run.
I/O interface used in O/S's such as Windows NT. Userspace notifies the kernel of each file descriptor the program is interested. The O/S uses a callback for each fd to indicate that I/O is ready.
Where two tasks each want an exclusive resource. You may hear talk of, for example, spinlock contention, which is where one or more tasks is commonly busy-waiting for a spinlock to become unlocked, as it is being taken by other tasks.
Refers to the changes necessary in the CPU when the scheduler schedules a different process to run on the CPU. This involves invalidating the TLB, loading the registers with the saved values, etc. There is an associated cost with such a switch, so it is best to avoid unnecessary context switch when possible. Note that the division of kernel-mode and user-mode means a similar, but simpler, operation is necessary when a syscall moves into kernel mode. However this is not called a context switch, as the mode switch doesn't change the current process. See lazy TLB. One good of feature of Linux is its extremely low context and mode switch cost, compared to an operating system like Solaris.
Copy-On-Write, efficiency method where a page or other resource is shared until an attempt to write is made. In that case a copy is made, and the write is done to the copy.
Current Privilege Level. FIXME
A vital code path which should be optimised for the common case. Critical paths are executed frequently and form the important trunk routes of various kernel operations. An example would be buffer head manipulation during file I/O.
Code storage segment, aka text section. This is the memory mapping containing the executable code (text) for a binary image.
FIXME
An experimental new Makefile set up for configuring and compiling the kernel, written by Michael Elizabeth Chastain.
The cache of dentry structures. Under UNIX an entry in a particular directory must be searched for linearly, so even if the disk block containing the directory entry list is in-core, there is an associated cost. The dcache stores recent results of these searches which in general speeds up these disk searches by a large factor. Recent 2.3 work uses the dentries to allow multiple mounting, union mount, and more. The hardware data cache is usually referred to as the D-cache.
Any of a number of situations where two or more processes cannot proceed because they are both waiting for the other to release some resource. FIXME(give good references).
See write behind.
In demand paging, where the page is to be zeroed when actually created (common case: bss segment of an executable image, which is uninitialised heap data for the executable). Also called ZFOD.
Directory entry, in-core structure defining a file's details: inode, parent dentry etc. Cached in a hash table indexed by hashed filename (see dcache).
IP packet bit indicating it should not be fragmented. The remote host will return ICMP notifications if the packet had to be split anyway, and these are used in MTU discovery.
Provides hooks for notifying tasks when the contents of a directory has changed. Note "contents" can refer to dentries, the file inodes, or even the file contents themselves (file notification).
Dial-On-Demand for PPP connections over a standard telephone line.
In stream I/O conditions, data that has already been read and processed is not needed again. The VM ideally should recognise this and mark the used pages as unneeded, so they can be discarded first. This technique is called "drop behind".
Data storage segment, aka data section. This is the memory mapping containing the initialised data for a binary image.
Processors such as the Pentium Pro, that can decode and execute two instructions simultaneously.
Abbrev. fr. duplication.
Double word, i.e. 4 bytes on x86.
See extended attributes.
What the buddy allocator currently does, i.e. merge adjacent blocks as soon as possible.
The interrupt is triggered by the rising or falling edge of the interrupt line. This makes IRQ line sharing difficult, as an edge may occur while an ISR is running, and it could be easily missed; to allow sharing level-triggered interrupts are usually used.
Extended Instruction Pointer. This register contains the PC value of a task, that is, it points to the next instruction to be fetched, decoded etc.
This algorithm, often used in disk accesses, keeps an ordered list of requests. When the current request on the disk (e.g. the disk block) has been satisfied, the next strictly greater request on the list is dealt with. When a new request arrives, it is inserted into the ordered list in position (e.g. if the new requested block number is less than the current handled request, it goes before it in the list). When reaching the end of the list, the elevator changes direction, and the situation is reversed.
Explicitly-Parallel Instruction set Computing, an instruction set architecture where every dependency for an instruction is encoded into the instruction itself. This has the potential to be faster as the compiler can encode the data dependencies in the instructions.
A general algorithm for dealing with contention cases; for example, collisions on a network bus, or contention for a spinlock.
Also known as multi-part or multi-stream files, files with extended attributes deviate from the principle of files being a simple single data stream. An example of extended attributes is the Macintosh's "resource fork", which is associated with a specific file (known as the "data fork").
A scheduler which ensures fairness between users, such that a user's process count and associated cost only impacts that user, rather than the whole system as currently. Rik van Riel and Borislav Deianov have both produced different patches to implement this.
On SMP caches, when two parts of single block are accessed, neither of which collide with the other, the cache coherency protocol may not be able to detect this, and mark the block as "shared" even when it isn't. This is known as false sharing.
The code path most commonly taken, often optimised heavily at the expense of less frequently-taken blocks of code. This is the reason you see so many gotos in core functions - it produces common-path code far more efficient than an optimising compiler can manage.
file descriptor.
The mapping of a file's contents into memory.
A user-space request for a mmap starting at a fixed virtual address. Generally not useful or guaranteed to work; a notable exception is overlayed mmaps, where a mmaped area has further mmaps of different types at fixed positions in the map.
Fully-Qualified Domain Name, e.g. martyr.darrenemerson.co.uk.
For AGP setups, Graphics Aperture Relocation Table.
GNOME's source documentation system (similar to javadoc). Available by CVS from gnome. Kernel driver interface descriptions, built from source using gdoc, are currently being written in 2.3.
Global Descriptor Table. Something to do with x86 memory segmentation I think (FIXME). See ldt.
In the kernel, often means "get a reference to". This may be as simple as incrementing a usage count, or it may imply attempting to retrieve an object from a cache of some sort, or allocating kernel memory. See put.
Generalised Kernel Hook Infrastructure, an IBM patch to implement hooks into the kernel code.
On-disk filesystem structure, containing information for a block group, such as the inode bitmap and block bitmap.
Human Interface Device (for USB).
On the x86 platform, memory addresses accessed using Intel's Physical Address Extension, which allows real memory of up to 64Gb on IA32. Only certain things can be stored in this space (FIXME: only userspace ? or what ? no bh or irqhandler can use it ...)
Possibly byte-swapping conversion from host-endian to network-endian format. The network is big-endian.
A standard for executable image interfacing (syscalls etc.). This is orthogonal to binary image file format standards such as ELF.
On lkml, this almost always refers to the inode cache, rather than the hardware instruction cache (I-cache).
Internet Engineering Task Force, a standards organisation for protocols used on the Internet.
Integrated Kernel Debugger. A patched version of the kernel containing additional facilities for debugging the kernel.
Instruction-Level Parallelism, i.e. executing more than one instruction at once. See dual-issue.
What you might expect: code with knowledge of used code's internals beyond the API. This breaks modularity, increases complexity and causes bugs, and a design goal is to minimise this behaviour.
As per UNIX, the structure defining a file (except for the filename which is stored in the directory entry). The inode cache is a closed hash table indexed by superblock and inode number, and is only hit when the dentry lookup failed.
In UNIX-like filesystems, the usage of on-disk inodes is recorded in the inode bitmap, where each set bit indicates a specific allocated inode.
Abbrev. fr. "instruction".
This is code executing in the kernel in response to a generated interrupt, without a process context. For more time-consuming work, a bottom-half handler task can be added to be executed at some later time.
x86 hardware support for hardware interrupts. Interrupt gate facilities are provided to allow a known entry path from untrusted privilege levels to the interrupt routines.
Receives interrupts external to a CPU (e.g. from a device) and routes them through the local APIC to the CPU.
I/O privilege level system call and bit in x86 EFLAGS register.
Inter-processor Interrupt. On SMP machines this refers to an interrupt sent between CPUs, indicating some event that the other CPU needs to be aware of, for example smp_invalidate_interrupt, which invalidates a CPU's TLB.
On-disk format often used for CDs. See also Joliet.
Interrupt Service Routine, or interrupt handler. Also, on x86 APICs, In-Service Register, confusingly enough.
Basic packet of kernel time, around 10ms on x86. Related to HZ, the basic resolution of the operating system. The timer interrupt is raised each 10ms, which then performs some h/w timer related stuff, and marks a couple of bh's ready to run if applicable.
Microsoft extension to ISO9660, commonly used.
A filesystem that writes all changes to some log file, to preserve filesystem consistency (for example, an inode's link count should equal the number of directory entries pointing to that inode). There are two types; FIXME log only metadata such as link count, whereas FIXME logs both meta-data and data, effectively writing all data twice. ext3, ReiserFS and XFS are all meta-data loggers, the second type are rare as the same effect can be achieved using RAID.
FIXME
This allows user-space memory access from other contexts, such as in a bh. This also allows kernel drivers to set up DMA from a device directly to user-space pages.
Kernel memory allocation routine. See vmalloc. kmalloc() ensures physical address contiguity.
Problems with VM balancing in late 2.3 kernels. People in a variety of situations found that the VM subsystem was far from optimal, leading to the kernel swapping daemon kswapd taking an inordinate amount of CPU resources; thrashing was also a common problem. Work is undergoing to fix these scenarios.
See L2.
The L2 (level 2) cache is a memory cache in the order of half a megabyte wide, located on the motherboard. The much faster L1 cache is in-CPU, but correspondingly smaller. x86 has a two-level cache, while some architectures such as Alpha AXP have an additional L3 cache. Often referred to is the memory hierarchy, which goes something like (with increasing latency and capacity, decreasing cost) registers, L1 cache, L2 cache, L3 cache, main memory, local mass storage device, network.
Some tasks, such as kernel threads like kswapd, don't have a process context, which renders the costly invalidation of the TLB unnecessary (as the user-level page tables are not used). The task structure member mm indicates the user-level address space, which will be NULL for kernel threads. In this case active_mm indicates the used address space (kernel). These address spaces are known as anonymous. Note also that certain tasks may occasionally temporarily drop the user-level address space and have an anonymous address space for a short while.
Local Descriptor Table. The ldt is a per-process memory management table used on x86. See gdt.
General term used in unreliably-connected components. A resource is given a lease which eventually expires. This means the connection can break, without leaving state hanging. NFS, being stateless, does not have the concept of leases.
Also level-sensitive. As used on PCI. PCI's ability to have multiple devices sharing a single interrupt line is made possible by its use of level-triggered interrupts. If interrupts were edge-triggered, it is possible that some of the incoming interrupts might be lost, for example if a second interrupt were to come in while the processor was still in the midst of processing a previous one, the processor would return from servicing the first interrupt but not be aware that a second interrupt had been asserted. With level-triggered interrupts, by contrast, the interrupt line remains active until specifically deactivated by an ISR. This does mean, however, that if a buggy routine forgets to reset the interrupt line after agreeing to deal with the interrupt, then the machine may hang in a loop of constant interrupts.
Large File Support, specifically files larger than 2Gb on 32-bit systems.
Least Frequently Used, another MM technique. The tension between recency of access and frequency of access of a an area of memory is one of the crucial parts of any memory management system.
The standard C libraries. Remember that the kernel build makes no use of libc !
Loop Initialization Primitive, related to target IDs on SCSI busses.
When tasks are fighting for an exclusive resource, but are also ready to run. For example both tasks may be in a loop checking for a resource's availability.
The Linux-Kernel Mailing List. See the FAQ for details.
On-chip interrupt controller provided on P6 and above Intel CPUs. Linux uses the timer interrupt register if a local APIC is available to provide its timer interrupt (is this true ?). The local APIC is part of a replacement for the old-style 8259 PIC, and receives external interrupts through an IO-APIC if there is one.
The fake net device representing the local host net interface, lo. This can also refer to file loopback mounts.
There are two types of mount called loopback mounting. First is the file loopback mount, where a single file on a filesystem itself containing a filesystem is mounted (useful in constructing ISO9660 images for one). The other type, which is possible in 2.4, is "true" loopback mount, where you can do something like mount -t bind /safebin /chrootjail/bin.
As per the CS literature, Least Recently Used. A selection algorithm where a list is kept, ordered by last usage of each element. Often used in page replacement algorithms. Also Most Recently Used and LFU.
This is a blacklist of SCSI devices that do not properly handle probes with a Logical UNit number other than zero. Some mischievous devices respond to all LUNs, while others can hang with a LUN more than zero. LUNs are used in CD changers amongst other devices.
Logical Volume Manager. This allows several physical partitions to be represented as a single block device, amongst other things.
A major page fault occurs when an attempt to access a page not currently present in physical memory was made. The page must be swapped in to physical memory by the fault fixup code.
Memory barrier; ensures on SMP systems that each CPU's view of memory is the same, i.e. enforces order. rmb() ensures reads are done in order, and wmb() ensures writes are done in order. Placing a barrier after some code basically makes sure that all previous memory reads or writes are done (as CPUs can re-order usually to increase performance). (FIXME: this needs to be more clear).
A race condition scenario. FIXME
Name for problems with mm code in v2.1, where pages got rusted and stuck in memory when they should really have been swapped out. Was a problem on small-memory machines. Now fixed (right ?)
The name for a superset of cache coherency protocols. There are five possible states for a cacheline to be in:
This protocol is described in more detail in the Schimmel book.
Information about a file, such as permissions, in contrast to the actual data in the file.
A minor page fault occurs when an attempt to access a page present in physical memory, but without the correct permissions. An example is the first write to a second reference to a shared page, when the kernel must perform the copy-on-write and allow the task to update the copied page.
Memory-mapped I/O, that is I/O memory (memory on hardware devices) accessible through a memory mapping. Also known as an I/O region.
Technical term, meaning very roughly "new data doesn't affect existing data". Used to mean a wide range of different things. For example, a monotonic clock is one which never gives a time prior to the time given previously (that is, sequential reads of the clock always increase).
Model-Specific Register. For the x86 platform, registers that are not guaranteed as part of the Intel Architecture (i.e. could be not present in future CPU models).
The Maximum Segment Size is the largest quantity of data that can be transmitted at one time over a net interface.
Memory Type Range Registers.FIXME: explain what they do.
The Maximum Transmission Unit is a parameter that determines the largest datagram than can be transmitted by an IP interface without it needing to be broken down into smaller units. Typical values are 1500 bytes for an ethernet interface, or 576 bytes for a SLIP interface.
The process of discovering the MTU of a remote site that can be used without causing expensive fragmentation. See DF.
Used occasionally for "network".
TCP algorithm for reducing network traffic in a connection. The local machine will accumulate outgoing data into packets until receiving a packet from the machine, and then the packets are sent. This way small amounts of outgoing data are amalgamated into the packet's larger payloads. Setting the TCP option NODELAY disables this behaviour. FIXME
Network Block Device - presents a local block device to a remote device.
Meaning an error return rather than the expected entry from some vector or cache. A negative dentry is used when an NFS client tries to operate on a stale file, and with other situations. Without the negative dentry, a lookup is done on each request. This would allow a DoS attack by stealing CPU time doing lookups which will fail. So the dentry is kept around to provide a near-immediate "stale file" response to the client. See fs/dcache.c:d_delete(). Apparently positive dentries are not allowed to become negative. A negative dentry is one with a NULL d_inode field.
Non-maskable interrupt - highest-priority interrupt that can even interrupt standard h/w interrupt ISRs. The NMI watchdog is a kernel facility, running off NMIs, which checks CPUs for recent activity (in the last five minutes), to detect permanently stalled CPUs. NMIs are also delivered by motherboards in the case of serious hardware problems (like bad RAM).
A routine executed when a virtual address has been accessed when the page is for some reason not available in real memory. An example would be a swapped out page, hence the nopage operation would need to bring the page into real memory from the swap file.
"Number of", e.g. nr_free_pages(). Preferred to the "no_" form as it's not ambiguous with the negative case (compare no_free_pages()).*
Possibly byte-swapping conversion from network-endian to host-endian format. The network is big-endian.
Non-Uniform Memory Architecture. Usually on an SMP system, all memory beyond the caches costs an equal amount to reach for each CPU. In NUMA systems, some memory can be accessed more quickly than other parts. Later kernels have some NUMA support for preferring to use memory "nearer" to the CPU, rather than higher-latency distant memory. See Documentation/vm/numa for details.
A graphics card manufacturer that produce proprietary, non-open-source/binary kernel module drivers. Complaints about this module go to NVidia; such complaints are not welcome on linux-kernel. Also check #nvidia on irc.openprojects.net.
Notation from Computer Science literature for complexity of algorithms. An O(1) algorithm takes constant time; an O(n) algorithm takes time proportional to N, the number of data, and so on. Also used for space complexity, the number of memory "units" used in the algorithm.
Compaq standard for USB controllers.
VM algorithm. A one-handed clock algorithm has one "clock hand" sweeping through pages, performing ageing and other activities.
In networking Out-Of-Band data. FIXME
An oops is generated on some kernel errors. This may be due to kernel bugs or faulty hardware (especially bad RAM). Often the kernel can continue to operate mostly correctly after an oops. More serious or potentially dangerous errors can cause a kernel panic (the O/S freezes). This is sometimes done to protect a filesystem from incurring further harm. An oops report will appear in the kernel logs, but is only useful to developers after decoding it with ksymoops (see Documentation/oops-tracing.txt). Note also that klogd can corrupt oops reports by incorrectly decoding it automatically. It is recommended to pass the -x option to klogd on startup to prevent this from happening. The plural form is oopsen.
Some processors (such as Intel Pentium Pro and above) can decode and execute more than one instruction at a time in parallel. This means a sequence of instructions is not necessarily the order in which those instructions are completed. See mb().
Physical Address Extension - Intel's 36-bit physical address extensions for highmem.
Marking a page as "old" in some way, that is, it hasn't been used for some time. Old pages are preferred for swapping out (although this is a simplification of the page swap selection algorithm), and are aged by kernel threads via a clock algorithm. Confusingly, page->age==0 indicates an infinitely old page (FIXME: true ?). The pte's are also aged for selection for flushing from the page table to the page or swap cache.
The page cache caches pages mapped from files (e.g. mapping in an executable image).
A system to ensure that virtually-contiguous pages are not mapped to the same cache lines, improving performance. FreeBSD does this, but Linux, after some discussion, doesn't (yet).
A task which steals from a process's vm, taking pages for itself. An example is kswapd, which runs through the pages of the system swapping out pages to keep freepages.min number of free pages.
By analogy with optimal, the worst situation possible in some context.
FIXME
Page Global Directory, or the abstracted top level of the multi-level page tables. See pmd, pte. On x86, there are only two real levels of page tables; this fact is abstracted away by macros in the source. Each level of page table deals with different sizes of memory. For example, this global directory may deal with areas 4Mb in size. Each entry will be a pointer to a lower table of a smaller-sized directory. On x86, the next level (middle) is non present in hardware, and is folded in to the pgd in the kernel code. The bottom level deals in pages directly (PAGE_SIZE). So the pgd is a directory of page tables. Code which traverses this structure (such as the bttv driver) is said to walk the page tables.
FIXME
Programmable Interrupt Controller. See 8259 PIC, local APIC, IO-APIC.
The old-fashioned transfer protocol for devices on an IDE bus. Now mostly used is DMA, and standards like UDMA.
A pinned page cannot be swapped out of main memory.
Page Middle Directory. The middle level of page tables.
Used in the slab cache for debugging purposes. Slabs are filled with a special value after they have been freed in order to catch double frees.
A standard specifying semantics and interfaces for a UNIX-like kernel interface, along with standards for user-space facilities. There is a core POSIX definition which must be supported by all POSIX-conforming OSes, and several optional standards for specific facilities; so you may see references to "POSIX shm" or "POSIX real-time signals".
PPP (point-to-point protocol, often used in modem up-links) running over Ethernet.
Involuntary switching of a CPU from one task to another. User-space is pre-empted by interrupts, which can then either return to the process, or schedule another process (a process switch will also occur when the process "voluntarily" gives up the CPU by e.g. waiting for a disk block in kernel mode).
Kernel mode tasks are never pre-empted (except by interrupts) - they are guaranteed use of the CPU until they sleep or yield the CPU. Some kernel code runs with interrupts disabled, meaning nothing except an NMI can interrupt the execution of the code.
Many pages (such as those for a process's BSS segment) need to be zeroed when accessed. As the kernel only takes the page on the page fault for these areas, it is possible to pre-zero some amount of pages before the page faults occur. FreeBSD does this, and there has been some inconclusive talk of including this feature in Linux. See demand zero.
Obviously, the equivalent of printf for kernel code (remember, you can't use libc functions in the kernel). Can be called from interrupt context (although it does take the console spinlock) but floating-point cannot be used, as throughout the kernel.
See priority inversion.
A situation where a low-priority task holds a resource which a higher-priority task is waiting for. This can occur with tasks scheduled with SCHED_OTHER or SCHED_IDLE, where a task with a higher scheduling priority will always be scheduled before in preference. This leads to deadlock as the low-priority task cannot be scheduled in order to release the lock, and the high-priority task can't continue without the lock on the resource. Priority inheritance is a putative solution, where the low-priority task temporarily takes on the priority of the other task, enabling its scheduling and release of the lock. However this approach can itself lead to deadlock (FIXME: is this really true ?).
Some OS's have system calls for binding a process to a particular CPU. The kernel ensures that when the process is re-scheduled it will be run on that processor. Linux currently doesn't support this and its relative merits are debatable.
The ordering of memory read and writes as seen from the CPU's point of view.
Code executing in the kernel on behalf of a user process (from a syscall), is said to have a process context. Code not in a process context is not allowed to sleep or wait (see interrupt context).
A set of user processes sharing the same group pid. One of these processes will be the group leader (same pid as pgid).
When a VM is capable of completely suspending processes for a while. Also known as process swapping.
Of a network device, sucks all packets regardless of destination.
Page Table Entry, or a value containing the physical address of a page along with associated bits indicating, for example, that the entry is valid and the related page is present in real memory.
In the kernel, means "drop a reference to". This indicates that whatever task was using the object previously is no longer interested. This may just decrement a usage count, or perform a more complicated reaping routine.
Quality of Service.
Generally, where the ordering of events can affect the outcome of a situation. These bugs are caused by incomplete mutual exclusion, so a task or interrupt can come in and change something vital under the feet of another task. They are notoriously hard to debug as they are often marginal (often only occur in pathological=="very unlikely" situations).
Redundant Array of Independent/Inexpensive Disks. See the standard literature.
Reliability, Availability & Serviceability.
To avoid DoS attacks, some facilities have upper limits on how often they can trigger. Most trite example is logging.
I/O that does not go through the caches that Linux provides as standard to all block devices. This feature can be useful to database systems, where the database back-end has a better idea of the data access pattern than the kernel can, hence increasing performance.
Where I/O is performed before strictly required in an attempt to improve performance. This is performed by the Linux kernel, and often by hardware as well.
See completion ports.
Adding protection (e.g. read or write) to kernel pages, for use in detecting stack overflow and other situations. This often works by adding a guard area at the end of each buffer in the slab allocator, which is checked to make sure it hasn't been overwritten.
FIXME
On out-of-order processors, a retired instruction is one that has been executed and completed.
ring 0 is the protection level provided by x86 hardware for kernel mode. Ring 3 is used in Linux for user mode, and rings 1 and 2 are unused. Ring transition to a higher protection level is only possible through gates. Interrupt gates are set up for transitions due to interrupts, and call gates are used for system calls.
A buffer of data which is of fixed size; when it fills, further data is placed back at the start of the buffer, overwriting the old data, in a "ring". Commonly used in device drivers.
An NFS option whereby any actions performed by a remote client's root account are actually attempted by the user "nobody" on the server.
Common abbreviation used in net code for "receive".
An I/O mechanism, whereby devices can write to memory at several physically non-contiguous addresses (scatter), and compose reads from physically non-contiguous addresses in memory (gather).
As per the literature, a mutual exclusion or resource management method where tasks sleep waiting for a semaphore to be granted to them. Later versions of the Linux kernel also provide a more complicated type, the read-write semaphore, which allows a task to specify its intentions with regard to the resource. Hence several readers are allowed for the resource, but a writer will have mutual exclusion.
Sessions are related to terminal handling. Sessions can have a controlling terminal; by becoming a session group leader, a daemon process can detach itself from any controlling terminal. A session group is a set of processes created by the session leader. FIXME
The set_fs() facility enables the copy_*_user routines to access kernel (KERNEL_DS) or user (KERNEL_FS) memory. This is required in kernel code that uses functions that expect to copy data from user-space, so that the correct segment is used (i.e. copying from kernel memory rather than user space memory).
Shared memory. This is memory accessible through more than one processes' virtual address map. There are standardised shm interfaces derived from System V, as well as POSIX.
Silly-deleting in NFS is used when somebody tries to delete a file that is held open by some other process. The deletion is delayed, and the file is instead renamed to something along the lines of .nfsxxxxxxxxx. The last process to close the file will then attempt to delete it.
Cache-based kernel memory allocation scheme used by kmalloc(). Bonwick's paper can be found at ftp://ftp.bitmover.com/pub/slab.ps. When actually requiring new pages, __get_free_pages() is called which uses alloc_pages(), with the standard buddy allocator.
Interrupts run on return to user space or after a hardware interrupt, and there is a fixed number available. See tasklet, bottom-half handler. These methods are usually used to run kernel code resulting from a hardware interrupt of some kind, where the necessary operations take too long to execute in hardirq context. A notable case is routines queuing on TIMER_BH which are executed depending on the hardware timer tick. These are different from DOS soft interrupts as they don't make use of hardware soft interrupt support (int instruction), although these types of soft interrupt are used in kernel code, for example kernel_thread().
FIXME
A busy-wait method of ensuring mutual exclusion for a resource. Tasks waiting on a spin-lock sit in a busy loop until the spinlock becomes available. On a UP (single processor) system, spinlocks are not used and are optimised out of the kernel. There are also read-write spinlocks, where multiple readers are allowed, but only one writer. See Documentation/spinlocks.txt in the kernel source for details.
See cli/sti.
An fs-specific data structure, stored on disk, that holds various bits of information about the fs on the disk, such as a pointer to the free blocks structure, the number of free inodes, etc. In the code, the inode manipulation routines are entered in super_operations vector of function pointers. Often abbreviated to sb.
The Single Unix Specification, a standard defined with the aim of standardisation of Unix-like OS semantics and APIs. There is also a second version of this standard, referred to as SuS v2.
System V Release 4, as in the UNIX source that POSIX is mostly based on.
Protection against TCP/IP SYN flooding DoS attack by storing state information about SYN packets from hosts. See ftp://koobera.math.uic.edu/syncookies.html.
The method by which a user process requests facilities provided by the kernel. Examples are wait4(2), open(2), and sched_setscheduler(2). By an architecture-specific method, the kernel is executed on behalf of the process in kernel mode at the entry point of the system call. Generally there is a convention in the source that syscall entry points are prefixed with sys_, for example, the entry point to open(2) is the function sys_open in the file fs/open.c. Code executing in kernel mode (that is, with access to the kernel's memory) from a syscall entry point is said to have a process context. On x86, all system calls executed from user mode use int 0x80 and came through the syscall call gate in arch/i386/kernel/entry.S.
Tail merging is a technique which collects tails of multiple files together in one block to save space, reducing internal fragmentation, especially at larger block sizes.
The kernel abstraction for every user process, user thread, and kernel thread. All of these are handled as tasks in the kernel. Every task is described by its task_struct. User processes/threads have an associated user_struct. When in process context, the process's task_struct is accessible through the routine get_current, which does assembly magic to access the struct, which is stored at the bottom of the kernel stack. When running in kernel mode without process context, the struct at the bottom of the kernel stack refers to the idle task.
"Software interrupt" routines running when the software interrupt is received at a return to user space or after a hardware interrupt. They do not run concurrently on multiple CPUs, but are dynamically allocatable. See softirq and bottom-half handler.
A linked list of processes or threads. Tasks, for example, can queue on bottom-half handlers waiting to be executed. See wait queue.
FIXME
Tagged Command Queuing. This facility allows the kernel to send several asynchronous requests at once to a block device. The device can service the requests as it sees fit; the advantage being that the device can arrange the transfers as best suits the device hardware. TCQ works for both reads and writes.
Consider a set of tasks on a wait queue, waiting for some condition. When the event occurs, it is often the case that only some of the tasks has real work to do. Older kernels always woke up all tasks on the wait queue - this meant that each task immediately vied for scheduling - the "thundering herd". More recent kernels have added the TASK_EXCLUSIVE flag to tasks in the wait queue. This ensures that one of the tasks with this flag set will be the only one to be woken up.
Flagging part or all of a CPU's translation lookaside buffer (which is a virtual-address-indexed cache of physical addresses) as invalid. Depending on hardware, the whole TLB may be flushed at any one time, or just a part.
One of those cyclical lkml threads. O/S's such as Solaris have a "tmpfs" filesystem, providing a fs interface using main memory as storage. This differs from the usual ram disk in that it is swappable. The fs is often used for /tmp, which provides performance gains. Under Linux, ext2fs is already faster than Solaris' tmpfs, so no-one has ever implemented this.
The process by which a compiler has special code to deal with code presented as data within a binary's segment, which is to be executed, amongst other scenarios. A short routine is called before the real procedure which sets up the stack environment correctly. This facility is used in languages such as Ada, and complicates provision of non-executable stack support at kernel level (useful for preventing simplistic buffer over-runs). FIXME
Time Stamp Counter, a very high resolution counter on the Intel platform.
VM algorithm. A two-handed clock algorithm has two "clock hands" sweeping through pages, performing ageing and other activities. Imagine one hand ageing pages, whereas the later one decides to mark pages as unused.
Common abbreviation used in net code for "transmit".
Universal Asynchronous Receiver/Transmitter. UARTs are hardware devices that convert byte-wide data streams in to serial (bit-wide) data streams.
Unified Buffer Cache - the combination of the buffer and page caches to use the same memory pool (disk I/O and VM use the same pool), and data I/O utilises the page cache.
Intel standard for USB controllers.
Each mount structure is represented in the kernel by a dentry tree. Sharing of a tree allows union mount, where two separate mounts are accessible under one mount point.
For micro-operation, or the basic set of CPU operations which are generated from machine code in microcode processors. Intel P6 processors and above have the facility for upgrading the microcode (stepping of the processor).
USB Request Block, or the message struct for Universal Serial Bus.
A patched version of the kernel modified to run on top of user-space. This is intended to be used for testing, experimentation and debugging. See http://user-mode-linux.sf.net.
Use The Source, Luke!
Video 4 Linux
A filesystem that tracks changes to files, for example by storing the entire previous file on a change in data.
VFS is the filesystem-independent in-kernel interface used to access each particular filesystem supported by Linux.
Very Long Instruction Word, instruction sets with large-sized complex instructions encoded into one instruction.
Allocates a virtual address space. Can be slow for some purposes as it requires a TLB flush on all processors. Does not ensure physical address contiguity, but can allocate arbitrary sized areas. Also requires (along with related vfree()) the big kernel lock.
Commercial virtual O/S software. If the modules won't compile for you, DO NOT complain to linux-kernel ! And see FreeMWare/Plex86.
From the C keyword. Used throughout the source and in discussion, signifying a variable or inline asm that cannot be optimised as usual by the compiler. This usually means the compiler shouldn't store the variable in a register, as another task can alter the value of the real memory contents at the time. For example, this is necessary with I/O address spaces, where the hardware device can fill up a buffer asynchronously to a reading task. Variables which are changed from both interrupts and other contexts are also volatile.
A linked list of tasks (processes or threads) waiting on some specific condition. For example, a locked page may have a wait queue of tasks waiting for I/O on that page to complete (that is, !PageLocked(page)).
Real time, that is, the sort you would expect to find on a clock on your wall...
On journalling filesystems, the metadata must be logged. A wandering log system spreads log blocks throughout the filesystem for performance, rather than using a static log area on the disk.
A cache whose contents are not invalid and are useful to the task to be run next. It is important to benchmark code in both cold-cache and warm-cache circumstances.
Code to protect interactivity from being harmed by large writes. FIXME
A.K.A. delayed write, asynchronously writing after the write request was made.
FIXME
External Data Representation, the marshalled form of RPC communications. XDR deals with sizing and endianness issues.
"Execute in-place". For normal storage devices, executables are loaded into main memory before execution. For devices like a ram disk or flash system, it is possible to execute the program straight from the device, without duplication.
Networking where no data is copied to/from userspace during network transmit or receive. This is often touted as a universally good thing, but in fact it is a trade-off between the cost of the copy, and the cost of setting up page tables etc. such that user-space can read the data.
Zero-fill-on-demand, a page filled with zeros for which no space has yet been allocated.
A completed process whose parent has not waited for it. When a child exits, the return status must remain available to the parent, as it may later do a wait4(). A zombie holds its place in the process table until the parent waits for it.
A particular area of free memory. Zoned allocators differentiate between intended uses, and are generally used to model different characteristics of the memory. For example, on the x86 there is only 16Mb of DMA-able memory; the zoned allocator will try to save DMA pages for processes specifically requesting ZONE_DMA memory. There are currently three zone types: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM (see highmem).