Page Tables
Contents
(cont:mm:page-tables) =
This chapter describes page tables in general, how multi-level page tables are allocated, how they work and all the various bits in the PTEs work.
14.3. Page Tables#
14.3.1. Single-level Page Table#
One of the simplest ways to structure a page table for mapping 20-bit page numbers is as a simple array with \(2^{20}\) entries. With this configuration, each virtual page has an entry, and the value in that entry is the corresponding physical page number, as seen in Fig. 14.8. This single-level table is located in physical memory, and the MMU is given a pointer to this table, which is stored in an MMU register. (On Intel-compatible CPUs, the page table pointer is Control Register 3, or CR3.) This is shown in Fig. 14.8, where we see the first two entries in a \(2^{20}\) or 1048576-entry mapping table.
In addition to the translated page number, each entry contains a P bit to indicate whether or not the entry is “present,” i.e., valid. Unlike in C or Java we can’t use a special null pointer, because 0 is a perfectly valid page number[^3].
Below we see pseudo-code for the translation algorithm implemented in an MMU using a single-level table; VA and PA stand for virtual and physical addresses, and VPN and PPN are the virtual and physical page numbers.
PA = translate(VA):
VPN, offset = split[20 bits, 12 bits](VA)
PTE = physical_read(CR3 + VPN*sizeof(PTE), sizeof(PTE))
if not PTE.present:
fault
return PTE.PPN + offset
Note that this means that every memory operation performed by the CPU now requires two physical memory operations: one to translate the virtual address, and a second one to perform the actual operation. If this seems inefficient, it is, and it will get worse. However, in a page or two we’ll discuss the translation lookaside buffer or TLB, which caches these translations to eliminate most of the overhead.
The single-level page table handles the problem of encoding the virtual-to-physical page map, but causes another: it uses 4 MB of memory per map. Years ago (e.g. in the mid-80s when the first Intel CPUs using this paging structure were introduced) this was entirely out of the question, as a single computer might have a total of 4 MB of memory or less. Even today, it remains problematic. As an example, when these notes were first written (2013), the most heavily-used machine in the CCIS lab (login.ccs.neu.edu) had 4 GB of memory, and when I checked it had 640 running processes. With 4 MB page tables and one table per process, this would require 2.5GB of memory just for page tables, or most of the machine’s memory. Worse yet, each table would require a contiguous 4MB region of memory, running into the same problem of external fragmentation that paged address translation was supposed to solve.
14.3.2. Multi-Level Page Tables#
In order to solve the large physically contiguous page table problem and the very large page table memory requirements multi-level page tables are used. For example a 2-level page tables is comprised of one first level page or page directory page and several second level or page table pages that map the actual page frames mapped into the virtual address space.
In order to support a virtual address space larger than 32 bits more levels of page tables are required.
In Fig. 14.11 we see a page table constructed of 3 pages: physical pages 00000 (the root directory), 00001, and 00003. Two data pages are mapped: 00002 and 00004. Any entries not shown are assumed to be null, i.e., the present bit is set to 0. As an example we use this page table to translate a read from virtual address 0x0040102C.
The steps involved in translating this address are:
1) Split the address into page number and offset
2) Split the page number into top and bottom 10 bits, giving 0x001
and 0x001
. (in the figure the top row is hex, the middle two rows are
binary, and the bottom is hex again.)
3) Read entry [001]
from the top-level page directory (physical page
00000
) (note sizeof(entry) is 4 bytes):
address = start [00000000] + index [001] * sizeof(entry)
read 4 bytes from physical address 00000004 (page 00000, offset 004)
result = [p=1, pgnum = 00001]
4) Read entry [001]
from the page table in physical page 00001
:
address = 00001000 + 001*4 = 00001004
read 4 bytes from physical address 00001004
:result = [p=1, pgnum = 00002]
This means that the translated physical page number is 00002
. The
offset in the original virtual address is 02C
, so combining the two we
get the final physical address, 0000202C
.
14.3.2.1. Review questions#
Note
13 A famous computer science quote attributed to David Wheeler is: “All problems in computer science can be solved by another level of indirection,” to which some add “except the performance problems caused by indirection.” A corollary to this is that most performance problems can be solved by adding caching. How are these quotes applicable to paged address translation?
14.3.3. Page Table Entries(PTE)#
The components of a 32-bit Intel page table entry are shown in Fig. 14.13; for more information you may wish to refer to http://wiki.osdev.org/Paging.
14.3.4. Page Permissions - P, W, and U bits#
Page tables allow different permissions to be applied to memory at a per-page level of granularity. The page permission bits are set by software and read by the hardware to prevent illegal access to a page.
P=0/1 - If the present bit is zero, the entry is ignored entirely by the MMU, thus preventing any form of access to the corresponding virtual page. When the present bit is zero the entire PTE can be used by software.
W = 0/1 - Write permission. If the W bit is zero, then read accesses to this page will be allowed, but any attempt to write will cause a fault. By setting the W bit to zero, pages that should not be modified (i.e., program instructions) can be protected. Since correctly-functioning programs in most languages do not change the code generated by the compiler, any attempt to write to such a page must be a bug, and stopping the program earlier rather than later may reduce the amount of damage caused.
U = 0/1 - User permission. If the U bit is zero, then accesses to this page will fail unless the CPU is running in supervisor mode. Typically the OS kernel will “live” in a portion of the same address space as the current process, but will hide its code and data structures from access by user processes by setting U=0 on the OS-only mappings.
14.3.5. Page Access - Accessed and Dirty bits#
Page tables convey whether or not pages have been accessed and/or modified.
These bits are set by hardware and read by software to determine how
frequently a page is being accessed and whether or not its in a modified state.
A=0/1 - If the accessed bit is set the page has been referenced since the last time the bit was cleared. This aids the kernel in determining how active a page is. If the active bit is frequently being set then the page is being referenced frequently and probably not a good candidate for reclaiming.
D-0/1 - If the dirty bit is set the page frame associated with the PFN is dirty or modified and must be written to storage before it can be reclaimed for correctness. Subsequent access to the associated virtual page will result in a page fault so its contents can be restored as long as the present bit was cleared when the page was reclaimed.
14.3.6. Creating a Page Table#
char hello[] = 'hello world\n';
void _start(void)
{
syscall(4, 1, hello, 12); /* syscall 4 = write(fd,buf,len) */
syscall(1); /* syscall 1 = exit() */
}
To see how a page table is created, we start by examining the virtual memory map of perhaps the simplest possible Linux program, shown above. This program doesn’t use any libraries, but rather uses direct system calls to write to standard output (always file descriptor 1 in Unix) and to exit. In Linux, _start is the point at which execution of a program begins; normally the _start function is part of the standard library, which performs initialization before calling main.
When this program runs and its memory map is examined (using the pmap
command) you see the following:
00110000 4K r-x-- [ anon ] <- file header - used by OS
08048000 4K r-x-- /tmp/hello <- .text segment (code)
08049000 4K rwx-- /tmp/hello <- .data segment
bffdf000 128K rwx-- [ stack ]
The address space is constructed of a series of contiguous segments, each a multiple of the 4 KB page size (although most are the minimum 4 KB here), with different permissions for each. (realistic programs will have many more segments; as an example, the address space for the Nautilus file manager process on my Ubuntu 15.10 system has more than 800 segments.) To create a page table for this program, the first step is splitting the page numbers into top and bottom halves (all numbers given in hex or binary), as shown below.
VPN 00110 = 0000 0000 00 01 0001 0000
top10 = 000 bottom10 = 110
VPN 08048 = 0000 1000 00 00 0100 1000
top10 = 020 bottom10 = 048
VPN 08049 = 0000 1000 00 00 0100 1001
top10 = 020 bottom10 = 049
VPN BFFDF = 1011 1111 11 11 1101 1111
top10 = 2FF bottom10 = 3DF
The first three segments are one page long; note that the last segment is 32 pages (128 KB), so it uses entries 0x3DF to 0x3FF in the second-level page table.
The program needs four physical pages for the table; assume that pages 0000, 0001, 0002, and 0003 are used for the table, and pages 00004 and up for data/code pages. The actual page table and associated PTEs may be seen in Fig. 14.14. (note that the choice of physical pages is arbitrary; the page numbers within the page directory and page table entries would of course change if different physical pages were used.)