Xen is responsible for managing the allocation of physical memory to domains, and for ensuring safe use of the paging and segmentation hardware.
1. Memory Allocation
As well as allocating a portion of physical memory for its own private use, Xen also reserves s small fixed portion of every virtual address space.
This is located in the top 64MB on 32-bit systems, the top 168MB on PAE systems, and a larger
portion in the middle of the address space on 64-bit systems.
Unreserved physical memory is available for allocation to domains at a page granularity.
Xen tracks the ownership and use of each page, which allows it to enforce secure partitioning between domains.
Each domain has a maximum and current physical memory allocation. A guest OS may run a ‘balloon driver’ to dynamically adjust its current memory allocation up to its limit.
2. Pseudo-Physical Memory
Since physical memory is allocated and freed on a page granularity, there is no guarantee that a domain will receive a contiguous stretch of physical memory.
However most operating systems do not have good support for operating in a fragmented physical address space. To aid porting such operating systems to run on top of Xen, we make a distinction between machine memory and pseudo-physical memory.
Put simply, machine memory refers to the entire amount of memory installed in the machine, including that reserved by Xen, in use by various domains, or currently unallocated. We consider machine memory to comprise a set of 4kB machine page frames numbered consecutively starting from 0. Machine frame numbers mean the same within Xen or any domain.
Pseudo-physical memory, on the other hand, is a per-domain abstraction. It allows a guest operating system to consider its memory allocation to consist of a contiguous range of physical page frames starting at physical frame 0, despite the fact that the underlying machine page frames may be sparsely allocated and in any order.
To achieve this, Xen maintains a globally readable machine-to-physical table which records the mapping from machine page frames to pseudo-physical ones. In addition, each domain is supplied with a physical-to-machine table which performs the inverse mapping. Clearly the machine-to-physical table has size proportional to the amount of RAM installed in the machine, while each physical-to-machine table has
size proportional to the memory allocation of the given domain.
Architecture dependent code in guest operating systems can then use the two tables to provide the abstraction of pseudo-physical memory.
In general, only certain specialized parts of the operating system (such as page table management) needs
to understand the difference between machine and pseudo-physical addresses.
3. Page Table Updates
In the default mode of operation, Xen enforces read-only access to page tables and requires guest operating systems to explicitly request any modifications.
Xen validates all such requests and only applies updates that it deems safe.
This is necessary to prevent domains from adding arbitrary mappings to their page tables.
To aid validation, Xen associates a type and reference count with each memory page.
A page has one of the following mutually-exclusive types at any point in time:
page directory (PD), page table (PT), local descriptor table (LDT), global descriptor table (GDT), or writable (RW). Note that a guest OS may always create readable mappings of its own memory regardless of its current type.
This mechanism is used to maintain the invariants required for safety; for example, a domain cannot have a writable mapping to any part of a page table as this would require the page concerned to simultaneously be of types PT and RW.
mmu_update(mmu update t *req, int count, int *success count, domid t domid)
This hypercall is used to make updates to either the domain’s pagetables or to the machine to physical mapping table.
It supports submitting a queue of updates, allowing batching for maximal performance.
Explicitly queuing updates using this interface will cause any outstanding writable pagetable state to be flushed from the system.
4. Writable Page Tables
Xen also provides an alternative mode of operation in which guests have the illusion that their page tables are directly writable.
Of course this is not really the case, since Xen must still validate modifications to ensure secure partitioning. To this end, Xen traps any write attempt to a memory page of type PT (i.e., that is currently part of a page table).
If such an access occurs, Xen temporarily allows write access to that page while at the same time disconnecting it from the page table that is currently in use.
This allows the guest to safely make updates to the page because the newly-updated entries cannot be used by the MMU until Xen revalidates and reconnects the page.
Reconnection occurs automatically in a number of situations: for example, when the guest modifies a different page-table page, when the domain is preempted, or whenever the guest uses Xen’s explicit page-table update
interfaces.
Writable pagetable functionality is enabled when the guest requests it, using a vm_assist hypercall.
Writable pagetables do not provide full virtualisation of the MMU, so the memory management code of the guest still needs to be aware that it is running on Xen. Since the guest’s page tables are used directly, it must translate pseudo-physical addresses to real machine addresses when building page table entries. The guest may not attempt to map its own pagetables writably, since this would violate the memory type invariants; page tables will automatically be made writable by the hypervisor, as necessary.
5. Shadow Page Tables
Finally, Xen also supports a form of shadow page tables in which the guest OS uses a independent copy of page tables which are unknown to the hardware (i.e. which are never pointed to by CR3). Instead Xen propagates changes made to the guest’s tables to the real ones, and vice versa. This is useful for logging page writes (e.g. for live migration or checkpoint). A full version of the shadow page tables also allows guest OS porting with less effort.
6. Segment Descriptor Tables
At start of day a guest is supplied with a default GDT, which does not reside within its own memory allocation. If the guest wishes to use other than the default ‘flat’ ring-1 and ring-3 segments that this GDT provides, it must register a custom GDT and/or LDT with Xen, allocated from its own memory.
The following hypercall is used to specify a new GDT:
int set_gdt(unsigned long *frame list, int entries)
frame list: An array of up to 14 machine page frames within which the GDT resides. Any frame registered as a GDT frame may only be mapped read-only within the guest’s address space (e.g., no writable mappings, no use as a page-table page, and so on). Only 14 pages may be specified because pages 15 and 16 are reserved for the hypervisor’s GDT entries.
entries: The number of descriptor-entry slots in the GDT.
The LDT is updated via the generic MMU update mechanism (i.e., via the mmu_update hypercall).
7. Start of Day(domU的启动环境)
The start-of-day environment for guest operating systems is rather different to that provided by the underlying hardware. In particular, the processor is already executing in protected mode with paging enabled.
Domain 0 is created and booted by Xen itself. For all subsequent domains, the analogue of the boot-loader is the domain builder, user-space software running in domain 0. The domain builder is responsible for building the initial page tables for a domain and loading its kernel image at the appropriate virtual address.
8. VM assists
Xen provides a number of “assists” for guest memory management. These are available on an “opt-in” basis to provide commonly-used extra functionality to a guest.
vm_assist(unsigned int cmd, unsigned int type)
The cmd parameter describes the action to be taken, whilst the type parameter describes the kind of assist that is being referred to. Available commands are as follows:
VMASST_CMD_enable Enable a particular assist type
VMASST_CMD_disable Disable a particular assist type
And the available types are:
VMASST_TYPE_4gb_segments Provide emulated support for instructions that rely on 4GB segments (such as the techniques used by some TLS solutions).
VMASST_TYPE_4gb_segments_notify Provide a callback (via trap number 15) to the guest if the above segment fixups are used: allows the guest to display a warning message during boot.
VMASST_TYPE_writable_pagetables Enable writable pagetable mode - described above.
阅读(836) | 评论(0) | 转发(0) |