Invoking Hypercalls
Hypercalls are invoked in a manner analogous to system calls in a conventional operating system; a software interrupt is issued which vectors to an entry point within Xen.
On x86/32 machines the instruction required is int $82; the (real) IDT is setup so that this may only be issued from within ring 1.
The particular hypercall to be invoked is contained in EAX — a list mapping these values to symbolic hypercall names can be found in xen/include/public/xen.h.
On some occasions a set of hypercalls will be required to carry out a higher-level function; a good example is when a guest operating wishes to context switch to a new process which requires updating various privileged CPU state.
As an optimization for these cases, there is a generic mechanism to issue a set of hypercalls as a batch:
multicall(void *call list, int nr calls)
Execute a series of hypervisor calls; nr calls is the length of the array of multicall entry t structures pointed to be call list.
Each entry contains the hypercall operation code followed by up to 7 word-sized arguments.
Note that multicalls are provided purely as an optimization; there is no requirement to use them when first porting a guest operating system.
Virtual CPU Setup
At start of day, a guest operating system needs to setup the virtual CPU it is executing on.
This includes installing vectors for the virtual IDT so that the guest OS can handle interrupts, page faults, etc. However the very first thing a guest OS must setup is a pair of hypervisor callbacks: these are the entry points which Xen will use when it wishes to notify the guest OS of an occurrence.
set callbacks(unsigned long event selector, unsigned long event address,
unsigned long failsafe selector, unsigned long failsafe address)
Register the normal (“event”) and failsafe callbacks for event processing. In each case the code segment selector and address within that segment are provided. The selectors must have RPL 1; in XenLinux we simply use the kernel’s CS for both event selector and failsafe selector.
The value event address specifies the address of the guest OSes event handling and dispatch routine; the failsafe address specifies a separate entry point which is used only if a fault occurs when Xen attempts
to use the normal callback.
On x86/64 systems the hypercall takes slightly different arguments. This is because callback CS does not need to be specified (since teh callbacks are entered via SYSRET), and also because an entry address needs to be specified for SYSCALLs from guest user space:
set callbacks(unsigned long event address, unsigned long fail-
safe address, unsigned long syscall address)
After installing the hypervisor callbacks, the guest OS can install a ‘virtual IDT’
by using the following hypercall:
set trap table(trap info t *table)
Install one or more entries into the per-domain trap handler table (es-
sentially a software version of the IDT). Each entry in the array pointed
to by table includes the exception vector number with the correspond-
ing segment selector and entry point. Most guest OSes can use the
same handlers on Xen as when running on the real hardware.
A further hypercall is provided for the management of virtual CPUs:
vcpu op(int cmd, int vcpuid, void *extra args)
This hypercall can be used to bootstrap VCPUs, to bring them up and
down and to test their current status.
Scheduling and Timer
Domains are preemptively scheduled by Xen according to the parameters installed by domain 0.
In addition, however, a domain may choose to explicitly control certain behavior with the following hypercall:
sched op new(int cmd, void *extra args)
Request scheduling operation from hypervisor. The following sub-commands are available:
SCHEDOP_yield voluntarily yields the CPU, but leaves the caller marked as runnable. No extra arguments are passed to this command.
SCHEDOP_block removes the calling domain from the run queue and causes it to sleep until an event is delivered to it. No extra arguments are passed to this command.
SCHEDOP_shutdown is used to end the calling domain’s execution. The extra argument is a sched shutdown structure which indicates the reason why the domain suspended (e.g., for reboot, halt,
power-off).
SCHEDOP_poll allows a VCPU to wait on a set of event channels with an optional timeout (all of which are specified in the sched poll extra argument). The semantics are similar to the UNIX poll system call. The caller must have event-channel upcalls masked when executing this command.
sched_op_new was not available prior to Xen 3.0.2. Older versions provide only the following hypercall:
sched_op(int cmd, unsigned long extra arg)
This hypercall supports the following subset of sched op new commands:
SCHEDOP_yield (extra argument is 0).
SCHEDOP_block (extra argument is 0).
SCHEDOP_shutdown (extra argument is numeric reason code).
To aid the implementation of a process scheduler within a guest OS, Xen provides a virtual programmable timer:
set_timer_op(uint64 t timeout)
Request a timer event to be sent at the specified system time (time in nanoseconds since system boot).
Note that calling set_timer_op prior to sched_op allows block-with-timeout semantics.
Page Table Management
Since guest operating systems have read-only access to their page tables, Xen must be involved when making any changes.
The following multi-purpose hypercall can be used to modify page-table entries, update the machine-to-physical mapping table, flush the TLB, install a new page-table base pointer, and more.
mmu update(mmu update t *req, int count, int *success count)
Update the page table for the domain; a set of count updates are submitted for processing in a batch, with success count being updated to report the number of successful updates.
Each element of req[] contains a pointer (address) and value; the least significant 2-bits of the pointer are used to distinguish the type of update requested as follows:
MMU_NORMAL_PT_UPDATE: update a page directory entry or page table entry to the associated value; Xen will check that the update is safe.
MMU_MACHPHYS_UPDATE: update an entry in the machine-to-physical table. The calling domain must own the machine page in question (or be privileged).
Explicitly updating batches of page table entries is extremely efficient, but can require a number of alterations to the guest OS.
Using the writable page table mode is recommended for new OS ports.
Regardless of which page table update mode is being used, however, there are some occasions (notably handling a demand page fault) where a guest OS will wish to modify exactly one PTE rather than a batch, and where that PTE is mapped into the current address space. This is catered for by the following:
update_va_mapping(unsigned long va, uint64 t val, unsigned long flags)
Update the currently installed PTE that maps virtual address va to new value val. As with mmu_update, Xen checks the modification is safe before applying it. The flags determine which kind of TLB flush, if any, should follow the update.
Finally, sufficiently privileged domains may occasionally wish to manipulate the pages of others:
update va mapping otherdomain(unsigned long va, uint64 t val,
unsigned long flags, domid t domid)
Identical to update va mapping save that the pages being mapped must belong to the domain domid.
An additional MMU hypercall provides an “extended command” interface. This
provides additional functionality beyond the basic table updating commands:
mmuext op(struct mmuext op *op, int count, int *success count,
domid t domid)
This hypercall is used to perform additional MMU operations. These
include updating cr3 (or just re-installing it for a TLB flush), request-
ing various kinds of TLB flush, flushing the cache, installing a new
LDT, or pinning & unpinning page-table pages (to ensure their refer-
ence count doesn’t drop to zero which would require a revalidation of
all entries). Some of the operations available are restricted to domains
with sufficient system privileges.
It is also possible for privileged domains to reassign page ownership
via an extended MMU operation, although grant tables are used in-
stead of this where possible; see Section A.8.
Finally, a hypercall interface is exposed to activate and deactivate various optional
facilities provided by Xen for memory management.
vm assist(unsigned int cmd, unsigned int type)
Toggle various memory management modes (in particular writable
page tables).
Segmentation Support
Xen allows guest OSes to install a custom GDT if they require it; this is context
switched transparently whenever a domain is [de]scheduled. The following hyper-
call is effectively a ‘safe’ version of lgdt:
set gdt(unsigned long *frame list, int entries)
Install a global descriptor table for a domain; frame list is an array
of up to 16 machine page frames within which the GDT resides, with
entries being the actual number of descriptor-entry slots. All page
frames must be mapped read-only within the guest’s address space,
and the table must be large enough to contain Xen’s reserved entries
(see xen/include/public/arch-x86 32.h).
Many guest OSes will also wish to install LDTs; this is achieved by using mmu update
with an extended command, passing the linear address of the LDT base along with
the number of entries. No special safety checks are required; Xen needs to perform
this task simply since lldt requires CPL 0.
Xen also allows guest operating systems to update just an individual segment de-
scriptor in the GDT or LDT:
update descriptor(uint64 t ma, uint64 t desc)
Update the GDT/LDT entry at machine address ma; the new 8-byte
descriptor is stored in desc. Xen performs a number of checks to en-
sure the descriptor is valid.
Guest OSes can use the above in place of context switching entire LDTs (or the
GDT) when the number of changing descriptors is small.
Context Switching
When a guest OS wishes to context switch between two processes, it can use the page table and segmentation hypercalls described above to perform the the bulk of the privileged work.
In addition, however, it will need to invoke Xen to switch the kernel (ring 1) stack pointer:
stack switch(unsigned long ss, unsigned long esp)
Request kernel stack switch from hypervisor; ss is the new stack seg-
ment, which esp is the new stack pointer.
A useful hypercall for context switching allows “lazy” save and restore of floating
point state:
fpu taskswitch(int set)
This call instructs Xen to set the TS bit in the cr0 control register;
this means that the next attempt to use floating point will cause a trap
which the guest OS can trap. Typically it will then save/restore the FP
state, and clear the TS bit, using the same call.
This is provided as an optimization only; guest OSes can also choose to save and
restore FP state on all context switches for simplicity.
Finally, a hypercall is provided for entering vm86 mode:
switch vm86
This allows the guest to run code in vm86 mode, which is needed for
some legacy software.
Physical Memory Management
As mentioned previously, each domain has a maximum and current memory allo-
cation. The maximum allocation, set at domain creation time, cannot be modified.
However a domain can choose to reduce and subsequently grow its current alloca-
tion by using the following call:
memory op(unsigned int op, void *arg)
Increase or decrease current memory allocation (as determined by the
value of op). The available operations are:
XENMEM increase reservation Request an increase in machine mem-
ory allocation; arg must point to a xen memory reservation
structure.
XENMEM decrease reservation Request a decrease in machine mem-
ory allocation; arg must point to a xen memory reservation
structure.
XENMEM maximum ram page Request the frame number of the
highest-addressed frame of machine memory in the system. arg
must point to an unsigned long where this value will be stored.
XENMEM current reservation Returns current memory reservation
of the specified domain.
XENMEM maximum reservation Returns maximum memory reser-
vation of the specified domain.
In addition to simply reducing or increasing the current memory allocation via a
‘balloon driver’, this call is also useful for obtaining contiguous regions of machine
memory when required (e.g. for certain PCI devices, or if using superpages).
Inter-Domain Communication
Xen provides a simple asynchronous notification mechanism via event channels.
Each domain has a set of end-points (or ports) which may be bound to an event
source (e.g. a physical IRQ, a virtual IRQ, or an port in another domain). When
a pair of end-points in two different domains are bound together, then a ‘send’
operation on one will cause an event to be received by the destination domain.
The control and use of event channels involves the following hypercall:
event channel op(evtchn op t *op)
Inter-domain event-channel management; op is a discriminated union
which allows the following 7 operations:
alloc unbound: allocate a free (unbound) local port and prepare for
connection from a specified domain.
bind virq: bind a local port to a virtual IRQ; any particular VIRQ can
be bound to at most one port per domain.
bind pirq: bind a local port to a physical IRQ; once more, a given
pIRQ can be bound to at most one port per domain. Furthermore
the calling domain must be sufficiently privileged.
bind interdomain: construct an interdomain event channel; in gen-
eral, the target domain must have previously allocated an un-
bound port for this channel, although this can be bypassed by
privileged domains during domain setup.
close: close an interdomain event channel.
send: send an event to the remote end of a interdomain event channel.
status: determine the current status of a local port.
For more details see xen/include/public/event channel.h.
Event channels are the fundamental communication primitive between Xen do-
mains and seamlessly support SMP. However they provide little bandwidth for
communication per se, and hence are typically married with a piece of shared mem-
ory to produce effective and high-performance inter-domain communication.
Safe sharing of memory pages between guest OSes is carried out by granting ac-
cess on a per page basis to individual domains. This is achieved by using the
grant table op hypercall.
grant table op(unsigned int cmd, void *uop, unsigned int count)
Used to invoke operations on a grant reference, to setup the grant table
and to dump the tables’ contents for debugging.
IO Configuration
Domains with physical device access (i.e. driver domains) receive limited access to certain PCI devices (bus address space and interrupts).
However many guest operating systems attempt to determine the PCI configuration by directly access
the PCI BIOS, which cannot be allowed for safety.
Instead, Xen provides the following hypercall:
physdev op(void *physdev op)
Set and query IRQ configuration details, set the system IOPL, set the
TSS IO bitmap.
For examples of using physdev op, see the Xen-specific PCI code in the linux
sparse tree.
Administrative Operations
A large number of control operations are available to a sufficiently privileged do-
main (typically domain 0). These allow the creation and management of new do-
mains, for example. A complete list is given below: for more details on any or all
of these, please see xen/include/public/dom0 ops.h
dom0 op(dom0 op t *op)
Administrative domain operations for domain management. The op-
tions are:
DOM0 GETMEMLIST: get list of pages used by the domain
DOM0 SCHEDCTL:
DOM0 ADJUSTDOM: adjust scheduling priorities for domain
DOM0 CREATEDOMAIN: create a new domain
DOM0 DESTROYDOMAIN: deallocate all resources associated with
a domain
DOM0 PAUSEDOMAIN: remove a domain from the scheduler run
queue.
DOM0 UNPAUSEDOMAIN: mark a paused domain as schedulable
once again.
DOM0 GETDOMAININFO: get statistics about the domain
DOM0 SETDOMAININFO: set VCPU-related attributes
DOM0 MSR: read or write model specific registers
DOM0 DEBUG: interactively invoke the debugger
DOM0 SETTIME: set system time
DOM0 GETPAGEFRAMEINFO:
DOM0 READCONSOLE: read console content from hypervisor buffer
ring
DOM0 PINCPUDOMAIN: pin domain to a particular CPU
DOM0 TBUFCONTROL: get and set trace buffer attributes
DOM0 PHYSINFO: get information about the host machine
DOM0 SCHED ID: get the ID of the current Xen scheduler
DOM0 SHADOW CONTROL: switch between shadow page-table
modes
DOM0 SETDOMAINMAXMEM: set maximum memory allocation
of a domain
DOM0 GETPAGEFRAMEINFO2: batched interface for getting page
frame info
DOM0 ADD MEMTYPE: set MTRRs
DOM0 DEL MEMTYPE: remove a memory type range
DOM0 READ MEMTYPE: read MTRR
DOM0 PERFCCONTROL: control Xen’s software performance coun-
ters
DOM0 MICROCODE: update CPU microcode
DOM0 IOPORT PERMISSION: modify domain permissions for
an IO port range (enable / disable a range for a particular do-
main)
DOM0 GETVCPUCONTEXT: get context from a VCPU
DOM0 GETVCPUINFO: get current state for a VCPU
DOM0 GETDOMAININFOLIST: batched interface to get domain
info
DOM0 PLATFORM QUIRK: inform Xen of a platform quirk it
needs to handle (e.g. noirqbalance)
DOM0 PHYSICAL MEMORY MAP: get info about dom0’s mem-
ory map
DOM0 MAX VCPUS: change max number of VCPUs for a domain
DOM0 SETDOMAINHANDLE: set the handle for a domain
Most of the above are best understood by looking at the code implementing them
(in xen/common/dom0 ops.c) and in the user-space tools that use them (mostly
in tools/libxc).
Access Control Module Hypercalls
Hypercalls relating to the management of the Access Control Module are also re-
stricted to domain 0 access for now. For more details on any or all of these, please
see xen/include/public/acm ops.h. A complete list is given below:
acm op(int cmd, void *args)
This hypercall can be used to configure the state of the ACM, query
that state, request access control decisions and dump additional infor-
mation.
ACMOP SETPOLICY: set the access control policy
ACMOP GETPOLICY: get the current access control policy and
status
ACMOP DUMPSTATS: get current access control hook invocation
statistics
ACMOP GETSSID: get security access control information for a
domain
ACMOP GETDECISION: get access decision based on the currently
enforced access control policy
Most of the above are best understood by looking at the code implementing them
(in xen/common/acm ops.c) and in the user-space tools that use them (mostly
in tools/security and tools/python/xen/lowlevel/acm).
Debugging Hypercalls
A few additional hypercalls are mainly useful for debugging:
console io(int cmd, int count, char *str)
Use Xen to interact with the console; operations are:
CONSOLEIO write: Output count characters from buffer str.
CONSOLEIO read: Input at most count characters into buffer str.
A pair of hypercalls allows access to the underlying debug registers:
set debugreg(int reg, unsigned long value)
Set debug register reg to value
get debugreg(int reg)
Return the contents of the debug register reg
And finally:
xen version(int cmd)
Request Xen version number.
This is useful to ensure that user-space tools are in sync with the underlying hyper-
visor.
阅读(1365) | 评论(0) | 转发(0) |