Split drivers exchange requests and responses in shared memory, with an event channel for asynchronous notifications of activity.
When the frontend driver comes up, it uses Xenstore to set up a shared memory frame and an interdomain event channel for communications with the backend.
Once this connection is established, the two can communicate directly by placing requests / responses into shared memory and then sending notifications on the event channel.
This separation of notification from data transfer allows message batching, and results in very efficient device access.
Network I/O
Virtual network device services are provided by shared memory communication with a backend domain.
From the point of view of other domains, the backend may be viewed as a virtual ethernet switch element with each domain having one or more virtual network interfaces connected to it.
From the point of view of the backend domain itself, the network backend driver consists of a number of ethernet devices. Each of these has a logical direct connection to a virtual network device in another domain. This allows the backend domain to route, bridge, firewall, etc the traffic to / from the other domains using
normal operating system mechanisms.
Backend Packet Handling
The backend driver is responsible for a variety of actions relating to the transmission and reception of packets from the physical device.
With regard to transmission, the backend performs these key actions:
• Validation: To ensure that domains do not attempt to generate invalid (e.g. spoofed) traffic, the backend driver may validate headers ensuring that source MAC and IP addresses match the interface that they have been sent from.
Validation functions can be configured using standard firewall rules (iptables in the case of Linux).
• Scheduling: Since a number of domains can share a single physical network interface, the backend must mediate access when several domains each have packets queued for transmission. This general scheduling function subsumes basic shaping or rate-limiting schemes.
• Logging and Accounting: The backend domain can be configured with classifier rules that control how packets are accounted or logged.
For example, log messages might be generated whenever a domain attempts to send a TCP packet containing a SYN.
On receipt of incoming packets, the backend acts as a simple demultiplexer: Packets are passed to the appropriate virtual interface after any necessary logging and accounting have been carried out.
Data Transfer
Each virtual interface uses two “descriptor rings”, one for transmit, the other for receive.
Each descriptor identifies a block of contiguous machine memory allocated to the domain.
The transmit ring carries packets to transmit from the guest to the backend domain.
The return path of the transmit ring carries messages indicating that the contents have been physically transmitted and the backend no longer requires the associated pages of memory.
To receive packets, the guest places descriptors of unused pages on the receive ring. The backend will return received packets by exchanging these pages in the domain’s memory with new pages containing the received data, and passing back descriptors regarding the new packets on the ring.
This zero-copy approach allows the backend to maintain a pool of free pages to receive packets into, and then deliver them to appropriate domains after examining their headers.
If a domain does not keep its receive ring stocked with empty buffers then packets
destined to it may be dropped. This provides some defence against receive livelock problems because an overloaded domain will cease to receive further data.
Similarly, on the transmit path, it provides the application with feedback on the rate at which packets are able to leave the system.
Flow control on rings is achieved by including a pair of producer indexes on the shared ring page. Each side will maintain a private consumer index indicating the next outstanding message.
In this manner, the domains cooperate to divide the ring into two message lists, one in each direction. Notification is decoupled from the immediate placement of new messages on the ring; the event channel will be used to generate notification when either a certain number of outstanding messages are queued, or a specified number of nanoseconds have elapsed since the oldest message was placed on the ring.
Transmit requests are described by the following structure:
typedef struct netif_tx_request {
grant_ref_t gref; /*Grant reference for the network buffer*/
/* Reference to buffer page */
uint16_t offset; /*Offset to data*/
/* Offset within buffer page */
uint16_t flags;
/*Transmit flags (currently only NETTXF csum blank is supported, to indicate
that the protocol checksum field is incomplete).*/
/* NETTXF_* */
uint16_t id;
/*Echoed to guest by the backend in the ring-level response so that the guest can
match it to this request*/
/* Echoed in response message. */
uint16_t size;
/*Buffer size*/
/* Packet size in bytes.*/
} netif_tx_request_t;
Each transmit request is followed by a transmit response at some later date.
This is part of the shared-memory communication protocol and allows the guest to (potentially) retire internal structures related to the request.
It does not imply a network-level response. This structure is as follows:
typedef struct netif_tx_response {
uint16_t id; /*Echo of the ID field in the corresponding transmit request.*/
int16_t status; /*Success / failure status of the transmit request.*/
} netif_tx_response_t;
Receive requests must be queued by the frontend, accompanied by a donation of page-frames to the backend. The backend transfers page frames full of data back to the guest
typedef struct {
uint16_t id; /*Echoed by the frontend to identify this request when responding.*/
grant_ref_t gref; /*Transfer reference - the backend will use this reference to transfer a frame of
network data to us.*/
} netif_rx_request_t;
Receive response descriptors are queued for each received frame.
Note that these may only be queued in reply to an existing receive request, providing an in-built form of traffic throttling.
typedef struct {
uint16_t id; /*ID echoed from the original request, used by the guest to match this response to the original request.*/
uint16_t offset; /*Offset to data within the transferred frame.*/
uint16_t flags; /*Transmit flags (currently only NETRXF csum valid is supported, to indicate that the protocol checksum field has already been validated).*/
int16_t status; /*Success / error status for this operation.*/
} netif_rx_response_t;
Note that the receive protocol includes a mechanism for guests to receive incoming memory frames but there is no explicit transfer of frames in the other direction.
Guests are expected to return memory to the hypervisor in order to use the network interface.
They must do this or they will exceed their maximum memory reservation and will not be able to receive incoming frame transfers.
When necessary, the backend is able to replenish its pool of free network buffers by claiming some of
this free memory from the hypervisor.
Block I/O
All guest OS disk access goes through the virtual block device VBD interface.
This interface allows domains access to portions of block storage devices visible to the the block backend device.
The VBD interface is a split driver, similar to the network interface described above. A single shared memory ring is used between the frontend and backend drivers for each virtual device, across which IO requests
and responses are sent.
Any block device accessible to the backend domain, including network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices, can be exported as a VBD.
Each VBD is mapped to a device node in the guest, specified in the guest’s startup configuration.
Data Transfer
The per-(virtual)-device ring between the guest and the block backend supports two messages:
READ: Read data from the specified block device. The front end identifies the device and location to read from and attaches pages for the data to be copied to (typically via DMA from the device).
The backend acknowledges completed read requests as they finish.
WRITE: Write data to the specified block device. This functions essentially as READ, except that the data moves to the device instead of from it.
Block ring interface
The block interface is defined by the structures passed over the shared memory interface. These structures are either requests (from the frontend to the backend) or responses (from the backend to the frontend).
The request structure is defined as follows:
typedef struct blkif_request {
.......
} blkif_request_t;
The fields are as follows:
operation operation ID: one of the operations described above
nr_segments number of segments for scatter / gather IO described by this request handle identifier for a particular virtual device on this interface
id this value is echoed in the response message for this IO; the guest may use it to identify the original request
sector_number start sector on the virtual device for this request
frame_and_sects This array contains structures encoding scatter-gather IO to be performed:
gref The grant reference for the foreign I/O buffer page.
first_sect First sector to access within the buffer page (0 to 7).
last_sect Last sector to access within the buffer page (0 to 7).
Data will be transferred into frames at an offset determined by the value of first sect.
Virtual TPM
Virtual TPM (VTPM) support provides TPM functionality to each virtual machine that requests this functionality in its configuration file.
The interface enables domains to access their own private TPM like it was a hardware TPM built into the
machine.
The virtual TPM interface is implemented as a split driver, similar to the network and block interfaces described above. The user domain hosting the frontend exports a character device /dev/tpm0 to user-level applications for communicating with the virtual TPM. This is the same device interface that is also offered if a hardware TPM is available in the system. The backend provides a single interface /dev/vtpm where the virtual TPM is waiting for commands from all domains that have located their backend in a given domain.
Data Transfer
A single shared memory ring is used between the frontend and backend drivers.
TPM requests and responses are sent in pages where a pointer to those pages and other information is placed into the ring such that the backend can map the pages into its memory space using the grant table mechanism.
The backend driver has been implemented to only accept well-formed TPM requests.
To meet this requirement, the length indicator in the TPM request must correctly indicate the length of the request. Otherwise an error message is automatically sent back by the device driver.
The virtual TPM implementation listens for TPM request on /dev/vtpm. Since it
must be able to apply the TPM request packet to the virtual TPM instance associated with the virtual machine, a 4-byte virtual TPM instance identifier is pretended to each packet by the backend driver (in network byte order) for internal routing of the request.
Virtual TPM ring interface
The TPM protocol is a strict request/response protocol and therefore only one ring is used to send requests from the frontend to the backend and responses on the reverse path.
The request/response structure is defined as follows:
typedef struct {
........
} tpmif_tx_request_t;
The fields are as follows:
addr The machine address of the page associated with the TPM request/response; a request/response may span multiple pages
ref The grant table reference associated with the address.
size The size of the remaining packet; up to PAGE SIZE bytes can be found in the page referenced by ’addr’
The frontend initially allocates several pages whose addresses are stored in the ring. Only these pages are used for exchange of requests and responses.
阅读(953) | 评论(0) | 转发(0) |