分类: LINUX
2010-05-03 23:22:41
代码代码io.c已经被移除
http://lists.xensource.com/archives/html/xen-devel/2006-06/msg00166.html
[Xen-devel] [PATCH 1/9] Xen Share: Simplified I/O Mechanism, Rusty Russell, 2006/06/05
一: 总体模型
来自lguest.txt
Lguest I/O model:
Lguest uses a simplified DMA model plus shared memory for I/O. Guests can communicate with each other if they share underlying memory (usually by the lguest program mmaping the same file), but they can use any non-shared memory to communicate with the lguest process.
Guests can register DMA buffers at any key (must be a valid physical address) using the LHCALL_BIND_DMA(key, dmabufs, num<<8|irq) hypercall. "dmabufs" is the physical address of an array of "num" "struct lguest_dma": each contains a used_len, and an array of physical addresses and lengths. When a transfer occurs, the "used_len" field of one of the buffers which has used_len 0 will be set to the length transferred and the irq will fire.
Using an irq value of 0 unbinds the dma buffers.
To send DMA, the LHCALL_SEND_DMA(key, dma_physaddr) hypercall is used, and the bytes used is written to the used_len field. This can be 0 if noone else has bound a DMA buffer to that key or some other error. DMA buffers bound by the same guest are ignored.
来自lguest_launcher.h
/*D:200
* Lguest I/O
*
* The lguest I/O mechanism is the only way Guests can talk to devices. There
* are two hypercalls involved: SEND_DMA for output and BIND_DMA for input. In
* each case, "struct lguest_dma" describes the buffer: this contains 16
* addr/len pairs, and if there are fewer buffer elements the len array is
* terminated with a 0.
*
* I/O is organized by keys: BIND_DMA attaches buffers to a particular key, and
* SEND_DMA transfers to buffers bound to particular key. By convention, keys
* correspond to a physical address within the device's page. This means that
* devices will never accidentally end up with the same keys, and allows the
* Host use The Futex Trick (as we'll see later in our journey).
*
* SEND_DMA simply indicates a key to send to, and the physical address of the
* "struct lguest_dma" to send. The Host will write the number of bytes
* transferred into the "struct lguest_dma"'s used_len member.
*
* BIND_DMA indicates a key to bind to, a pointer to an array of "struct
* lguest_dma"s ready for receiving, the size of that array, and an interrupt
* to trigger when data is received. The Host will only allow transfers into
* buffers with a used_len of zero: it then sets used_len to the number of
* bytes transferred and triggers the interrupt for the Guest to process the
* new input. */
struct lguest_dma
{
/* 0 if free to be used, filled by the Host. */
u32 used_len;
unsigned long addr[LGUEST_MAX_DMA_SECTIONS];
u16 len[LGUEST_MAX_DMA_SECTIONS];
};
(1)每一个虚拟拟设备一般都有内嵌一个或多个lguest_dma结构。
(2)每一个虚拟机都包含lguest_dma_info 数组标明最多支持的DMA数目。
struct lguest_dma_info dma[LGUEST_MAX_DMA];
struct lguest_dma_info
{
struct list_head list;
union futex_key key;
unsigned long dmas; //lguest_dma数组的首址
u16 next_dma;
u16 num_dmas; //lguest_dma数组的大小
u16 guestid;
u8 interrupt; /* 0 when not registered */
};
(3)为了让虚拟机之间能够share memory,每个有效 lguest_dma_info 要挂入相应的hash表。
static struct list_head dma_hash[61];
bind_dma做两件事:
(1)将lguest_dma注册到虚拟机的 lguest_dma_info 数组
(2)挂入 dma_hash,以便能快速找到共享的其他虚拟机lguest_dma_info 对象
二 虚拟设备到guest的异步请求(类似中断)
1 setup_waker 创建一个waker进程, waker进程将在各虚拟设备的输入端(实际就是各个已打开的fd)进行监听(wake_parent函数中的select), 当虚拟设备产生输入时, waker 发出LHREQ_BREAK请求
write(lguest_fd, args, sizeof(args));
waker 切换到hypervisor, 调用break_guest_out()函数,执行如下代码:
if (on) {
lg->break_out = 1;
/* Pop it out (may be running on different CPU) */
wake_up_process(lg->tsk);
/* Wait for them to reset it */
return wait_event_interruptible(lg->break_wq, !lg->break_out);
}
2 guest OS运行, 因为多种情况(如真实硬件中断,hypercall等)返回hypervisor执行run_guest()函数:
/* If Waker set break_out, return to Launcher. */
if (lg->break_out)
return -EAGAIN;
3 返回用户态(即launcher),调用run_guest=> handle_input 处理各虚拟设备的输出,并释放waker进程。
/* Service input, then unset the BREAK which releases
* the Waker. */
handle_input(lguest_fd, device_list);
if (write(lguest_fd, args, sizeof(args)) < 0)
handle_input的处理见下(例如handle_console_input、handle_tun_input等),
它们将调用get_dma_buffer得到虚拟设备的DMA buffer,然后将虚拟设备的输入读入buffer, 再调用trigger_irq 将lg->irqs_pending置位
launcher 处理完handle_input后,将再次调用下面这行进入hypervisor
readval = read(lguest_fd, arr, sizeof(arr));
4. hypervisor将执行maybe_do_interrupt对guest os注入中断,然后继续运行guest
三 guest到虚拟设备的同步操作请求
以虚拟块设备为例
1 当guest os运行访问块设备(其代码在lguest_blk.c函数中),将发出LHCALL_SEND_DMA切入Hypervisor, Hypervisor的主体是下面的run_guest函数, 有如下代码, do_hypercalls 将处理SEND_DMA hypercall ,SEND_DMA hypercall 的处理函数send_dma,该函数有如下语句:
lg->dma_is_pending = 1;
然后根据lg->dma_is_pending被置位切换回launcher 。
int run_guest(struct lguest *lg, unsigned long __user *user)
{
/* First we run any hypercalls the Guest wants done: either in
* the hypercall ring in "struct lguest_data", or directly by
* using int 31 (LGUEST_TRAP_ENTRY). */
do_hypercalls(lg);
/* It's possible the Guest did a SEND_DMA hypercall to the
* Launcher, in which case we return from the read() now. */
if (lg->dma_is_pending) {
if (put_user(lg->pending_dma, user) ||
put_user(lg->pending_key, user+1))
return -EFAULT;
return sizeof(unsigned long)*2;
}
2. 在第一步之前 launcher 运行 run_guest(int lguest_fd, struct device_list *device_list), 读/dev/lguest 文件 导致guest os运行,launcher阻塞。读/dev/lguest 文件的read函数有如下代码:
/* If we returned from read() last time because the Guest sent DMA,
* clear the flag. */
if (lg->dma_is_pending)
lg->dma_is_pending = 0;
/* Run the Guest until something interesting happens. */
return run_guest(lg, (unsigned long __user *)user);
现在重新回到launcher的 run_guest(int lguest_fd, struct device_list *device_list),其中有如下代码
if (readval == sizeof(arr)) {
handle_output(lguest_fd, arr[0], arr[1], device_list);
continue;
可以看到处理 handle_output后,继续回到guest os的执行
3 handle_output对于handle_console_output 、handle_tun_output是非常简单的,只是一个write操作而已,但是对于块设备handle_block_output 则相对复杂。
handle_block_output函数处理,发起DMA操作(可能读也可能写),然后出发中断trigger_irq。
四:网络设备之一---sharenet(已被淘汰),实现intra-guest communication
文档太差了,不知道如何使用
替代方式
[RFC PATCH 5/5] lguest: Inter-guest networking
setup_net_file 和 dma_transfer 函数
主要流程:
如果guest 使用同一个net file , 该文件一个页面大小,每个guest在该页面中占据一个struct lguest_net 大小(位置则称为slot)。 setup_net_file 为该net fle 创建一个mmap,
有如下两行:
dev->mem = (void *)(dev->desc->pfn * getpagesize());
if (mmap(dev->mem, getpagesize(), PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, netfd, 0) != dev->mem)
err(1, "could not mmap '%s'", filename);
这样保证了所有guest的netfile 设备实际是指向同一页面(文件),每个guest在该页面都登记一个mac地址。lguestnet_open中有如下代码:
/* Copy our MAC address into the device page, so others on the network
* can find us. */
memcpy(info->peer[info->me].mac, dev->dev_addr, ETH_ALEN);
最后,lguestnet_start_xmit中有如下代码确保找到guest:
/* Look through all the published ethernet addresses to see if we
* should send this packet. */
for (i = 0; i < info->mapsize/sizeof(struct lguest_net); i++) {
/* We don't send to ourselves (we actually can't SEND_DMA to
* ourselves anyway), and don't send to unused slots.*/
if (i == info->me || unused_peer(info->peer, i))
continue;
/* If it's broadcast we send it. If they want every packet we
* send it. If the destination matches their address we send
* it. Otherwise we go to the next peer. */
if (!broadcast && !promisc(info, i) && !mac_eq(dest, info, i))
continue;
pr_debug("lguestnet %s: sending from %i to %i\n",
dev->name, info->me, i);
/* Our routine which actually does the transfer. */
transfer_packet(dev, skb, i);
}
+io.c:
+ lguest provides DMA-style transfer, and buffer registration.
+ The guest can dma send to a particular address, or register a
+ set of DMA buffers at a particular address. This provides
+ inter-guest I/O (for shared addresses, such as a shared mmap)
+ or I/O out to the userspace process (lguest).
+
+ We currently use the futex infrastructure to see if a given
+ address is shared: if it is, we look for another guest which
+ has registered a DMA buffer at this address and copy the data,
+ then interrupt the recipient. Otherwise, we notify the guest
+ userspace (which has access to all the guest memory) to handle
+ the transfer.
+
+ TODO: We could flip whole pages between guests at this point
+ if we wanted to, however it seems unlikely to be worthwhile.
+ More optimization could be gained by having servers for certain
+ devices within the host kernel itself, avoiding at
+ least two switches into the lguest binary and back.
+
* We want Guests which share memory to be able to DMA to each other: two
* Launchers can mmap memory the same file, then the Guests can communicate.
* Fortunately, the futex code provides us with a way to get a "union
* futex_key" corresponding to the memory lying at a virtual address: if the
* two processes share memory, the "union futex_key" for that memory will match
* even if the memory is mapped at different addresses in each. So we always
* convert the keys to "union futex_key"s to compare them.
*
* Before we dive into this though, we need to look at another set of helper
* routines used throughout the Host kernel code to access Guest memory.
:*/
五 网络设备之二---TUN/TAP
sharenet的device memory page是各个guest共享,而TUN/TAP有所不同,device memory放置两个设备的mac地址,见setup_tun_net函数,关键是网络设备在slot 1 (NET_PEERNUM), 而TUN 在slot 0。最终网络发包变成了send_dma(TUN设备的key, )。
/* We create the net device with 1 page, using the features field of
* the descriptor to tell the Guest it is in slot 1 (NET_PEERNUM), and
* that the device has fairly random timing. We do *not* specify
* LGUEST_NET_F_NOCSUM: these packets can reach the real world.
*
* We will put our MAC address is slot 0 for the Guest to see, so
* it will send packets to us using the key "peer_offset(0)": */
dev = new_device(devices, LGUEST_DEVICE_T_NET, 1,
NET_PEERNUM|LGUEST_DEVICE_F_RANDOMNESS, netfd,
handle_tun_input, peer_offset(0), handle_tun_output);
/* We are peer 0, ie. first slot, so we hand dev->mem to this routine
* to write the MAC address at the start of the device memory. */
configure_device(ipfd, ifr.ifr_name, ip, dev->mem);