lguest I/O模型及DMA机制-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 619056
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lguest I/O模型及DMA机制

分类： LINUX

2010-05-03 23:22:41

代码代码io.c已经被移除

http://lists.xensource.com/archives/html/xen-devel/2006-06/msg00166.html

[Xen-devel] [PATCH 1/9] Xen Share: Simplified I/O Mechanism, Rusty Russell, 2006/06/05

一：总体模型

来自lguest.txt

Lguest I/O model:
Lguest uses a simplified DMA model plus shared memory for I/O. Guests can communicate with each other if they share underlying memory (usually by the lguest program mmaping the same file), but they can use any non-shared memory to communicate with the lguest process.
Guests can register DMA buffers at any key (must be a valid physical address) using the LHCALL_BIND_DMA(key, dmabufs, num<<8|irq) hypercall. "dmabufs" is the physical address of an array of "num" "struct lguest_dma": each contains a used_len, and an array of physical addresses and lengths. When a transfer occurs, the "used_len" field of one of the buffers which has used_len 0 will be set to the length transferred and the irq will fire.
Using an irq value of 0 unbinds the dma buffers.
To send DMA, the LHCALL_SEND_DMA(key, dma_physaddr) hypercall is used, and the bytes used is written to the used_len field. This can be 0 if noone else has bound a DMA buffer to that key or some other error. DMA buffers bound by the same guest are ignored.

来自lguest_launcher.h

/*D:200
* Lguest I/O
*
* The lguest I/O mechanism is the only way Guests can talk to devices. There
* are two hypercalls involved: SEND_DMA for output and BIND_DMA for input. In
* each case, "struct lguest_dma" describes the buffer: this contains 16
* addr/len pairs, and if there are fewer buffer elements the len array is
* terminated with a 0.
*
* I/O is organized by keys: BIND_DMA attaches buffers to a particular key, and
* SEND_DMA transfers to buffers bound to particular key. By convention, keys
* correspond to a physical address within the device's page. This means that
* devices will never accidentally end up with the same keys, and allows the
* Host use The Futex Trick (as we'll see later in our journey).
*
* SEND_DMA simply indicates a key to send to, and the physical address of the
* "struct lguest_dma" to send. The Host will write the number of bytes
* transferred into the "struct lguest_dma"'s used_len member.
*
* BIND_DMA indicates a key to bind to, a pointer to an array of "struct
* lguest_dma"s ready for receiving, the size of that array, and an interrupt
* to trigger when data is received. The Host will only allow transfers into
* buffers with a used_len of zero: it then sets used_len to the number of
* bytes transferred and triggers the interrupt for the Guest to process the
* new input. */

struct lguest_dma
{
/* 0 if free to be used, filled by the Host. */
u32 used_len;
unsigned long addr[LGUEST_MAX_DMA_SECTIONS];
u16 len[LGUEST_MAX_DMA_SECTIONS];
};

（1）每一个虚拟拟设备一般都有内嵌一个或多个lguest_dma结构。

（2）每一个虚拟机都包含lguest_dma_info 数组标明最多支持的DMA数目。

struct lguest_dma_info dma[LGUEST_MAX_DMA];

struct lguest_dma_info
{
struct list_head list;
union futex_key key;
unsigned long dmas; //lguest_dma数组的首址
u16 next_dma;
u16 num_dmas; //lguest_dma数组的大小
u16 guestid;
u8 interrupt; /* 0 when not registered */
};

（3）为了让虚拟机之间能够share memory，每个有效 lguest_dma_info 要挂入相应的hash表。

static struct list_head dma_hash[61];

bind_dma做两件事：

（1）将lguest_dma注册到虚拟机的 lguest_dma_info 数组

（2）挂入 dma_hash，以便能快速找到共享的其他虚拟机lguest_dma_info 对象

二虚拟设备到guest的异步请求（类似中断）

1 setup_waker 创建一个waker进程， waker进程将在各虚拟设备的输入端（实际就是各个已打开的fd）进行监听（wake_parent函数中的select），当虚拟设备产生输入时， waker 发出LHREQ_BREAK请求

write(lguest_fd, args, sizeof(args));

waker 切换到hypervisor，调用break_guest_out（）函数，执行如下代码：

if (on) {
  lg->break_out = 1;
  /* Pop it out (may be running on different CPU) */
  wake_up_process(lg->tsk);
  /* Wait for them to reset it */
  return wait_event_interruptible(lg->break_wq, !lg->break_out);

}

2 guest OS运行，因为多种情况（如真实硬件中断，hypercall等）返回hypervisor执行run_guest()函数：

/* If Waker set break_out, return to Launcher. */
if (lg->break_out)
return -EAGAIN;

3 返回用户态（即launcher），调用run_guest=> handle_input 处理各虚拟设备的输出，并释放waker进程。

/* Service input, then unset the BREAK which releases
   * the Waker. */
  handle_input(lguest_fd, device_list);
  if (write(lguest_fd, args, sizeof(args)) < 0)

handle_input的处理见下（例如handle_console_input、handle_tun_input等），

它们将调用get_dma_buffer得到虚拟设备的DMA buffer，然后将虚拟设备的输入读入buffer，再调用trigger_irq 将lg->irqs_pending置位

launcher 处理完handle_input后，将再次调用下面这行进入hypervisor

readval = read(lguest_fd, arr, sizeof(arr));

4. hypervisor将执行maybe_do_interrupt对guest os注入中断，然后继续运行guest

三 guest到虚拟设备的同步操作请求

以虚拟块设备为例

1 当guest os运行访问块设备(其代码在lguest_blk.c函数中)，将发出LHCALL_SEND_DMA切入Hypervisor, Hypervisor的主体是下面的run_guest函数，有如下代码， do_hypercalls 将处理SEND_DMA hypercall ，SEND_DMA hypercall 的处理函数send_dma，该函数有如下语句：

lg->dma_is_pending = 1;

然后根据lg->dma_is_pending被置位切换回launcher 。

int run_guest(struct lguest *lg, unsigned long __user *user)
{
   /* First we run any hypercalls the Guest wants done: either in
   * the hypercall ring in "struct lguest_data", or directly by
   * using int 31 (LGUEST_TRAP_ENTRY). */
  do_hypercalls(lg);
  /* It's possible the Guest did a SEND_DMA hypercall to the
   * Launcher, in which case we return from the read() now. */
  if (lg->dma_is_pending) {
   if (put_user(lg->pending_dma, user) ||
       put_user(lg->pending_key, user+1))
    return -EFAULT;
   return sizeof(unsigned long)*2;
  }

2. 在第一步之前 launcher 运行 run_guest(int lguest_fd, struct device_list *device_list)，读/dev/lguest 文件导致guest os运行，launcher阻塞。读/dev/lguest 文件的read函数有如下代码：

/* If we returned from read() last time because the Guest sent DMA,
* clear the flag. */
if (lg->dma_is_pending)
lg->dma_is_pending = 0;

/* Run the Guest until something interesting happens. */
return run_guest(lg, (unsigned long __user *)user);

现在重新回到launcher的 run_guest(int lguest_fd, struct device_list *device_list)，其中有如下代码

if (readval == sizeof(arr)) {
handle_output(lguest_fd, arr[0], arr[1], device_list);
continue;

可以看到处理 handle_output后，继续回到guest os的执行

3 handle_output对于handle_console_output 、handle_tun_output是非常简单的，只是一个write操作而已，但是对于块设备handle_block_output 则相对复杂。

handle_block_output函数处理，发起DMA操作（可能读也可能写），然后出发中断trigger_irq。

四：网络设备之一---sharenet（已被淘汰），实现intra-guest communication

文档太差了，不知道如何使用

替代方式

[RFC PATCH 5/5] lguest: Inter-guest networking

setup_net_file 和 dma_transfer 函数

主要流程：

如果guest 使用同一个net file ，该文件一个页面大小，每个guest在该页面中占据一个struct lguest_net 大小（位置则称为slot）。 setup_net_file 为该net fle 创建一个mmap，

有如下两行：

dev->mem = (void *)(dev->desc->pfn * getpagesize());

if (mmap(dev->mem, getpagesize(), PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, netfd, 0) != dev->mem)
err(1, "could not mmap '%s'", filename);

这样保证了所有guest的netfile 设备实际是指向同一页面（文件），每个guest在该页面都登记一个mac地址。lguestnet_open中有如下代码：

/* Copy our MAC address into the device page, so others on the network
* can find us. */
memcpy(info->peer[info->me].mac, dev->dev_addr, ETH_ALEN);

最后，lguestnet_start_xmit中有如下代码确保找到guest：

/* Look through all the published ethernet addresses to see if we
* should send this packet. */
for (i = 0; i < info->mapsize/sizeof(struct lguest_net); i++) {
  /* We don't send to ourselves (we actually can't SEND_DMA to
   * ourselves anyway), and don't send to unused slots.*/
  if (i == info->me || unused_peer(info->peer, i))
   continue;

  /* If it's broadcast we send it. If they want every packet we
   * send it. If the destination matches their address we send
   * it. Otherwise we go to the next peer. */
  if (!broadcast && !promisc(info, i) && !mac_eq(dest, info, i))
   continue;

  pr_debug("lguestnet %s: sending from %i to %i\n",
    dev->name, info->me, i);
  /* Our routine which actually does the transfer. */
  transfer_packet(dev, skb, i);
}

+io.c:
+ lguest provides DMA-style transfer, and buffer registration.
+ The guest can dma send to a particular address, or register a
+ set of DMA buffers at a particular address. This provides
+ inter-guest I/O (for shared addresses, such as a shared mmap)
+ or I/O out to the userspace process (lguest).
+
+ We currently use the futex infrastructure to see if a given
+ address is shared: if it is, we look for another guest which
+ has registered a DMA buffer at this address and copy the data,
+ then interrupt the recipient. Otherwise, we notify the guest
+ userspace (which has access to all the guest memory) to handle
+ the transfer.
+
+ TODO: We could flip whole pages between guests at this point
+ if we wanted to, however it seems unlikely to be worthwhile.
+ More optimization could be gained by having servers for certain
+ devices within the host kernel itself, avoiding at
+ least two switches into the lguest binary and back.
+

* We want Guests which share memory to be able to DMA to each other: two
* Launchers can mmap memory the same file, then the Guests can communicate.
* Fortunately, the futex code provides us with a way to get a "union
* futex_key" corresponding to the memory lying at a virtual address: if the
* two processes share memory, the "union futex_key" for that memory will match
* even if the memory is mapped at different addresses in each. So we always
* convert the keys to "union futex_key"s to compare them.
*
* Before we dive into this though, we need to look at another set of helper
* routines used throughout the Host kernel code to access Guest memory.
:*/

五网络设备之二---TUN/TAP

sharenet的device memory page是各个guest共享，而TUN/TAP有所不同，device memory放置两个设备的mac地址，见setup_tun_net函数，关键是网络设备在slot 1 (NET_PEERNUM)，而TUN 在slot 0。最终网络发包变成了send_dma(TUN设备的key，）。

/* We create the net device with 1 page, using the features field of
* the descriptor to tell the Guest it is in slot 1 (NET_PEERNUM), and
* that the device has fairly random timing. We do *not* specify
* LGUEST_NET_F_NOCSUM: these packets can reach the real world.
*
* We will put our MAC address is slot 0 for the Guest to see, so
* it will send packets to us using the key "peer_offset(0)": */
dev = new_device(devices, LGUEST_DEVICE_T_NET, 1,
NET_PEERNUM|LGUEST_DEVICE_F_RANDOMNESS, netfd,
handle_tun_input, peer_offset(0), handle_tun_output);

/* We are peer 0, ie. first slot, so we hand dev->mem to this routine
* to write the MAC address at the start of the device memory. */
configure_device(ipfd, ifr.ifr_name, ip, dev->mem);

阅读(950) | 评论(0) | 转发(0) |

上一篇：两个ubuntu sources.list

下一篇：2.6.24 x86_32/x86_64已经合并为x86

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6