DMA缓冲略说-静飞lv-ChinaUnix博客

静飞lv的ChinaUnix博客

首页　| 　博文目录　| 　关于我

静飞lv

博客访问： 395082
博文数量： 62
博客积分： 388
博客等级：一等列兵
技术积分： 1032
用户组：普通用户
注册时间： 2012-02-03 20:18

文章分类

全部博文（62）

linux虚拟化（1）
shell脚本（3）
网络安全（10）
算法（6）
linux技术（18）
c/c++ (linu（21）
未分配的博文（3）

文章存档

2017年（5）

2016年（3）

2015年（3）

2014年（8）

2013年（15）

2012年（28）

我的朋友

相关博文

DMA缓冲略说

分类：系统运维

2013-08-25 17:07:27

原文地址：DMA缓冲略说作者：baozhao

通常，系统通过Programmed I/O 、Memory-mapped I/O 、Direct memory access (DMA) 和Interrupts四种方式和外设打交道。在处理大规模数据时，DMA能有效减轻CPU的工作。大量外设都使用DMA来进行数据传输，但是DMA缓冲的结构却无固定形式，本文以网卡为例介绍了DMA缓冲的几种组织结构。

1 DMA的传输配置方式

        DMA有，Register Mode and Descriptor Mode.When the DMA runs in Register Mode, the DMA controller simply uses the values contained in the registers. In the case of Descriptor Mode, the DMA controller looks in memory for its configuration values.
      In a register-based DMA, the processor directly programs DMA control registers to initiate a transfer.在中提到“A DMA controller can generate addresses and initiate memory read or write cycles. It contains several registers that can be written and read by the CPU. These include a memory address register, a byte count register, and one or more control registers.“
        而对于Descriptor Models， DMA transfers that are descriptor-based require a set of parameters stored within memory to initiate a DMA sequence. The descriptor contains all of the same parameters normally programmed into the DMA control register set. 下面的材料来自，”To initiate a DMA operation, the device driver within the operating system creates DMA descriptors that refer to regions of memory. Each DMA descriptor typically includes an address, a length, and a few device-specific flags. In commodity x86 systems, devices lack support for virtual-to-physical address translation, so DMA descriptors always contain physical addresses for main memory. Once created, the device driver passes the descriptors to the device, which will later use the descriptors to transfer data to or from the indicated memory regions autonomously. When the requested I/O operations have been completed, the device raises an interrupt to notify the device driver.“，可以看到，原来在DMA寄存器中的有些内容已经挪到在内存中的DMA descriptor 去了，所以需要DMA知道DMA descriptors的首地址（通常有一个DMA寄存器用来放置）。
       如何告诉DMA关于DMA descriptor的首地址呢？The precise mechanism of that notification depends on the particular network interface, but typically involves a programmed I/O operation to the device telling it the location of the new descriptors.The network interface would then retrieve the descriptors from main memory using DMA—if they were not written to the device directly by programmed I/O.
       在实际情况中，driver提供给设备往往不止一个DMA descriptor，这些DMA descriptor组成环状（称为dma ring buffer）或者数组。在于CPU忙时，网卡依然能接收多个包；同样，网卡可以一次发出多个包才通知CPU也可以减少开销。
    下面以dma ring buffer为例讨论，从文章可以看到，对DMA descriptor的操作需要CPU和DMA engine协同完成。
       The DMA engine has an internal code and is programmed to move the data using the information in the buffer descriptor, update the status field in the buffer when the transfer completes and move to the next entry in the buffer ring.” 这一段是指DMA engine完成对DMA descriptor的操作后会自动修改其状态位并且自动移到下一个DMA descriptor。另外要提醒的是Some DMA engines only work on one entry to completion then must be explicitly re-enable to work on the next entry.
    The DMA-engine maintains a pointer to the active buffer, which is the buffer it is actually working on or the one it will work on when enabled. The DMA Engine is triggered to perform a data transfer when a specific bit is set in the status register of the active buffer. The engine works on the buffers until it is either stopped by the CPU or it encounters a buffer that is not ready for data transfer. 这里提到DMA-engine的操作受CPU和DMA descriptor中的状态位的控制。
      The software on the CPU maintains a put and a get pointer and it accesses the ring buffer space and the hardware registers though memory-mapped I/O. The put pointer specifies the address of the next free buffer that will be used to queue a DMA request and the get pointer points to the next entry that the software will check for completion. When the software put pointer is equal to the software get pointer, the ring is empty. Note that the software must track and detect a ring full condition so that subsequent DMA request can wait until a buffer becomes available on the ring. The software queues DMA requests to the DMA Engine by getting the address of the put pointer, setting appropriate fields (origin and destination address, “valid bit” in the status field to allow the DMA-engine to process the entry. The software can also set some software controls fields if needed.). When the DMA engine completes the entry, it updates the status field that the software looks to determine if the hardware has completed the request。这里就是CPU和DMA engine的协同操作ring buffer，典型的生产者/消费者模型，三个指针和状态位控制着操作过程。

Documentation/DMA-mapping.txt

2 实例一 RTL8139 A/B网卡的DMA缓冲

8139 C还支持更复杂的DMA缓冲方式。Linux驱动见drivers/net/8139too.c。

2.1 发送缓冲

8139有四个发送描述符Transmit start address(TSAD0-3)和 Transmit status(TSD0-3)，轮转使用这四个描述符。驱动初始化rtl8139_init_ring就固定TSADx的buffer位置（此处导致了效率的低下？见后，似乎有改进空间）。因此当发送包rtl8139_start_xmit时，需要将skb拷贝到TSADX的buffer，然后将填写TSDX，包括 the size of this packet, the early transmit threshold, Clear OWN bit in TSD (this starts the PCI operation).

2.2 接收缓冲

      接收缓冲rx_ring一开始就分配好的，这意味着存在和上面一样的问题，需要从rx_ring到skb的拷贝。rx_ring采用循环缓冲方式，The Rx buffer should be pre-allocated and indicated in RCR before packet reception. All received packets stored in Rx buffer, including Rx header and 4-byte CRC, are double-word alignment. I.e., each received packet is placed at next available double-word alignment address after the last received packet in Rx buffer.
         The process of packet receive:
1. Data received from line is stored in the receive FIFO.
2. When Early Receive Threshold is meet, data is moved from FIFO to Receive Buffer.
3. After the whole packet is moved from FIFO to Receive Buffer, the receive packet
header(receive status and packet length) is written in front of the packet. CBA is
updated to the end of the packet.
4. CMD(BufferEmpty) and ISR(TOK) set.
5. ISR routine called and then driver clear ISR(TOK) and update CAPR.
    从上可以看出，每个包的格式是包头（包含接收状态和包长度）以及包本身两部分,以上都是网卡件自动填充，包头不属于包，只是为了方便driver处理。

0038h-0039h R/W CAPR Current Address of Packet Read (C mode only, the initial value is 0FFF0h)
003Ah-003Bh R CBR Current Buffer Address: The initial value is 0000h. It reflects total
received byte-count in the Rx buffer. (C mode only)

CBA(即下面的CBR）由网卡自动维护，CAPR由ISR维护，网卡可写的缓冲在CBA和CAPR之间。rtl8139_rx函数中有如下语句维护CAPR寄存器，：
               RTL_W16 (RxBufPtr, (u16) (cur_rx - 16));
这样的话The CAPR register points to an offset that is 16-bytes ahead of the newly received packet’s packet-header，不知为什么？

http://wiki.osdev.org/RTL8139
, Datasheet for the RTL8139C
, Datasheet for the RTL8139D, has more information
, Programming guide for the RTL8139

3 实例二 Intel e100网卡的DMA缓冲

intel 8255x系列网卡程序e100的开发手册，Linux驱动drivers/net/e100.c。
下面的材料来自Manual 6.1
The shared memory structure is divided into three parts: the Control/Status Registers (CSR), the Command Block List (CBL), and the Receive Frame Area (RFA). The CSR physically resides on the LAN controller and can be accessed by either I/O or memory cycles, while the rest of the memory structures reside in system (host) memory. The first 8 bytes of the CSR is called the System Control Block (SCB). The SCB serves as a central communication point for exchanging control and status information between the host CPU and the 8255x. The host software controls the state of the Command Unit (CU) and Receive Unit (RU) (for example, active, suspended or idle) by writing commands to the SCB. The device posts the status of the CU and RU in the SCB Status word and indicates status changes with an interrupt. The SCB also holds pointers to a linked list of action commands called the CBL and a linked list of receive resources called the RFA. This type of structure is shown in the figure below.

可以看到，e100的网卡有两种DMA缓冲，接收缓冲RFA链表和命令缓冲（包括发送和配置等命令）CBL。

3.1 接收缓冲

The RFA is the list of free receive resources and consists of Receive Frame Descriptors (RFDs). RFD完全类似于第1节后面讨论的结构
struct rfd {
       __le16 status;//状态位，操作是否完成等
       __le16 command;
       __le32 link;//指向下一个RFD的指针
       __le32 rbd;
       __le16 actual_size;
       __le16 size;
};

       下面看RFA链表的在Linux中的实现，首先初始化时e100_rx_alloc_list函数分配nic->params.rfds.count个struct rx对象，rx只是一个支架，构成一个双向链表，它内嵌指针有指向的sk_buff不仅包含了RFD，而且还有相应的接收数据缓冲。

RFA链表的建立在e100_rx_alloc_skb函数中完成起。

static int e100_rx_alloc_skb(struct nic *nic, struct rx *rx)
{
       if (!(rx->skb = netdev_alloc_skb_ip_align(nic->netdev, RFD_BUF_LEN)))
              return -ENOMEM;

       /* Init, and map the RFD. */
       skb_copy_to_linear_data(rx->skb, &nic->blank_rfd, sizeof(struct rfd));
       ...

if (rx->prev->skb) {

              struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
              put_unaligned_le32(rx->dma_addr, &prev_rfd->link);//建立链表
       }

}

建立相应大小的RFA表，该表的大小基本就固定了。下面要启动接收单元(RU）
从Manual 表16中我们知道相关的命令：
RUC Field          RU Command                 SCB General Pointer                                       Added to
1                          RU Start                Pointer to first RFD in the Receive Frame Area       RU Base

在e100_start_receiver中有如下语句完成上述任务
e100_exec_cmd(nic, ruc_start, rx->dma_addr);

RU单元启动后就开始收包过程，当收到完整的packet后，DMA engine会发出中断，并自动指向下一个RFD. Driver接收中断后要将包含数据的skb上推到进程的协议栈，同时为RFD分配一个新的skb。中断流程e100_intr=> e100_poll(NAPI模式)=>e100_rx_clean。driver要维护两个指针，struct nic结构有如下两个成员：
rx_to_use: 将来被DMA engine放数据的缓冲，但是该缓冲还没有分配对应的skb。注意DMA engine的当前指针不一定对应rx_to_use。
rx_to_clean：已经收到了帧，等待driver来clean的第一个缓冲。
e100的收包就是从rx_to_clean开始，遍历RFDs，直至某个RFD的状态位没有被设置或者已经完成预订任务才停止。而rx_to_use到rx_to_clean之间就是已经被driver处理过的帧，原有的skb已经upload到上层协议，需要分配新的skb。

3.2 发送缓冲

e100的包发送似乎是同步的（fix me），也就是发一个就等待一个完成(见循环执行e100_exec_cmd），浪费不少CPU时间。理想的方式CPU 似乎可以在尾部提交transmit descriptor，然后启动CU（comand unit），处理整个CB链表，事实上，e100接收过程就是这样做的，很奇怪发送为什么没有这样做。

static int e100_exec_cb(struct nic *nic, struct sk_buff *skb,

void (*cb_prepare)(struct nic *, struct cb *, struct sk_buff *))

{

struct cb *cb;

unsigned long flags;

int err = 0;

cb = nic->cb_to_use;

nic->cb_to_use = cb->next;

nic->cbs_avail--;

cb->skb = skb;

cb_prepare(nic, cb, skb);

/* Order is important otherwise we'll be in a race with h/w:h

* set S-bit in current first, then clear S-bit in previous. */

cb->command |= cpu_to_le16(cb_s);

wmb();

cb->prev->command &= cpu_to_le16(~cb_s);

while (nic->cb_to_send != nic->cb_to_use) {

if (unlikely(e100_exec_cmd(nic, nic->cuc_cmd,

nic->cb_to_send->dma_addr))) {

break;

} else {

nic->cuc_cmd = cuc_resume;

nic->cb_to_send = nic->cb_to_send->next;

}

return err;

}

e100_exec_cmd的执行要用到下面的知识
The 8255x updates the SCB status and clears the SCB command word to indicate that acceptance has completed.)

但是，利用NAPI，发送过程还是有优化的，在e100_xmit_prepare中设置每发送16个包才中断一次（接收方也可能触发中断），减少了网卡和CPU的交互。这样e100_tx_clean就可以一次性回收多个skb。

> The tx interrupt, if used, indicates that a frame has been transmitted
> so can be cleaned from the tx ring. The e100 driver uses NAPI, kicked by
> rx/tx interrupts. When e100_poll() is called via NAPI, it does tx
> cleaning automatically. There is no need to do this ASAP via tx
> interrupts so the driver disables tx interrupts most of the time and
> does the cleaning when the next NAPI poll is processed.
>
> To handle the case of transmitting lots of frames while not receiving
> anything, a tx interrupt is generated every 16 frames. See
> e100_xmit_prepare(). The interrupt handler will kick the NAPI poll which
> will do tx clean processing.
>
> The e100 design minimizes interrupts as much as possible in order to
> maximize packets/sec throughput. An interrupt is used only to kick off
> NAPI polling - interrupts are disabled while in NAPI polled mode.
> Polling continues while there is work to do (tx/rx). When no more frames
> are queued for tx or are being received by rx, the driver re-enables its
> interrupt and goes back to interrupt mode. In other words, interrupts
> are used only to wake up driver tx/rx processing, not per-frame processing.
>
> The netif_rx_schedule() function name is badly named - it is actually
> scheduling NAPI polling which can be used for both rx/tx work. The e100
> driver does both rx _and_ tx clean processing in its NAPI poll handler:
> e100_poll(). Other drivers use NAPI for rx only - I've never understood
> why.
> Hope I've helped.

linux协议栈之链路层上的数据传输之二
 http://developer.intel.com/design/network/datashts/29736001.pdf
~baker/devices/restricted/notes06/ch17.html

Documentation/networking/e100.txt

4 实例三 Intel e1000 网卡的DMA缓冲

上面提到e100的发送和接收是不对称的，但是e1000的发送和传输结构是类似的，Receive Descriptor Queue 和Transmit Descriptor Ring都是Circular Buffer Queues ，分别有一个head和tail指针，硬件负责从head开始操作，而driver负责从tail插入。
来自Manual 3.2.3
Software adds receive descriptors by writing the tail pointer with the index of the entry beyond the last valid descriptor. As packets arrive, they are stored in memory and the head pointer is incremented by hardware. When the head pointer is equal to the tail pointer, the ring is empty. Hardware stops storing packets in system memory until software advances the tail pointer, making more receive buffers available.

所有Intel网卡的官方材料

下面两个链接可以知道e100和e1000的某些开发信息
http://oss.sgi.com/projects/netdev/archive/2005-04/msg01822.html
http://blog.gmane.org/gmane.linux.drivers.e1000.devel/month=20050201

5 总结

从总体来看，DMA引擎的设计经历了从简单到复杂，性能逐步优化的过程。DMA从支持单个缓冲到ring buffer，RTL 8139 的descriptor与数据连续存放，driver要负责区分各个包，导致不必要的计算和一次不必要的copy。e100的descriptor用链式组织，各个包有了明显的区分，descriptor和skb动态分配，避免了不必要的copy。另外一个有趣的问题是发送缓冲和接收缓冲是否对称的问题，从实质来讲，接收确实复杂一些，因为外部的情况是不确定的，而发送却是自己控制的，RTL 8139和e100的发送缓冲结构都简单一些，而e1000的发送结构和接收结构类似。

阅读(13832) | 评论(0) | 转发(0) |

上一篇：栈字节序

下一篇：Linux cpuinfo详解

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6