分类: 系统运维
2013-08-25 17:07:27
1 DMA的传输配置方式
DMA有,Register Mode and Descriptor Mode.When the DMA runs in
Register Mode, the DMA controller simply uses the values contained in the
registers. In the case of Descriptor Mode, the DMA controller looks in memory
for its configuration values.
In a
register-based DMA, the processor directly programs DMA control registers to
initiate a transfer.在中提到“A DMA controller can generate addresses and initiate
memory read or write cycles. It contains several registers that can be written
and read by the CPU. These include a memory address register, a byte count
register, and one or more control registers.“
而对于Descriptor Models, DMA transfers that are
descriptor-based require a set of parameters stored within memory to initiate a
DMA sequence. The descriptor contains all of the same parameters normally
programmed into the DMA control register set. 下面的材料来自 ,”To initiate a DMA operation, the device driver within
the operating system creates DMA descriptors that refer to regions of
memory. Each DMA descriptor typically includes an address, a length, and a few
device-specific flags. In commodity x86 systems, devices lack support for
virtual-to-physical address translation, so DMA descriptors always contain
physical addresses for main memory. Once created, the device driver passes the
descriptors to the device, which will later use the descriptors to transfer
data to or from the indicated memory regions autonomously. When the requested
I/O operations have been completed, the device raises an interrupt to notify
the device driver.“,可以看到,原来在DMA寄存器中的有些内容已经挪到在内存中的DMA descriptor 去了,所以需要DMA知道DMA descriptors的首地址(通常有一个DMA寄存器用来放置)。
如何告诉DMA关于DMA descriptor的首地址呢?The precise mechanism of that notification depends on the
particular network interface, but typically involves a programmed I/O
operation to the device telling it the location of the new
descriptors.The network interface would then retrieve the descriptors from main
memory using DMA—if they were not written to the device directly by programmed
I/O.
在实际情况中,driver提供给设备往往不止一个DMA descriptor,这些DMA descriptor组成环状(称为dma ring buffer)或者数组。在于CPU忙时,网卡依然能接收多个包;同样,网卡 可以一次发出多个包才通知CPU也可以减少开销。
下面以dma ring buffer为例讨论,从文章 可以看到,对DMA descriptor的操作需要CPU和DMA engine协同完成。
The
DMA engine has an internal code and is programmed to move the data using the
information in the buffer descriptor, update the status field in the buffer
when the transfer completes and move to the next entry in the buffer ring.” 这一段是指DMA engine完成对DMA descriptor的操作后会自动修改其状态位并且自动移到下一个DMA descriptor。另外要提醒的是Some DMA engines only work on one entry to completion
then must be explicitly re-enable to work on the next entry.
The DMA-engine
maintains a pointer to the active buffer, which is the buffer it is actually
working on or the one it will work on when enabled. The DMA Engine is triggered
to perform a data transfer when a specific bit is set in the status register of
the active buffer. The engine works on the buffers until it is either stopped
by the CPU or it encounters a buffer that is not ready for data transfer. 这里提到DMA-engine的操作受CPU和DMA descriptor中的状态位的控制。
The
software on the CPU maintains a put and a get pointer and it accesses the ring
buffer space and the hardware registers though memory-mapped I/O. The put
pointer specifies the address of the next free buffer that will be used to
queue a DMA request and the get pointer points to the next entry that the
software will check for completion. When the software put pointer is equal to
the software get pointer, the ring is empty. Note that the software must track
and detect a ring full condition so that subsequent DMA request can wait until
a buffer becomes available on the ring. The software queues DMA requests to the
DMA Engine by getting the address of the put pointer, setting appropriate
fields (origin and destination address, “valid bit” in the status field to
allow the DMA-engine to process the entry. The software can also set some
software controls fields if needed.). When the DMA engine completes the entry,
it updates the status field that the software looks to determine if the
hardware has completed the request。这里就是CPU和DMA engine的协同操作ring buffer,典型的生产者/消费者模型,三个指针和状态位控制着操作过程。
Documentation/DMA-mapping.txt
2 实例一 RTL8139 A/B网卡的DMA缓冲
8139 C还支持更复杂的DMA缓冲方式。Linux驱动见drivers/net/8139too.c。
2.1 发送缓冲
8139有四个发送描述符Transmit start address(TSAD0-3)和 Transmit status(TSD0-3),轮转使用这四个描述符。驱动初始化rtl8139_init_ring就固定TSADx的buffer位置(此处导致了效率的低下?见后,似乎有改进空间)。因此当发送包rtl8139_start_xmit时,需要将skb拷贝到TSADX的buffer,然后将填写TSDX,包括 the size of this packet, the early transmit threshold, Clear OWN bit in TSD (this starts the PCI operation).
2.2 接收缓冲
接收缓冲rx_ring一开始就分配好的,这意味着存在和上面一样的问题,需要从rx_ring到skb的拷贝。rx_ring采用循环缓冲方式,The Rx buffer should be pre-allocated and indicated in
RCR before packet reception. All received packets stored in Rx buffer,
including Rx header and 4-byte CRC, are double-word alignment. I.e., each
received packet is placed at next available double-word alignment address after
the last received packet in Rx buffer.
The
process of packet receive:
1. Data received from line is stored in the
receive FIFO.
2. When Early Receive Threshold is meet, data
is moved from FIFO to Receive Buffer.
3. After the whole packet is moved from FIFO
to Receive Buffer, the receive packet
header(receive status and packet length) is
written in front of the packet. CBA is
updated to the end of the packet.
4. CMD(BufferEmpty) and ISR(TOK) set.
5. ISR routine called and then driver clear
ISR(TOK) and update CAPR.
从上可以看出,每个包的格式是包头(包含接收状态和包长度)以及包本身两部分,以上都是网卡件自动填充,包头不属于包,只是为了方便driver处理。
0038h-0039h R/W CAPR Current Address of
Packet Read (C mode only, the initial value is 0FFF0h)
003Ah-003Bh R CBR Current Buffer Address: The
initial value is 0000h. It reflects total
received byte-count in the Rx buffer. (C mode
only)
CBA(即下面的CBR)由网卡自动维护,CAPR由ISR维护,网卡可写的缓冲在CBA和CAPR之间。rtl8139_rx函数中有如下语句维护CAPR寄存器,:
RTL_W16
(RxBufPtr, (u16) (cur_rx - 16));
这样的话The CAPR register points to an offset that is
16-bytes ahead of the newly received packet’s packet-header,不知为什么?
http://wiki.osdev.org/RTL8139
, Datasheet for the RTL8139C
, Datasheet for the RTL8139D, has more information
, Programming guide for the RTL8139
3 实例二 Intel e100网卡的DMA缓冲
intel 8255x系列网卡程序e100的开发手册,Linux驱动drivers/net/e100.c。
下面的材料来自Manual 6.1
The
shared memory structure is divided into three parts: the Control/Status
Registers (CSR), the Command Block List (CBL), and the Receive Frame Area
(RFA). The CSR physically resides on the LAN controller and can be accessed by
either I/O or memory cycles, while the rest of the memory structures reside in
system (host) memory. The first 8 bytes of the CSR is called the System Control
Block (SCB). The SCB serves as a central communication point for exchanging
control and status information between the host CPU and the 8255x. The host software
controls the state of the Command Unit (CU) and Receive Unit (RU) (for example,
active, suspended or idle) by writing commands to the SCB. The device posts the
status of the CU and RU in the SCB Status word and indicates status changes
with an interrupt. The SCB also holds pointers to a linked list of action
commands called the CBL and a linked list of receive resources called the RFA.
This type of structure is shown in the figure below.
可以看到,e100的网卡有两种DMA缓冲,接收缓冲RFA链表和命令缓冲(包括发送和配置等命令)CBL。
3.1 接收缓冲
The RFA is the list of free receive resources and
consists of Receive Frame Descriptors (RFDs). RFD完全类似于第1节后面讨论的结构
struct rfd {
__le16
status;//状态位,操作是否完成等
__le16
command;
__le32
link;//指向下一个RFD的指针
__le32
rbd;
__le16
actual_size;
__le16
size;
};
下面看RFA链表的在Linux中的实现,首先初始化时e100_rx_alloc_list函数分配nic->params.rfds.count个struct rx对象,rx只是一个支架,构成一个双向链表,它内嵌指针有指向的sk_buff不仅包含了RFD,而且还有相应的接收数据缓冲。
RFA链表的建立在e100_rx_alloc_skb函数中完成起。
static int e100_rx_alloc_skb(struct nic *nic,
struct rx *rx)
{
if
(!(rx->skb = netdev_alloc_skb_ip_align(nic->netdev, RFD_BUF_LEN)))
return -ENOMEM;
/*
Init, and map the RFD. */
skb_copy_to_linear_data(rx->skb,
&nic->blank_rfd, sizeof(struct rfd));
...
if (rx->prev->skb) {
struct rfd *prev_rfd = (struct rfd
*)rx->prev->skb->data;
put_unaligned_le32(rx->dma_addr,
&prev_rfd->link);//建立链表
}
}
建立相应大小的RFA表,该表的大小基本就固定了。下面要启动接收单元(RU)
从Manual 表16中我们知道相关的命令:
RUC Field
RU Command
SCB
General Pointer
Added
to
1
RU
Start
Pointer
to first RFD in the Receive Frame Area RU
Base
在e100_start_receiver中有如下语句完成上述任务
e100_exec_cmd(nic, ruc_start, rx->dma_addr);
RU单元启动后就开始收包过程,当收到完整的packet后,DMA engine会发出中断,并自动指向下一个RFD. Driver接收中断后要将包含数据的skb上推到进程的协议栈,同时为RFD分配一个新的skb。中断流程e100_intr=> e100_poll(NAPI模式)=>e100_rx_clean。driver要维护两个指针,struct nic结构有如下两个成员:
rx_to_use: 将来被DMA engine放数据的缓冲,但是该缓冲还没有分配对应的skb。注意DMA engine的当前指针不一定对应rx_to_use。
rx_to_clean:已经收到了帧,等待driver来clean的第一个缓冲。
e100的收包就是从rx_to_clean开始,遍历RFDs,直至某个RFD的状态位没有被设置或者已经完成预订任务才停止。而rx_to_use到rx_to_clean之间就是已经被driver处理过的帧,原有的skb已经upload到上层协议,需要分配新的skb。
3.2 发送缓冲
e100的包发送似乎是同步的(fix me),也就是发一个就等待一个完成(见循环执行e100_exec_cmd),浪费不少CPU时间。理想的方式CPU 似乎可以在尾部提交transmit descriptor,然后启动CU(comand unit),处理整个CB链表,事实上,e100接收过程就是这样做的,很奇怪发送为什么没有这样做。
static int e100_exec_cb(struct nic *nic, struct sk_buff *skb,
void (*cb_prepare)(struct nic *, struct cb *, struct sk_buff *))
{
struct cb *cb;
unsigned long flags;
int err = 0;
cb = nic->cb_to_use;
nic->cb_to_use = cb->next;
nic->cbs_avail--;
cb->skb = skb;
cb_prepare(nic, cb, skb);
/* Order is important otherwise we'll be in a race with h/w:h
* set S-bit in current first, then clear S-bit in previous. */
cb->command |= cpu_to_le16(cb_s);
wmb();
cb->prev->command &= cpu_to_le16(~cb_s);
while (nic->cb_to_send != nic->cb_to_use) {
if (unlikely(e100_exec_cmd(nic, nic->cuc_cmd,
nic->cb_to_send->dma_addr))) {
break;
} else {
nic->cuc_cmd = cuc_resume;
nic->cb_to_send = nic->cb_to_send->next;
}
}
return err;
}
e100_exec_cmd的执行要用到下面的知识
The 8255x updates the SCB status and clears
the SCB command word to indicate that acceptance has completed.)
但是,利用NAPI,发送过程还是有优化的,在e100_xmit_prepare中设置每发送16个包才中断一次(接收方也可能触发中断),减少了网卡和CPU的交互。这样e100_tx_clean就可以一次性回收多个skb。
> The tx interrupt, if used, indicates
that a frame has been transmitted
> so can be cleaned from the tx ring. The
e100 driver uses NAPI, kicked by
> rx/tx interrupts. When e100_poll() is
called via NAPI, it does tx
> cleaning automatically. There is no need
to do this ASAP via tx
> interrupts so the driver disables tx
interrupts most of the time and
> does the cleaning when the next NAPI
poll is processed.
>
> To handle the case of transmitting lots
of frames while not receiving
> anything, a tx interrupt is generated
every 16 frames. See
> e100_xmit_prepare(). The interrupt
handler will kick the NAPI poll which
> will do tx clean processing.
>
> The e100 design minimizes interrupts as
much as possible in order to
> maximize packets/sec throughput. An
interrupt is used only to kick off
> NAPI polling - interrupts are disabled
while in NAPI polled mode.
> Polling continues while there is work to
do (tx/rx). When no more frames
> are queued for tx or are being received
by rx, the driver re-enables its
> interrupt and goes back to interrupt
mode. In other words, interrupts
> are used only to wake up driver tx/rx
processing, not per-frame processing.
>
> The netif_rx_schedule() function name is
badly named - it is actually
> scheduling NAPI polling which can be
used for both rx/tx work. The e100
> driver does both rx _and_ tx clean
processing in its NAPI poll handler:
> e100_poll(). Other drivers use NAPI for
rx only - I've never understood
> why.
> Hope I've helped.
linux协议栈之链路层上的数据传输之二
http://developer.intel.com/design/network/datashts/29736001.pdf
~baker/devices/restricted/notes06/ch17.html
Documentation/networking/e100.txt
4 实例三 Intel e1000 网卡的DMA缓冲
上面提到e100的发送和接收是不对称的,但是e1000的发送和传输结构是类似的,Receive Descriptor Queue 和Transmit Descriptor Ring都是Circular Buffer Queues ,分别有一个head和tail指针,硬件负责从head开始操作,而driver负责从tail插入。
来自Manual 3.2.3
Software adds receive descriptors by writing
the tail pointer with the index of the entry beyond the last valid descriptor.
As packets arrive, they are stored in memory and the head pointer is incremented
by hardware. When the head pointer is equal to the tail pointer, the ring is
empty. Hardware stops storing packets in system memory until software advances
the tail pointer, making more receive buffers available.
所有Intel网卡的官方材料
下面两个链接可以知道e100和e1000的某些开发信息
http://oss.sgi.com/projects/netdev/archive/2005-04/msg01822.html
http://blog.gmane.org/gmane.linux.drivers.e1000.devel/month=20050201
5 总结
从总体来看,DMA引擎的设计经历了从简单到复杂,性能逐步优化的过程。DMA从支持单个缓冲到ring buffer,RTL 8139 的descriptor与数据连续存放,driver要负责区分各个包,导致不必要的计算和一次不必要的copy。e100的descriptor用链式组织,各个包有了明显的区分,descriptor和skb动态分配,避免了不必要的copy。另外一个有趣的问题是发送缓冲和接收缓冲是否对称的问题,从实质来讲,接收确实复杂一些,因为外部的情况是不确定的,而发送却是自己控制的,RTL 8139和e100的发送缓冲结构都简单一些,而e1000的发送结构和接收结构类似。