DPDK网络功能中使用的rte_mbuf作用类似于内核态网络中的sk_buff,它是对接网络驱动和协议栈的接口。rte_mbuf的内存是应用在申请mbuf_pool时创建的,在《
DPDK rte_mempool创建与使用》一文中有介绍,在此无需再细说其过程。申请pktmbuf pool对外开放的API如下:
-
struct rte_mempool *
-
rte_pktmbuf_pool_create_by_ops(const char *name, unsigned int n,
-
unsigned int cache_size, uint16_t priv_size,
-
uint16_t data_room_size, int socket_id,
-
const char *ops_name)
-
struct rte_mempool *
-
rte_pktmbuf_pool_create(const char *name, unsigned int n, unsigned int cache_size,
-
uint16_t priv_size, uint16_t data_room_size, int socket_id)
rte_pktmbuf_pool_create_by_ops()和
rte_pktmbuf_pool_create()的差异在于前者指定了rte_mempool_ops的名字。如果上层应用自己没有实现rte_mempool_ops,或者在eal层初始化时,没有通过使--mbuf-pool-ops-name来指定,则用
rte_pktmbuf_pool_create()创建时,默认会使用ring_mp_mc(支持多生产者多消费者)。从这两个接口,可以看出,创建pktmbuf pool时,需要指定:mempool的name,mbuf个数,核的local cache大小,mbuf中私有数据的大小,mbuf中data_room的大小以及从哪个socket_id上申请。其中mbuf的个数,上层应用在进行初始化时,需根据模型进行估算,确保网络设备驱动和应用中均够用,不然可能会出现申请mbuf失败的问题。
从rte_pktmbuf_pool_create_by_ops()的接口中的如下代码,可以看出rte_mbuf的内存结构主要由三部分构成:rte_mbuf结构体,私有数据和data room。其中data room包括headroom和报文数据区域构成。headroom大小由RTE_PKTMBUF_HEADROOM宏控制,默认为128,可根据需要修改,该内存可供上层应用进行报文处理使用。
-
elt_size = sizeof(struct rte_mbuf) + (unsigned)priv_size + (unsigned)data_room_size;
-
memset(&mbp_priv, 0, sizeof(mbp_priv));
-
mbp_priv.mbuf_data_room_size = data_room_size;
-
mbp_priv.mbuf_priv_size = priv_size;
再根据pktmbuf初始化函数rte_pktmbuf_init()中的如下代码,即可得到rte_mbuf的内存结构。
-
priv_size = rte_pktmbuf_priv_size(mp);
-
mbuf_size = sizeof(struct rte_mbuf) + priv_size;
-
buf_len = rte_pktmbuf_data_room_size(mp);
-
-
memset(m, 0, mbuf_size);
-
/* start of buffer is after mbuf structure and priv data */
-
m->priv_size = priv_size;
-
m->buf_addr = (char *)m + mbuf_size;
-
m->buf_iova = rte_mempool_virt2iova(m) + mbuf_size;
-
m->buf_len = (uint16_t)buf_len;
-
-
/* keep some headroom between start of buffer and data */
-
m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len)
rte_mbuf的内存结构如下图:
注意:rte_mbuf结构体由两个cache line构成,其中有很多成员,在此不展开细说,部分域段下文也有提到。rte_mbuf中有个next域段,如果一个报文只有一个mbuf,则mbuf中的next为NULL;如果一个报文由多个mbuf构成,则mbuf的next被用来指向下一个mbuf,结构如下图所示:
发送和接收报文时报文的一些信息一般填写在{BANNED}中国{BANNED}中国第一个mbuf中。
mbuf分配
mbuf的分配接口有多个,作用各不一样。
-
struct rte_mbuf *rte_mbuf_raw_alloc(struct rte_mempool *mp)
从mp内存池中申请一个未初始化的mbuf,它一般应用在网络驱动的Rx函数中,驱动负责初始化所有必须初始化的域段。如Kunpeng920 hns3网卡的Rx队列初始化时分配Rx队列中的mbuf的函数实现:
-
static int
-
hns3_alloc_rx_queue_mbufs(struct hns3_hw *hw, struct hns3_rx_queue *rxq)
-
{
-
struct rte_mbuf *mbuf;
-
uint64_t dma_addr;
-
uint16_t i;
-
-
for (i = 0; i < rxq->nb_rx_desc; i++) {
-
mbuf = rte_mbuf_raw_alloc(rxq->mb_pool);
-
if (unlikely(mbuf == NULL)) {
-
hns3_err(hw, "Failed to allocate RXD[%u] for rx queue!",
-
i);
-
hns3_rx_queue_release_mbufs(rxq);
-
return -ENOMEM;
-
}
-
-
rte_mbuf_refcnt_set(mbuf, 1);
-
mbuf->next = NULL;
-
mbuf->data_off = RTE_PKTMBUF_HEADROOM;
-
mbuf->nb_segs = 1;
-
mbuf->port = rxq->port_id;
-
-
rxq->sw_ring[i].mbuf = mbuf;
-
dma_addr = rte_cpu_to_le_64(rte_mbuf_data_iova_default(mbuf));
-
rxq->rx_ring[i].addr = dma_addr;
-
rxq->rx_ring[i].rx.bd_base_info = 0;
-
}
-
-
return 0;
-
}
2. struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
从mp内存池中分配一个新的mbuf,内部已初始化mbuf一些域段为默认值,如报文长度为0和nb_segs为1。该API一般供上层应用构造报文所用,申请到mbuf后,还需对mbuf进行操作,添加报文头和报文内容,并填充mbuf的data_len和pkt_len等域段。如examples/ptpclient/ptpclient.c中代码:
-
created_pkt = rte_pktmbuf_alloc(mbuf_pool);
-
pkt_size = sizeof(struct rte_ether_hdr) +
-
sizeof(struct delay_req_msg);
-
-
if (rte_pktmbuf_append(created_pkt, pkt_size) == NULL) {
-
rte_pktmbuf_free(created_pkt);
-
return;
-
}
-
created_pkt->data_len = pkt_size;
-
created_pkt->pkt_len = pkt_size;
-
eth_hdr = rte_pktmbuf_mtod(created_pkt, struct rte_ether_hdr *);
-
rte_ether_addr_copy(ð_addr, ð_hdr->src_addr);
-
-
/* Set multicast address 01-1B-19-00-00-00. */
-
rte_ether_addr_copy(ð_multicast, ð_hdr->dst_addr);
-
-
eth_hdr->ether_type = htons(PTP_PROTOCOL);
-
req_msg = rte_pktmbuf_mtod_offset(created_pkt,
-
struct delay_req_msg *, sizeof(struct
-
rte_ether_hdr));
-
-
req_msg->hdr.seq_id = htons(ptp_data->seqID_SYNC);
-
req_msg->hdr.msg_type = DELAY_REQ;
-
req_msg->hdr.ver = 2;
-
req_msg->hdr.control = 1;
-
req_msg->hdr.log_message_interval = 127;
-
req_msg->hdr.message_length =
-
htons(sizeof(struct delay_req_msg));
-
req_msg->hdr.domain_number = ptp_hdr->domain_number;
3. int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool, struct rte_mbuf **mbufs, unsigned count)
出于性能优化的考虑,很多时候可能会用到批量申请mbuf,就会调用该批量申请接口。该接口从pool内存池中批量申请count个mbuf,申请的所有mbuf会被设置为默认值(与
rte_pktmbuf_alloc申请的mbuf状态相同)。该接口在PMD驱动(ena驱动的ena_populate_rx_queue()函数中)和上层应用中都会用到。
4.int rte_mempool_get_bulk(struct rte_mempool *mp, void **obj_table, unsigned int n)
该接口是从mp内存池中申请n个mbuf对象,
rte_pktmbuf_alloc_bulk()接口内部就调用了该接口。该接口申请的所有mbuf的状态与rte_mbuf_raw_alloc()申请的相同,都是原始的mbuf。也被PMD驱动和上层应用用于性能优化。如Kunpeng hns3 PMD收包函数中使用的批量申请处理:
-
static inline struct rte_mbuf *
-
hns3_rx_alloc_buffer(struct hns3_rx_queue *rxq)
-
{
-
int ret;
-
-
if (likely(rxq->bulk_mbuf_num > 0))
-
return rxq->bulk_mbuf[--rxq->bulk_mbuf_num];
-
-
ret = rte_mempool_get_bulk(rxq->mb_pool, (void **)rxq->bulk_mbuf,
-
HNS3_BULK_ALLOC_MBUF_NUM);
-
if (likely(ret == 0)) {
-
rxq->bulk_mbuf_num = HNS3_BULK_ALLOC_MBUF_NUM;
-
return rxq->bulk_mbuf[--rxq->bulk_mbuf_num];
-
} else
-
return rte_mbuf_raw_alloc(rxq->mb_pool);
-
}
再如testpmd中txonly.c用来发送报文的代码:
-
static void
-
pkt_burst_transmit(struct fwd_stream *fs)
-
{
-
struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
-
-
...
-
if (rte_mempool_get_bulk(mbp, (void **)pkts_burst,
-
nb_pkt_per_burst) == 0) {
-
for (nb_pkt = 0; nb_pkt < nb_pkt_per_burst; nb_pkt++) {
-
if (unlikely(!pkt_burst_prepare(pkts_burst[nb_pkt], mbp,
-
ð_hdr, vlan_tci,
-
vlan_tci_outer,
-
ol_flags,
-
nb_pkt, fs))) {
-
rte_mempool_put_bulk(mbp,
-
(void **)&pkts_burst[nb_pkt],
-
nb_pkt_per_burst - nb_pkt);
-
break;
-
}
-
}
-
} else {
-
for (nb_pkt = 0; nb_pkt < nb_pkt_per_burst; nb_pkt++) {
-
pkt = rte_mbuf_raw_alloc(mbp);
-
if (pkt == NULL)
-
break;
-
if (unlikely(!pkt_burst_prepare(pkt, mbp, ð_hdr,
-
vlan_tci,
-
vlan_tci_outer,
-
ol_flags,
-
nb_pkt, fs))) {
-
rte_pktmbuf_free(pkt);
-
break;
-
}
-
pkts_burst[nb_pkt] = pkt;
-
}
-
}
-
-
if (nb_pkt == 0)
-
return;
-
-
nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_pkt);
-
...
-
}
mbuf释放
涉及mbuf释放时,因为下文的mbuf克隆操作存在,会遇到两种mbuf:direct mbuf和indirect mbuf。他们的区别在于,direct mbuf是{BANNED}{BANNED}最佳佳原始的mbuf,indirect mbuf是从direct mbuf中克隆过来的,indirect mbuf的buf_iova和buf_addr均与
direct mbuf的相同,即报文indirect mbuf的报文数据指向direct mbuf指向的报文数据。其余mbuf头的信息两者是一样的,indriect mbuf的引用计数refcnt为1。
1. void rte_mbuf_raw_free(struct rte_mbuf *m)
释放一个mbuf到它对应的内存池中,调用者必须保证它的引用技术refcnt=1, refcnt=1, next=NULL, nb_segs=1。它不支持释放indirect mbuf,不支持有externel buffer的mbuf,不支持有pinned external buffer的mbuf。相关的宏定义如下:
-
#define RTE_MBUF_DIRECT(mb) \
-
(!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
-
#define RTE_MBUF_CLONED(mb) ((mb)->ol_flags & RTE_MBUF_F_INDIRECT)
-
#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & RTE_MBUF_F_EXTERNAL)
-
#define RTE_MBUF_HAS_PINNED_EXTBUF(mb) \
-
(rte_pktmbuf_priv_flags(mb->pool) & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
2. void rte_pktmbuf_free(struct rte_mbuf *m)
释放一个mbuf链到内存池,如果mbuf有多个段,则都会被放到对应的mempool中。支持释放indirect mbuf和indirect mbuf。
3. void rte_pktmbuf_free_seg(struct rte_mbuf *m)
释放一个mbuf,注意如果mbuf是多段时,则应该使用
rte_pktmbuf_free去释放。
4.void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
批量释放mbuf,若该mbuf是多段的,也均会释放到内存池。释放的mbuf必须时direct mbuf,如果释放indirect mbuf可能会导致业务异常。
5. void rte_mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table, unsigned int n)
直接将多个mbuf批量释放到指定内存池,依赖使用者确保被释放的mbuf都是来自该内存池。
释放的mbuf必须时direct mbuf,如果释放indirect mbuf可能会导致业务异常。
mbuf拷贝和克隆
mbuf克隆属于浅度拷贝,接口定义如下:
struct rte_mbuf* rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
从mp内存池中申请mbuf来克隆目标mbuf,目标mbuf可以是indirect、direct mbuf,和有external buffer的mbuf。该接口实现时,内部有一个实现mbuf克隆的关键接口:
void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)。该接口实现将mi的mbuf关联attach到m的mbuf中。mi的mbuf除了自身的引用计数为1外,其余都和md的mbuf的数据域段一致。mi的buff_addr和buf_iova均指向md的mbuf中装的报文数据域。如果该md的mbuf是indirect mbuf,则会通过rte_mbuf_from_indirect(m)对direct mbuf的引用计数+1。
rte_pktmbuf_attach的实现如下:
-
static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
-
{
-
RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
-
rte_mbuf_refcnt_read(mi) == 1);
-
-
if (RTE_MBUF_HAS_EXTBUF(m)) {
-
rte_mbuf_ext_refcnt_update(m->shinfo, 1);
-
mi->ol_flags = m->ol_flags;
-
mi->shinfo = m->shinfo;
-
} else {
-
/* if m is not direct, get the mbuf that embeds the data */
-
rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
-
mi->priv_size = m->priv_size;
-
mi->ol_flags = m->ol_flags | RTE_MBUF_F_INDIRECT;
-
}
-
-
__rte_pktmbuf_copy_hdr(mi, m);
-
-
mi->data_off = m->data_off;
-
mi->data_len = m->data_len;
-
mi->buf_iova = m->buf_iova;
-
mi->buf_addr = m->buf_addr;
-
mi->buf_len = m->buf_len;
-
-
mi->next = NULL;
-
mi->pkt_len = mi->data_len;
-
mi->nb_segs = 1;
-
-
__rte_mbuf_sanity_check(mi, 1);
-
__rte_mbuf_sanity_check(m, 0);
-
}
被attach的mbuf被打上RTE_MBUF_F_INDIRECT的标记。称为indirect mbuf。
mbuf拷贝是深度拷贝,接口如下:
struct rte_mbuf * rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp, uint32_t off, uint32_t len)
从mp内存池中申请mbuf来拷贝目标mbuf,目标mbuf可以是indirect、direct mbuf,和有external buffer的mbuf。off是前面拷贝的偏移,len是拷贝的长度。拷贝的长度超过mbuf的数据包长度时,函数内部会自动调整。
rte_pktmbuf_copy支持拷贝多段的mbuf,但是mbuf的私有数据不会被拷贝,如果mbuf是一个indirect或
有external buffer的mbuf,这个特征会被从ol_flags中去掉。
mbuf解封操作
mbuf的解封操作一般供上层应用或者用户态协议栈处理报文使用。
得到mbuf中的报文数据起始位置:
-
#define rte_pktmbuf_mtod_offset(m, t, o) \
((t)(void *)((char *)(m)->buf_addr + (m)->data_off + (o)))
-
#define rte_pktmbuf_mtod(m, t) rte_pktmbuf_mtod_offset(m, t, 0)
m->data_off则是headroom的大小,m->buf_addr+m->data_off则表示mbuf中报文数据的开始位置。例如对于ipv4-udp报文得到ip头的地址:
-
ip_hdr = rte_pktmbuf_mtod_offset(pkt, struct rte_ipv4_hdr *, sizeof(struct rte_ether_hdr));
相对于报文起始地址偏移一个以太网头的大小,则得到ip头的首地址。
同样的得到udp头的地址:
-
udp_hdr = rte_pktmbuf_mtod_offset(pkt, struct rte_udp_hdr *,
-
sizeof(struct rte_ether_hdr) + sizeof(struct rte_ipv4_hdr));
得到mbuf中headroom的大小接口:uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)。实际返回的就是m->data_off。
得到mbuf中tailroom的大小接口如下:
-
static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
-
{
-
__rte_mbuf_sanity_check(m, 0);
-
return (uint16_t)(m->buf_len - rte_pktmbuf_headroom(m) -
-
m->data_len);
-
}
即buf的总大小减去headroom和data_len大小,即为tailroom大小。
获取mbuf中报文的长度#define rte_pktmbuf_pkt_len(m) ((m)->pkt_len)
获取当前mbuf数据的长度 #define rte_pktmbuf_data_len(m) ((m)->data_len)
以上两个长度有差别:m->pkt_len是整个报文的长度,m->data_len是当前mbuf中报文数据的长度,当报文有单个mbuf构成时,他们是相等的。报文由多个mbuf构成时,pkt_len等于每个mbuf的data_len之和。
报文数据起始位置向headroom方向扩展len个字节:rte_pktmbuf_prepend
(m, len)
-
static inline char *rte_pktmbuf_prepend(struct rte_mbuf *m,
-
uint16_t len)
-
{
-
__rte_mbuf_sanity_check(m, 1);
-
-
if (unlikely(len > rte_pktmbuf_headroom(m)))
-
return NULL;
-
-
/* NB: elaborating the subtraction like this instead of using
-
* -= allows us to ensure the result type is uint16_t
-
* avoiding compiler warnings on gcc 8.1 at least */
-
m->data_off = (uint16_t)(m->data_off - len);
-
m->data_len = (uint16_t)(m->data_len + len);
-
m->pkt_len = (m->pkt_len + len);
-
-
return (char *)m->buf_addr + m->data_off;
-
}
向tailroom方向扩展len个字节的报文数据:rte_pktmbuf_append(m, len)
-
static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
-
{
-
void *tail;
-
struct rte_mbuf *m_last;
-
-
__rte_mbuf_sanity_check(m, 1);
-
-
m_last = rte_pktmbuf_lastseg(m);
-
if (unlikely(len > rte_pktmbuf_tailroom(m_last)))
-
return NULL;
-
-
tail = (char *)m_last->buf_addr + m_last->data_off + m_last->data_len;
-
m_last->data_len = (uint16_t)(m_last->data_len + len);
-
m->pkt_len = (m->pkt_len + len);
-
return (char*) tail;
-
}
该方式在PMD中也会用到,例如当发送的报文长度太短,需要对进行将报文padding到支持的长度,否则可能会触发硬件异常。
上面是向headroom和tailroom扩展报文数据,rte_mbuf库中也对外开放了向这两个方向移除报文内容的接口。
从报文头移除len个字节的报文数据:rte_pktmbuf_adj(m, len)
-
static inline char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)
-
{
-
__rte_mbuf_sanity_check(m, 1);
-
-
if (unlikely(len > m->data_len))
-
return NULL;
-
-
/* NB: elaborating the addition like this instead of using
-
* += allows us to ensure the result type is uint16_t
-
* avoiding compiler warnings on gcc 8.1 at least */
-
m->data_len = (uint16_t)(m->data_len - len);
-
m->data_off = (uint16_t)(m->data_off + len);
-
m->pkt_len = (m->pkt_len - len);
-
return (char *)m->buf_addr + m->data_off;
-
}
从报文尾移除len个字节的报文数据:rte_pktmbuf_trim(m, len)
-
static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)
-
{
-
struct rte_mbuf *m_last;
-
-
__rte_mbuf_sanity_check(m, 1);
-
-
m_last = rte_pktmbuf_lastseg(m);
-
if (unlikely(len > m_last->data_len))
-
return -1;
-
-
m_last->data_len = (uint16_t)(m_last->data_len - len);
-
m->pkt_len = (m->pkt_len - len);
-
return 0;
-
}
链接一个mbuf到另一个mbuf上:rte_pktmbuf_chain(m1, m2)
支持链接多段的mbuf。
-
static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail)
-
{
-
struct rte_mbuf *cur_tail;
-
-
/* Check for number-of-segments-overflow */
-
if (head->nb_segs + tail->nb_segs > RTE_MBUF_MAX_NB_SEGS)
-
return -EOVERFLOW;
-
-
/* Chain 'tail' onto the old tail */
-
cur_tail = rte_pktmbuf_lastseg(head);
-
cur_tail->next = tail;
-
-
/* accumulate number of segments and total length.
-
* NB: elaborating the addition like this instead of using
-
* -= allows us to ensure the result type is uint16_t
-
* avoiding compiler warnings on gcc 8.1 at least */
-
head->nb_segs = (uint16_t)(head->nb_segs + tail->nb_segs);
-
head->pkt_len += tail->pkt_len;
-
-
/* pkt_len is only set in the head */
-
tail->pkt_len = tail->data_len;
-
-
return 0;
-
}
rte_mbuf读取操作
读rte_mbuf报文内容:
void *rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off, uint32_t len, void *buf)
从报文长度偏移off位置读取len个字节的报文内容到buf中。
支持dump mbuf的头信息和报文内容到文件:
void rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
dump_len:表示要dump的报文长度。
阅读(4066) | 评论(0) | 转发(0) |