open vswitch研究：vswitchd-yww680169-ChinaUnix博客

yww680169

首页　| 　博文目录　| 　关于我

yww680169

博客访问： 477429
博文数量： 185
博客积分： 10
博客等级：民兵
技术积分： 681
用户组：普通用户
注册时间： 2011-08-06 21:45

个人简介

为梦而战

文章分类

全部博文（185）

vim（1）
python（4）
Doxgen+Graphviz+（1）
C++（6）

boost（1）

关键字详解（2）
代码阅读（3）
C语言基础（1）
java（2）
PF_RING（1）
软件架构设计（17）

设计模式（4）

EA（5）

UML（1）
linux环境搭建（3）
linux内核裁剪记（0）
linux内核裁剪（3）

yww的linux内核裁（2）
虚拟化（43）

KVM（2）

openvswitch（5）

xen（3）

libvirt（7）
网络编程（11）

netlink（5）
linux命令（1）
SR-IOV（2）
snmp（1）
APUE（4）
openssl（6）
数据结构（0）
数据压缩解压缩（1）
高效编程（0）
linux内核学习（23）

IP分片（1）

netfilter（1）

IPsec学习（9）
SVN（0）
Makefile（3）
shell（8）
GNU自动化工具（5）
configure（5）
Linux调试（7）
C语言基础（2）
Linux 内核（17）
APUE（3）
未分配的博文（1）

文章存档

2016年（3）

2015年（103）

2014年（79）

我的朋友

linuxdev

相关博文

open vswitch研究：vswitchd

分类： LINUX

2015-03-24 10:49:14

vswitchd是用户态的daemon进程，其核心是执行ofproto的逻辑。我们知道ovs是遵从openflow交换机的规范实现的，就拿二层包转发为例，传统交换机(包括Linux bridge的实现)是通过查找cam表，找到dst mac对应的port；而open vswitch的实现则是根据入包skb，查找是否有对应的flow。如果有flow，说明这个skb不是流的第一个包了，那么可以在flow->action里找到转发的port。这里要说明的是，SDN的思想就是所有的包都需要对应一个flow，基于flow给出包的行为action，传统的action无非就是转发，接受，或者丢弃，而在SDN中，会有更多的action定义：修改skb的内容，改变包的路径，clone多份出来发到不同路径等等。

如果skb没有对应的flow，说明这是flow的第一个包，需要为这个包创建一个flow，vswitchd会在一个while循环里反复检查有没有ofproto的请求过来，有可能是ovs-ofctl传过来的，也可能是openvswitch.ko通过netlink发送的upcall请求，当然大部分情况下，都是flow miss导致的创建flow的请求，这时vswitchd会基于openflow规范创建flow, action，我们看下这个流程:

由于open vswitch是一个2层交换机模型，所有包开始都是从某个port接收进来，即调用ovs_dp_process_received_packet，该函数先基于skb通过ovs_flow_extract生成key，然后调用ovs_flow_tbl_lookup基于key查找flow，如果无法找到flow，调用ovs_dp_upcall通过netlink把一个dp_upcall_info结构发到vswitchd里去处理(调用genlmsg_unicast)

vswitchd会在handle_upcalls里来处理上述的netlink request，对于flow table里miss的情况，会调用handle_miss_upcalls，继而又调用handle_flow_miss，下面来看handle_miss_upcalls的实现

static void
handle_miss_upcalls(struct dpif_backer *backer, struct dpif_upcall *upcalls,
size_t n_upcalls)
{

/* Construct the to-do list.
*
* This just amounts to extracting the flow from each packet and sticking
* the packets that have the same flow in the same "flow_miss" structure so
* that we can process them together. */
hmap_init(&todo);
n_misses = 0;

注释里写得很明白，下面的循环会遍历netlink传到用户态的struct dpif_upcall，该结构包含了miss packet，和基于报文生成的的flow key，对于flow key相同的packet，会集中处理

for (upcall = upcalls; upcall < &upcalls[n_upcalls]; upcall++) {

fitness = odp_flow_key_to_flow(upcall->key, upcall->key_len, &flow);
port = odp_port_to_ofport(backer, flow.in_port);

odp_flow_key_to_flow，先调用lib/parse_flow_nlattrs函数解析upcall->key, upcall->key_len，把解析出来的attr属性放到一个bitmap present_attrs中，而对应类型的struct nlattr则放到struct nlattr* attrs[]中。接下来对present_attrs的每一位，从upcall->key中取得相应值并存入flow中。对于vlan的parse，特别调用了parse_8021q_onward

odp_port_to_ofport，用来把flow.in_port，即datapath的port号转换成openflow port，即struct ofport_dpif* port

flow_extract(upcall->packet, flow.skb_priority,
&flow.tunnel, flow.in_port, &miss->flow);

这里把packet解析到flow中，该函数和odp_flow_key_to_flow有些地方重复

/* Add other packets to a to-do list. */
hash = flow_hash(&miss->flow, 0);
existing_miss = flow_miss_find(&todo, &miss->flow, hash);
if (!existing_miss) {
hmap_insert(&todo, &miss->hmap_node, hash);
miss->ofproto = ofproto;
miss->key = upcall->key;
miss->key_len = upcall->key_len;
miss->upcall_type = upcall->type;
list_init(&miss->packets);

n_misses++;
} else {
miss = existing_miss;
}
list_push_back(&miss->packets, &upcall->packet->list_node);
}

flow_hash计算出miss->flow的哈希值，之后在todo这个hmap里基于哈希值查找struct flow_miss*，如果为空，表示这是第一个flow_miss，初始化这个flow_miss并加入到todo中，最后把packet假如到flow_miss->packets的list中。这里验证了之前的结论，对于一次性的多个upcall，会把属于同一个flow_miss的packets链接到同一个flow_miss下再一并处理。

OVS定义了facet，用来表示用户态程序，比如vswitchd，对于一条被匹配的flow的视图。同时kernel space对于一条flow同样有一个视图，facet表示两个视图相同的部分。不同的部分用subfacet来表示，struct subfacet里定义了action行为

如果datapath计算出的flow_key，和vswitchd基于packet计算出的flow_key完全一致的话，facet只会包含唯一的subfacet，如果datapath计算出的flow_key的成员比vswitchd基于packet计算出来的还要多，那么每个多出来的部分都会成为一个subfacet

struct subfacet {
/* Owners. */
struct hmap_node hmap_node; /* In struct ofproto_dpif 'subfacets' list. */
struct list list_node; /* In struct facet's 'facets' list. */
struct facet *facet; /* Owning facet. */

/* Key.
*
* To save memory in the common case, 'key' is NULL if 'key_fitness' is
* ODP_FIT_PERFECT, that is, odp_flow_key_from_flow() can accurately
* regenerate the ODP flow key from ->facet->flow. */
enum odp_key_fitness key_fitness;
struct nlattr *key;
int key_len;

long long int used; /* Time last used; time created if not used. */

uint64_t dp_packet_count; /* Last known packet count in the datapath. */
uint64_t dp_byte_count; /* Last known byte count in the datapath. */

/* Datapath actions.
*
* These should be essentially identical for every subfacet in a facet, but
* may differ in trivial ways due to VLAN splinters. */
size_t actions_len; /* Number of bytes in actions[]. */
struct nlattr *actions; /* Datapath actions. */

enum slow_path_reason slow; /* 0 if fast path may be used. */
enum subfacet_path path; /* Installed in datapath? */

}

我们先来看handle_flow_miss

/* Handles flow miss 'miss' on 'ofproto'. May add any required datapath
* operations to 'ops', incrementing '*n_ops' for each new op. */
static void
handle_flow_miss(struct ofproto_dpif *ofproto, struct flow_miss *miss,
struct flow_miss_op *ops, size_t *n_ops)
{
struct facet *facet;
uint32_t hash;

/* The caller must ensure that miss->hmap_node.hash contains
* flow_hash(miss->flow, 0). */
hash = miss->hmap_node.hash;

facet = facet_lookup_valid(ofproto, &miss->flow, hash);

在表示datapath的数据结构struct ofproto_dpif* ofproto中查找flow。ofproto->facets是一个hashmap，首先计算出miss flow的hash值，之后在hash对应的hmap_node list中查找是否有匹配的flow，比较的方式比较暴力，直接拿memcmp比较。。

if (!facet) {
struct rule_dpif *rule = rule_dpif_lookup(ofproto, &miss->flow);

if (!flow_miss_should_make_facet(ofproto, miss, hash)) {
handle_flow_miss_without_facet(miss, rule, ops, n_ops);

此时认为没有必要创建flow facet，对于一些trivial的流量，创建一个flow facet反而会带来更大的overload

return;
}

facet = facet_create(rule, &miss->flow, hash);

好吧，我们为这个flow创建一个facet
}
handle_flow_miss_with_facet(miss, facet, ops, n_ops);
}

struct flow_miss是对flow的一个封装，用来加快miss flow的batch处理。大多数情况下，都会创建这个facet出来，

2012-10-26T07:15:43Z|22522|ofproto_dpif|INFO|[qinq] miss flow, create facet: vlan_tci 0, proto 0x806, in_port 1, src mac 0:16:3e:83:0:1, dst mac 0:25:9e:5d:62:53

2012-10-26T07:15:43Z|22529|ofproto_dpif|INFO|[qinq] miss flow, create facet: vlan_tci 0, proto 0x806, in_port 2, src mac 0:25:9e:5d:62:53, dst mac 0:16:3e:83:0:1

可以看出一个双工通信创建了两个flow出来，同时也创建了facet

下面来看handle_flow_miss_with_facet，里面调用subfacet_make_actions来生成action，该函数首先调用action_xlate_ctx_init，初始化一个action_xlate_ctx结构，该结构定义如下：

struct action_xlate_ctx {
/* action_xlate_ctx_init() initializes these members. */

/* The ofproto. */
struct ofproto_dpif *ofproto;

/* Flow to which the OpenFlow actions apply. xlate_actions() will modify
* this flow when actions change header fields. */
struct flow flow;

/* The packet corresponding to 'flow', or a null pointer if we are
* revalidating without a packet to refer to. */
const struct ofpbuf *packet;

/* Should OFPP_NORMAL update the MAC learning table? Should "learn"
* actions update the flow table?
*
* We want to update these tables if we are actually processing a packet,
* or if we are accounting for packets that the datapath has processed, but
* not if we are just revalidating. */
bool may_learn;

/* The rule that we are currently translating, or NULL. */

struct rule_dpif *rule;

/* Union of the set of TCP flags seen so far in this flow. (Used only by
* NXAST_FIN_TIMEOUT. Set to zero to avoid updating updating rules'
* timeouts.) */
uint8_t tcp_flags;

/* xlate_actions() initializes and uses these members. The client might want
* to look at them after it returns. */

struct ofpbuf *odp_actions; /* Datapath actions. */
tag_type tags; /* Tags associated with actions. */
enum slow_path_reason slow; /* 0 if fast path may be used. */
bool has_learn; /* Actions include NXAST_LEARN? */
bool has_normal; /* Actions output to OFPP_NORMAL? */
bool has_fin_timeout; /* Actions include NXAST_FIN_TIMEOUT? */
uint16_t nf_output_iface; /* Output interface index for NetFlow. */
mirror_mask_t mirrors; /* Bitmap of associated mirrors. */

/* xlate_actions() initializes and uses these members, but the client has no
* reason to look at them. */

int recurse; /* Recursion level, via xlate_table_action. */
bool max_resubmit_trigger; /* Recursed too deeply during translation. */
struct flow base_flow; /* Flow at the last commit. */
uint32_t orig_skb_priority; /* Priority when packet arrived. */
uint8_t table_id; /* OpenFlow table ID where flow was found. */
uint32_t sflow_n_outputs; /* Number of output ports. */
uint16_t sflow_odp_port; /* Output port for composing sFlow action. */
uint16_t user_cookie_offset;/* Used for user_action_cookie fixup. */
bool exit; /* No further actions should be processed. */
struct flow orig_flow; /* Copy of original flow. */
};

之后调用xlate_actions，openflow1.0定义了如下action，

enum ofp10_action_type {
OFPAT10_OUTPUT, /* Output to switch port. */
OFPAT10_SET_VLAN_VID, /* Set the 802.1q VLAN id. */
OFPAT10_SET_VLAN_PCP, /* Set the 802.1q priority. */
OFPAT10_STRIP_VLAN, /* Strip the 802.1q header. */
OFPAT10_SET_DL_SRC, /* Ethernet source address. */
OFPAT10_SET_DL_DST, /* Ethernet destination address. */
OFPAT10_SET_NW_SRC, /* IP source address. */
OFPAT10_SET_NW_DST, /* IP destination address. */
OFPAT10_SET_NW_TOS, /* IP ToS (DSCP field, 6 bits). */
OFPAT10_SET_TP_SRC, /* TCP/UDP source port. */
OFPAT10_SET_TP_DST, /* TCP/UDP destination port. */
OFPAT10_ENQUEUE, /* Output to queue. */
OFPAT10_VENDOR = 0xffff
};

对应不同的action type，其action传入的数据结构也不同，e.g.

/* Action structure for OFPAT10_SET_VLAN_VID. */
struct ofp_action_vlan_vid {
ovs_be16 type; /* OFPAT10_SET_VLAN_VID. */
ovs_be16 len; /* Length is 8. */
ovs_be16 vlan_vid; /* VLAN id. */
uint8_t pad[2];
};

/* Action structure for OFPAT10_SET_VLAN_PCP. */
struct ofp_action_vlan_pcp {
ovs_be16 type; /* OFPAT10_SET_VLAN_PCP. */
ovs_be16 len; /* Length is 8. */
uint8_t vlan_pcp; /* VLAN priority. */
uint8_t pad[3];
};

union ofp_action {
ovs_be16 type;
struct ofp_action_header header;
struct ofp_action_vendor_header vendor;
struct ofp_action_output output;
struct ofp_action_vlan_vid vlan_vid;
struct ofp_action_vlan_pcp vlan_pcp;
struct ofp_action_nw_addr nw_addr;
struct ofp_action_nw_tos nw_tos;
struct ofp_action_tp_port tp_port;
};

do_xlate_actions传入一个struct ofp_action*数组，对每个struct ofp_action，执行不同的操作，e.g.

case OFPUTIL_OFPAT10_OUTPUT:
xlate_output_action(ctx, &ia->output);
break;

case OFPUTIL_OFPAT10_SET_VLAN_VID:
ctx->flow.vlan_tci &= ~htons(VLAN_VID_MASK);
ctx->flow.vlan_tci |= ia->vlan_vid.vlan_vid | htons(VLAN_CFI);
break;

case OFPUTIL_OFPAT10_SET_VLAN_PCP:
ctx->flow.vlan_tci &= ~htons(VLAN_PCP_MASK);
ctx->flow.vlan_tci |= htons(
(ia->vlan_pcp.vlan_pcp << VLAN_PCP_SHIFT) | VLAN_CFI);
break;

case OFPUTIL_OFPAT10_STRIP_VLAN:
ctx->flow.vlan_tci = htons(0);
break;

对于转发报文，最重要的就是xlate_output_action，该函数调用的xlate_output_action__，其中传入的port为datapath port index，或者其他控制参数，可以在ofp_port的定义中看到如下定义：

enum ofp_port {
/* Maximum number of physical switch ports. */
OFPP_MAX = 0xff00,

/* Fake output "ports". */
OFPP_IN_PORT = 0xfff8, /* Send the packet out the input port. This
virtual port must be explicitly used
in order to send back out of the input
port. */
OFPP_TABLE = 0xfff9, /* Perform actions in flow table.
NB: This can only be the destination
port for packet-out messages. */
OFPP_NORMAL = 0xfffa, /* Process with normal L2/L3 switching. */
OFPP_FLOOD = 0xfffb, /* All physical ports except input port and
those disabled by STP. */
OFPP_ALL = 0xfffc, /* All physical ports except input port. */
OFPP_CONTROLLER = 0xfffd, /* Send to controller. */
OFPP_LOCAL = 0xfffe, /* Local openflow "port". */
OFPP_NONE = 0xffff /* Not associated with a physical port. */
};

在xlate_output_action__中，大部分情况都是走到OFPP_NORMAL里面，调用xlate_normal，里面会调用mac_learning_lookup, 查找mac表找到报文的出口port，然后调用output_normal，output_normal最终调用compose_output_action

compose_output_action__(struct action_xlate_ctx *ctx, uint16_t ofp_port,

bool check_stp)
{
const struct ofport_dpif *ofport = get_ofp_port(ctx->ofproto, ofp_port);
uint16_t odp_port = ofp_port_to_odp_port(ofp_port);
ovs_be16 flow_vlan_tci = ctx->flow.vlan_tci;
uint8_t flow_nw_tos = ctx->flow.nw_tos;
uint16_t out_port;

...

out_port = vsp_realdev_to_vlandev(ctx->ofproto, odp_port,
ctx->flow.vlan_tci);
if (out_port != odp_port) {
ctx->flow.vlan_tci = htons(0);
}
commit_odp_actions(&ctx->flow, &ctx->base_flow, ctx->odp_actions);
nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_OUTPUT, out_port);

ctx->sflow_odp_port = odp_port;
ctx->sflow_n_outputs++;
ctx->nf_output_iface = ofp_port;
ctx->flow.vlan_tci = flow_vlan_tci;
ctx->flow.nw_tos = flow_nw_tos;
}

commit_odp_actions，用来把所有action编码车功能nlattr的格式存到ctx->odp_actions中，之后的nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_OUTPUT, out_port)把报文的出口port添加进去，这样一条flow action差不多组合完毕了

下面来讨论下vswitchd中的cam表，代码在lib/mac-learning.h lib/mac-learning.c中，

vswitchd内部维护了一个mac/port的cam表，其中mac entry的老化时间为300秒，cam表定义了flooding vlan的概念，即如果vlan是flooding，表示不会去学习任何地址，这个vlan的所有转发都通过flooding完成，

/* A MAC learning table entry. */
struct mac_entry {
struct hmap_node hmap_node; /* Node in a mac_learning hmap. */
struct list lru_node; /* Element in 'lrus' list. */
time_t expires; /* Expiration time. */
time_t grat_arp_lock; /* Gratuitous ARP lock expiration time. */
uint8_t mac[ETH_ADDR_LEN]; /* Known MAC address. */
uint16_t vlan; /* VLAN tag. */
tag_type tag; /* Tag for this learning entry. */

/* Learned port. */
union {
void *p;
int i;
} port;
};

/* MAC learning table. */
struct mac_learning {
struct hmap table; /* Learning table. */ mac_entry组成的hmap哈希表，mac_entry通过hmap_node挂载到mac_learning->table中
struct list lrus; /* In-use entries, least recently used at the
front, most recently used at the back. */ lru的链表，mac_entry通过lru_node挂载到mac_learning->lrus中
uint32_t secret; /* Secret for randomizing hash table. */
unsigned long *flood_vlans; /* Bitmap of learning disabled VLANs. */
unsigned int idle_time; /* Max age before deleting an entry. */ 最大老化时间
};

static uint32_t
mac_table_hash(const struct mac_learning *ml, const uint8_t mac[ETH_ADDR_LEN],
uint16_t vlan)
{
unsigned int mac1 = get_unaligned_u32((uint32_t *) mac);
unsigned int mac2 = get_unaligned_u16((uint16_t *) (mac + 4));
return hash_3words(mac1, mac2 | (vlan << 16), ml->secret);
}

mac_entry计算的hash值，由mac_learning->secret，vlan, mac地址共同通过hash_3words计算出来

mac_entry_lookup，通过mac地址，vlan来查看是否已经对应的mac_entry

get_lru，找到lru链表对应的第一个mac_entry

mac_learning_create/mac_learning_destroy，创建/销毁mac_learning表

mac_learning_may_learn，如果vlan不是flooding vlan且mac地址不是多播地址，返回true

mac_learning_insert，向mac_learning中插入一条mac_entry，首先通过mac_entry_lookup查看mac, vlan对应的mac_entry是否存在，不存在的话如果此时mac_learning已经有了MAC_MAX条mac_entry，老化最老的那条，之后创建mac_entry并插入到cam表中。

mac_learning_lookup，调用mac_entry_lookup在cam表中查找某个vlan对应的mac地址

mac_learning_run，循环老化已经超时的mac_entry

阅读(1303) | 评论(0) | 转发(0) |

上一篇：open vswitch研究: ovsdb

下一篇：open vswitch研究：utility

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6