19.2 一般包的处理-learn007-ChinaUnix博客

learn007的ChinaUnix博客learn007.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

learn007

博客访问： 14000
博文数量： 6
博客积分： 110
博客等级：入伍新兵
技术积分： 75
用户组：普通用户
注册时间： 2011-03-20 12:37

文章分类

全部博文（6）

Qemu（1）
Linux内核（5）

ULNI（3）

网卡驱动（2）

启动相关（0）

同步手段（0）
未分配的博文（0）

文章存档

2014年（1）

2011年（5）

我的朋友

相关博文

19.2 一般包的处理

分类： LINUX

2011-03-22 19:00:46

19.2. General Packet Handling
常规包的处理

The rest of this chapter covers some general considerations that the kernel has to take into account when handling ingress IP packets, such as checksumming and options. Subsequent chapters go into detail about how they are forwarded, transmitted, and fragmented/defragmented.

本章余下的部分覆盖了内核处理一般的IP包需要考虑的内容，例如校验和和选项。

接下来的章节会详细描述包括转发、本机发包、分片和重组。

19.2.1. Protocol Initialization
协议初始化

The IPv4 protocol is initialized by ip_init, defined in net/ipv4/ip_output.c. Because IPv4 support cannot be removed from the kernel (i.e., it cannot be compiled as a module), there is no ip_uninitfunction.

IPv4协议使用ip_init初始化，在net/ipv4/ip_output.c中定义。

因为内核IPv4不能模块化，因此没有析构函数。

Here are the main tasks accomplished by ip_init:

下面是ip_init主要的任务

Register the handler for IP packets with the dev_add_pack function (see ). This handler is a function named ip_rcv.
使用dev_add_pack注册IP包处理函数，处理函数为ip_rcv（见13章）
Initialize the routing subsystem, including the protocol-independent cache (see ).
初始化路由子系统，包括协议无关的cache部分（见32章）
Initialize the infrastructure used to manage IP peers (see the section "" in .
初始化IP peers处理的环境。参见23章“长生命周期的IP Peer”

ip_init is invoked at boot time by inet_init, which takes care of the initialization of all the subsystems related to IPv4, including the L4 protocols.

ip_init被inet_init调用。inet_init用来初始化所有ipv4相关的子系统，包括L4层协议。

19.2.2. Interaction with Netfilter
和Netfilter交互

We will not examine the Netfilter firewalling subsystem in this book, but we can examine its main working principles now, particularly its relationship to the aspects of the IPv4 implementation we discuss in this part of the book.

我们不会在本书中详细讨论Netfilter防火墙，但会讨论他的主要工作，尤其是他和IPv4实现有关的部分。

Firewalling, essentially, hooks into certain places in the network stack code that packets always pass through when the packets or the kernel meet certain conditions; at those points, the firewall allows network administrators to manipulate the contents or disposition of the traffic. Those points in the kernel, as shown in in , include:

防火墙，主要的是，在数据包或内核满足某种条件时，数据包总会经过网络协议栈路径上挂钩子。

在这些地方，防火墙允许网络管理员修改数据包内容或丢弃数据包。这些挂载的点，如18章图18-1所示，包括：

Packet reception
接收包
Packet forwarding (before routing decision)
包转发（在决定路由之前）
Packet forwarding (after routing decision)
包转发（在决定路由之后）
Packet transmission
发送包

The reason why it is useful to distinguish between pre-routing and post-routing will become clearer in .

为什么要区分决定路由之前和之后，详解在第五部分。

In each case just listed, the function in charge of the operation is split into two parts, usually called do_something and do_something_finish. (In a few cases, the names are do_something anddo_something2.) do_something contains only some sanity checks and maybe some housekeeping. The code that does the real job is in do_something_finish or do_something2. do_something ends by calling the Netfilter function NF_HOOK, passing in the point where the call comes from (for instance, packet reception) and the function to execute if the filtering rules configured by the user with theiptables command do not decide to drop or reject the packet. If there are no rules to apply or they simply indicate "go ahead," the function do_something_finish is executed. Given the following general call:

在列举每种情况时，负责实现功能的函数被分成两种，通常称之为do_something和do_something_finish（有时也称之为do_something和）。

do_something常常包含一些异常检测和清理工作，真正处理任务有do_something_finish或do_something2完成。

do_something结束时会调用Netfilter的NF_HOOK函数，在这里根据用户是否使用iptables命令配置接受包或丢弃包规则执行包过滤。

如果用户没有添加任何规则，则简单的放过。下面是NF_HOOK调用：

NF_HOOK(PROTOCOL, HOOK_POSITION_IN_THE_STACK, SKB_BUFFER, IN_DEVICE, OUT_DEVICE, do_ something_finish)

the output value of NF_HOOK can be one of the following:

NF_HOOK的输出为下列之一：

The output value of do_something_finish when the latter is executed
输出do_something_finish，这个函数之后会被执行
-EPERM if SKB_BUFFER is dropped because of a filter
数据包被filter丢弃，返回-EPERM
-ENOMEM if there was insufficient memory to perform the filtering operation
在处理filter时内存不足返回-ENOMEM

In this chapter, we do not need to worry about those details. We will assume that no filters are configured and therefore that, at the end of do_something, the call to the Netfilter function will simply execute do_something_finish. We will see the first example at the end of the ip_rcv function.

在这一章，我们不需要考虑过多的细节。我们假定没有配置任何filter，在do_something处理结束时，就会调用do_something_finish。我们会在ip_rcv处理结束时看到第一个例子。

19.2.3. Interaction with the Routing Subsystem
和路由子系统的交互

The IP layer needs to interact with the routing table in several places, such as when receiving and when transmitting a packet. I will cover the details about routing in when I will describe the routing subsystem; for now I'll just briefly describe three of the functions used by the IP layer to consult the routing table:

IP层在很多地方需要和路由子系统交互，例如接收和发送数据包。我们会在第七部分描述路由系统时描述细节。在这里仅描述IP层用来访问路由表的3个函数。

ip_route_input

Determines the destiny of an input packet. As you can see in in , the packet could be delivered locally, forwarded, or dropped.

ip_route_input，决定数据包的目的地址。如18章图18-1所示，数据包可以发到本机，转发或丢弃。

ip_route_output_flow

Used before transmitting a packet. This function returns both the next hop gateway and the egress device to use.

ip_route_output_flow，在发送数据包之前使用。这个函数返回下一个网关和使用的出口设备。

dst_pmtu

Given a routing table cache entry, returns the associated Path Maximum Transmission Unit (PMTU).

dst_pmtu，输入一个路由表项，给出路径MTU。

The ip_route_xxx functions, described in detail in and , consult the routing table and base their decisions on a set of fields:

这些ip_route_xxx函数，在33章和35章详细描述，查询路由表和如下一些项：

Destination IP address.
目的IP
Source IP address.
源IP
Type of Service (ToS).
服务类型（TOS）
Receiving device in the case of reception.
接收时的接收设备
List of allowed transmitting devices.
允许发送的设备列表

Among the more complex factors that could influence the decision returned by these functions are the presence of policy routing and the presence of a firewall.

在影响这些函数输出的众多因素中，尤其复杂的是路由策略和防火墙。

The functions store the result of the routing table lookup in skb->dst. This structure includes several fields, including the input and output function pointers that will be called to complete the reception or the transmission of the packet (see in for where those two function pointers are used). The ip_route_xxx functions return a negative value if the lookup fails.

这些函数将路由表查找结果保存在skb->dst中。这个数据结构包括很多域，包括为了完成接收或发送而需要调用的输入输出函数的指针。

Both functions also use a cache to get a stream of packets to the same destination quickly. The destination IP address is the most important criterion for making the decision, and is used as the search key into the cache. But each cache entry also includes several other parameters that distinguish which route is used. For instance, the cache keeps track of each route's PMTU, which was described in the section "" in .

19.2.4. Processing Input IP Packets

showed that the kernel routes traffic at every level to the proper protocol by invoking the handler function registered by that protocol. In the section "" in that chapter, we saw how the IP protocol registers its protocol handler ip_rcv, defined in net/ipv4/ip_input.c, with the kernel. We can now start to analyze the path of IP packets inside the kernel network stack, starting with the ip_rcv function.

ip_rcv is a classic case of the two-stage process described in the section "." Its work consists just of applying sanity checks to the packet and then invoking the Netfilter hook. Most processing will take place in ip_rcv_finish, called from the Netfilter hook.

Here is the prototype of ip_rcv. The third input parameter is not used.

int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt)

The netif_receive_skb function (described in ) sets the pointer to the L3 protocol (skb->nh) at the end of the L2 header. IP layer functions can therefore safely cast it to an iphdrstructure.

Most of the fields of sk_buff are set before the call to ip_rcv, as explained in previous chapters, during the sequence of events that take place from the interrupt notification by an NIC to the invocation of the L3 protocol handler. shows the values of some of the sk_buff fields when ip_rcv starts. Note that skb->data, which is usually used to point to the payload, here points to the L3 header.

Figure 19-1. Part of sk_buff data structure at the beginning of ip_rcv

In and we saw how the NIC's device driver sets the L3 protocol identifier skb->protocol and the packet type skb->pkt_type. Ethernet drivers, for instance, do that by means of the eth_type_trans function.

skb->pkt_type is set to PACKET_OTHERHOST when the L2 destination address of the frame is different from the address of the receiving interface. Normally those packets are discarded by the NIC itself. However, if the interface has been put into promiscuous mode, it receives all packets regardless of the destination L2 address and passes them up to higher layers. The kernel invokes sniffers that have requested access to all packets, as described in . But ip_rcv is not concerned with packets for other addresses and simply drops them:

if (skb->pkt_type == PACKET_OTHERHOST) goto drop;

Note that receiving a packet for a different L2 address is not the same as receiving a packet that should be routed to another system. In the latter case, the packet has the interface's L2 address but an L3 layer address that is different from that of the current recipient. A router is configured to accept such packets and route them, as described in .

skb_share_check checks whether the reference count of the packet is bigger than 1, which means that other parts of the kernel have references to the buffer. As discussed in earlier chapters, sniffers and other users might be interested in packets, so each packet contains a reference count. The netif_receive_skb function, which is the one that calls ip_rcv, increments the reference count before it calls a protocol handler. If the handler sees a reference count bigger than 1, it creates its own copy of the buffer so that it can modify the packet. Any following handlers will receive the original, unchanged buffer. If a copy is needed but memory allocation fails, the packet is dropped.

if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL) { IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS); goto out; }

The job of pskb_may_pull is to make sure that the area pointed to by skb->data contains a block of data at least as big as the IP header, since each IP packet (fragments included) must include a complete IP header. If the condition is met, there is nothing to do. Otherwise, the missing part is copied from the data fragments (if any) stored in skb_shinfo(skb)->frags[]. If this fails, the function terminates with an error. If it succeeds, the function must initialize iph again because pskb_may_pull could change the buffer structure.

Do not confuse data fragments with IP fragments. See for the use of the skb_shinfo macro.

if (!pskb_may_pull(skb, sizeof(struct iphdr))) goto inhdr_error; iph = skb->nh.iph;

Next come some sanity checks on the IP header. The size of a basic IP header is 20 bytes, and since the size stored in the header is expressed in multiples of 32 bits (4 bytes), if its value is smaller than 5 it means there is an error. The second check in the if statement is rather fussy. Currently there are two versions of the IP protocol: IPv4 and IPv6. The if statement makes sure the packet is an IPv4 packet. But because the two protocols are handled by two different functions, the ip_rcv function should never have been called for IPv6 in the first place.

if (iph->ihl < 5 || iph->version != 4) goto inhdr_error;

Now we repeat the same check as before, but this time we use the full IP header size (including the options). If the IP header claims a size of iph->ihl, the packet should be at least as long asiph->ihl. This check was left until now because the function needs first to make sure the basic header (i.e., the header without options) has not been truncated and that it passes a basic sanity check before reading something from it (ihl in this case).

if (!pskb_may_pull(skb, iph->ihl*4)) goto inhdr_error; iph = skb->nh.iph;

After these two protocol consistency checks have been performed, the function needs to compute the checksum and see whether it matches the one carried in the header. If it doesn't, the packet is dropped. The ip_fast_csum routine was introduced in the section "" in .

if (ip_fast_csum((u8 *)iph, iph->ihl) != 0) goto inhdr_error;

After the checksum, there are two other sanity checks:

Make sure the length of the buffer (i.e., the received packet) is greater than or equal to the length reported in the IP header.
Make sure the size of the packet is at least as large as the IP header size.
{ _ _u32 len = ntohs(iph->tot_len); if (skb->len < len || len < (iph->ihl<<2)) goto inhdr_error;

Here we need to explain why those two checks are needed. The first one arises from the fact that the L2 protocols (e.g., Ethernet) can pad out the payload, so there may be extra bytes after the IP payload. (This happens, for instance, when the L2 size of the frame is smaller than the minimum required by the protocol. Ethernet frames have a minimum frame length of 64 bytes.) In such a case, the packet would look bigger than the length reported in the IP header. The different sizes and padding are shown in .

From the L2 perspective, the payload is the IP header and everything that follows it.

Figure 19-2. L2 padding needed to reach the minimum payload size

The second check derives from the fact that an IP header cannot be fragmented, and that each IP fragment must therefore contain at least an IP header. The reason for the <<2 in the condition is that the size of the header (iph->ihl) is measured in units of 32 bits. This check should fail only in an extremely rare situation. It would mean that the checksum had been computed on a corrupted packet but happened by chance to produce the same checksum as the original packet (i.e., the checksum did not detect the error).

The IP protocol specification (RFC 791) says that an Internet host must be able to forward a datagram of 68 bytes without having to fragment it: in other words, the L2 protocol must be able to transmit a frame with a payload of at least 68 bytes.

The minimum MTU associated with a route is in fact 68, which comes from RFC 791. Since the IP header can be up to 60 bytes long (20+40) and the minimum fragment length (with the exception of the last one) is 8 bytes, it follows that every IP router must be able to forward an IP packet of 68 bytes without any further fragmentation.

As you can imagine, all of the sanity checks that we have seen so far and that we will see later are very important for the stability of the system. If, by chance, the sk_buff structure was incorrectly initialized, or if the IP header itself was corrupted, the kernel could process packets in a wrong way or could access invalid memory locations, which could indirectly cause a crash.

We said that the L2 protocols could have padded out the packet to reach a specific minimum length. The function pskb_trim_rcsum checks whether that happened and, if it did, trims the packet to the right size with _ _pskb_trim and invalidates the L4 checksum in case it had been computed by the receiving NIC. _ _pskb_trim is slightly complex because it may need to deal with fragmented buffers, too.^[]

^[] See for examples of what a fragmented buffer looks like.

When the L4 checksum is computed in hardware by the network card, it could include the L2 padding if the card is not smart enough to leave it out. Since here there is no way to know whether that was the case, to be on the safe side, pskb_trim_rcsum simply invalidates the checksum and forces the L4 protocol to recompute it. See the section "" in for more details.

if (pskb_trim_rcsum(skb, len)) { IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS); goto drop; } }

Finally we get to the end of the function. Note that no routing decision or option handling has been done so far; that's the job of ip_rcv_finish. As we anticipated earlier in the chapter, the function ends with a call to the Netfilter subsystem, which more or less can be read in this way:

"skb is the packet that was received from device dev; please check whether the packet is allowed to proceed with its travel, or if it needs changes. Take into consideration that we are asking you this from the NF_IP_PRE_ROUTING point within the network stack (which means the packet was received but no routing decision was taken yet). If you decide not to drop the packet, executeip_rcv_finish."

return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);

See the earlier section "" for background information.

19.2.5. The ip_rcv_finish Function

ip_rcv did not do much more than a basic sanity check of the packet. So when ip_rcv_finish is called, it will take care of the main processing, such as:

Deciding whether the packet has to be locally delivered or forwarded. In the second case, it needs to find both the egress device and the next hop.
Parsing and processing some of the IP options. Not all of the options are processed here, however, as we will see when analyzing the forwarding case.

This is the prototype of the ip_rcv_finish function, defined in the same net/ipv4/ip_input.c file as ip_rcv.

static inline int ip_rcv_finish(struct sk_buff *skb)

The skb->nh field was initialized in netif_receive_skb, which came earlier in the receiving path. At that time, the L3 protocol was not yet known, so it was initialized using nh.raw. Now the function can get a pointer to the IP header.

struct net_device *dev = skb->dev; struct iphdr *iph = skb->nh.iph;

skb->dst may contain information about the route to be taken by the packet to get to its destination. If that information is not known yet, the function asks the routing subsystem where to send the packet, and if the latter says the destination is unreachable, the packet is dropped. See the section "" in for an example of when skb->dst is not NULL here.

if (skb->dst == NULL) { if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev)) goto drop; }

Then the function updates some statistics that are used by Traffic Control (the Quality of Service, or QoS, layer).

#ifdef CONFIG_NET_CLS_ROUTE if (skb->dst->tclassid) { struct ip_rt_acct *st = ip_rt_acct + 256*smp_processor_id( ); u32 idx = skb->dst->tclassid; st[idx&0xFF].o_packets++; st[idx&0xFF].o_bytes+=skb->len; st[(idx>>16)&0xFF].i_packets++; st[(idx>>16)&0xFF].i_bytes+=skb->len; } #endif

When the length of the IP header is bigger than 20 bytes (5 x 32 bits) it means there are options to process. skb_cow, whose name comes from the well-known phrase "Copy on Write," is called here to make a copy of the buffer if the latter is shared with someone else. Exclusive ownership of the buffer is needed because we are about to process the options and will probably need to change the IP header.

20 bytes is the length of an IP header without options.

if (iph->ihl > 5) { struct ip_options *opt; if (skb_cow(skb, skb_headroom(skb))) { IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS); goto drop; } iph = skb->nh.iph;

ip_option_compile is used to interpret the IP options carried in the header. The next section describes its implementation in detail. Right now we are interested in the output of that function. We saw in that skb contains a field called cb that can be used to store private data by whomever manages an sk_buff buffer. In this case, the IP layer uses it to store the result of the IP header option parsing plus some other stuff such as fragmentation-related information. The result is stored in a data structure of type struct inet_skb_parm, defined in include/net/ip.h and accessed with the macro IPCB (see the section "" in ).

If there are any wrong options, the packet is discarded and a special Internet Control Message Protocol (ICMP) message is sent back to the sender to notify the latter about the problem. As we will see in , ICMP messages contain information about where the error was found in the header, something that could help the sender to understand what happened.

You will see in the next section that when the first input parameter to ip_options_compile is NULL, the output of the parsing process is stored in IPCB(skb)->opt; this explains why the parsed options are retrieved with IPCB.

if (ip_options_compile(NULL, skb)) goto inhdr_error;

Note that ip_options_compile simply checks whether the options are correct and stores them in an ip_option structure inside the private data field pointed to by skb->cb. The function does not handle any of them. Instead, the upcoming piece of code partially takes care of that.

In case the packet was source routed, the kernel needs to check whether the configuration of the device allows that option to be used. (If you are not familiar with IP source routing, check the section ".")

I briefly describe the in_device structure and the associated APIs in the section "in_device Structure" in . If there was no explicit configuration for IP source routing, the option would be allowed by default. If, on the other hand, that option was disabled, the packet is dropped (but no ICMP message is generated). NIPQUAD is a simple macro defined in include/linux/kernel.h that splits a 32-bit variable into four 8-bit components.

if (opt->srr) { struct in_device *in_dev = in_dev_get(dev); if (in_dev) { if (!IN_DEV_SOURCE_ROUTE(in_dev)) { if (IN_DEV_LOG_MARTIANS(in_dev) && net_ratelimit( )) printk(KERN_INFO "source route option %u.%u.%u.%u -> %u. %u.%u.%u\n", NIPQUAD(iph->saddr), NIPQUAD(iph->daddr)); in_dev_put(in_dev); goto drop; } in_dev_put(in_dev); } if (ip_options_rcv_srr(skb)) goto drop; } }

When IP source routing is allowed on the device, the code calls ip_options_rcv_srr to set skb->dst and decide how to handle the packet, which means deciding which device to use to forward the packet toward the next hop in the source route list. Normally, the requested next hop refers to another host, and the function simply sets opt->srr_is_hit to indicate the address has been found. The ip_options_rcv_srr function has to take into account, however, the possibility that the "next hop" may be an interface on the local host. If that happens, the function writes the IP address into the destination IP address of the IP header and goes on to check the next address in the source routing list, if there is one (in the code, this is called a superfast loopback forward).ip_options_rcv_srr keeps browsing the list of next hops in the IP header source routing option block until it finds an IP address that is not local to the host. Normally, there will be no more than one local IP address in that list. However, it is legal to have more than one. In the latter case, going from one next hop to the following one is a no-opi.e., one more loop inside ip_options_rcv_srr. The srr_is_hit flag is set when the last next-hop found by ip_options_rcv_srr is not a local IP address, which means the packet has not reached its final destination and needs to be forwarded.

If the packet is to be forwarded, as we will see in the section "" in , the initialization of srr_is_hit tells ip_forward_options to take care of the source routing option by adding the necessary data to the IP header. If the packet is being transmitted (that is, if it originated on this host), opt->faddr will be used instead and the opt->srr_is_hit flag will not be used.

The term MARTIANS is used in the previous code to decide whether a parameter value is wrong. The term is not a fanciful choice by the Linux developers but comes from the RFCs themselves.

ip_rcv_finish ends with a call to dst_input, which actually invokes the function stored in the dst field of the skb buffer. skb->dst was initialized either near the beginning of ip_rcv_finish, or near the end within ip_options_rcv_srr (which is called if the IP source routing option is present in the header). skb->dst->input is set to ip_local_deliver or ip_forward, depending on the destination address of the packet. The call to dst_input therefore completes the processing of the packet (see in and the earlier section "").

See also the section "" in for the relationship between the call to ip_route_input in ip_rcv_finish and the one in ip_options_rcv_srr.

阅读(536) | 评论(0) | 转发(0) |

上一篇：19.1 主要的IPv4的数据结构

下一篇：Qemu概述

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6