分类: 系统运维
2007-02-14 07:04:51
With the emergence of Napster in the fall of 1999, peer to peer (P2P) applications and their user base have grown rapidly in the Internet community. With the popularity of P2P and the bandwidth it consume, there is a growing need to identify P2P users within the network traffic.
In this paper the author will propose a new method based on traffic behavior that helps identify P2P users, and even helps to distinguish what type of P2P applications are being used.
To perform port based analysis, administrators just need to observe the network traffic and check whether there are connection records using these ports. If a match is found, it may indicate a P2P activity. Port based analysis is almost the only choice for network administrators who don't have special software or hardware (such as an IDS) to monitor traffic.
Port matching is very simple in practice, but its limitations are obvious. Most P2P applications allow users to change the default port numbers by manually selecting whatever port(s) they like. Additionally, many newer P2P applications are more inclined to use random ports, thus making the ports unpredictable. Also there is a trend for P2P applications begin to masquerade their function ports within well-known application ports such as port 80. All these issues make port based analysis less effective.
With this approach, an application or piece of equipment monitors traffic passing through the network and inspects the data payload of the packets according to some previously defined P2P application signatures. Many of today's commercial and open source P2P application identification solutions are based on this approach, and include the L7-filter, Cisco's PDML, Juniper's netscreen-IDP, Alteon Application Switches, Microsoft common application signatures, and NetScout. They each do their detection work by doing regular expression matches on the application layer data, in order to determine whether a special P2P application is being used.
Because protocol analysis focuses on the packet payload and raises alerts only on a definite match, any client-side tricks that use non-default or dynamic ports to avoid detection by P2P applications will fail. Using this approach, the result is normally more accurate and believable, but it still has some shortcomings. Here are some points to remember with protocol analysis of P2P networks:
Most importantly, if your organization cannot afford the special appliances or applications that perform protocol analysis, is port matching your only alternative? Fortunately, the answer is no. An approach based on traffic behavior patterns proves to be both functional and cost-effective.
Although network traffic information is still coarse in some degree, there is valuable information inside the traffic and useful patterns can be uncovered. Looking at host UDP sessions is one good example of this.
What exactly does it mean to look at a UDP connection pattern, and how can it help us? Before answering these questions, let's review the first popular P2P application, Napster.
The network structure of Napster has an Achilles Heel -- it is highly dependent on the static central server. If the central server is down, the network will collapse. This was shown by the actions of the recording industry, which forced the original Napster to be shutdown.
The Napster case illustrates the vulnerability of a centralized network structure and greatly affects the subsequent P2P application. For legal, security, scalability, anonymity and some other reasons, more and more P2P applications nowadays work in a totally or partially decentralized network structure, or are moving in the direction. Major P2P file-sharing networks and protocols, such as Edonkey2k, FastTrack, Gnutella, Gnutella2, Overnet, Kad, all use this concept.
Here the author must make it clear that Bittorrent is not a general purpose P2P network although it is a popular P2P application. It still needs tracker servers; while the network structure of Bittorrent is partially decentralized, the technique discussed in this article can't be used to identify Bittorrent users.
Decentralized means a network structure with no dedicated central index servers. It is a trend for P2P evolution. Today, there are many P2P camps using their own network and protocol, but normally their network structures are totally or partially decentralized. Some P2P applications such as EMule and Edonkey support fully decentralized protocols such as Kademlia, which needs no servers at all. And as a partially decentralized model, hybrid decentralized networks have won broad support from various P2P applications and are thus recognized as the most popular P2P network model.
In a hybrid decentralized network, there are still central servers, but they are no longer dedicated and static. Instead, some peers with more power (CPU, DISK, Bandwidth, and active time) will automatically take over the central indexing server functions, which are called ultrapeers (Supernodes). Every one of them is elected from normal peers and each serves a group of normal peers. They communicate with each other to form the backbone of hybrid decentralized network. New ultrapeers are continuously added when appropriate peers join the network. At the same time, ultrapeers are removed when they leave the network.
In order to join the network, a peer must find a way to connect with one or a few of the live ultrapeers. They get the ultrapeer list by some means such as a bootstrap stored in the program or download from special web site. After connecting to a proper ultrapeer, apart from the normal file transfer work, the P2P application must interact with the P2P network to help them keep connected and live happily in the network, uploading information to the server, checking the status of ultrapeer to which they are connected, getting the most current available ultrapeers, comparing the available ultrapeers situations, actively switching to a better ultrapeer, searching files, probing the status of file suppliers, storing available ultrapeers for future use, and so on. In short, besides the real file transfer traffic itself, peers need to send out many control packets (probe, inform and some other packets) to various different hosts to keep up with the changing network environment in real time. This is the first key element of our traffic behavior identification: peers need many control purpose packets sent out to interact with the decentralized network during their lifetime.
Why do they select UDP? UDP is simple, effect and low-cost. It does not need to provide guarantee for packet delivery, establish connection, or maintain connection state. All these features make UDP fit for fast delivery of data to many destinations. These are just what P2P applications need. Inspecting different P2P applications carefully, you will find most of the modern decentralized P2P applications adopt a similar network behavior. When they startup, they create one or several UDP sockets to listen, and then communicate with abundant outside addresses during their life by using these UDP ports to assist their interaction in the P2P world. This is the second key element of our traffic behavior identification: peers keep using one or several UDP ports to make connections to fulfill the control work.
Now, let's turn to a popular P2P application, Edonkey2000, to see how it can be identified.
Edonkey2000 UDP traffic example
The following is a trace file of Edonkey's outgoing UDP traffic. The output display here is sanitized, so it is only a fraction of the captured traffic. In fact, for this example there were 390 records in just two minutes. For example purposes, the source address is replaced with x and the first column of destination address is replaced with y.
11:24:19.650034 IP x.10810 > y.34.233.22.8613: UDP, length: 25 11:24:19.666047 IP x.2587 > y.138.230.251.4246: UDP, length: 6 11:24:19.666091 IP x.10810 > y.127.115.17.4197: UDP, length: 25 11:24:19.681433 IP x.10810 > y.76.27.4.4175: UDP, length: 25 11:24:19.681473 IP x.2587 > y.28.31.240.4865: UDP, length: 6 11:24:19.696907 IP x.2587 > y.162.178.102.4265: UDP, length: 6 ...... 11:24:20.946921 IP x.2587 > y.250.47.34.4665: UDP, length: 6 11:24:20.962509 IP x.2587 > y.152.93.254.4665: UDP, length: 6 11:24:20.978275 IP x.2587 > y.28.31.241.5065: UDP, length: 6 11:24:20.993871 IP x.2587 > y.135.32.97.580: UDP, length: 6 11:24:21.009621 IP x.2587 > y.149.102.1.4246: UDP, length: 6 11:24:29.681224 IP x.10810 > y.32.97.189.5312: UDP, length: 4 11:24:29.696903 IP x.10810 > y.10.34.181.7638: UDP, length: 4 11:24:29.716503 IP x.10810 > y.26.234.251.12632: UDP, length: 4 ...... 11:26:20.291874 IP x.10810 > y.19.149.0.21438: UDP, length: 19
From the output, we can see that all traffic is coming from two source ports, UDP 2587 and UDP 10810 (These ports are randomly selected by Edonkey and the port numbers on different hosts will be different). The destination IP addresses are diverse. In fact, Edonkey uses one port to send out server status requests to the Edonkey servers, and uses another port to make connection, IP query, search, publicize and some other work.
A study of some other decentralized P2P applications, such as BearShare, Skpye, Kazaa, EMule, Limewire, Shareaza, Xolox, MLDonkey, Gnucleus, Sancho, and Morpheus leads to a similar result. All these applications have the same connection pattern: they use one or several UDP ports to communicate with many outside hosts during their lifetime. Describing this pattern in the network layer, it can be summarized as:
For a period of time(x), from on single IP, fixed UDP port -> many destination IP(y), fixed or random UDP ports
Experience shows that when x equals five, y equals three, as administrators scanning for a P2P application we will get a satisfying result. Administrators can change x and y values to get more precious or rough result according to their requirement.
In practice, we can export network connection records from corresponding equipment and use a database and shell scripts to process them. For every given minute, if the result shows that any host sends out some number of UDP packets to different hosts from a fixed source port, it is highly probable that the host is a P2P host.
The author of this article setup a test environment on one of China's largest ISP nodes. The network connection records were exported from the router as Netflow data and stored into a MySQL database. With the help of a little script to process all the data, many hosts were identified as P2P peers, and some interesting, locally developed P2P new applications were also discovered.
This sounds like a good method to perform P2P host identification, but what about false positives? Fortunately, this kind of network traffic behavior is seldom seen in other types of usage around the Internet. An exception to this would be if the host is a traditional game server, DNS server or media server. This kind of server will also produce traffic records in which many UDP packets are sent out to many different IP addresses from a single source. But administrators can easily distinguish whether a host is a traditional server because a server normally will not send any kind of traffic on ports other than their functional port, which is not the model used by a P2P host.
The value of this UDP connection pattern is obvious: this approach does not need any kind of application layer information, yet the result is still quite satisfactory. It does not rely on any kind of signatures so newly developed P2P application can still be identified quickly in large networks. Meanwhile, analyzing the network layer information requires almost no extra software of hardware, and dramatically reduces the pressure that might otherwise be put on corresponding equipment.
Disadvantages of this approach
To be sure, this UDP session method also has two disadvantages: it can only be used to identify P2P applications that use a decentralized structure (although most of the modern P2P applications are indeed decentralized). Second, if the P2P application chooses TCP rather than UDP to perform its control function, our identification work will fail.
Up to this point we have identified P2P users by relying on network connection records. We now go one step further to identify what exactly P2P application a host is running without the help of any high level layer data.
Examining the UDP traffic of different P2P applications more carefully, you will find even more interesting patterns. It has been mentioned that a decentralized network structure needs control purpose packets, and it is not difficult to understand that for a dedicated P2P application, there are many kinds of control packets. Packets of the same control purpose are very often identical in size. Therefore, the UDP packet can even help us identify exactly which P2P application is running, in the absence of any higher level information.
Most of P2P applications do not have complete documentation on their implementation details and some of them are closed source, so we are still unclear exactly what the makeup is of most applications' UDP packets. Therefore, the author of this article has randomly selected seven decentralized, popular P2P applications and made such observations. The result confirm the hypothesis, that all these applications use some fixed length packets to contact outside.
The result of these simple tests is quite interesting. It means that after identifying the peers in the network records, we could use this technology to determine in the future what exactly a peer uses. However, research on the size of different P2P applications' control packets is still in its infant stage and there are many things left to do. For a detailed and accurate result, each application may need special focus and a lot of research work is still needed.
Furthermore, there are other means that can be used and combine with the methods we discussed in this article to better identify P2P users and P2P applications. Some P2P applications will make connections to fixed outside IP addresses to perform such functions as version checks, authentication, downloading bootstrap, or even advertising. For example, Kazaa will connect to ssa.Kazaa.com, desktop.Kazaa.com and some other sites when it operates. Skype will make TCP connection to ui.skype.com whenever it startups.
Also there are other aspects about traffic behavior, such as data transferred. Connection duration may be used in P2P identification but this adds another level of complexity.
As always, there is no one-fit-all solution for the P2P identification work. Although port based analysis and protocol analysis are currently the most important and commonly used technologies, we should not feel content with them. Try a brain head storming, there may be another method cropping up to reinforce the P2P identifies solution.
Acknowledgement
My special thanks to Kelly Martin for his careful review and suggestions!
About the author
Yiming Gong has worked for China Telecom for more than 5 years as a senior systems administrator, and now he works as a researcher at the Research Department, NSFocus Information Technology Co.Ltd.