Chinaunix首页 | 论坛 | 博客
  • 博客访问: 29307578
  • 博文数量: 2065
  • 博客积分: 10377
  • 博客等级: 上将
  • 技术积分: 21525
  • 用 户 组: 普通用户
  • 注册时间: 2008-11-04 17:50
文章分类

全部博文(2065)

文章存档

2012年(2)

2011年(19)

2010年(1160)

2009年(969)

2008年(153)

分类: Java

2010-04-25 10:35:10

HttpClient自学教程理论

一、处理字符编码

The headers of a HTTP request or response must be in US-ASCII format. It is not possible to use non US-ASCII characters in the header of a request or response. Generally this is not an issue however, because the HTTP headers are designed to facilite the transfer of data rather than to actually transfer the data itself.

One exception however are cookies. Since cookies are transfered as HTTP Headers they are confined to the US-ASCII character set. See the Cookie Guide for more information.

笔记:HTTP头部的编码定义为US-ASCII编码格式的。HTTP头并不传递实际的数据。但是COOKIE是一个例外。因为COOKIE是存放在HTTP回应头。

The request or response body can be any encoding, but by default is ISO-8859-1. The encoding may be specified in the Content-Type header, for example:

Content-Type: text/html; charset=UTF-8

In this case the application should be careful to use UTF-8 encoding when converting the body to a String or some characters may be corrupt. You can set the content type header for a request with the addRequestHeader method in each method and retrieve the encoding for the response body with the getResponseCharSet method.

If the response is known to be a String, you can use the getResponseBodyAsString method which will automatically use the encoding specified in the Content-Type header or ISO-8859-1 if no charset is specified.

笔记:默认的编码是ISO-8859-1。当然如果是GET下来的页面可以依据HTTP头来判断是什么编码。依据此方法getResponseCharSet 可以获取到回应的编码是什么。如果只是回应一段字符串的话就可以直接使用getResponseBodyAsString 的方法这种编码是依据HTTP头中指定的如果没有指定就为默认的编码了。

 

二、COOKIE

HttpClient supports automatic management of cookies, including allowing the server to set cookies and automatically return them to the server when required. It is also possible to manually set cookies to be sent to the server.

Unfortunately, there are several at times conflicting standards for handling Cookies: the Netscape Cookie draft, RFC2109, RFC2965 and a large number of vendor specific implementations that are compliant with neither specification. To deal with this, HttpClient provides policy driven cookie management. This guide will explain how to use the different cookie specifications and identify some of the common problems people have when using Cookies and HttpClient.

笔记:可以实现模拟COOKIE发送与回应。而且是自动化的。但是针对不同的浏览器其COOKIE的实现原理还不一样的。所以针对这种情况HttpClient提供了一种比较智能化管理机制。

手工处理COOKIE的方式

HttpMethod method = new GetMethod();

method.getParams().setCookiePolicy(CookiePolicy.IGNORE_COOKIES);

method.setRequestHeader("Cookie", "special-cookie=value");

三、异常处理机制

两种主要的异常

transport exceptions    protocol exceptions

In some circumstances, usually when under heavy load, the web server may be able to receive requests but unable to process them

笔记:在制作爬虫的时候就遇到过这种情况。即请求一个服务器资源过于频繁的时候服务器就无法做出响应出来。导致了出现404错误。This may cause the server to drop the connection to the client without giving any response.python中也是存在此类问题

In most cases it is safe to retry a method that failed with NoHttpResponseException

最好的解决办法 就是:当出现了异常的时候再次去请求此资源。直到出现了200为止!

org.apache.commons.httpclient.ConnectTimeoutException

This exception signals that HttpClient is unable to establish a connection with the target server or proxy server within the given period of time.

Protocol exceptions generally indicate logical errors caused by a mismatch between the client and the server (web server or proxy server) in their interpretation of the HTTP specification. Usually protocol exceptions cannot be recovered from without making adjustments to either the client request or the server. Some aspects of the HTTP specification allow for different, at times conflicting, interpretations. HttpClient can be configured to support different degrees of HTTP specification compliance varying from very lenient to very strict.

协议异常就是客户端与服务器端所配置的协议不一致所导致的。

 

org.apache.commons.httpclient.auth.AuthChallengeException 在做HTTPS的时候就会遇到。

 

四、Methods

         表示的是HTTP请求的各个选项。以下正是HTTP的请求方法

The GET method means retrieve whatever information is identified by the requested URL. Also refer to the .

The HEAD method is identical to GET except that the server must not return a message-body in the response. This method can be used for obtaining metainformation about the document implied by the request without transferring the document itself.

The POST method is used to request that the origin server accept the data enclosed in the request as a new child of the request URL. POST is designed to allow a uniform method to cover a variety of functions such as appending to a database, providing data to a data-handling process or posting to a message board.

The multipart post method is identical to the POST method, except that the request body is separated into multiple parts. This method is generally used when uploading files to the server.

The PUT method requests that the enclosed document be stored under the supplied URL. This method is generally disabled on publicly available servers because it is generally undesireable to allow clients to put new files on the server or to replace existing files.

The DELETE method requests that the server delete the resource identified by the request URL. This method is generally disabled on publicly available servers because it is generally undesireable to allow clients to delete files on the server.

 

 

五、处理30X

·                     301 Moved Permanently. HttpStatus.SC_MOVED_PERMANENTLY

·                     302 Moved Temporarily. HttpStatus.SC_MOVED_TEMPORARILY

·                     303 See Other. HttpStatus.SC_SEE_OTHER

·                     307 Temporary Redirect. HttpStatus.SC_TEMPORARY_REDIRECT

String redirectLocation;

        Header locationHeader = method.getResponseHeader("location");

        if (locationHeader != null) {

            redirectLocation = locationHeader.getValue();

        } else {

            // The response is invalid and did not provide the new location for

            // the resource.  Report an error or possibly handle the response

            // like a 404 Not Found error.

        }

 

 

阅读(1424) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~