全部博文(2065)
分类: Java
2010-04-25 10:35:10
HttpClient自学教程理论
一、处理字符编码
The headers of a HTTP request or response
must be in US-ASCII format. It is not possible to use non US-ASCII characters
in the header of a request or response. Generally this is not an issue however,
because the HTTP headers are designed to facilite the transfer of data rather
than to actually transfer the data itself.
One exception however are cookies. Since
cookies are transfered as HTTP Headers they are confined to the US-ASCII
character set. See the Cookie Guide for more information.
笔记:HTTP头部的编码定义为US-ASCII编码格式的。HTTP头并不传递实际的数据。但是COOKIE是一个例外。因为COOKIE是存放在HTTP回应头。
The request or response body
can be any encoding, but by default is ISO-8859-1. The encoding may be
specified in the Content-Type header, for example:
Content-Type: text/html; charset=UTF-8
In this case the application
should be careful to use UTF-8 encoding when converting the body to a String or
some characters may be corrupt. You can set the content type header for a
request with the addRequestHeader method in each method and retrieve the
encoding for the response body with the getResponseCharSet method.
If the response is known to be
a String, you can use the getResponseBodyAsString method which will
automatically use the encoding specified in the Content-Type header or ISO-8859-1
if no charset is specified.
笔记:默认的编码是ISO-8859-1。当然如果是GET下来的页面可以依据HTTP头来判断是什么编码。依据此方法getResponseCharSet 可以获取到回应的编码是什么。如果只是回应一段字符串的话就可以直接使用getResponseBodyAsString 的方法这种编码是依据HTTP头中指定的如果没有指定就为默认的编码了。
二、COOKIE
HttpClient
supports automatic management of cookies, including allowing the server to set
cookies and automatically return them to the server when required. It is also
possible to manually set cookies to be sent to the server.
Unfortunately,
there are several at times conflicting standards for handling Cookies: the
Netscape Cookie draft, RFC2109, RFC2965 and a large number of vendor specific
implementations that are compliant with neither specification. To deal with
this, HttpClient provides policy driven cookie management. This guide will
explain how to use the different cookie specifications and identify some of the
common problems people have when using Cookies and HttpClient.
笔记:可以实现模拟COOKIE发送与回应。而且是自动化的。但是针对不同的浏览器其COOKIE的实现原理还不一样的。所以针对这种情况HttpClient提供了一种比较智能化管理机制。
手工处理COOKIE的方式
HttpMethod method = new GetMethod();
method.getParams().setCookiePolicy(CookiePolicy.IGNORE_COOKIES);
method.setRequestHeader("Cookie",
"special-cookie=value");
三、异常处理机制
两种主要的异常
transport
exceptions protocol exceptions
In some circumstances, usually when under
heavy load, the web server may be able to receive requests but unable to
process them。
笔记:在制作爬虫的时候就遇到过这种情况。即请求一个服务器资源过于频繁的时候服务器就无法做出响应出来。导致了出现404错误。This may cause the server to
drop the connection to the client without giving any response.python中也是存在此类问题
In most cases it is safe to retry a method
that failed with NoHttpResponseException
最好的解决办法 就是:当出现了异常的时候再次去请求此资源。直到出现了200为止!
org.apache.commons.httpclient.ConnectTimeoutException
This
exception signals that HttpClient is unable to establish a connection with the
target server or proxy server within the given period of time.
Protocol
exceptions generally indicate logical errors caused by a mismatch between the
client and the server (web server or proxy server) in their interpretation of
the HTTP specification. Usually protocol exceptions cannot be recovered from
without making adjustments to either the client request or the server. Some
aspects of the HTTP specification allow for different, at times conflicting,
interpretations. HttpClient can be configured to support different degrees of
HTTP specification compliance varying from very lenient to very strict.
协议异常就是客户端与服务器端所配置的协议不一致所导致的。
org.apache.commons.httpclient.auth.AuthChallengeException
在做HTTPS的时候就会遇到。
四、Methods
表示的是HTTP请求的各个选项。以下正是HTTP的请求方法
|
The GET method means retrieve whatever information is identified by the
requested URL. Also refer to the . |
|
The HEAD method is identical to GET except that the server must not
return a message-body in the response. This method can be used for obtaining
metainformation about the document implied by the request without
transferring the document itself. |
|
The POST method is used to request that the origin server accept the
data enclosed in the request as a new child of the request URL. POST is
designed to allow a uniform method to cover a variety of functions such as
appending to a database, providing data to a data-handling process or posting
to a message board. |
|
The multipart post method is identical to the POST method, except that
the request body is separated into multiple parts. This method is generally
used when uploading files to the server. |
|
The PUT method requests that the enclosed document be stored under the
supplied URL. This method is generally disabled on publicly available servers
because it is generally undesireable to allow clients to put new files on the
server or to replace existing files. |
|
The DELETE method requests that the server delete the resource
identified by the request URL. This method is generally disabled on publicly
available servers because it is generally undesireable to allow clients to
delete files on the server. |
|
|
五、处理30X
·
301 Moved Permanently.
HttpStatus.SC_MOVED_PERMANENTLY
·
302 Moved Temporarily.
HttpStatus.SC_MOVED_TEMPORARILY
·
303 See Other.
HttpStatus.SC_SEE_OTHER
·
307 Temporary Redirect.
HttpStatus.SC_TEMPORARY_REDIRECT
String redirectLocation;
Header
locationHeader = method.getResponseHeader("location");
if
(locationHeader != null) {
redirectLocation = locationHeader.getValue();
} else {
// The
response is invalid and did not provide the new location for
// the
resource. Report an error or possibly
handle the response
// like
a 404 Not Found error.
}