C++,python,热爱算法和机器学习
全部博文(1214)
分类: Python/Ruby
2012-04-25 17:00:05
You need to understand . No, really, you do. I have mentioned repeatedly that you need to choose your HTTP methods carefully when building a web service, in part because you can get the performance benefits of caching with GET. Well, if you want to get the real advantages of GET then you need to understand caching and how you can use it effectively to improve the performance of your service.
This article will not explain how to set up caching for your particular web server, nor will it cover the different kinds of caches. If you want that kind of information I recommend .
GoalsFirst you need to understand the goals of the HTTP caching model. One objective is to let both the client and server have a say over when to return a cached entry. As you can imagine, allowing both client and server to have input on when a cached entry is to be considered stale is obviously going to introduce some complexity.
The HTTP caching model is based on validators, which are bits of data that a client can use to validate that a cached response is still valid. They are fundamental to the operation of caches since they allow a client or intermediary to query the status of a resource without having to transfer the entire response again: the server returns an entity body only if the validator indicates that the cache has a stale response.
ValidatorsOne of the validators for HTTP is the ETag. An ETag is like a fingerprint for the bytes in the representation; if a single byte changes the ETag also changes.
Using validators requires that you already have done a GET once on a resource. The cache stores the value of the ETag header if present and then uses the value of that header in later requests to that same URI.
For example, if I send a request to example.org and get back this response:
HTTP/1.1 200 OK Date: Fri, 30 Dec 2005 17:30:56 GMT Server: Apache ETag: "11c415a-8206-243aea40" Accept-Ranges: bytes Content-Length: 33286 Vary: Accept-Encoding,User-Agent Cache-Control: max-age=7200 Expires: Fri, 30 Dec 2005 19:30:56 GMT Content-Type: image/png -- binary data --Then the next time I do a GET I can add the validator in. Note that the value of ETag is placed in the If-None-Match: header.
GET / HTTP/1.1 Host: example.org If-None-Match: "11c415a-8206-243aea40"If there was no change in the representation then the server returns a 304 Not Modified.
HTTP/1.1 304 Not Modified Date: Fri, 30 Dec 2005 17:32:47 GMTIf there was a change, the new representation is returned with a status code of 200 and a new ETag.
HTTP/1.1 200 OK Date: Fri, 30 Dec 2005 17:32:47 GMT Server: Apache ETag: "0192384-9023-1a929893" Accept-Ranges: bytes Content-Length: 33286 Vary: Accept-Encoding,User-Agent Cache-Control: max-age=7200 Expires: Fri, 30 Dec 2005 19:30:56 GMT Content-Type: image/png -- binary data --While validators are used to test if a cached entry is still valid, the header is used to signal how long a representation can be cached. The most fundamental of all the cache-control directives is max-age. This directive asserts that the cached response can be only max-age seconds old before being considered stale. Note that max-age can appear in both request headers and response headers, which gives both the client and server a chance to assert how old they like their responses cached. If a cached response is fresh then we can return the cached response immediately; if it's stale then we need to validate the cached response before returning it.
Let's take another look at our example response from above. Note that the Cache-Control: header is set and that a max-age of 7200 means that the entry can be cached for up to two hours.
HTTP/1.1 200 OK Date: Fri, 30 Dec 2005 17:32:47 GMT Server: Apache ETag: "0192384-9023-1a929893" Accept-Ranges: bytes Content-Length: 33286 Vary: Accept-Encoding,User-Agent Cache-Control: max-age=7200 Expires: Fri, 30 Dec 2005 19:30:56 GMT Content-Type: text/xmlThere are lots of directives that can be put in the Cache-Control: header, and the Cache-Control: header may appear in both requests and/or responses.
Directive | Description |
---|---|
no-cache | The cached response must not be used to satisfy this request. |
no-store | Do not store this response in a cache. |
max-age=delta-seconds | The client is willing to accept a cached reponse that is delta-seconds old without validating. |
max-stale=delta-seconds | The client is willing to accept a cached response that is no more than delta-seconds stale. |
min-fresh=delta-seconds | The client is willing to accept only a cached response that will still be fresh delta-seconds from now. |
no-transform | The entity body must not be transformed. |
only-if-cached | Return a response only if there is one in the cache. Do not validate or GET a response if no cache entry exists. |
Directive | Description |
---|---|
public | This can be cached by any cache. |
private | This can be cached only by a private cache. |
no-cache | The cached response must not be used on subsequent requests without first validating it. |
no-store | Do not store this response in a cache. |
no-transform | The entity body must not be transformed. |
must-revalidate | If the cached response is stale it must be validated before it is returned in any response. Overrides max-stale. |
max-age=delta-seconds | The client is willing to accept a cached reponse that is delta-seconds old without validating. |
s-maxage=delta-seconds | Just like max-age but it applies only to shared caches. |
proxy-revalidate | Like must-revalidate, but only for proxies. |
Let's look at some Cache-Control: header examples.
Cache-Control: private, max-age=3600If sent by a server, this Cache-Control: header states that the response can only be cached in a private cache for one hour.
Cache-Control: public, must-revalidate, max-age=7200The included response can be cached by a public cache and can be cached for two hours; after that the cache must revalidate the entry before returning it to a subsequent request.
Cache-Control: must-revalidate, max-age=0This forces the client to revalidate every request, since a max-age=0 forces the cached entry to be instantly stale. See Mark Nottingham's Leveraging the Web: Caching for a nice example of how this can be applied.
Cache-Control: no-cacheThis is pretty close to must-revalidate, max-age=0, except that a client could use a max-stale header on a request and get a stale response. The must-revalidate will override the max-stale property. I told you that giving both client and server some control would make things a bit complicated.
So far all of the Cache-Control: header examples we have looked at are on the response side, but they can also be added on the request too.
Cache-Control: no-cacheThis forces an "end-to-end reload," where the client forces the cache to reload its cache from the origin server.
Cache-Control: min-fresh=200Here the client asserts that it wants a response that will be fresh for at least 200 seconds.
VaryYou may be wondering about situations where a cache might get confused. For example, what if a server does content negotiation, where different representations can be returned from the same URI? For cases like this HTTP supplies the Vary: header. The Vary: header informs the cache of the names of the all headers that might cause a resources representation to change.
For example, if a server did do content negotiation then the Content-Type: header would be different for the different types of responses, depending on the type of content negotiated. In that case the server can add a Vary: accept header, which causes the cache to consider the Accept: header when caching responses from that URI.
Date: Mon, 23 Jan 2006 15:37:34 GMT Server: Apache Accept-Ranges: bytes Vary: Accept-Encoding,User-Agent Content-Encoding: gzip Cache-Control: max-age=7200 Expires: Mon, 23 Jan 2006 17:37:34 GMT Content-Length: 5073 Content-Type: text/html; charset=utf-8In this example the server is stating that responses can be cached for two hours, but that responses may vary based on the Accept-Encoding and User-Agent headers.
ConnectionWhen a server successfully validates a cached response, using for example the If-None-Match: header, then the server returns a status code of 304 Not Modified. So nothing much happens on a 304 Not Modified response, right? Well, not exactly. In fact, the server can send updated headers for the entity that have to be updated in the cache. The server can also send along a Connection: header that says which headers shouldn't be updated.
Some headers are by default excluded from list of headers to update. These are called headers and they are: Connection, Keep-Alive, Proxy-Authenticate, Proxy-Authorization, TE, Trailers, Transfer-Encoding, and Upgrade. All other headers are considered end-to-end headers.
HTTP/1.1 304 Not Modified Content-Length: 647 Server: Apache Connection: close Date: Mon, 23 Jan 2006 16:10:52 GMT Content-Type: text/html; charset=iso-8859-1 ...In the above example Date: is not a hop-by-hop header nor is it listed in the Connection: header, so the cache has to update the value of Date: in the cache.
If Only It Were That EasyWhile a little complex, the above is at least conceptually nice. Of course, one of the problems is that we have to be able to work with HTTP 1.0 servers and caches which use a different set of headers, all time-based, to do caching and out of necessity those are brought forward into HTTP 1.1.
The older cache control model from HTTP 1.0 is based solely on time. The Last-Modified cache validator is just that, the last time that the resource was modified. The cache uses the Date:, Expires:, Last-Modified:, and If-Modified-Since: headers to detect changes in a resource.
If you are developing a client you should always use both validators if present; you never know when an HTTP 1.0 cache will pop up between you and a server. HTTP 1.1 was published seven years ago so you'd think that at this late date most things would be updated. This is the protocol equivalent of wearing a belt and suspenders.
Now that you understand caching you may be wondering if the client library in your favorite language even supports caching. I know the answer for Python, and sadly that answer is currently no. It pains me that my favorite language doesn't have one of the best HTTP client implementations around. That needs to change.
Introducing httplib2Introducing , a comprehensive Python HTTP client library that supports a local private cache that understands all the caching operations we just talked about. In addition it supports many features left out of other HTTP libraries.
HTTP and HTTPSHTTPS support is available only if the socket module was compiled with SSL support.Keep-AliveSupports HTTP 1.1 Keep-Alive, keeping the socket open and performing multiple requests over the same connection if possible.AuthenticationThe following three types of HTTP Authentication are supported. These can be used over both HTTP and HTTPS.See the for more details.
Next TimeNext time I will cover HTTP authentication, redirects, keep-alive, and compression in HTTP and how httplib2 handles them. You might also be wondering how the "big guys" handle caching. That will take a whole other article to cover.