Chinaunix首页 | 论坛 | 博客
  • 博客访问: 4609411
  • 博文数量: 1214
  • 博客积分: 13195
  • 博客等级: 上将
  • 技术积分: 9105
  • 用 户 组: 普通用户
  • 注册时间: 2007-01-19 14:41
个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文(1214)

文章存档

2021年(13)

2020年(49)

2019年(14)

2018年(27)

2017年(69)

2016年(100)

2015年(106)

2014年(240)

2013年(5)

2012年(193)

2011年(155)

2010年(93)

2009年(62)

2008年(51)

2007年(37)

分类: Python/Ruby

2015-01-06 13:33:51

原文地址:http://www.mobify.com/blog/http-requests-are-hard/

A lot of things can go wrong when requesting information over HTTP from a remote web server: requests timeout, servers fail, government operatives cut undersea cables. You get the picture.

Identifying and handling failures helps build fault tolerant systems that stay up even when services they rely on are down. A nice side effect is your phone is less likely to beep in the middle of the night with a message from your coworkers talking in all caps.

This guide will introduce you to the common ways HTTP requests fail and how to handle the failures.

The examples use Python's fantastic  library, but the principles shown work across all languages. You can follow along on your computer by grabbing requestsoff PyPi.

The  method is the cornerstone for all the examples. It makes a synchronous HTTP GET request to fetch the content from url:

# Importing `requests` is omitted from here on for brevity. If you are coding # along with the article, make sure to include before trying the examples! import requests response = requests.get(url="") 

Where possible, the examples use  to illustrate the specific failure scenarios. It's a great service for testing how your code will react in a hostile world!

The guide assumes familiarly with making HTTP requests and uses the following terminology:

  • Client: The code making the HTTP requests and the server it lives on.
  • Server: The box that delivers the HTTP response we requested.
  • Caller: The code which instantiates the client and tells it to make a request.

Ready to make some requests? Let's go!

DNS lookup failures

HTTP requests can fail before the client can even make a connection to the server. If the URL specified by the caller has a domain name, the client must look up its IP address before making the request. If the domain name doesn't resolve it's possible that it isn't configured correctly or doesn't exist.

# This domain name doesn't exist! url = "" try: response = request.get(url) except requests.exceptions.ConnectionError as e: print "These aren't the domains we're looking for." 

It's important to let the caller know they may have entered the wrong domain!

Errors connecting to the server

Even if the hostname of the URL correctly resolves, we might not always succeed in connecting to the server. If someone tripped on its power cord and took it down, it's unlikely it will accept our connection!

Errors of this nature often block the client, tying it up waiting for a server that will never respond. For this reason, it's a good idea to add timeouts to the client. That way, if the server takes too long to respond, the client can move on to do something else rather than waiting indefinitely. . connect is the amount of time the client should wait to establish a connection to the server:

# Using a very short `connect_timeout` gives us a feel for what happens when the # server is slow to pickup the connection. connect_timeout = 0.0001 try: response = requests.get(url="", timeout=(connect_timeout, 10.0)) except requests.exceptions.ConnectTimeout as e: print "Too slow Mojo!" 

read is the amount of time it should wait between bytes from the server:

# Our resource takes longer than `read_timeout` to send a byte. read_timeout = 1.0 try: response = requests.get(url="", timeout=(10.0, read_timeout)) except requests.exceptions.ReadTimeout as e: print "Waited too long between bytes." 

The exact values used for the timeout are usually less important than just setting one. You don't want the client to be blocked forever on a slowpoke server. Start with 10 seconds and watch your logs.

Extra Credit: Depending on the profile of the system you're building, you may want to implement dynamic timeouts that use historical data to wait longer for servers that are known to be slow. You may want to ban your client from even trying to connect to servers that always timeout.

HTTP errors

What if something goes sideways while the server is preparing our response? Maybe its database is unresponsive or it was switched in maintenance mode. Whatever the reason, if the server is able to detect that it isn't functioning correctly, it should respond with a .

Alternatively, if the client is incorrectly constructing the request, the server may respond with a .

In most cases we'll want to identify these bad response status codes and let the caller handle them. With requests, this is as easy as calling the method on the response object:

# This URL returns a HTTP 500 Server Error. url = "" response = requests.get(url) try: response.raise_for_status() except requests.exceptions.HTTPError as e: print "And you get an HTTPError:", e.message 

Responses that aren't what we expect

It's possible that the caller could request a resource that our client wasn't designed to handle. For example, what if someone uses our RSS reader to request an MKV file of the last episode of Game of Thrones?

We can assert that Content-Type response header matches what we expect. Our RSS reader example might look for the following:

class WrongContent(requests.exceptions.RequestException): """The response has the wrong content.""" # This URL sets the `Content-Type` to `text/plain`. url = "" response = requests.get(url) if response.headers["content-type"] != "application/rss+xml": raise WrongContent(response=response) 

Note that even if the Content-Type header does match what we are expecting, there is no guarantee that the response's body will. Calling code should account for this. For example, if we're expecting JSON and we don't get back JSON, that's a problem. In requests, the response.json() method tries to convert the response body into a Python object from JSON:

# This URL returns an XML document. url = "" response = requests.get(url) try: data = response.json() except ValueError: raise WrongContent(response=response) 

Extra Credit: If we're processing text data like HTML, don't forget to detect its charset and correctly decode it. You'll need to check the response's Content-Typeheader as well as potentially the content itself to avoid decoding errors.

Responses that are too large

Let's go back to our movie example. Not only is the movie not the content type our RSS reader expects, it's also really big. If we're not careful, these kinds of responses could exhaust our client's resources.

To ensure our client hasn't been asked to download the entire internet, we must track how much content we've received. With requests, :

from contextlib import closing class TooBig(requests.exceptions.RequestException): """The response was way too big.""" TOO_BIG = 1024 * 1024 * 10 # 10MB CHUNK_SIZE = 1024 * 128 url = "" with closing(requests.get(url, stream=True)) as response: content_length = 0 for chunk in response.iter_content(chunk_size=CHUNK_SIZE): content_length = content_length + CHUNK_SIZE if content_length > TOO_BIG: raise TooBig(response=response) 

Requests to unexpected URLs

If the client is located inside your network it may have privileged access to internal servers not addressable from the public internet. For example, what if the caller requests 

If you're letting callers request arbitrary URLs, we need to check that they are allowed to request what they are asking for.

One strategy is to prevent callers from requesting sensitive hosts using a blacklist. A blacklist checks whether the requested domain is present in a set of restricted domains. If it is, the request is rejected before it's even made. At a minimum, we'll want to blacklist internal IP addresses.

Python 3.3 added the  module to the standard library, and in Python 2 we can install its backport . Here we use it to filter requests for internal IP addresses:

import ipaddress import urlparse url = "" hostname = urlparse.urlparse(url).hostname # `localhost` isn't an IP address, but we probably don't want callers hitting it. if hostname == 'localhost': raise requests.exceptions.InvalidURL(url) # If `hostname` quacks like an IP address, make sure it isn't internal. try: ip = ipaddress.ip_address(hostname) except ValueError: pass else: if ip.is_loopback or ip.is_reserved or ip.is_private: raise requests.exceptions.InvalidURL(url) 

We might extend our blacklist to include internal hostnames or other sensitive servers. Maybe we also don't want callers to call the server doing the calling. Otherwise it could be turtles all the way down.

Extra Credit: If you want to get serious you'll need to resolve the domain name of the requested resource and check whether it maps to a local IP address.

Alternatively, if callers should only be able to request from a narrow set of servers it may be easier to use a whitelist to reject requests which aren't directed at a known host:

import urlparse WHITELISTED_HOSTS = {"rainbows.com", "magic.com"} url = "" if urlparse.urlparse(url=url).hostname not in WHITELISTED_HOSTS: raise requests.exceptions.InvalidURL(url) 

Extra Credit: Depending on your needs, you might also want to restrict other parts of the HTTP request, including the protocol used, or the ports. Additionally, if you find a caller abusing the system, you might want to build a mechanism to ban them!

Handling errors

So now that we've identified all these errors, what the heck should be do with them?

Logging

What broke? When? Where? Logging failures creates a trail that you can search for patterns. Logs will often give you insight about how you can further tweak your configuration to best suit your system or whether someone is abusing the system.

Retrying

When you're firing bits around the world sometimes you just get unlucky. Depending on what you're doing, it may make sense to just retry the request if you think the error was intermittent. requests provides an interface for creating custom that can be used to implement retries:

# Use a `Session` instance to customize how `requests` handles making HTTP requests. session = requests.Session() # `mount` a custom adapter that retries failed connections for HTTP and HTTPS requests. session.mount("http://", requests.adapters.HTTPAdapter(max_retries=1)) session.mount("https://", requests.adapters.HTTPAdapter(max_retries=1)) # Rejoice with new fault tolerant behaviour! session.get(url="") 

Just make sure you only retry requests that are idempotent!

Notification

Finally, you'll need to raise the error to the caller. You'll want to do it in a way that makes it easy for the caller to handle all possible exceptions, but also in a way that makes it clear why the exception was raised. This is especially important if you will be displaying the error to a non-technical user and you want to provide clear instructions about whether they've mistyped the domain or the server they are trying to connect to is down. In Python, this is a great chance to read up on properly re-raising exceptions!

For Further Consideration

SSL

SSL is pretty cool and we should do more of it. The requests library . If you're using a different library or language, be sure to check that your client is checking that certificates are valid. You don't want someone  your connection!

try: response = requests.get(url=" verify=True) except requests.exceptions.SSLError as e: print "That domain looks super sketchy." 

Internationalized Domain Names

 are a thing. Many libraries will handle these by default now, but you probably want to throw a test case in there that makes sure the works:

requests.get(url=u"") 

Performance

Depending on how you've built your client, there are a variety of ways you might be able to improve its performance:

  • Consider requesting the compressed response content by setting the headerAccept-Content: gzip. You'll need to make sure your  the content.
  • Consider having your client connect through an HTTP proxy like  or. If you expect to be requesting the same resources again and again, the proxy's cache may considerably reduce response times for cacheable resources.
  • . That means that your client will only be able to process one request at a time. If your system needs to support many concurrent requests, you might consider going async using libraries like  or .

Tooling

There are a number of tools out there that can help simplify putting this all together:

  •  allows you to quickly test a number of different HTTP response scenarios. It's  and  endpoints are especially useful for testing weird edge conditions.
  •  is a mocking library that can make cranking out unit tests for all these errors relatively simple.

Wrapping it all up

Wow, there are a lot of ways HTTP requests can fail. TLDR, when making a request:

  • Account for DNS lookup failures
  • Set a connection and read timeout
  • Be sure to handle HTTP errors
  • Check that the response has the content type you expect
  • Limit the maximum response size
  • Ensure that private URLs are not requestable
  • Always. Be. SSLing.

Now it's your turn!

Go forth and write fault tolerant services that request data using HTTP!

Did we miss anything? Let us know in the comments below.

阅读(1035) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~