Chinaunix首页 | 论坛 | 博客
  • 博客访问: 2781998
  • 博文数量: 77
  • 博客积分: 10204
  • 博客等级: 上将
  • 技术积分: 5035
  • 用 户 组: 普通用户
  • 注册时间: 2006-03-05 22:13
文章分类
文章存档

2013年(1)

2010年(1)

2009年(17)

2008年(58)

我的朋友

分类:

2008-04-22 14:40:03

Abstract

In this article you will learn what the CURL library is, how to use it, and some of its (advanced) options.

Introduction

Sooner or later you're bound to run across a certain problem in your script: how to retrieve content from other websites. There are several methods for this, and the simplest one is probably to use the fopen() function (if it's enabled), but there aren't really a lot of options you can set when using the fopen function. What if you're building a web spider, and want to have a custom user agent? That isn't really possible with fopen, nor is it possible to define the request method (GET or POST).

That's where the CURL library comes in. This library, usually included with PHP, allows you to retrieve other pages, and also makes it possible to define dozens of different options.

In this article we'll have a look at how to use the CURL library, what it can do, and explore some of its options. But first, let's get started with the basics of CURL.

The Basics

The first step in using CURL is to create a new CURL resource, by calling the curl_init() function, like so:


// create a new curl resource
$ch = curl_init();
?>

Now that you've got a curl resource, it's possible to retrieve a URL, by first setting the URL you want to retrieve using the curl_setopt() function:


// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "google.com/");
?>

After that, to get the page, call the curl_exec() which will execute the retrieval, and automatically print the page:


// grab URL and pass it to the browser
curl_exec($ch);
?>

Finally, it's probably wise to close the curl resource to free up system resources. This can be done with the curl_close() function, as follows:


// close curl resource, and free up system resources
curl_close($ch);
?>

That's all there is to it, and the above code snippets together form the following working demo:

// create a new curl resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "google.nl/");

// grab URL and pass it to the browser
curl_exec($ch);

// close curl resource, and free up system resources
curl_close($ch);

?>

()

The only problem we have now is that the output of the page is immediately printed, but what if we want to use the output in some other way? That's no problem, as there's an option called CURLOPT_RETURNTRANSFER which, when set to TRUE, will make sure the output of the page is returned instead of printed. See the example below:

// create a new curl resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "google.nl/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// grab URL, and return output
$output = curl_exec($ch);

// close curl resource, and free up system resources
curl_close($ch);

// Replace 'Google' with 'PHPit'
$output = str_replace('Google', 'PHPit', $output);

// Print output
echo $output;

?>

()

In the previous two examples you might've noticed we used the curl_setopt() function to define how the page should be retrieved, and that's where the real power of curl lies. By setting all kinds of different options, pretty much anything is possible, so let's have a look at that a bit more.

What's possible with the curl options

If you have a look at the manual for the curl_setopt() function you'll notice there's a huge list of different options. Let's go through the most interesting.

The first interesting option is CURLOPT_FOLLOWLOCATION. When this is set to true, curl will automatically follow any redirect it gets sent. For example, when you try to retrieve a PHP page, and the PHP page uses header("Location: "), curl will automatically follow it. The example below demonstrates this:

// create a new curl resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "google.com/");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

// grab URL, and print
curl_exec($ch);

?>

()

If Google decides to send a redirect, the example above will now follow to the new location. Two options that are related to this are the CURLOPT_MAXREDIRS and CURLOPT_AUTOREFERER options.

The CURLOPT_MAXREDIRS option allows you to define how many redirects should be followed, and any more after that won't be followed. If the CURLOPT_AUTOREFERER option is set to TRUE, curl will automatically include the Referer header in each redirect. Not that important really, but could be useful in certain cases.

Next up is the CURLOPT_POST option. This is a very useful function, as it allows you to do POST requests, instead of GET requests, which actually means you can submit forms to other pages without having to actually fill in the form. The below example demonstrates what I mean:

// create a new curl resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "%20curl%20php/demos/handle_form.php");

// Do a POST
$data = array('name' => 'Dennis', 'surname' => 'Pallett');

curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);

// grab URL, and print
curl_exec($ch);

?>

()

And the handle_form.php file:

echo '

Form variables I received:

';

echo '

';
print_r ($_POST);
echo '
';

?>

As you can see this makes it really easy to submit forms, and it's a great way to test all your forms, without having to fill them in all the time.

The CURLOPT_CONNECTTIMEOUT is used to set how long curl should wait whilst trying to connect. This is a very important option, since it could cause requests to fail if you set it too low, but if you set it too high (e.g. 1000 or 0 for unlimited) it could cause your PHP scripts to crash. A related option to this is the CURLOPT_TIMEOUT option, which is used to set how long curl requests are allowed to execute. If you set this to a low value, it might cause slow pages to be incomplete, since they take a while to download.

The final interesting option is the CURLOPT_USERAGENT option, which allows you to set the user agent of the request. This makes it possible to create your own web spiders, with their own user agent, like so:

// create a new curl resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "useragent.org/");
curl_setopt($ch, CURLOPT_USERAGENT, 'My custom web spider/0.1');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

// grab URL, and print
curl_exec($ch);

?>

()

Now that we've had most of the interesting options, let's have a look at the curl_getinfo() function and what it can do for us.

Getting info about the page

The curl_getinfo() is used to get all kinds of different information about the page that was retrieved and the request itself. You can either specify what information you want by setting the second argument or you can simple leave the second argument out and get an associative array with every detail. The below example demonstrates this:

// create a new curl resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "google.com");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FILETIME, true);

// grab URL
$output = curl_exec($ch);

// Print info
echo '

';
print_r (curl_getinfo($ch));
echo '
';

?>

()

Most of the information returned is about the request itself, like the amount of time it took and the response header that was returned, but there's also some information on the page, like the content-type and last modified time (only if you explicitly state you want to get the last modified time, like I did in the example).

That's all about curl_getinfo(), so let's have a look at some practical uses now.

Practical uses

The first useful thing the curl library could be used for is checking whether a page really exists. To do this, we first have to retrieve the page, and then check the response code (404=not found, and thus it doesn't exist). See the example below:

php

// create a new curl resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "google.com/does/not/exist");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// grab URL
$output = curl_exec($ch);

// Get response code
$response_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

// Not found?
if ($response_code == '404') {
        echo 'Page doesn\'t exist';
} else {
        echo $output;
}

?>

()

Another possibility is to create an automatic link checker, which will get a page, and check if all the links work (by using the above code), and then retrieving each link, and doing the same.

Curl also makes it possible to write your own web spider, similar to Google's web spider, or any other web spider. This article isn't about writing a web spider, so I won't talk about it any further, but a future article on PHPit will show you exactly how to create your own web spider.

Conclusion

In this article I've shown how to use the CURL library, and taken you through most of its options.

For most basic tasks, like simply getting a page, you probably won't need the curl library, since PHP comes with inbuilt support for remote pages. But as soon as you want to do anything slightly advanced, you're probably going to want to use the curl library.

In the near-future I will show you exactly how to build your own web spider, similar to Google's web spider, so stay tuned to PHPit.

If you have any questions or comments on this article, feel free to leave them below, or join us at .

阅读(1573) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~