Using cURL to Grab Data-cobrawgl-ChinaUnix博客

Cobrawgl's Blog

首页　| 　博文目录　| 　关于我

cobrawgl

博客访问： 602499
博文数量： 207
博客积分： 10128
博客等级：上将
技术积分： 2440
用户组：普通用户
注册时间： 2004-10-10 21:40

文章分类

全部博文（207）

Erlang（0）
Shell（1）
C/C++（0）
Network（3）
Lisp/Scheme（0）
Java（2）
JavaScript（7）
Security（2）
Mac OSX（1）
Programming（5）
Algorithm（6）
Database（11）
杂文（5）
西医（0）
中医（9）
Unix/Linux/BSD（9）
Web（54）
Computer（12）
PHP（21）
Ruby（10）
Python（20）
Perl（29）
未分配的博文（0）

文章存档

2009年（200）

2008年（7）

我的朋友

最近访客

推荐博文

Using cURL to Grab Data

分类：

2009-04-25 07:21:41

Using cURL is a simple and effective way to gather data from another website, run it through a script, parse the data and transform it into something useful that you can use on your website. Whether you are “scraping” data to build a summary of a link, pulling an XML file to parse into a database, or just simply wanting to get the contents of the file, cURL is a simple and effective way to pull the data from an outside source into your page.

Making sure cURL is enabled and setup
First things first, you need to make sure cURL is enabled on your web host. The easiest way to accomplish this is to check your phpinfo on your server. Simply deploy a PHP file with the following information onto your server, and name it whatever you want.

phpinfo();
?>

After the file is uploaded/saved onto your web server, look through the file to ensure that there is a section that looks as follows.

If your PHP file doesn’t have this section of code, or nothing similar to it, then your hosting service may not support cURL, or it may not be enabled. If you are on a hosting service, you can ask your host to enable it for you, or if you are on your own server, you can modify your php.ini file to enable the extension.

You can modify your php.ini file as follows:
(if you can’t find it, look at the top of the script we wrote above, it will give you the ini path)

view plain copy to clipboard print ?

// Find this line in your php.ini
;extension=php_curl.dll
// Remove the semi-colon in front, to make the line look like this:
extension=php_curl.dll

// Find this line in your php.ini
;extension=php_curl.dll

// Remove the semi-colon in front, to make the line look like this:
extension=php_curl.dll

After modifying and saving your php.ini file, you are going to have to restart your web service.

- If you are running on Apache, you should be able to enable it with a simple “apachectl restart” command.

- If you are running an IIS web server, you are going to have to restart IIS or just restart the Worker Pool that is running your PHP. This can be done through the MMC IIS Snap-In.

- If you are running WAMP on your local machine, simply right-click on the WAMP icon in your system tray, find the Apache menu, and click “Restart”.

Just make sure you go back into your file running phpinfo() to ensure that cURL is showing up in the file now. If not, you may want to seek addition support from your IT, Co-workers or Web hosting provider for more information as to why cURL will not function on your server.

Assuming everything is running now, and cURL is enabled, we will continue onwards.

A simple cURL Request

cURL isn’t incredibly hard to use to pull the data in, as illustrated below.

view plain copy to clipboard print ?

// Init $curl as a cURL object
$curl = curl_init();
// Tell cURL what URL we are going after
curl_setopt($curl, CURLOPT_URL, '');
// Tell cURL we would like headers as well
curl_setopt($curl, CURLOPT_HEADER, 1);
// Tell cURL we would like the results as a string instead of just dumping it on the screen
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
// Execute the cURL request
$data = curl_exec($curl);
// Close the cURL request
curl_close($curl);
// Display the data from the variable to ensure its there.
var_dump($data);


The above set of code will go out to  and will set the variable $data to contain the HTML contents of the website. The var_dump($data) at the end of the file merely spits it back out onto your screen so you can see the data you have to work with. 
Now, what you end up doing with this data is up to you! You could run it through some regex statements to pull relevant information, you could parse it line by line and store certain portions of code somewhere, or if you are pulling an XML file, you could begin to parse the XML. Since this article is just about cURL, we won’t get into that.
Using a cURL Request Object
A bit more on the advanced side, but if you want to create an object to handle all your requests for you, I’ve pulled one out of my code library that you may find useful. 


view plaincopy to clipboardprint?


class curlHandler {   
    public $url = '';   
    public $output = '';   
    public $curl = '';   
  
    function __construct($url) {   
        $this->curl = curl_init();   
        $this->url($url);   
        curl_setopt($this->curl, CURLOPT_URL, $this->url);   
        curl_setopt($this->curl, CURLOPT_RETURNTRANSFER, true);   
        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2');   
        $this->output = curl_exec($this->curl);   
        return $this->output;   
  }   
  
    function __destruct() {   
        curl_close($this->curl);   
    }   
  
    function url($url) {   
        $this->url = $url;   
        curl_setopt($this->curl, CURLOPT_URL, $url);   
    }   
}   
  
// Init the Object and do the Request, as well as close down the handler afterwards   
$curlHandler = new curlHandler("");   
  
// Display what we've found   
var_dump($curlHandler);  
curl = curl_init();
        $this->url($url);
        curl_setopt($this->curl, CURLOPT_URL, $this->url);
        curl_setopt($this->curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2');
        $this->output = curl_exec($this->curl);
  return $this->output;
  }

 function __destruct() {
  curl_close($this->curl);
 }

 function url($url) {
  $this->url = $url;
  curl_setopt($this->curl, CURLOPT_URL, $url);
 }
}

// Init the Object and do the Request, as well as close down the handler afterwards
$curlHandler = new curlHandler("");

// Display what we've found
var_dump($curlHandler);

Well, gathering data this way is pretty simple when you know what you are passing in. Notice above in my class, that I am passing a Firefox browser string into the cURL request. Why is this? Well some websites try to block cURL or automated requests (such as the World of Warcraft Armory, which is what I was scraping), so by mimicking a browser, we can get past these obstacles. 
Now what you do with all of this new found data, well that is up to you. Eventually I will create a post more about parsing this data you find, but that is for another day.

阅读(933) | 评论(0) | 转发(0) |

上一篇：Develop a Social Media Website With These 10 Code

下一篇：A Round-up of GUI Clients for Amazon S3 Storage &

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6