Writing HTML Scrapper in PHP
Sometimes we want to extract the HTML content of the remote website page, this technique is called as HTML scrapper. This article will discuss on how we can extract the HTML content of the remote webpage.
We can achieve HTML scrapper operation in 2 step operation:
- Call to Remote Web Page and extract the HTML content.
- Match the HTML tags using Regular Expression.
Call to Remote Web Page using PHP:
In PHP there are various ways we can call the remote webpage, for more information on connecting to remote web page we can refer to . But here we will be using CURL to achieve our operation.
$ch = curl_init(); $timeout = 5; // set to zero for no timeout curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout); $file_contents = curl_exec($ch); curl_close($ch);
$url holds the Remote URL you want to connect to; and $file_contents holds the HTML content of the remote web page that we have called.
Match the HTML tags using Regular Expression using PHP:
Here we will be using preg_match/preg_match_all to read the HTML tags from the HTML source. Here i am posting few Regular Expression code that will extract the content inside the HTML tags.
Extracting data from HTML tags
preg_match_all('/<span>[\/\(\)-:<>\w\s]+< \/span>/',$file_contents,$htmlContent); </span>
Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the span tags from the HTML source code. Isn’t it simple, so now instead of span we want data from any other tag just replace the tag with that tag.
preg_match_all('/<span class="test">[\/\(\)-:<>\w\s]+< \/span>/',$file_contents,$htmlContent); </span>
Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the span tags having class=”test” from the HTML source code. This will ensure that we extract only those span tags that will have class attributes only.
preg_match('%<table class="test".*>.*\s*\s*</table>%', $file_contents, $htmlContent);
Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the table tags from the HTML source code. This will ensure that we extract only those table tags that will have class=’test’ attributes only. Now we have the table tag content, now we will extract the data inside td tags.
preg_match_all('#<td [^>]*>(.*?)</td>]*>#is', $htmlContent[0], $td_matches);
Here we pass the extracted table tags details to the preg_match_all, this will ensure that we read all the data that resides inside the td tags.
Popular Articles:
- Reading Excel Documents from PHP applications
- OOPS in PHP 5 – Polymorphism
- PHP5 Tutorial – Parsing XML documents in PHP5 using SimpleXML
- HTTP FORM POST in PHP using AJAX
- OOPS in PHP5 – Define Attributes for Class
- MySql Batch Insert Using PHP
- OOPS in PHP 5 Tutorial – Static Keyword
- OOPS in PHP 5 Tutorial – Parent Keyword
- PHP5 Tutorial – __sleep() Magic Method
- PHP5 Tutorial – __wakeup() Magic Method



































GREAT Article!
I am gonna use it in my next project.
maybe you hust need some phpquery to scrap
http://code.google.com/p/phpquery/
How will it be printed? please give the func!!
Hi,
Its not working at all I mean its not returning any value after preg_match even I am passing this url http://www.nfl.com/teams/sandiegochargers/roster?team=SD.