Home > PHP > Writing HTML Scrapper in PHP

Writing HTML Scrapper in PHP

Sometimes we want to extract the HTML content of the remote website page, this technique is called as HTML scrapper. This article will discuss on how we can extract the HTML content of the remote webpage.

We can achieve HTML scrapper operation in 2 step operation:

  • Call to Remote Web Page and extract the HTML content.
  • Match the HTML tags using Regular Expression.

Call to Remote Web Page using PHP:
In PHP there are various ways we can call the remote webpage, for more information on connecting to remote web page we can refer to . But here we will be using CURL to achieve our operation.

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);

$url holds the Remote URL you want to connect to; and $file_contents holds the HTML content of the remote web page that we have called.

Your email:

 


Match the HTML tags using Regular Expression using PHP:
Here we will be using preg_match/preg_match_all to read the HTML tags from the HTML source. Here i am posting few Regular Expression code that will extract the content inside the HTML tags.

Extracting data from HTML tags

    preg_match_all('/<span>[\/\(\)-:<>\w\s]+< \/span>/',$file_contents,$htmlContent);
</span>

Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the span tags from the HTML source code. Isn’t it simple, so now instead of span we want data from any other tag just replace the tag with that tag.

    preg_match_all('/<span class="test">[\/\(\)-:<>\w\s]+< \/span>/',$file_contents,$htmlContent);
</span>

Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the span tags having class=”test” from the HTML source code. This will ensure that we extract only those span tags that will have class attributes only.

    preg_match('%<table class="test".*>.*\s*\s*</table>%',  $file_contents, $htmlContent);

Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the table tags from the HTML source code. This will ensure that we extract only those table tags that will have class=’test’ attributes only. Now we have the table tag content, now we will extract the data inside td tags.

   preg_match_all('#<td [^>]*>(.*?)</td>]*>#is', $htmlContent[0], $td_matches);

Here we pass the extracted table tags details to the preg_match_all, this will ensure that we read all the data that resides inside the td tags.


Custom Search

Popular Articles:

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • email
  • IndianPad
  • LinkedIn
  • Live
  • MySpace
  • Netvibes
  • RSS
  • Technorati
  • Yahoo! Bookmarks
  • Yahoo! Buzz
  • Reddit
  • Add to favorites
  • PDF
  • Twitter
Categories: PHP Tags: ,
  1. August 22nd, 2009 at 17:03 | #1

    GREAT Article!

    I am gonna use it in my next project.

  2. tung
    October 26th, 2009 at 20:17 | #2

    maybe you hust need some phpquery to scrap
    http://code.google.com/p/phpquery/

  3. saad
    June 28th, 2010 at 14:21 | #3

    How will it be printed? please give the func!!

  4. August 4th, 2010 at 03:15 | #4

    Hi,
    Its not working at all I mean its not returning any value after preg_match even I am passing this url http://www.nfl.com/teams/sandiegochargers/roster?team=SD.

  1. October 16th, 2009 at 07:21 | #1
  2. November 9th, 2009 at 07:04 | #2
  3. March 12th, 2010 at 07:49 | #3
  4. May 27th, 2010 at 22:46 | #4
  5. June 16th, 2010 at 14:28 | #5