Coding with Jesse

Easy web scraping with PHP

February 17th, 2008

Web scraping is a technique of web development where you load a web page and "scrape" the data off the page to be used elsewhere. It's not pretty, but sometimes scraping is the only way to access data or content from a web site that doesn't provide RSS or an open API.

I'm not going to discuss the legal aspects of scraping, as it may be considered copyright infringement in some situations. However, there are also perfectly legal reasons to need to scrape, like if you have permission.

To make things really easy, we're going to let the power of regular expressions do all the work for us. If you're not familiar with regular expressions, you may want to google for a tutorial. Here is the documentation for PHP regular expression syntax.

First, we start off by loading the HTML using file_get_contents. Next, we use preg_match_all with a regular expression to turn the data on the page into a PHP array.

This example will demonstrate scraping this web site's blog page to extract the most recent blog posts. This is just for demo purposes - of course, the RSS feed is much better suited for this.

// get the HTML
$html = file_get_contents("http://www.thefutureoftheweb.com/blog/");

Here is what the HTML looks like for the blog posts:

<ul id="main">
    <li>
        <h1><a href="[link]">[title]</a></h1>
        <span class="date">[date]</span>
        <div class="section">
            [content]
        </div>
    </li>
</ul>

So we will use a regular expression that looks for all the li elements and capture the content using parentheses at the appropriate places (link, title, date & content).

preg_match_all(
    '/<li>.*?<h1><a href="(.*?)">(.*?)<\/a><\/h1>.*?<span class="date">(.*?)<\/span>.*?<div class="section">(.*?)<\/div>.*?<\/li>/s',
    $html,
    $posts, // will contain the blog posts
    PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
    $link = $post[1];
    $title = $post[2];
    $date = $post[3];
    $content = $post[4];

    // do something with data
}

There's a lot going on inside that regular expression, but there are really only a few "tricks" that are used. Anytime I want to say "skip over whatever is between" I use .*?. And any time I want to say "match whatever is in here" I use (.*?). And lastly, the s at the end tells PHP to allow the dot . to match newlines. That's about all there is to it.

The regular expression will only match blog posts, because they are the only <li> elements that contain an <h1>, <span class="date"> and <div class="section">.

Web scraping is highly unreliable - if the HTML structure were to change this code would break instantly. However, it's often quite easy to write this code, and usually produces a perfectly usable hack solution.


Comments

1 . Perry on March 17th, 2008

Perry

This is a perfect tutorial for scraping, thanks, it's a big help!

2 . Alan on May 20th, 2008

Alan

Nice article! I was originally planning to write a small scraper for my web app in PHP or RoR, but then I came across Feedity ( http://feedity.com ) which made things a lot easier. Feedity generates custom RSS feeds from webpages, and now I just consume the resulting RSS feed in my application. Simple and straight! Check it out sometime!

3 . Tristan on May 29th, 2008

Tristan

Hi, Thanks alot for the information has really helped with scraping information from a video site, for future reference for other users they will want to fix this:

$link = $post[1]

swap with

$link = $post[1];

Lastly, I was going to ask for your help, if I wanted to use this to get the body content, although needed to get an additional piece of information, such as the amount of results, how would this be achieved?

Thanks

4 . Jesse Skinner on May 31st, 2008

Jesse Skinner

@Tristan - thanks for the correction! I changed it in the post.

To answer your question, once you have the $post array you can just use count($post) to see how many posts were found.

5 . Tristan on May 31st, 2008

Tristan

Thanks Jesse, glad to be of help, that count($post) will be helpful, but what I mean is I need to scrape a different item, say you where scraping google, what I need is the number at the top "Results 1 - 10 of about search of SOMEVALUE". How would I obtain that? Could I just run a preg_match before scraping the main content?

Thanks Again

6 . Tom on June 27th, 2008

Tom

Good post.
Should not the preg_match_all statement use the backslash escape for each forward slash in the statement? So it would
read:

preg_match_all(
'/<li>.*?<h1><a href="(.*?)">(.*?)</a></h1>.*?<span class="date">(.*?)</span>.*?<div class="section">(.*?)</div>.*?</li>/s',
$html,
$posts, // will contain the blog posts
PREG_SET_ORDER // formats data into an array of posts
);

Thanks

7 . Jesse Skinner on June 28th, 2008

Jesse Skinner

@Tom - Yep, you're right. I've fixed the post code. Thanks!

8 . Sam on July 9th, 2008

Sam

Hi,

Nice and neat article. However this would only work if the html is predictable. I'm trying to scrap the content for ANY website/blogs and I find it very difficult. Currently I'm only relying on RSS feeds, but not everyone provides one.

Have you tried scrapping more websites ?

Thanks

9 . Yuriy on October 2nd, 2008

Yuriy

Thanks, man! You saved me so much time. This is just a perfect tutorial on web scraping.

10 . Ankit on October 20th, 2008

Ankit

Hi,

First of all, thanks for an awesome tutorial. I was trying to tweak this to make a movie showtime listing engine based on Google's results. here's the code i used.

<html><body>

<?-php

&s=$_GET['s];
&s1=&_GET['s1'];

echo "<p><i>Search for $s</i></p>";

$s=urlencode($s);
$s1=urlencode($s1);


$html = file_get_contents("http://www.google.com/movies?q=".s."&btnG=Search+Movies&hl=en&near=".s1."");
preg_match_all('/<table cellpadding=3>.(.*?)</td class=k>/s', $string, $matches),
print $matches[1];
$html,
PREG_SET_ORDER
print $matches[1];
);




}
else
{
?>

<form name="form1" id="form1" method="get" action="">
<div align="center">
<p>
<input name="s" type="text" id="s" size="50" />
<input name="s1" type="text" id="s1" size="50" />
<input type="submit" name="Submit" value="Search" />
</p>
</div>
</form>

<p>
<?php
}
?>
</p>

</body></html>

As you stated, I found that the movie show timings are nested between <table cellpadding=3> and <td class=k> tags. hence we could exploit this for the engine. But the above doesn't seem to work. Could you please help?

11 . Jesse Skinner on October 20th, 2008

Jesse Skinner

@Ankit - You need to escape the backslash in the regular expression:

'/<table cellpadding=3>.(.*?)</td class=k>/s'

instead of:

'/<table cellpadding=3>.(.*?)</td class=k>/s'

12 . Jesse Skinner on October 20th, 2008

Jesse Skinner

Oops I guess the backslash got lost which is probably why yours did in your comment.

let's try this:

'/<table cellpadding=3>.(.*?)<\/td class=k>/s'

instead of:

'/<table cellpadding=3>.(.*?)</td class=k>/s'

13 . Ankit on October 21st, 2008

Ankit

Hi Jesse,

Thanks for the prompt reply. Well I changed the code, but it still doesn't work. Would you mind reviewing it, saving the entire code as a php file and running it in a browser? I think this is a good tool, with good applicability. But it's not working, and that's bugging me :)

Regards,
Ankit

14 . gav on November 29th, 2008

gav

hi, is there anyway i can scrap a background image for example

<td width="46" height="49" background="images/0.gif" align="center" valign="top" nowrap>

well on the site its sometimes images/1.gif or 2 or 5 so can i get this information ?

15 . hassan on March 14th, 2009

hassan

Thanks for uploading well understandable code for scrapping.
I have a problem that I want to scrap the links of my site which is hosted at local host . "http://192.156.1.100/$sitepreview/marketingmanager.com/" if run the following code

<?php
$siteurl="http://192.168.1.100/$sitepreview/marketingmanager.com/";
$html=file_get_contents($siteurl);
?>
<ul id="main">
<li>
<h1><a href="[link]">[title]</a></h1>
<span class="date">[date]</span>
<div class="section">[content]
</div>
</li>
</ul>
<?php
preg_match_all('/<a href="(.*?)">(.*?)</a>/s',$html,$posts,PREG_SET_ORDER);
echo $count_post=count($post);
foreach ($posts as $post)
{
echo $link = $post[1];
$title = $post[2];
$date = $post[3];
$content = $post[4];

// do something with data
}
?>
Iam facing error on webpage that
Warning: file_get_contents(https://192.168.1.100//marketingmanager.com/): failed to open stream: Invalid argument in C:Inetpubvhostsmarketingmanager.comhttpdocs est_urls.php on line 3
and count of $post is 0. could you go through my problem.

16 . Jesse Skinner on March 16th, 2009

Jesse Skinner

@hassan - the problem is the dollar sign $ in your URL string, which PHP thinks is a variable. You can wrap the URL in single quotes to avoid this, ie.:

$siteurl='http://192.168.1.100/$sitepreview/marketingmanager.com/';

17 . vince on March 17th, 2009

vince

Hi,
Great tutorial, what if the code I am scraping gets bland at one point in the scrape and it becomes hard to decipher one html tag from another? Please see example below, I am trying to capture the percentage data, but not sure how to ignore the first 4 tds and zero in on the correct td:

<tr>
<td>03/11/09</td>

<td>3509</td>
<td>7-13-2</td>
<td>1-1-0</td>
<td>8-14-2</td>
<td>36.36%</td>
<td>92</td>

</tr>

Thanks.

18 . vince on March 17th, 2009

vince

p.s. I don't control the web pages I am scraping.

19 . zlot on April 16th, 2009

zlot

thanx for the tutorial, now i can finally scrap my competitors website ^___^

20 . NomikOS on April 28th, 2009

NomikOS

@Vince: Try:

<tr>.*?<td>([^<]+)%<\/td>.*?<\/tr>

this extract only the percentage data. in your ex.: 36.36

@ Jesse: Finally I can to participate in your blog.

21 . leo on July 16th, 2009

leo

Hi, I have the following script:

--------------- START CODE ---------------
<?php
function hyperlinkextract($s1,$s2,$s){
$myarray=array();
$s1=strtolower($s1);
$s2=strtolower($s2);
$L1=strlen($s1);
$L2=strlen($s2);
$scheck=strtolower($s);

do{
$pos1 = strpos($scheck,$s1);
if($pos1!==false){
$pos2 = strpos(substr($scheck,$pos1+$L1),$s2);
if($pos2!==false){
$myarray[]=substr($s,$pos1+$L1,$pos2);
$s=substr($s,$pos1+$L1+$pos2+$L2);
$scheck=strtolower($s);
}
}
} while (($pos1!==false)and($pos2!==false));
return $myarray;
}

$content = file_get_contents('./sample.htm');
$myarray = hyperlinkextract("href=\"","\"",$content);

// Process all the links
foreach($myarray as $key => $val) {
echo "<br />".$val."\n";
}
?>

--------------- END CODE ---------------

It´s working well and capture all links on given page, but I´m trying, without success, filtering the results to get only links from a specific id or class.

Also I would like to get links from the current page on "$content" variable... so it should work like "$content = file_get_contents('this.href');" .

Thanks in advance !
LEO

22 . Jenni on September 1st, 2009

Jenni

I often need to scrape our own web pages for legal reasons - review of text version by legal dept. I use biterscripting ( http://www.biterscripting.com ) for that. Take a look at a sample script they have posted at their site at http://www.biterscripting.com/SS_WebPageToText.html.

That script extracts plain text from a web page. Similarly, script SS_WebPageToCSV extracts a table from a web page, such as stock table.

Jenni

23 . Joe on September 3rd, 2009

Joe

I used regexes in my early days of web scraping, but found they can be fragile. Try a good library instead, like LWP for Perl.

24 . paul on September 12nd, 2009

paul

I need help.. I want to add a MORTGAGE RATES for my loan site.. how can i do that? I want to post a rate without the name of the link i copied..would that be possible?

25 . Gerson Jaber on September 17th, 2009

Gerson Jaber

You can use DOMHtml to do this, look this article:
http://www.developertutorials.com/tutorials/php/scraping-links-with-php-8-01-05/page7.html

26 . Gerschel on September 26th, 2009

Gerschel

Okay, I would like to scrape a large list of links:
<a href="">""</a>

Just before and after the links there is a header tag:
<h3>""</h3>

I would like to scrape every one into their own variable, not have them all go onto $post[1].

Basically, my goal is to go through 300 different pages, where all the pages in the directory are named "page1.html"; "page2.html.

I was able to come up with this so far:

<? if (!isset($_POST['sub'])) {
$page_number = 1 ;
}
if (isset($_POST['sub'])) {
$page_number = $_POST['page_numeral'];
}
?>
<?
$html = file_get_contents("http://example.com/directory/page$page_number.html");
?>

And further down have a form where I can select the page number as a try and work it.

<form id="form1" name="form1" method="post" action="">
<label>page number
<input type="text" name="page_numeral" id="page_numeral" />
</label>
<p>
<label>
<input type="submit" name="sub" id="sub" value="Submit" />
</label>
</p>
</form>

Now that I can control the pages of links, I would like to get one link, go into it, find the <blockquote> that follows a level three header tag that has the same name of the link. Go back to main page. Start on to the next link and do the same thing. Back to main page. Once all links are completed in main page, go to page2.html
As I do this, I will be saving the <blockquote> into a database. The only problem that I am having is that page1.html may have a slightly different amount of links than page50.html or whatever.

Is there a shorthand to say something like:
'<h3>(.*?)<\/h3><a href=(.*?)>(.*?)<\/a> next link by adding a variable that increments until <h3>.*?<\/h3>

$link+incrementing variable = $post[incrementing variable]

27 . Gerschel on September 27th, 2009

Gerschel

Okay, there are 880 links, I don't want to write the link part 880 times. Any ideas. By the way, I got it to get the <blockquote> automatically in one swoop. Here's my code, p.s. I am new to all of this, I started sometime earlier this month:

<? if (!isset($_POST['sub'])) {
$page_number = 1 ;
}
if (isset($_POST['sub'])) {
$page_number = $_POST['page_numeral'];
}
?>
<?php
// get the HTML
$html = file_get_contents("http://exampleurl.com/directory/page$page_number.html");

preg_match_all(
'/<blockquote>.*?<a href="(.*?)>(.*?)<\/a>/s',
$html,
$posts, // will contain the blog posts
PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
$link = $post[1];
$a =$post[2];

echo $link; echo $a; // do something with data
$html1 = file_get_contents("http://exampleurl.com/directory/$a.html");

preg_match_all(
'/<blockquote>(.*?)<\/blockquote>/s',
$html1,
$posts1, // will contain the blog posts
PREG_SET_ORDER // formats data into an array of posts
);

$quote = $posts1[0][0];

echo $quote; // do something with data
}
?>
<form id="form1" name="form1" method="post" action="">
<label>page number
<input type="text" name="page_numeral" id="page_numeral" />
</label>
<p>
<label>
<input type="submit" name="sub" id="sub" value="Submit" />
</label>
</p>
</form>

28 . Laura Grant on October 29th, 2009

Laura Grant

I need 10 values scraped from this portion of HTML (name, address, etc):

<div id="leftnav">
<h1>Charity Rating</h1>
</div>
<div id="sideads">
<div class="rating">

<p><strong>NAME</strong><br />

ADDRESS<br />

Memphis,&nbsp;TN&nbsp;38105<br />
tel: (800) 805-5856<br /> fax: (901) 578-2805<br />
<a href="javascript:openBrWindow('print=1')">EIN</a>: 351044
</p>


<p><a href="mailto:[email protected]">Contact Email</a><br /> <a href="http://www.donorexample.org" target="_blank" onclick="javascript: pageTracker._trackPageview('/outgoing/5234.htm');">Visit Web Site</a></p>


</div>

<div>

And here is the PHP code I am trying to use... but it's not working. I don't understand the backslash escape issue and that might be the problem?

$arr = array(10003,10029);

foreach($arr as $value){
// get the HTML
$web = 'http://www.example.org/orgid='.$value;
echo $web."<br/>";

$html = file_get_contents($web);

preg_match_all(


'/<div id="leftnav"><h1>Charity Rating</h1>.*?<p><strong>(.*?)</strong><br />(.*?)<br />(.*?),&nbsp;(.*?)&nbsp;(.*?)<br />(.*?)<br />(.*?)<br />.*?">EIN</a>: (.*?)</p>.*?<p><a href="(.*?)".*?<a href="(.*?)"\',
$html,
$posts,
PREG_SET_ORDER
);

foreach ($posts as $post) {
$name = $post[1];
$address = $post[2];
$city = $post[3];
$state = $post[4];
$zip = $post[5];
$tel = $post[6];
$fax = $post[7];
$ein = $post[8];
$email = $post[9];
$link = $post[10];
}

// Create date stamp
$dateStamp = strftime("%D %T", time());

echo $name."|".$address."|".$city."|".$state."|".$zip."|".$tel."|".$fax."|".$ein."|".$email."|".$link."|".$dateStamp."<br/>";

}

THANKS in advance for your help... this is really cool script and will really speed up my research :)

29 . Jesse Skinner on October 29th, 2009

Jesse Skinner

@Laura - yes, it's an escaping issue. Regular expressions start and end with the / slash, like:

/hello/

so whenever you need to put in a /, like </strong>, you need to escape it with a \ like:

/<\/strong>/

so just go through your regular expression, add a \ before all the /s, and make sure it ends with a / too.

30 . Jesse Skinner on October 29th, 2009

Jesse Skinner

@Laura - actually you may want to make sure it ends with /s - that 's' means that the dot '.' matches line breaks, and HTML is full of line breaks.

31 . Laura Grant on October 29th, 2009

Laura Grant

Thanks JEsse! This has been so helpful -- I was able to debug the code with the '\' escapes and I am getting my output - yay!

But I have another related question -- if I want just part of a url, like the last four characters, and I am using the other part as an identifying tag and it has '/'s, e.g. http://www.example.com/1345, how do I block those '/'s?
Cheers!

32 . Jesse Skinner on October 30th, 2009

Jesse Skinner

@Laura - in that case it might look like:

/http:\/\/www.example.com\/(.{4})/

The regular expression parser will ignore the \ characters, they will just let it know that the regular expression isn't over yet.

33 . digital on November 5th, 2009

digital

Hi, I really like your tutorial. I have also found a script which search nth results of google search


<?php

$query = urlencode("adobe dreamweaver");

preg_match_all('/<a title=".*?" href=(.*?)>/', file_get_contents("http://www.google.com/ie?q=" . urlencode($query) . "&num=100&start=1"), $matches);

print implode("<br>", $matches[1]);

?>

It returns the url form the searches, but i want that it also return the description of those urls.

34 . Sumit on December 8th, 2009

Sumit

Hi,
This is an excellent article probably the simplest one to explain evry bit of web scrapping. I am trying to use the same with the following html
<table width="100%" border="0" cellspacing="0" cellpadding="3">
<tr>
<td style="padding-bottom: 0px; line-height: 20px; padding-top: 6px;" valign="top" width="1%">
<input type="checkbox" name="job" value="7644383" /><input type="hidden" id="7644383" value="0"></td>
<td style="padding-bottom: 0px; line-height: 20px;"><a href="http://a.com/details/7644383.html" target="_blank" id="link7644383" style="text-decoration: underline;" >Java</a>,<span class="small txt_grey">18th Nov 2009</span><br>Infinity Services<br><div style="line-height: normal;"><span class="txt_green">Hyderabad, 2-4 years, 2.50-3.50 lacs:</span> Total of 2 to 3 years experience with the Java language, object oriented programming, and related concepts such as refactoring1 year experience with SQL and database based programmingFamiliarity with UNIX & Junit.</div><a href="javascript:findSimilar(7644383)" class="txt_blue1">Similar Jobs</a>&nbsp;&nbsp;-&nbsp;&nbsp;<a href="http://a.com/searchresult.html" class="txt_blue1">All Jobs by this Recruiter</a></td>
</tr><tr>
<td style="padding-bottom: 0px; line-height: 20px; padding-top: 6px;" valign="top">&nbsp;</td>
<td style="padding-bottom: 0px; line-height: 20px;">&nbsp;</td></tr>
<tr>
<td style="padding-bottom: 0px; line-height: 20px; padding-top: 6px;" valign="top" width="1%">
<input type="checkbox" name="job" value="7466305" /><input type="hidden" id="7466305" value="0"></td>
<td style="padding-bottom: 0px; line-height: 20px;"><a href="http://a.com/details/7466305.html" target="_blank" id="link7466305" style="text-decoration: underline;" >Java Specialist</a>,<span class="small txt_grey">18th Nov 2009</span><br>Magna Infotech Pvt Ltd<br>
<div style="line-height: normal;"><span class="txt_green">Chennai, 4-7 years:</span> Java Developer with strong technical developer with focus and expertise in the Java based tools and technologies. The individual must be proficient in Java development and unit testing</div><a href="javascript:findSimilar(7466305)" class="txt_blue1">Similar Jobs</a>&nbsp;&nbsp;-&nbsp;&nbsp;<a href="http://a.com/searchresult.html" class="txt_blue1">All Jobs by this Recruiter</a>
</td></tr></table>
using

<?php
$html = file_get_contents("data.html");
preg_match_all(
'/<td style="padding-bottom: 0px; line-height: 20px;">(<a href="(.*?.)" .*?.)<\/a>.*?<span class="small txt_grey">(.*?).<\/span><br>.*?<\/span><br>(.*?).<br>.*?<div style="line-height: normal;"><span class="txt_green">(.*?).</span>.*?</span>(.*?).</div>/s',
$html,
$posts, // will contain the blog posts
PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
$link = $post[1];
$title = $post[2];
$date = $post[3];
$content = $post[4];
$loc = $post[5];
$desc= $post[6];
echo $link."<br>". $title."<br>".$date."<br>".$content."<br>".$loc."<br>".$desc;
// do something with data
}
?>
I am getting Warning: preg_match_all() [function.preg-match-all]: Unknown modifier 'p'. I got the result upto <\/span><br>(.*?).<br> but when I add other tags I am getting the warning.
I am also getting only one record instead of two. Why so?
Can you please check this and let me know where I am doing wrong?
Regards

35 . NomikOS on December 8th, 2009

NomikOS

>> "[function.preg-match-all]: Unknown modifier 'p'. "
A: escape this slashes too: </span>.*?</span>(.*?).</div>',

Besides:

1.- (.*?.) and (.*?). are very weird expressions. the second dot seems be a redundant one.

2.- .*? is a greedy expressions. study for lazy expressions.

3.- (<a href="(.*?.)" .*?.) this is a double backreference. Will give you something like: $post[n] for outer parenthesis and $post[n+1] for inner parenthesis.

In resume you must training you more in regular expressions.

-------------

Do you want scrape info on each table row?

36 . Sumit on December 8th, 2009

Sumit

Hi,
Thanks a lot.
Can u please provide me scrape info on each table row?
It will be very helpful.
I am not good in php, still learning.
With best regards
Sumit

37 . NomikOS on December 8th, 2009

NomikOS

If the layout don't change this will work:

<td style="padding\-bottom\: 0px; line\-height\: 20px;"><a href="(.*?)" target="_blank" id="link\d+" style="text\-decoration\: underline;" >(.*?)<\/a>,<span class="small txt_grey">(.*?)<\/span><br>(.*?)<br>\s*<div style="line\-height\: normal;"><span class="txt_green">(.*?)\:<\/span>(.*?)<\/div>

use var_dump($posts) to check;

bye.-

38 . Sumit on December 8th, 2009

Sumit

Hi,
Thanks a lot. Great Work!!!!!!!!.
I want to learn this. Where I can learn preg_match_all in detail?
Regards

39 . NomikOS on December 8th, 2009

NomikOS

Sumit, PHP has one of the best documentation online.
http://www.php.net/docs.php

40 . Sumit on December 9th, 2009

Sumit

Hi,
I am getting no result when I am using
preg_match_all(
'/<td style="padding\-bottom\: 0px; line\-height\: 20px;"><a href="(.*?)" target="_blank" id="link\d+" style="text\-decoration\: underline;" >(.*?)<\/a>,<span class="small txt_grey">/s',
$html,
$posts, // will contain the blog posts
PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
//$link = $post[1];
$title = $post[2];
$date = $post[3];

echo $title."<br>".$date."<br>";


}
What is wrong?
Regards

41 . Cwjones on January 12nd, 2010

Cwjones

Im looking to scrape a page in my directory rather than writing it out all again.
I'm using php's $_GET from the URL but scraping doesn't seem to want to do the leg work.
If i process
$url = 'http://localhost/~test/result.php?School=$School';
$page = file_get_contents($url);
I get nothing, although if i process,
$url = 'http://localhost/~test/result.php?School=FullSchoolName';
$page = file_get_contents($url);
I get a response.
I'm using the $_GET but like I say, there's no response. Any ideas?

42 . Jesse Skinner on January 12nd, 2010

Jesse Skinner

@Cwjones - PHP $variables aren't parsed between single quotes. Try this:

$url = "http://localhost/~test/result.php?School=$School";

or this:

$url = 'http://localhost/~test/result.php?School='.$School;

43 . Cwjones on January 12nd, 2010

Cwjones

Thanks for the quick response,

Seems the double quotes don't want to work and this is my fault for not disclosing the full info but there are more than one $variable eg.

$url = "http://localhost/~test/result.php?School=$School&Ward=$Ward&Term=$Term";

So I now cant get my head around the php?School='.$School --- with extras...

Thanks again

44 . Jesse Skinner on January 12nd, 2010

Jesse Skinner

@Cwjones - the period concatenates strings together. You can just do this:

$url = "http://localhost/~test/result.php?School=".$School."&Ward=".$Ward."&Term=".$Term;

Try using 'echo' to print out the URL for debugging so you can see what's actually going on.

45 . svnlabs on January 27th, 2010

svnlabs

Great Idea!!

Really scraping is great tool for web developers...

Why we not utilize it for productive work?

Thanks
SV

46 . NomikOS on January 27th, 2010

NomikOS

For a better performance use curl: http://curl.haxx.se/
Among other things handles HTTP headers, SSL, cookies, proxies, etc.

47 . mike on February 3rd, 2010

mike

I am trying to extract link from html content
eg.

<a href="/contents/text/logic">Value</a>
<a href="/contents/something/logic">Value2</a>

I am trying to pattern match and extract the Value based on the known path.
Each link will have a different value depending on the path like "/contents/text/logic"

Which reg. ex pattern will help me do that

48 . Noxier on February 19th, 2010

Noxier

Hey Jesse, i am glad and thanks for your simple and meaningful web scrapping tutorial.

i am newbie here, i tried your tutorial to scrap a web content from Wordpress based blog, i get some trouble for web which have contents like this.

<ul id="main">
<li id="comment1"><a href="http://some.url">Links1</a></li>
<li id="comment2"><a href="http://some.url">Links2</a></li>
<li id="comment3"><a href="http://some.url">Links3</a></li>
</ul>

in this case, every li tag have a different ID name. How to scrape it? and how the regular expression used here?

thanks for your answer :smile:

49 . Runtest on February 28th, 2010

Runtest

First I would love to thank you for the super simple tutorial.
I could use a little help though. Nothing is echoing back from this script.

Did I mess up the syntax?

$pGet = file_get_contents("http://fedcoelectronics.com/detail.tpl?SKU=P250C-10ALX&_fid=35");

preg_match_all('/<TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
<TR bgcolor="#adcfe0"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
<TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
<TR bgcolor="#adcfe0"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
<TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
<TR bgcolor="#adcfe0"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
<TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
<TR bgcolor="#adcfe0"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
<TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
<TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>/s',
$pGet,
$pInfo,
PREG_SET_ORDER
);

foreach($pInfo as $pInfo) {
$partNumber = $pInfo[1];

echo $partNumber;
}

50 . JEsteban on March 26th, 2010

JEsteban

I tried the file_get_contents function on a website that I'd like to collect data from. It's a drill down type database application and I would like to get all of it's data into a database so that I can make it searchable with query tools.

The problem is file_get_contents is no a browser and so the Ajax functions which load the most of the data on the site don't get executed because there is no browser loading the page. Any idea?

51 . Jesse Skinner on March 26th, 2010

Jesse Skinner

@JEsteban - you can try using Firebug or Fiddler to see what URLs are being called via Ajax, and then use file_get_contents or cURL to call those URLs and get the data you need.

52 . JEsteban on March 26th, 2010

JEsteban

Oh that's a good idea. I didn't know that would work. Thanks I'll try that.

53 . Goha on March 27th, 2010

Goha

nice tip... thanks..

54 . I3L1nd on April 1st, 2010

I3L1nd

Wow,

This really came in handy because I have to update show dates for a clubs website.

Now I can just pull the show dates from the Myspace.


Thanks a lot.

55 . Forbes on April 6th, 2010

Forbes

awesome tutorial! I am finally able to get my head around data scraping.

Just a note on your site code... your <div class="tags"> aren't being closed. Ran across this while tweaking the scraped data being spit out from your site.

56 . Andrei on April 6th, 2010

Andrei

THANKS A LOT DUDE.
A VIRTUAL BEER FROM ME TO YOU!

57 . TechRedNeck on April 22nd, 2010

TechRedNeck

I've been using a scraping software called mozenda which allows you to add in custom code. Does anyone know if this will work with them? It's http://www.mozenda.com if anyone thinks they can find it in their support section. I looked but I'm a dip when it comes to finding things. Thanks :)

58 . trendzvijay on April 29th, 2010

trendzvijay

hi, i have the error with my following code.. when i try to get the count of array, it showing zero. can you help me.

$h1count = preg_match_all('/<div id="nutritions"><table class="blk_brd" width="270" cellpadding="0" cellspacing="1">
<tbody><tr><td colspan="3" class="PadLft" height="20"><span style="font\-size\: 20px;">
<b>Nutrition Facts<\/b><\/span><\/td><\/tr>
<tr><td colspan="3" class="PadLft" height="15">Serving Size 1 cup <\/td><\/tr>
<tr><td colspan="3" class="blk" height="1"><img src="(.*?)" width="1" height="8"><\/td>
<\/tr><tr><td colspan="3" class="PadLft" height="15"><b>Amount Per 1 Serving<\/b><\/td><\/tr>
<tr><td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Calories<\/b> 120 <\/div><\/td><\/tr>
<tr><td colspan="3" class="brdtp" align="right"><b>% Daily Value * <\/b><\/td><\/tr>
<tr><td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Total Fat <\/b>1.0g<\/div>
<div class="divrht"><b>2<\/b>%<\/div><\/td><\/tr>
<tr>
<td width="9%">&nbsp;<\/td>
<td colspan="2" class="brdtp"><div class="divlft">Saturated Fat 0.0g<\/div>
<div class="divrht"><b>0<\/b>%<\/div><\/td>
<\/tr>
<tr>
<td width="9%">&nbsp;<\/td>
<td colspan="2" class="brdtp"><div class="divlft">Trans Fat 0.0g<\/div>
<div class="divrht"><\/div><\/td>
<\/tr>
<tr>
<td>&nbsp;<\/td>
<td colspan="2" class="brdtp"><div class="divlft">Polyunsaturated Fat 0.0g<\/div>
<div class="divrht"><\/div><\/td>
<\/tr>
<tr>
<td>&nbsp;<\/td>
<td colspan="2" class="brdtp"><div class="divlft">MonoUnsaturated Fat 0.0g<\/div>
<div class="divrht"><\/div><\/td>
<\/tr>
<tr>
<td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Cholesterol&nbsp;<\/b>0.0mg<\/div>
<div class="divrht"><b>0<\/b>%<\/div><\/td>
<\/tr>
<tr>
<td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Sodium <\/b> 540.0mg<\/div>
<div class="divrht"><b>23<\/b>%<\/div><\/td>
<\/tr>
<tr>
<td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Total Carbohydrates <\/b>12.0g<\/div>
<div class="divrht"><b>4<\/b>%<\/div><\/td>
<\/tr>
<tr>
<td>&nbsp;<\/td>
<td colspan="2" class="brdtp"><div class="divlft">Dietary Fiber 6.0g <\/div>
<div class="divrht"><b>24<\/b>%<\/div><\/td>
<\/tr>
<tr>
<td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Protein <\/b>26.0 g<\/div>
<div class="divrht"><b>52<\/b>%<\/div><\/td>
<\/tr>
<tr>
<td colspan="3" class="blk" height="1"><img src="(.*?)" width="1" height="8"><\/td>
<\/tr>
<tr>
<td colspan="3"><table width="100%" border="0" cellpadding="0" cellspacing="0">
<\/table><\/td>
<\/tr>
<tr>
<td colspan="3" class="PadLft brdtp">* Based on a<u> 2,000 calorie diet<\/u>.<\/td>
<\/tr>
<\/tbody>
<\/table>
<\/div>/s',$file,$patterns);
echo $h1count ;

thank you

59 . NomikOS on April 29th, 2010

NomikOS

Are you crazy? I never saw something like that.

Look. First isolate the table. I designed this function:

# take only first ocurrence on $tring (very important!)
# return an empty string if delimeters fails
function getUnit($string, $start, $end)
{
if (($pos = stripos($string, $start)) === false)
return '';

$str = substr($string, $pos);
$str_two = substr($str, strlen($start));

if (($second_pos = stripos($str_two, $end)) === false)
return '';

$str_three = substr($str_two, 0, $second_pos);
return trim($str_three);
}

do:

$unit = getUnit($fileToParse, '<div id="nutritions">, '</div>');

and then preg_match_all over $unit:

if ( preg_match_all('/<img src="(.*?)" width="1" height="8">/si', $unit, $src, PREG_SET_ORDER))
{
var_dump($src);
}

this pattern is more appropriate between quotes: ([^"]*)

..V; ^^

60 . NomikOS on April 29th, 2010

NomikOS

correction:
$unit = getUnit($fileToParse, '<div id="nutritions">, '</tbody>');

delimeters must be unique (or at least be sure that delimit the block you're interested in $fileToParse). Here id= assure that.

61 . trendzvijay on April 29th, 2010

trendzvijay

Hi NomikOS,

Thank you for your quick reply.. im newbie to php. after the step var_dump($src); how can we retrieve the data. please give the brief details. it will be helpful to most people like me

62 . NomikOS on April 29th, 2010

NomikOS

Sure but you must go to php.net and study preg_match_all. In this case we use PREG_SET_ORDER. So you must do this:

foreach ($src as $aux)

{

$this_array_got_you_want[] = $aux[1];

}

63 . trendzvijay on April 29th, 2010

trendzvijay

Thanks a lot NomikOS !! now its working well. thank you very much dude.

64 . NomikOS on April 29th, 2010

NomikOS

Unbelievable! I not expected to give you a solution, if not just one track. You are in luck, I'm happy for you.

Some useful regular expressions are:

<?php
([^"]*) // match all until "
([^>]*) // match all until >
\\$(\d+\.*\d*) // match prices (It never hurts)
?>

65 . trendzvijay on May 1st, 2010

trendzvijay

Hi NomikOS,

I got the following error while scraping conent from one blog[website]. I got this message when i collect the 347 data[ nearly 38th page]

[file_get_contents]: failed to open stream

$file = file_get_contents($url); is My Code for this scrape work

is there any other way to get the complete solution?

thank you

66 . NomikOS on May 1st, 2010

NomikOS

for no break the program flow use @

$file = @file_get_contents($url);
if ($file) {}

and go on...

to scrape seriously you must use curl. but is your task learn how. search for a class ready to use.

I.-

67 . trendzvijay on May 1st, 2010

trendzvijay

Thank you.. Now im learning, how can we use curl to scrape.. thanks for your support

68 . Frederick Aristotle on May 5th, 2010

Frederick Aristotle

Is there a reason why I'm not getting any results when I code it using the following example?

<div style="margin: 10px 0px 0px 0px; padding: 5px; width: 500px; border: 1px solid #000000;">

<?php
// get the HTML
$html = file_get_contents("http://www.dailydealcafe.com/index.php");

preg_match_all(
'/<div class="main-product-image"><img src="(.*?)" alt="(.*?)" title="(.*?)" border="0" height="(.*?)" width="(.*?)"><\/div>
/s',
$html,
$posts, // will contain the blog posts
PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
$link = $post[1];
$title = $post[2];
$date = $post[3];
$content = $post[4];
$content2 = $post[5];

// do something with data
echo $link . '<br/>' . $title . '<br/>' . $date . '<br/>' . $content;
}


?>
</div>

69 . Jeff Nelson on May 12nd, 2010

Jeff Nelson

Great post and comments.

Can you comment on consequence of adoption of HTML 5 on the business of site scraping as done by Yodlee and others for financial information? Does use of RIA make scraping more difficult?

thanks/JN

70 . NomikOS on May 12nd, 2010

NomikOS

Please provide a suitable link. Thanks...

71 . prakash on May 24th, 2010

prakash

hi friend i want to scrap this data
<DIV CLASS="contenttext">
Many USMS clubs have their own web sites with local information, workout times, club events, and other useful information. Please stop by and visit one of our club sites!
<P><A HREF="edit_club_link.php?add=1">Add USMS Club Link</A>
<FORM ACTION="/links/usmsclubs.php" METHOD="POST">

<SELECT NAME="a">
<OPTION VALUE="">-All-
<OPTION VALUE="AL">Alabama
<OPTION VALUE="AK">Alaska
<OPTION VALUE="AZ">Arizona
<OPTION VALUE="AR">Arkansas
<OPTION VALUE="CA">California
<OPTION VALUE="CO">Colorado
<OPTION VALUE="CT">Connecticut
<OPTION VALUE="DE">Delaware
<OPTION VALUE="DC">District Of Columbia
<OPTION VALUE="FL">Florida
<OPTION VALUE="GA">Georgia
<OPTION VALUE="HI">Hawaii
<OPTION VALUE="ID">Idaho
<OPTION VALUE="IL">Illinois
<OPTION VALUE="IN">Indiana

<OPTION VALUE="IA">Iowa
<OPTION VALUE="KS">Kansas
<OPTION VALUE="KY">Kentucky
<OPTION VALUE="LA">Louisiana
<OPTION VALUE="ME">Maine
<OPTION VALUE="MD">Maryland
<OPTION VALUE="MA">Massachusetts
<OPTION VALUE="MI">Michigan
<OPTION VALUE="MN">Minnesota
<OPTION VALUE="MS">Mississippi
<OPTION VALUE="MO">Missouri
<OPTION VALUE="MT">Montana
<OPTION VALUE="NE">Nebraska
<OPTION VALUE="NV">Nevada
<OPTION VALUE="NH">New Hampshire
<OPTION VALUE="NJ">New Jersey
<OPTION VALUE="NM">New Mexico

<OPTION VALUE="NY">New York
<OPTION VALUE="NC">North Carolina
<OPTION VALUE="ND">North Dakota
<OPTION VALUE="OH">Ohio
<OPTION VALUE="OK">Oklahoma
<OPTION VALUE="OR">Oregon
<OPTION VALUE="PA">Pennsylvania
<OPTION VALUE="RI">Rhode Island
<OPTION VALUE="SC">South Carolina
<OPTION VALUE="SD">South Dakota
<OPTION VALUE="TN">Tennessee
<OPTION VALUE="TX">Texas
<OPTION VALUE="UT">Utah
<OPTION VALUE="VT">Vermont
<OPTION VALUE="VA">Virginia
<OPTION VALUE="WA">Washington
<OPTION VALUE="WV">West Virginia

<OPTION VALUE="WI">Wisconsin
<OPTION VALUE="WY">Wyoming
</SELECT>
<INPUT TYPE="submit" VALUE="Go">
</FORM>
<P>

</DL><DL><DT><B>Alabama</B>
<DD><A HREF="http://www.ag.auburn.edu/~cbailey/masters.html" TARGET="_new"> Auburn Masters Swimming</A> (Auburn)
<SMALL>[</SMALL> <A HREF="edit_club_link.php?a=982"><SMALL>Modify</SMALL></A> <SMALL>]</SMALL></DD>

<DD><A HREF="http://wng1.home.att.net/cams/" TARGET="_new"> CAMS</A> (Montgomery)
<SMALL>[</SMALL> <A HREF="edit_club_link.php?a=983"><SMALL>Modify</SMALL></A> <SMALL>]</SMALL></DD>
<DD><A HREF="http://www.ctaswim.com" TARGET="_new"> Crimson Tide Aquatics</A> (Tuscaloosa)
<SMALL>[</SMALL> <A HREF="edit_club_link.php?a=1279"><SMALL>Modify</SMALL></A> <SMALL>]</SMALL></DD>

<DD><A HREF="http://www.teamunify.com/SubTabGeneric.jsp?team=csfcast&_stabid_=5389" TARGET="_new"> FAST Masters - Fort Collins Area Swim Team</A> (Fort Collins)
<SMALL>[</SMALL> <A HREF="edit_club_link.php?a=1417"><SMALL>Modify</SMALL></A> <SMALL>]</SMALL></DD>
<DD><A HREF="http://www.swimhsa.org/masters" TARGET="_new"> Huntsville Swim Association</A> (Huntsville)
<SMALL>[</SMALL> <A HREF="edit_club_link.php?a=985"><SMALL>Modify</SMALL></A> <SMALL>]</SMALL></DD>

<DD><A HREF="http://www.magiccitymasters.org/" TARGET="_new"> Magic City Masters Swim Team</A> (Birmingham)
<SMALL>[</SMALL> <A HREF="edit_club_link.php?a=984"><SMALL>Modify</SMALL></A> <SMALL>]</SMALL></DD>
<DD><A HREF="http://www.shoalssharks.com" TARGET="_new"> Shoals Sharks Masters Swimming</A> (Florence)
<SMALL>[</SMALL> <A HREF="edit_club_link.php?a=1249"><SMALL>Modify</SMALL></A> <SMALL>]</SMALL></DD>

<DD><A HREF="http://www.mybswim.org/mastersswimming.htm" TARGET="_new"> YMCA Barracudas</A> (Montgomery)
<SMALL>[</SMALL> <A HREF="edit_club_link.php?a=987"><SMALL>Modify</SMALL></A> <SMALL>]</SMALL></DD>
</DL></DIV>


how can i do this plz help me

72 . Jose on June 10th, 2010

Jose

That is great information! Nowhere else did I find an easy and effective explanation. Proven, PHP CAN do the [email protected]!

73 . Shrikant on July 23rd, 2010

Shrikant

Hello sir
i am new in Cakephp framework, currently i am facing a problem.
Problem is that i am working on scrapping in cakephp and i am scrapping a site which is developed in .Net platform.
My problem is that how can i be logged in that .net site through php code means how to POST username and password on that site and response get back to on my site which is php site. After that i will scrape data from there and store it to DB.

Reference site link www.plentyoffish.com (.net site)
for example i am scrapping gmail account and logged in there from my php code but how it possible.

please help me

Thanks in advance

74 . NomikOS on July 23rd, 2010

NomikOS

That is easy, use cURL and a professional. http://www.rentacoder.com/RentACoder/DotNet/SoftwareCoders/ShowBioInfo.aspx?lngAuthorId=7064234

75 . theMaab on September 2nd, 2010

theMaab

What would the preg_match_all string look like to loop the TERMs and DEFINITIONs on this page? http://www.cancer.gov/drugdictionary/?expand=%23

Thanks in advanced. I'm horrible with regex, :(

76 . Steve on September 18th, 2010

Steve

nice tutorial. I've been working on a similar project (www.quickscrape.com) and found that some web hosts require you to use curl instead.

77 . Praveen on December 16th, 2010

Praveen

Dear All,

Could you please help me. How can i Scrap "http://www.indiatimes.com/" Site Latest News Sideshow Data on our site with php.

& when i click on the Scrap Feed URL The Main site show on my next page I frame. because I show the Our site header Portion.
Like(http://www.samachar.com)

Regards.
Praveen

78 . Ashneil on December 17th, 2010

Ashneil

Thanks for this tutorial. I really needed it. You have a really nice theme on your website.

79 . lioness on December 25th, 2010

lioness

need help with scraping.

user fills a form on my site1 and request made to another site2 that sends results directly to the user browser with excessive irrelevant information.

want to grab results before user sees results, i display only part of the results that is relevant.

<form name="example" action="http://www.site2.com/index.php?option=com_content&task=view&id=49&Itemid=10" method="post" onsubmit="return validate_form()" target="_blank";> ......

80 . Raj Keshwani on March 10th, 2011

Raj Keshwani

Great example this is!!!

81 . Freddy on March 22nd, 2011

Freddy

There is this new web scraping tool called Helium Scraper at http://www.heliumscraper.com also.

82 . Lisa Waters on March 28th, 2011

Lisa Waters

I need to extract data from multiple urls and have it inserted into a MySQL database. I am a newbie so, I have no idea what I am doing. I need some information from the body of the page and some from the url parameters. This is what I have so far:

<?php
$arr = array(10003,10029);

foreach($arr as $value){
// get the HTML
$web = 'http://www.doe.mass.edu/mcas/search/question.aspx?mcasyear=2010&QuestionSetID=1&grade=8&subjectcode=MTH&questionnumber=40'.$value;
echo $web."<br/>";

$html = file_get_contents($web);

preg_match_all(

'/<span class="nav em">(.*?)<br />(.*?).*?<\/span>/s',

$html,
$posts,
PREG_SET_ORDER
);

foreach ($posts as $post) {
$reportingcategory = $post[1];
$standard = $post[2];

}

// Create date stamp
$dateStamp = strftime("%D %T", time());

echo $name."|".$reportingcategory."|".$standard."<br/>";

}
?>
<?php

$url = "http://www.mysite.com/search/question.aspx?mcasyear=".$year."&QuestionSetID=".$QuestionSetID."&grade=".$grade."&subjectcode=".$QuestionType."&questionnumber=".$QuestionNumber;


echo $name."|".$reportingcategory."|".$standard."|".$grade."<br/>";
?>

83 . Gerry_castlow on March 30th, 2011

Gerry_castlow

haha idiot! regex loses against tidy xpath.

84 . djam on April 9th, 2011

djam

Hi..

This is the html code:

<div style="margin: 0px; z-index: 1000;" class="my-entry">
<ul>
<li><strong><a href="link.html">Sepucuk Surat Buat Presiden...</a></strong></li>
<li><strong><a href="link.html">Momentum, Rahasia Sukses</a></strong></li>
<li><strong><a href="link.html">Etika Bisnis Negeri Matahari Terbit</a></strong></li>
<li><strong><a href="link.html">Kiat Menjaga Motivasi Untuk Berolah Raga</a></strong></li>
</ul>
</div>


This is my code:

$url="http://www.theurl.com";
$text = file_get_contents($url);

preg_match_all(
'/<div style="margin\: 0px; z-index\: 1000;" class="my\-entry"><ul><li>.*?<strong><a href="(.*?)">(.*?)<\/a>.*?<\/strong>.*?<\/li>.*?<\/ul>.*?<\/div>/s',$text,$posts,PREG_SET_ORDER);

foreach ($posts as $post) {
$link = $post[1];
$title = $post[2];
$date = $post[3];
$content = $post[4];

echo $title;
echo $link;
echo $date;
echo $content;

}

But I got no result..
Please help...

85 . Elena Gallegos on May 6th, 2011

Elena Gallegos

Hola y si quisiera hacerlo en java como lo haria gracias?

86 . NomikOS on May 6th, 2011

NomikOS

Elena, el método de scraping visto en este post es mediante expresiones regulares. Las expresiones regulares son difíciles de aprender y usar. Java tiene un paquete para esto:java.util.regex. (http://www.regular-expressions.info/java.html)

Hay otras formas de scrapear también: como por ejemplo con xpath que son un poco más sencillas. En esta página se ven dos basadas en java: http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-java

Sea como sea, no hay una solución fácil si deseas hacer un trabajo profesional.

Espero haberte ayudado. No olvides visitar mi blog (http://nomikos.info), vale?

NomikOS.-

87 . Sudip Rooj on May 27th, 2011

Sudip Rooj

This code really helpful....
gr8 job.

88 . ras on May 30th, 2011

ras

Many thanks for this scraping tutorial, you saved my day and my time;)

89 . Sudip Rooj on June 7th, 2011

Sudip Rooj

not working pregmatch function with any regular expression in clickindia dot com site pls give some suggestion.

90 . tony on June 27th, 2011

tony

please help... y have 600 files.. and im stuck with it..... this is a sample from a file..... can u give me a sample code......????
y nedd to extract markets item and prices
y think img is the first key to find id s.. then search by ids....

<div id="centerData" class="dm">
<table class="fixw" cellspacing="0" cellpadding="0" border="0" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<tbody>
<tr class="h1 rh1">
<td align="center" width="32">
<a onclick="return false" href="#">
<img id="cpnBtn_981#38120821" border="0" align="absmiddle" onclick="clickOpenClose('981#38120821',4115,'',1,7,'',981,4,'',38120821,3,'',1,8,'',38120821,3,'',0,0);" src="mainpage_data/iconOpen.gif">
</a>
</td>
<td>
<a onclick="javaScript: gPC(100000,'',1,7,'',981,4,'',38120821,3); return false;" href="#">MARKET 1</a>
</td>
</tr>
</tbody>
</table>
<div id="cpnDiv_981#38120821" xmlns:fo="http://www.w3.org/1999/XSL/Format" style="display:inline">
<table cellspacing="0" cellpadding="0" width="565" style="border-bottom: 1px solid rgb(211, 211, 211);">
<tbody>
<tr>
<td style="height: 1px;"></td>
</tr>
<tr class="rcpn">
<td class="dcpnl ex clba cbb" onclick="javascript:number('pt=N#o=21/20#f=38120821#fp=194761477#so=0#c=1#');">ITEM 1</td>
<td class="dcpnr ex1 clab cbb" onclick="javascript: number('pt=N#o=21/20#f=38120821#fp=194761477#so=0#c=1#');">PRICE1</td>
<td class="dcpnl ex clba cbb" onclick="javascript: number('pt=N#o=3/4#f=38120821#fp=194761478#so=0#c=1#');">ITEM 2</td>
<td class="dcpnr ex1 clab cbr cbb" onclick="javascript: number('pt=N#o=3/4#f=38120821#fp=194761478#so=0#c=1#');">PRICE2</td>
</tr>
</tbody>
</table>
</div>
<table class="fixw" cellspacing="0" cellpadding="0" border="0" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<tbody>
<tr>
<td class="w" width="565" height="1px" colspan="1"></td>
</tr>
</tbody>
</table>
<table class="fixw" cellspacing="0" cellpadding="0" border="0" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<tbody>
<tr class="h1 rh1">
<td align="center" width="32">
<a onclick="return false" href="#">
<img id="cpnBtn_10202#38120821" border="0" align="absmiddle" onclick="clickOpenClose('10202#38120821',4115,'',1,7,'',10202,4,'',38120821,3,'',1,8,'',38120821,3,'',0,0);" src="mainpage_data/iconOpen.gif">
</a>
</td>
<td>
<a onclick="javaScript: gPC(100000,'',1,7,'',10202,4,'',38120821,3); return false;" href="#">MARKET 2</a>
</td>
</tr>
</tbody>
</table>
<div id="cpnDiv_10202#38120821" xmlns:fo="http://www.w3.org/1999/XSL/Format" style="display: inline;">
<table class="o4 no_b_tlrb" cellspacing="0" cellpadding="0" border="0" width="565">
<tbody>
<tr class="H1">
<td width="565" height="1" colspan="5"></td>
</tr>
</tbody>
</table>
<table cellspacing="0" cellpadding="0" border="0" width="565" style="border-bottom: 1px solid rgb(211, 211, 211);">
<tbody>
<tr>
<td style="height: 1px;"></td>
</tr>
<tr class="rcpn">
<td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=1/16#f=38120821#fp=194832550#so=0#c=1#');">ITEM 1</td>
<td class="dcpnr ex1 clab" onclick="javascript: number('pt=N#o=1/16#f=38120821#fp=194832550#so=0#c=1#');">PRICE 1</td>
<td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=9/1#f=38120821#fp=194832551#so=0#c=1#');">ITEM 2</td>
<td class="dcpnr ex1 clab cbr" onclick="javascript: number('pt=N#o=9/1#f=38120821#fp=194832551#so=0#c=1#');">PRICE 2</td>
</tr>
<tr class="rcpn">
<td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=3/10#f=38120821#fp=194832553#so=0#c=1#');">ITEM 3</td>
<td class="dcpnr ex1 clab" onclick="javascript: number('pt=N#o=3/10#f=38120821#fp=194832553#so=0#c=1#');">PRICE 3</td>
<td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=12/5#f=38120821#fp=194832554#so=0#c=1#');">ITEM 4</td>
<td class="dcpnr ex1 clab cbr" onclick="javascript: number('pt=N#o=12/5#f=38120821#fp=194832554#so=0#c=1#');">PRICE 4</td>
</tr>
<tr class="rcpn">
<td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=5/2#f=38120821#fp=194832556#so=0#c=1#');">ITEM 4</td>
<td class="dcpnr ex1 clab" onclick="javascript: number('pt=N#o=5/2#f=38120821#fp=194832556#so=0#c=1#');">PRICE 4</td>
<td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=2/7#f=38120821#fp=194832557#so=0#c=1#');">ITEM 5</td>
<td class="dcpnr ex1 clab cbr" onclick="javascript: number('pt=N#o=2/7#f=38120821#fp=194832557#so=0#c=1#');">PRICE 5</td>
</tr>
</tbody>
</table>
</div>

91 . Jax on June 27th, 2011

Jax

thanks for the tutorial, but I always get 0 results from this:

'/<td style="font\-weight\: bold;" class="rightNum">(.*?)<\/td><td style="padding\-left\: 40px;"><a href="(.*?)">(.*?)<\/a><\/td><td style="white\-space\: nowrap;">(.*?)<\/td><td class="centNum"><img src="(.*?)" onmouseover="setTipText\(\'(.*?)\'\);" class="staticTip"><\/td><td style="font\-weight\: bold; color\: rgb\(103, 135, 5\);" class="rightNum">(.*?)<\/td><td style="font\-weight\: bold; color\: rgb\(154, 20, 1\);" class="rightNum">(.*?)<\/td><td style="font\-weight\: bold;" class="rightNum">(.*?)<\/td><\/tr>/'

any ideas of why? Do someone see some error?

92 . kaushal sinha on June 30th, 2011

kaushal sinha

Excellent overview, it pointed me out something I didn’t realize before. I should encourage for your wonderful work. I am hoping the same best work from you in the future as well. Thank you for sharing this information with us.

93 . Stanley on July 1st, 2011

Stanley

Excellent - glad I found your article. I just scraped my local yellow pages (not ALL of it) to help our company do a bit of targeted marketing.

Much easier than I thought thanks to your simple explanation of how it works.

Many thanks.

94 . Stanley on July 1st, 2011

Stanley

@Jax Don't forget to end it with /s' instead of /' - that's the mistake I just made.

Can't vouch for the rest of your code though - depends on the HTML you're using it on. I'm no expert - do you really need to escape all those hyphens and colons? Maybe.

P.S. Just a point of interest for anyone else - when I was using this to grab a few names and addresses it worked fine until there was a piece of information (e.g. a url or an email address) missing from the records I was scraping, so the script naturally jumped ahead to find the next record where there was a "mailto", for instance.

Instead of trying to find a way to add an IF clause or two in my script, to test if there was a URL or an email address, I simplified my script to grab all of the HTML for this section of the records, then did a bit of weeding by using str_ireplace to get rid of the bits of HTML I didn't want and add a few "|" delimeters to re-separate the URL and email values.

Worked a treat. I also added a "page=" querystring to my page to make it quicker to load up the next page of records - similar to how Gerschel did it in Comment 27, but I just loaded the page, copied and pasted the records into my spreadsheet, then typed the next page number in my address bar and hit return, then repeated the process. I grabbed about 16 pages of records in just a few minutes using this method.

95 . Stanley on July 2nd, 2011

Stanley

I improved my method of paging through the records by adding a loop. I first checked to see how many pages of records there were and added a $totpages variable, like so:

$totpages = 18;

for ( $page_number = 1; $page_number <= $totpages; $page_number += 1) {

// get the HTML
$html = file_get_contents("http://www.example.com/listing.php?categoryid=123&page=".$page_number);

// then all the other stuff as per this tutorial, then an additional curly bracket right at the end to close the loop...

}

Works great. Cheers Jesse - you've unleashed a monster, lol!

I suppose I "could" make a list of all the category IDs and numbers of pages and just loop through the whole lot in one go... hmmm....

96 . Rashed on September 8th, 2011

Rashed

$html = file_get_contents("http://www.footbo.com/Teams/Real_Madrid");

preg_match_all(

'/<div class="bottom rounded6">(.*?)<\/div>/s',

$html,
$posts, // will contain the blog posts
PREG_SET_ORDER // formats data into an array of posts
);


foreach ($posts as $post) {
$link = $post[1];
$title = $post[2];
$date = $post[3];
$content = $post[4];


}
it's a great article, Jesse Skinner many many thanks your article . Regular expression is clear for me,I cant realize that array loop Please explain for me why use $link = $post[1]; , $title = $post[2]; , $date = $post[3]; , $date = $post[3]; $content = $post[4]; Please let me know and correct my code.Please

97 . Juned Ahmad on September 16th, 2011

Juned Ahmad

this tutts is very helpful for scraping. i am very thankful to You..
thanks You

98 . toto on October 28th, 2011

toto

I am a new php developer. Thank you for share how to scap site with php. this tutorial very helpuly for me

99 . Jonny on October 28th, 2011

Jonny

Thanks this came in useful and I have linked back to you.

100 . Gwen on November 9th, 2011

Gwen

I can't make this script work. Can anyone pls tell me what's wrong. I don't get any error, its just not working


<?

$html = file_get_contents("http://www.yellowpages.com/fort-lauderdale-fl/acupunture");

preg_match_all(
'/
<div class="listing_content">.*?
<h3 .*?>
<a .*?>(.*?)<\/a>
<\/h3>
<span class="listing-address adr">
<span class="street-address">(.*?)<\/span>
<span class="city-state">
<span class="locality">(.*?)<\/span>,
<span class="region">(.*?)<\/span>
<span class="postal-code">(.*?)<\/span>
<\/span>
<\/span>
<span class="business-phone phone">(.*?)<\/span>.*?
<li><a href="(.*?)">/s',
$html,
$posts,
PREG_SET_ORDER
);


$listing=array();

foreach ($posts as $post) {

$listing['title'][] = $post[1];

$listing['street'][] = $post[2];

$listing['city'][] = $post[3];

$listing['state'][] = $post[4];

$listing['zip'][] = $post[5];

$listing['phone'][] = $post[6];

$listing['website'][] = $post[7];

// do something with data

echo $post[4];
}


print_r($listing)


?>

101 . Jesse Skinner on November 9th, 2011

Jesse Skinner

@Gwen - you probably need to make the regular expression one line. Try using .* to capture the whitespace in between, with the /s ending as I describe in the article.

102 . Gwen on November 10th, 2011

Gwen

Thanks for the reply. I replaced a few tags with .*? and that did it!!!. Thank you Jesse. This tut rocks! : )

103 . Farhan on November 21st, 2011

Farhan

plz help. How can I get data from imdb coming soon movies page????

104 . Piyush on January 7th, 2012

Piyush

Hi,
I want scrap all text
<tr>
<td class="product-specs" colspan="2">
<h1>
<span style="font-size: small">
<span style="font-family: Arial">
Dell Optiplex 745 Tower Computer<br />
</span></span></h1>
<p>Intel Core 2 Duo 2.4 GHz <br />

2 GB&nbsp;RAM <br />
80 GB&nbsp;HDD <br />
DVDRW<br />
Windows XP&nbsp;Professional <br />
Keyboard <br />
and Mouse.</p>
<p><em>Factory refurbished desktop computer</em></p>

<h1><span style="color: rgb(255, 0, 0);"><strong><span style="font-size: small;"><span style="font-family: Arial;">3-year advance replacement warranty (no charge for parts, labor, and shipping)</span></span></strong></span></h1></td>
</tr>

105 . roisun on January 25th, 2012

roisun

I really want to scrap some site but I do not understand att all what to do with it.

I am new to internet and this php codes

106 . Alastair on February 11st, 2012

Alastair

Two libraries that I recommend for scraping:
PHP Simple HTML DOM Parser (simplehtmldom.sourceforge.net)
Magpie RSS (magpierss.sourceforge.net)

107 . Selo Bania on March 20th, 2012

Selo Bania

I have a really interesting question. In my stat counter I've found so many Russian
sites linking my website (http://selo-banya.com). Looks like is kind of Web scraping.
Someone here give me some help with this please. I foung a script on other site
(zdravstvo.rs) I don't deal with. In this script I found my domain name included. You
go to footer menu of this website and see information under "Odakle nam dolaze"

Here is the script - a
href="http://www.google.bg/url?sa=t&rct=j&q=link%3Ahttp%3A%2F%2Fselo-banya.com%20zvezdaput.net&source=web&cd=3&ved=0CDoQFjAC&url=http%3A%2F%2Fwww.zdravstvo.rs%2Fbaza%2Findex.php%3Fsrch%3D%26kategorija%3D41%26grad%3DBeograd&ei=itxoT_t8heW1BtyzteYH&usg=AFQjCNHGF-LTs4xiUYZI14DS7eAx36PfAw" rel="nofollow">www.google.bg

Also in the last 2 months my website ranked low. Alexa rank was 1 600 000 now went
down to 4 675 000. Please I need some help.

108 . Marcus on March 26th, 2012

Marcus

@Alastair - Thanks, that DOM parser was exactly what kind of scraping tool I was looking for.

@Jesse - Thanks for making me find this :)

109 . Marin on March 26th, 2012

Marin

Good article, I use a similar approach in my own freeware PHP web scraper: http://code.google.com/p/universal-web-scraper/

110 . Joe on March 28th, 2012

Joe

Hi, I am trying to use a web scraping script to search for telephone numbers on websites. Does anyone have a script that works?

111 . Martin on April 10th, 2012

Martin

Hi,

I want to scrape a website (not mine) but I was wondering if they could trace me while using a script like in this example. Is it possible to trace someone who is scraping your website with "file_get_contents"?

Thanks!

112 . nimo_Q on May 21st, 2012

nimo_Q

onderfule done ! thank you

113 . Lane on May 26th, 2012

Lane

This is an older post so a lot has changed over the past few years. Most people frown upon regular expressions for regular scraping needs and file_get_contents() works in some cases but not in others. If you are writing new web scraping code, I recommend looking at using the excellent Ultimate Web Scraper Toolkit:

http://barebonescms.com/documentation/ultimate_web_scraper_toolkit/

It comes with everything someone needs to get started with modern web scraping.

@Martin - It is possible to trace someone who is scraping a site if they are paying attention to their logs (or if software is monitoring for unusual activity from an IP address). Basically, your IP address may get banned by the administrator and, if what you are doing is illegal, there might be legal action taken, but I have yet to hear of anyone getting sued over it. Banning the IP address is a pretty effective measure.

Comments are closed, but I'd still love to hear your thoughts.