PDA

View Full Version : Simple HTML DOM problem



Dobby89
11-14-2009, 02:29 PM
I've been trying to use the simple html dom class to scrape search results from google. The code I'm using at the moment is below.

Now, the code works if I change the source website simply to http://www.google.com, but it doesn't seem to like the search suffix "search?q=example".

Is there something I'm missing?

<?php
include ('simplehtmldom/simple_html_dom.php');

// create HTML DOM
$html = file_get_html('http://www.google.com/search?q=example');

// find all links
foreach($html->find('a') as $e)
echo $e->href . "<br>";
?>

JShor
11-15-2009, 12:06 AM
Google has special permissions on their server to prevent exactly what you're doing - spidering their indexed search results.

Dobby89
11-16-2009, 09:51 PM
Yeah, I've done some further reading and learnt that :(

However, I've heard from my university tutor that cURL might provide a way round it by tricking Google into thinking it's serving a browser request or something. Something for me to investigate anyway.

tixrus
12-11-2009, 06:40 AM
If I do http://www.google.com/search?q=wombat and then do a show page source on the result, I get links with normal hrefs. It would seem that once you got that into your own script, you could parse and tweak it at will. So it must be that Google doesn't serve the same thing out to requests that it thinks are not from browsers. I know simplehtmldom uses file_get_contents to get stuff, but if it used cURL you could impersonate a browser agent, no?