Results 1 to 4 of 4

Thread: Simple HTML DOM problem

  1. #1
    Join Date
    Nov 2009
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Simple HTML DOM problem

    I've been trying to use the simple html dom class to scrape search results from google. The code I'm using at the moment is below.

    Now, the code works if I change the source website simply to http://www.google.com, but it doesn't seem to like the search suffix "search?q=example".

    Is there something I'm missing?

    <?php
    include ('simplehtmldom/simple_html_dom.php');

    // create HTML DOM
    $html = file_get_html('http://www.google.com/search?q=example');

    // find all links
    foreach($html->find('a') as $e)
    echo $e->href . "<br>";
    ?>

  2. #2
    Join Date
    Mar 2007
    Location
    New York, NY
    Posts
    557
    Thanks
    8
    Thanked 66 Times in 66 Posts

    Default

    Google has special permissions on their server to prevent exactly what you're doing - spidering their indexed search results.
    - Josh

  3. #3
    Join Date
    Nov 2009
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default

    Yeah, I've done some further reading and learnt that

    However, I've heard from my university tutor that cURL might provide a way round it by tricking Google into thinking it's serving a browser request or something. Something for me to investigate anyway.

  4. #4
    Join Date
    Apr 2008
    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default How does Google stop you.

    If I do http://www.google.com/search?q=wombat and then do a show page source on the result, I get links with normal hrefs. It would seem that once you got that into your own script, you could parse and tweak it at will. So it must be that Google doesn't serve the same thing out to requests that it thinks are not from browsers. I know simplehtmldom uses file_get_contents to get stuff, but if it used cURL you could impersonate a browser agent, no?

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •