Results 1 to 2 of 2

Thread: HTML Parser Confusion

  1. #1
    Join Date
    Sep 2008
    Location
    Bristol - UK
    Posts
    842
    Thanks
    32
    Thanked 132 Times in 131 Posts

    Default HTML Parser Confusion

    Hey there, I'm trying to get information from products on amazon, and am using an HTML parser to do so, but what I don't understand is that 2 URLs come out with the same contents, even though they are different.

    The HTML parser I'm using can be found here: http://simplehtmldom.sourceforge.net/index.htm

    Here are the amazon URLs:

    http://www.amazon.co.uk/s/ref=nb_ss?...311082&x=0&y=0

    The number is generated from a barcode scanner, but the URL above is the output of the search, whereas the URL below is if I had clicked through to the first link on the results page:

    http://www.amazon.co.uk/Worst-Witch-...1029691&sr=1-1

    If you click those 2 URLs, you'll see the content is completely different...

    However, when I use the HTML parser I get them returning the same data:

    PHP Code:
    $html file_get_html('http://www.amazon.co.uk/s/ref=nb_ss?url=search-alias%3Dstripbooks&field-keywords=9780140311082&x=0&y=0');

    $other file_get_html('http://www.amazon.co.uk/Worst-Witch-Puffin-Books/dp/0140311084/ref=sr_1_1?ie=UTF8&s=books&qid=1251029691&sr=1-1');

    echo 
    $html->plaintext;

    echo 
    $other->plaintext;

    // They will both echo the same data, even though they are different URLs 
    It's as if the parser is clicking through automatically on the first page, and going straight through to the product information. Anyone who can explain this?

  2. #2
    Join Date
    Apr 2008
    Location
    Limoges, France
    Posts
    395
    Thanks
    13
    Thanked 61 Times in 61 Posts

    Default

    My first instinct told me these pages are using a ton of javascript for loading content and css for layout. I looked at the source code of each page and sure enough there is a ton of both in the page. I bet the HTML you are pulling out is similar enough, perhaps even exactly the same, but because of all the javascript and css, the pages look completely different.

    My guess is that the dom is changing after the page is loaded. If your parser could somehow grab the HTML after all the javascript has fired, perhaps you'd get different results.

    I didn't test anything, this is just my best guess.

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •