Log in

View Full Version : HTML Parser Confusion



Schmoopy
08-23-2009, 12:34 PM
Hey there, I'm trying to get information from products on amazon, and am using an HTML parser to do so, but what I don't understand is that 2 URLs come out with the same contents, even though they are different.

The HTML parser I'm using can be found here: http://simplehtmldom.sourceforge.net/index.htm

Here are the amazon URLs:

http://www.amazon.co.uk/s/ref=nb_ss?url=search-alias%3Dstripbooks&field-keywords=9780140311082&x=0&y=0

The number is generated from a barcode scanner, but the URL above is the output of the search, whereas the URL below is if I had clicked through to the first link on the results page:

http://www.amazon.co.uk/Worst-Witch-Puffin-Books/dp/0140311084/ref=sr_1_1?ie=UTF8&s=books&qid=1251029691&sr=1-1

If you click those 2 URLs, you'll see the content is completely different...

However, when I use the HTML parser I get them returning the same data:



$html = file_get_html('http://www.amazon.co.uk/s/ref=nb_ss?url=search-alias%3Dstripbooks&field-keywords=9780140311082&x=0&y=0');

$other = file_get_html('http://www.amazon.co.uk/Worst-Witch-Puffin-Books/dp/0140311084/ref=sr_1_1?ie=UTF8&s=books&qid=1251029691&sr=1-1');

echo $html->plaintext;

echo $other->plaintext;

// They will both echo the same data, even though they are different URLs


It's as if the parser is clicking through automatically on the first page, and going straight through to the product information. Anyone who can explain this?

JasonDFR
08-24-2009, 03:38 PM
My first instinct told me these pages are using a ton of javascript for loading content and css for layout. I looked at the source code of each page and sure enough there is a ton of both in the page. I bet the HTML you are pulling out is similar enough, perhaps even exactly the same, but because of all the javascript and css, the pages look completely different.

My guess is that the dom is changing after the page is loaded. If your parser could somehow grab the HTML after all the javascript has fired, perhaps you'd get different results.

I didn't test anything, this is just my best guess.