PDA

View Full Version : Grabbing A Website



Deadweight
05-13-2015, 05:52 PM
So I am trying to grab a website from my own domain. However I get an error like this:

XMLHttpRequest cannot load http://thebcelements.com/NewWebsite/?_=1431538964656. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost' is therefore not allowed access.

I am not sure if this is JavaScript related or something else. I can grab the same website on my local machine just fine.

james438
05-13-2015, 08:45 PM
Just curious, but what do you mean by "grab" a website?

Deadweight
05-13-2015, 11:06 PM
I will show you:


function loadObject(area, file){
//alert(file!='null'?area+file:area)
$.ajax({
url: file!='null'?area+file:area,
cache: false,
dataType:"html",
success: function(data){
var head_start = data.indexOf('</title>')+("</title".length+1)
var head_end = data.indexOf('</head>')
var head_html = data.substring(head_start, head_end);

head_html = head_html.replace(/src="+/g, 'src="'+area+'');
head_html = head_html.replace(/href="+/g, 'href="'+area+'');

var body_start = data.indexOf('<body>')+('<body>'.length+1)
var body_end = data.indexOf('</body>')

var body_html = data.substring(body_start, body_end)
$("body").html(head_html+body_html);
}
})

}

jscheuer1
05-14-2015, 06:08 PM
That looks more like grabbing a page than an entire site. Regardless, when using AJAX we are mostly bound by the same origin policy. I know that at one time, and perhaps still, it was possible to get around that using only javascript if both domains were in on it. I'm not sure if this is still possible or not. It required certificates and was complicated. And this applies not only to AJAX, but also to virtually any cross domain javascript scripting.

Now, when you are working on the localhost with a virtual (or even a live) server like WAMP, XAMP, etc., everything on localhost is considered to be on a single domain, so you can cross script and do all the AJAX you like, as long as all of the pages involved are on that 'server' and addressed as localhost. The same is also true for any single live domain. However, if you have more than one domain on a live server, even if you own both of them, if they have different root folders, most likely browsers will see them as two distinct domains. In fact, even if it is the same root folder, accessing http://www.thedomain.com from http://thedomain.com will run afoul of the same origin policy.

If you are working simply on the localhost with no virtual server, it differs by browser how that's treated. Often it's considered all one server. Just about as often, each folder is treated as a separate domain, even child folders.

Getting back to where actual domains and servers are involved, information can be sent cross domain via query strings and post data, but in the first case you are pretty limited as to the number of characters, which (unless they are very simple) almost always must be at least URL Component Encoded/Decoded as part of the process, and by the fact that the string is seen in the address bar, and in the second by the need for server side code to receive the post data. If you have available server side code, and assuming the permissions on both domains are liberal enough, you can more easily grab a page by using the various file read functions to read the offsite file on the server side and then process that data via server and/or client side code. At this point another consideration is copyright. In many cases you do not own the copyright to the remote site's content. In cases like that, even though you might be technically able to grab it, if you then publish that content, you are violating copyright. You can though, in most cases, use this content for your own purposes, as long as you are the only person able to view it. If you own both sites and their content already, or have permission to publish the other site's content, this is not a problem.

Deadweight
05-14-2015, 06:35 PM
Okay let me ask you something else.
This might be a little confusing.

You have two different domains. Domain A (DA) and Domain B (DB).
You have a script on DA and you copy it into the head of DB. When in load DB the script would run and load the index page (hosted on DA) into DB page. Is that possible (without using iframes)

jscheuer1
05-14-2015, 07:16 PM
Without frames, (iframe, frame, and I would include the object tag here as well as a sort of frame, because it can act like a frame), and using only javascript and/or AJAX, no.

Deadweight
05-14-2015, 07:33 PM
What about using PHP with ajax?

jscheuer1
05-14-2015, 08:19 PM
Sure. If you have PHP on the domain that wants to grab the other domain, you can use file() or get_file_contents(), etc. (sometimes more extreme measures are needed depending upon the length and characteristics of the grabbed content) to grab the other domain's page and then use AJAX with that either to initiate it and/or process the content returned. But AJAX isn't needed. the content can be processed using PHP and then included on another PHP page. You really only need AJAX if you want to import the result to an html page without setting that extension for parsing via PHP (usually only .php is processed by PHP).

However, the grabbing domain must have permission to use its file processing commands like file(), etc. on files from remote domains, and the remote domain must not have this action blocked.

And, as mentioned in my other post, that's just the technical side. If you do not have permission/the right to publish the grabbed content on the grabbing domain, you are probably violating copyright and/or other laws.

Deadweight
05-14-2015, 08:58 PM
They will have the right to copy the website because their website (pages) will be hosted from my domain then give them a code to use to host their website on their own domain but not store their files on their own domain.

jscheuer1
05-14-2015, 10:03 PM
That's certainly doable. But it can get tricky with paths for links, css, js, and other resources. What's usually done in cases like that is that the user from the remote site actually comes to your domain. Information about them or what they're doing (if any) is posted to the page they enter your domain on. Any information that needs to go back to the site they came from is posted back either in more or less real time via AJAX and PHP, or as a form submission as they return to the originating site.

If there are a lot of pages involved, this is generally easier and faster.

If there is just one relatively simple third party service you are offering, then it might be feasible. But even with that - like, say how PayPla does it, the user goes to PayPal to make payment to the vendor and owner of the site they came from carrying post data with them about what they are paying for and to whom, and then are returned to the vendor's site with post data about the transaction. Separate data is available to the vendor by logging to their PayPal account to see records of all transactions, and both vendor and buyer receive confirmation emails.

If you could give a concrete example, I could be of more help. It is possible that fetching the page(s) from your domain via PHP as we have been discussing would be best. It would require the client domain have PHP with permission set to allow that. There's more time lag though* with that approach than with almost any other way I can think of for doing something like this.

*The client domain must request the content from your domain, download it to their server, then process it on their server and then serve it to their user.

The best use of something like this would be for data - say, sports scores or stock prices that you host, that they could fetch and present to their users. Even at that, it would be better to have them fetch a data file (xml is good for this) then parse it and present the data to their user in a table on an otherwise ready made page on their end.

Deadweight
05-15-2015, 01:05 AM
What I was actually thinking is having the person register an account with (with a unique username). It would auto create a folder in my domain. The registration would be saved in MYSQL and contain their "other domain(s)" with an encoded password and username. I then would give them a short script that will send the encoded username, password, and current domain to a PHP file to check to see if their domain address is valid. If valid then it [the code] will grab an XML file. The main contents of the XML file are the scripts and stylesheet links. Then the PHP would also grab the main page (first time loading) and send that back. If a link is clicked then one of the my personal JS files is also loaded into the page and prevent the default action of checking that page. Instead it would send a request of the page to PHP/AJAX and the PHP/AJAX page would get sent back and loaded into the correct areas. Sounds mega confusing I know.

jscheuer1
05-18-2015, 04:44 PM
That's kind of the opposite of what I asked for. I wanted a specific explanation of what they get. Not a general explanation of the methods/mechanisms.

The key thing that I think I can give you here is a method of grabbing files from a remote site via PHP. It can be (assuming the sought after file isn't overly complex or long, and that the requesting server has permission to request files from the web and the requested from server isn't blocking this) as simple as using the include command, ex:


<?php
$file = 'http://www.whatever.com/somefile.htm';
include $file;
?>

Now you have a copy of the served source of http://www.whatever.com/somefile.htm on the other server to do with whatever you like. You can access it via AJAX and parse it any way you want. Or the above could be inside a (possibly hidden) div element, then you can parse it right on the same page using javascript and the DOM or the HTML property of that div.

Instead of include $file, you can use file($file), or file_get_contents($file), file yields an array of the lines in the file. file_get_contents gives you a string representing the contents. The advantage with these is that you could set a server side variable equal to one of them and then parse its content on the server side.

If the file is too long or too complex, I have found that using CURL (if available - it comes with many PHP installations, and is available to add on to virtually any PHP server) often takes care of any problems those other simpler methods can create in longer and/or more complex files. It's just a little more complicated to set up. The simpler methods can handle fairly long and complex files though.

Still you are having one server request the file, process it, then serve it yet again to a user. If there's any way to cut down on that, it's a good idea. I've mention two ways of doing that in this thread already:

1) Taking the user to the other site for a period time in order to complete some action. Useful for third party utils like PayPay, mail handlers, etc.
2) Instead of sending an entire page, just send the data required to build that page. Great if you are just offering content like scores, listings or prices.

If the situation warrants, a combination of these can be used. But to reconstruct entire pages wholesale, especially if there are a lot of them, can get resource intensive and cause more of a lag than these other approaches.

Deadweight
05-18-2015, 11:10 PM
Hey John,
I think I got it. I did something like this:


<?php
$page = 'http://www.thebcelements.com/NewWebsite/';
$domain = 'http://www.thebcelements.com/NewWebsite/';
$homepage = htmlentities(file_get_contents($page));
$headObjects = strSplitter($homepage, htmlentities("</title>"), htmlentities("</head>"));
$bodyObjects = strSplitter($homepage, htmlentities("<body>"), htmlentities("</body>"));
$finalHead = headReplace($headObjects, $domain);

echo $finalHead.'<br /><br />'.$bodyObjects;

function headReplace($string, $website){
$string = str_replace(htmlentities('src="'), htmlentities('src="'.$website), $string);
$string = str_replace(htmlentities('href="'), htmlentities('href="'.$website), $string);
return $string;
}

function strSplitter($string, $beg, $end){
$s = strpos($string, $beg)+strlen($beg);
$e = strpos($string, $end);
return substr($string, $s, $e-$s);
}

?>


I think that will work best. Domain would be different from page in the latter run.

Thanks,

jscheuer1
05-19-2015, 12:14 AM
You seem to have the general idea of what can be done. I've played around with all of these in various ways in various situations. Generally, that's how you are going to fine tune things, if necessary. But if you have any further questions, let me know.