Log in

View Full Version : Rip text from website



Marcymarc
10-03-2010, 12:34 PM
Hi,

I need some sort of script that when run, it will rip the text from this website http://m.2dayfm.com.au/playing and append to a text file. Each time it is run, it will keep appending. I can schedule the script using windows.

It would be good too if just the songs could be appended and not any other text, but this part's not important.

bluewalrus
10-03-2010, 02:26 PM
Do you have access to php?

jscheuer1
10-03-2010, 02:28 PM
This can probably be done with PHP. But depending upon what that content is and what you intend to do with it, may be a violation of copyright.

What would you be using this for?

You would need a PHP server or one that can run on your OS. If using windows, that could be WAMP. Then you just get the file from the remote server and append it to a file on your server. You could filter it before appending it using various PHP string functions (probably preg_replace()) in an attempt to only save the sort of information you're looking for.

Marcymarc
10-03-2010, 08:31 PM
Yep, got access to PHP, like a web server. But if I use this, in the script it would need to be set to run every 5 minutes for example or any numeral that I specify.

Would you have an example of a PHP script that would do this?

As your question why? Using this website as an example, a base so to speak. There might be others, but once I get the script, changing the URL is reads would be pretty easy.

jscheuer1
10-03-2010, 11:26 PM
That would be what I believe is known as a cron job. Something the server does every whatever. I don't know the specifics of that.

Back to my question - I don't really care what site you are grabbing information from. What concerns me is what you are doing or intend to be doing with that information once you get it. What do you want the information for? What will you do with it once you have it?

Marcymarc
10-04-2010, 07:29 PM
Hey John, this shouldn't be a concern of yours, I come here for help. Can you please help by posting me a script or pointing me in the right direction?

I need some sort of script that when run, it will rip the songs from this website http://www.2dayfm.com.au/nowplaying/iframe and append to a text file. Each time it is run, it will keep appending. It can be either schedule to run using windows, or the script be set to loop or run every X minutes.

traq
10-04-2010, 07:50 PM
Marcymarc,

John is simply trying to point out potential problems with what you're trying to do. If I were in your position, I would appreciate his concern.

Furthermore, it is entirely appropriate for him to be concerned with what he is helping someone to do. Everyone here helps others because they want to. If your purpose is legit, I'm sure he won't have any problem helping you figure it out.

Beyond the ethical concerns, what you are trying to do will probably have a significant impact on what the best solution will be. If you want help, you need to be as forthcoming as possible.

jscheuer1
10-04-2010, 09:27 PM
Yes. It's against forum rules to help with illegal requests. Since you will not answer the question, we pretty much have to assume that your purpose here is illegal. From the rules (http://www.dynamicdrive.com/forums/rules.htm):


1.4) No illegal requests- Do not post requests that are illegal or break the usage terms of the service in question, such as where to download warez, disable pop up ads on your free host etc.



That would include taking copyrighted material such as song or play lists and displaying them on your own site. I checked the terms of service of that site, and something like that is prohibited. If you have permission to do so, show us some proof.

Marcymarc
10-05-2010, 06:50 AM
Reason why I need the information is so I can keep track of what music is played on other radio stations, like a report, instead of me listening to the radio in real time writing down songs one by one, having a script to pump it to a text file is easier for me, I can then sit down and run my eye over the list in one go.

Once the songs are published on a site, i can take this information for my own use, as long as i don't re-use or sell the information, it's okay. Same as me listening to the radio or recording the radio, people are allowed to record radio for their own use.

I run a radio station too, I need to keep track of what other stations are playing. Quiet a normal thing to do in the industry.

jscheuer1
10-05-2010, 08:43 AM
I run a radio station too, I need to keep track of what other stations are playing. Quiet a normal thing to do in the industry.

Good enough for me.

OK, let's get back to business here. I'm not aware of any specific program for this. And it would depend upon what PHP version, as to the exact commands required, but that can be worked out either by using a fall back for older versions or by your knowing what version of PHP you have. What version do you have?

Not long ago I worked out a method whereby one website could capture a page of another. There is a form. The user inputs the desired page and submits. The page's content is fed back to the user with a Flash application superimposed over it in the lower right corner. There's quite a bit of detail to how this is done, but you don't need to know all that. The basic thing is to grab the page:


$url = 'http://www.somedomain.com/';
$requested_page = file_get_contents($url);

and write to the log file:


$file = 'somedomain.log'; //May be any filename that you like, or optained from the $url variable in some fashion.
if(phpversion() >= 5.1){
file_put_contents($file, $requested_page, FILE_APPEND | LOCK_EX);
} else {
$afile = fopen($file, 'a');
fwrite($afile, $requested_page);
fclose($afile);
}

There are all sorts of things one could do with the $requested_page variable to alter it (like strip out HTML tags, and/or everything except that data you are looking for) after it's obtained and before writing it to the log file.

As I alluded to before, the page that this code is on could be run periodically. Data could be passed to the page to set the URL to grab and/or the filename of the file to be written to.

djr33
10-05-2010, 02:49 PM
That's a good start and there are a lot of options as John says.

One thing I can add is that to use file_get_contents(), or really any functions that would do the same thing, you need to set your server to allow external URLs. If that parameter is disabled it will only work for local files. If you have problems with it, check this. If it works, then your server is already configured to handle it.

The parameter is called allow_url_fopen. There might be some more ways to configure all of it, but generally speaking that should be enough info for now.


Also, I want to give you some more advice about the nature of your request: while your purposes may be entirely legal, the methods you are using could be potentially against the terms of service for the website. While I don't personally care that much what you do, the station from which you are taking this data certainly will. They probably want to protect this data just as much as you want to keep it. The way that this might become an issue is that you aren't "listening to the radio" but instead are automating a process in an unusual way. It is normal to listen to the radio. It is not normal to create an automated script that just checks the playlist and skips actually listening to the music. This means that your behavior (rather, your server's behavior) may get flagged by their server as an automated request and they may block you. This entirely depends on how obvious it is (that the behavior of your server is different from a regular visitor) and how aware they are of the activity on the server. Furthermore, they could even just check that the IP of the request is the same as a server and therefore is probably an automated request, rather than an IP without a website attached to it-- probably a regular user.
Do what you'd like, but be aware that the legal situation is complex, not just in what you do, but also how you do it. And regardless of legality, a host can block you if they feel like it, and if that happens there's nothing you can do about it (as far as I know, there are no legal rights associated with being able to access a domain).
Of course one simple option for all of this is to just ask them to work with you, letting you have access to their playlists (either through this method or another) and perhaps you could trade. They might say no, but in that case at least you'd know for sure that you aren't going to randomly get blocked, or even sued (you don't need to be anything illegal to get sued, just to be proven guilty). I imagine that without asking for permission this could eventually lead to a cease and desist letter which basically means "stop or we'll try something else". At that point, of course, you could look for another solution, but getting their cooperation beyond that is unlikely, I'd guess.

jscheuer1
10-05-2010, 03:24 PM
I didn't think of that:


it would need to be set to run every 5 minutes for example or any numeral that I specify.

Even every 5 minutes could possibly be construed as an assault on their bandwidth. Do it much more frequently and it almost certainly would be. If they discovered it, and that wouldn't be too hard for them to do, they would have to block it.

djr33
10-06-2010, 05:04 AM
Yes, and if you do it that often, that dramatically increases the odds that that particular connection will be noticed by the host and they will block you. As I said, regardless of legality, they can block you if you want, so the only way to stop that from happening is to "play nice" (and if you choose to, ask permission).
Plus, does the playlist actually change every 5 minutes? There must be a more efficient way to approach this.

Marcymarc
10-11-2010, 07:12 AM
I won't be running this script all the time, maybe for about 3 hours here, and 2 hours there. Can be set to run about 5-10 minutes, I would need to change this setting.

If I run this script that John posted, it captures the whole site, all of it, but not the songs, it captures the text on the page as id="LastPlayedArtist-1, id="LastPlayedArtist-2 and id="LastPlayedArtist-3 etc. Same as when you view it in a browser and view source.

So all I want it just the actual songs, but not all the other HTML code with it. The PHP would have to read it in such a way so that it would generate the songs.

If you can please put a sample script together, that would be very helpful.

jscheuer1
10-11-2010, 02:41 PM
That's because those values are generated by javascript. They're not part of the source code of the page. To capture something like that in an automated fashion, this (requires Excel and VBA):

http://www.associatedcontent.com/article/1506423/internet_explorerapplication_for_vba.html?cat=55

looks promising.