wkenny
06-23-2007, 01:59 PM
I want something where I can type in a URL and then a search string. The prog should go to the URL and crawl the site. If it finds the string it should return the page name on which the search string appears and stop the crawl.
I would want basically a single line report with either a page URL or a 'string not found' message.
djr33
06-23-2007, 07:55 PM
Interesting.
Is this a "first time found" report, or would you want a full listing of all pages?
When you say crawling, do you mean searching through your site by every file/directory (from the server end of things), or following links on each page as real search engines like google do?
Both would be possible with PHP, and both would likely run quite slowly. The way this is done is by storing each page within a variable, as a string, like most data, then searches that, a process which takes a significant amount of time to just get through all that data.
A more practical way might be to use a database for this and use the built in, faster search features within that. However, you would then have to do one of several options:
1. Create a mirror of the pages (or actually store the pages' content directly) in the database.
2. Create a representation of the pages' content through a list of keywords by:
a) automatically generating a list of terms/words by parsing each page, like the original search idea, but one time, so the speed wouldn't be so crucial. Even if your pages change frequently, this could be run each night, so it would stay up to date, but not slow your visitors down.
b) manually create a terms/keywords list like above, but without parsing the actual page to get it. Though this seems weird, I've used this method and it works well, since you can, for each page, type up a list of 100 or so keywords without too much trouble, then just leave it at that. You could also do this by cutting and pasting important parts of the content (paragraph by paragraph) into the database.
However, note that a database might start running a bit sluggishly if you are still searching through a ton of text for a match to the search term.
The "string not found" message is very easy; just, at the end of all this, check using an if statement:
if ($finalresult == '') { $finalresult = 'Sorry, no results found.'; }
(This can be expanded on a bit, too.)
The other two things to consider in this that could become difficult if you choose to implement them:
1. Parsing the html on a page. If PHP is searching a page for "<br>" (a line break), then it will find literally that code and return all pages with a line break. In the same sense, if it is searching for "input", a page with an <input> tag will also be found to match. This means you would need to program around this, so that the html would not be found as a match. Creating the keywords list is the easiest way, or you could actually parse the html, to a limited extent, to skip the code and go right to the content, but this might prove difficult, depending on your time/knowledge. I'd certainly consider it a challenge, especially if there is any chance you do have some errors on your page. On top of that, you might need to not just strip all tags, but skip some elements, OR even return pages with some elements within tags that match, such as image descriptions.
2. You may want to interpret the user input to the searching more than just as a simple string. If there is a space do you find any or all of the words? Is order important? Will you allow boolean operators, like google? Will you strip words like "the"? Will you allow any characters? Will you convert characters to their html entities when searching? Error correction if the user types a bad symbol, or even error correction for spelling? As you can see, these details get complex fast. They aren't actually needed, but they are certainly a nice feature, if possible, though a lot of work to deal with.
And these two are related in a way; you will want to consider how to deal with something like "word." on a page. It makes sense to only search for words not surrounded by symbols (to avoid html), but that would be a valid result. Moreover, there are things to think about like capitalization. Personally, I'd say not to worry about it, and use strtolower(), a function that makes it all lowercase, for both the user input search string and all of the page/keyword data, however you store it.
wkenny
06-24-2007, 05:00 PM
Thanks for the very comprehensive reply.
Really all I need is a simple version of something like Copyscape. I want to be able to type in for example www.yoursite.com and a search string such as www.mysite.com. The prog should go to yoursite.com and report the page name containing the first reference to www.mysite.com then exit.
djr33
06-24-2007, 07:15 PM
First time it finds the string will makes things easier, yes.
so, this isn't on your server... you will need to find links and follow them.
You could just search the page itself.... is that all you need? Or would you want to check the whole site/links?
I'll give this some more thought and may try writing up a script soon.
wkenny
06-24-2007, 07:52 PM
Not on my server, so I cannot use PHP.
On my own site I use Enarion to build a sitemap. Something like that would be ideal but instead of reporting all pages on the site, just report the page containing the search string. It is basically a link checker I need.
Most of the sites I need to examine do not have a search function. I can manually check but this is very time consuming.
Powered by vBulletin® Version 4.2.2 Copyright © 2021 vBulletin Solutions, Inc. All rights reserved.