View Full Version : Software or Script to Search a Website for a Given Search String and Provide Report
fergusoncg
11-12-2013, 05:14 PM
I am currently looking for a script or software that will allow me to:
1. Enter a URL like www.website.com
2. Enter a search string like "company name"
3. Script/Software will search all webpages and provide a report where the search string appears: URL, Number Times Search String Appears on Page
Any guidance or advice with this is greatly appreciated.
-RF
keyboard
11-13-2013, 05:45 AM
Just a quick clarification question:
Do you mean you give your script a specific website to search through for the string, or it searches the entire internet for that string?
This isn't overly simply, but could be done. Are you looking for someone to make this for you. I can't say I've heard of a pre-built package that does this for you, but its possible you could find one of course. I'm also doubtful that you'd find someone to make it for you, unless you post a paid work request.
Are you planning on creating this yourself at all?
If so, you'd need to: (If I'm interpreting your needs correctly)
Parse the url in question.
Search for the piece of text.
Crawl the website for links to other pages on the same domain.
Go to each page and repeat.
The more information you can provide us, the easier it will be to work out a solution for you!
djr33
11-13-2013, 06:07 AM
Why?
That's really an important question here. I don't understand the goal enough to determine the best way to do it.
As I understand it, you basically intend to build a search engine. (Another way of phrasing that is that Google would have no trouble doing this-- it would be easy, within a real search engine.)
This is not simple, as keyboard said. So you'd want to look into tutorials for building a search engine. But I don't know that you really want to do that.
This would be simpler if it's only on your own website (if this is a website-internal search). If you want it to search the web, it will, effectively, really mean building a search engine. The basic idea is as folllows:
1. Browse the whole web (or whatever you care about) and store a copy in a local database (removing extra irrelevant data like HTML code).
2. When a search is performed, it is performed on that stored copy.
Technically, you could try to do this in real time using a serverside language like PHP, but it would be very slow, impractical and not work very effectively/efficiently. Basically you could follow every link on that page (and the links on those pages, etc.) and search the text of each page [parsed to remove HTML?] to see how many times the search term comes up, then create a list. In theory, that's fine. In practice, that's a terrible idea-- it'll be very slow, use a lot of server resources, and not actually be a useful service.
keyboard
11-13-2013, 10:38 AM
This is not simple, as keyboard said. So you'd want to look into tutorials for building a search engine. But I don't know that you really want to do that.
This would be simpler if it's only on your own website (if this is a website-internal search). If you want it to search the web, it will, effectively, really mean building a search engine. The basic idea is as folllows:
1. Browse the whole web (or whatever you care about) and store a copy in a local database (removing extra irrelevant data like HTML code).
2. When a search is performed, it is performed on that stored copy.
I was working under the assumption that they were just talking about a single website, not the entire internet. We'll have to wait for the OP on that one though.
Technically, you could try to do this in real time using a serverside language like PHP, but it would be very slow, impractical and not work very effectively/efficiently. Basically you could follow every link on that page (and the links on those pages, etc.) and search the text of each page [parsed to remove HTML?] to see how many times the search term comes up, then create a list. In theory, that's fine. In practice, that's a terrible idea-- it'll be very slow, use a lot of server resources, and not actually be a useful service.
I'm trying to think of a more efficient way and failing miserably. Though I guess you would want to do it on a stored copy. How would you do this Daniel?
djr33
11-13-2013, 02:41 PM
I was working under the assumption that they were just talking about a single website, not the entire internet. We'll have to wait for the OP on that one though.I'd hope so, because that would be easier. But the example was of a domain-level query.
I'm trying to think of a more efficient way and failing miserably. Though I guess you would want to do it on a stored copy. How would you do this Daniel? The only practical answer is caching. That's why search engines use spiders to gather a database, parse that data once, then search it efficiently later. Caching can take many forms, but it's almost always better than raw input. Sometimes it can go as far as ngrams or pre-searched text. An N-gram is an N-length string (say, a 3-gram-- 3 words) with data. So we might know that "on this website" appears in every website a certain number of times without searching at all-- it's all presearched. But that's a huge amount of pre-organizational work. It's worth it if you're likely to actually receive all of those search queries multiple times.
It's simply not something web languages are suited for. If anyone is interested, Lucene + Solr (http://lucene.apache.org/) is a good (and fairly standard) solution.
(I've looked at it, but never used it in a project.)
Powered by vBulletin® Version 4.2.2 Copyright © 2021 vBulletin Solutions, Inc. All rights reserved.