Results 1 to 6 of 6

Thread: Software or Script to Search a Website for a Given Search String and Provide Report

  1. #1
    Join Date
    Nov 2013
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Software or Script to Search a Website for a Given Search String and Provide Report

    I am currently looking for a script or software that will allow me to:

    1. Enter a URL like www.website.com
    2. Enter a search string like "company name"
    3. Script/Software will search all webpages and provide a report where the search string appears: URL, Number Times Search String Appears on Page

    Any guidance or advice with this is greatly appreciated.

    -RF
    Last edited by keyboard; 11-13-2013 at 05:46 AM. Reason: Format: Removed URL

  2. #2
    Join Date
    Mar 2011
    Location
    N 11░ 19' 0.0012 E 142░ 15' 0
    Posts
    1,591
    Thanks
    57
    Thanked 99 Times in 97 Posts
    Blog Entries
    4

    Default

    Just a quick clarification question:
    Do you mean you give your script a specific website to search through for the string, or it searches the entire internet for that string?

    This isn't overly simply, but could be done. Are you looking for someone to make this for you. I can't say I've heard of a pre-built package that does this for you, but its possible you could find one of course. I'm also doubtful that you'd find someone to make it for you, unless you post a paid work request.

    Are you planning on creating this yourself at all?
    If so, you'd need to: (If I'm interpreting your needs correctly)
    Parse the url in question.
    Search for the piece of text.
    Crawl the website for links to other pages on the same domain.
    Go to each page and repeat.


    The more information you can provide us, the easier it will be to work out a solution for you!
    Posting Tips + FAQ | Issues? Feel free to PM me
    - keyboard1333[at]gmail[dot]com

  3. #3
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Why?

    That's really an important question here. I don't understand the goal enough to determine the best way to do it.

    As I understand it, you basically intend to build a search engine. (Another way of phrasing that is that Google would have no trouble doing this-- it would be easy, within a real search engine.)

    This is not simple, as keyboard said. So you'd want to look into tutorials for building a search engine. But I don't know that you really want to do that.

    This would be simpler if it's only on your own website (if this is a website-internal search). If you want it to search the web, it will, effectively, really mean building a search engine. The basic idea is as folllows:
    1. Browse the whole web (or whatever you care about) and store a copy in a local database (removing extra irrelevant data like HTML code).
    2. When a search is performed, it is performed on that stored copy.


    Technically, you could try to do this in real time using a serverside language like PHP, but it would be very slow, impractical and not work very effectively/efficiently. Basically you could follow every link on that page (and the links on those pages, etc.) and search the text of each page [parsed to remove HTML?] to see how many times the search term comes up, then create a list. In theory, that's fine. In practice, that's a terrible idea-- it'll be very slow, use a lot of server resources, and not actually be a useful service.
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  4. #4
    Join Date
    Mar 2011
    Location
    N 11░ 19' 0.0012 E 142░ 15' 0
    Posts
    1,591
    Thanks
    57
    Thanked 99 Times in 97 Posts
    Blog Entries
    4

    Default

    This is not simple, as keyboard said. So you'd want to look into tutorials for building a search engine. But I don't know that you really want to do that.

    This would be simpler if it's only on your own website (if this is a website-internal search). If you want it to search the web, it will, effectively, really mean building a search engine. The basic idea is as folllows:
    1. Browse the whole web (or whatever you care about) and store a copy in a local database (removing extra irrelevant data like HTML code).
    2. When a search is performed, it is performed on that stored copy.
    I was working under the assumption that they were just talking about a single website, not the entire internet. We'll have to wait for the OP on that one though.

    Technically, you could try to do this in real time using a serverside language like PHP, but it would be very slow, impractical and not work very effectively/efficiently. Basically you could follow every link on that page (and the links on those pages, etc.) and search the text of each page [parsed to remove HTML?] to see how many times the search term comes up, then create a list. In theory, that's fine. In practice, that's a terrible idea-- it'll be very slow, use a lot of server resources, and not actually be a useful service.
    I'm trying to think of a more efficient way and failing miserably. Though I guess you would want to do it on a stored copy. How would you do this Daniel?
    Posting Tips + FAQ | Issues? Feel free to PM me
    - keyboard1333[at]gmail[dot]com

  5. #5
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    I was working under the assumption that they were just talking about a single website, not the entire internet. We'll have to wait for the OP on that one though.
    I'd hope so, because that would be easier. But the example was of a domain-level query.
    I'm trying to think of a more efficient way and failing miserably. Though I guess you would want to do it on a stored copy. How would you do this Daniel?
    The only practical answer is caching. That's why search engines use spiders to gather a database, parse that data once, then search it efficiently later. Caching can take many forms, but it's almost always better than raw input. Sometimes it can go as far as ngrams or pre-searched text. An N-gram is an N-length string (say, a 3-gram-- 3 words) with data. So we might know that "on this website" appears in every website a certain number of times without searching at all-- it's all presearched. But that's a huge amount of pre-organizational work. It's worth it if you're likely to actually receive all of those search queries multiple times.
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  6. #6
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 516 Times in 502 Posts
    Blog Entries
    5

    Default

    It's simply not something web languages are suited for. If anyone is interested, Lucene + Solr is a good (and fairly standard) solution.
    (I've looked at it, but never used it in a project.)

  7. The Following 2 Users Say Thank You to traq For This Useful Post:

    djr33 (11-14-2013),keyboard (11-14-2013)

Similar Threads

  1. script / website for Flight Search Booking Engine script
    By ricky_wid in forum General Paid Work Requests
    Replies: 7
    Last Post: 09-27-2014, 12:51 AM
  2. Internal website search function script required
    By stevechand in forum Looking for such a script or service
    Replies: 2
    Last Post: 02-22-2011, 01:00 AM
  3. Replies: 0
    Last Post: 08-04-2009, 01:58 PM
  4. Replies: 0
    Last Post: 04-06-2008, 07:57 PM
  5. Looking for script/prog to search a website for a given string
    By wkenny in forum Looking for such a script or service
    Replies: 4
    Last Post: 06-24-2007, 07:52 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •