Page 1 of 4 123 ... LastLast
Results 1 to 10 of 37

Thread: Theories of a search algorithm

  1. #1
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,162
    Thanks
    263
    Thanked 690 Times in 678 Posts

    Default Theories of a search algorithm

    I am currently working on a project where for one section of the site I will need a strong search algorithm.

    The content will be user submitted and keywords would be difficult to get/require.

    Using a custom google search is problematic for a few reasons (since this is already a complex system, and that google doesn't integrate all that well with a few things).

    That said, what I need is to figure out how to design a search system that is efficient, fast and reliable. And of course that means that it gets good results.

    At the moment I'm trying to figure out if it makes sense to even attempt this or if I need to find a way to use an existing system.

    I'm using MySQL and PHP.

    I'd like to just start a discussion of good methods and known problems to figure out how to approach it. If there are any good references out there for a summary of this situation that would help too.
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  2. #2
    Join Date
    Aug 2005
    Location
    Other Side of My Monitor
    Posts
    3,486
    Thanks
    5
    Thanked 105 Times in 104 Posts
    Blog Entries
    1

    Default

    Okay, the obvious question here is what are you searching?

    So, the content to be searched is uploaded (or written, or whatever) into the site, and that stuff is stored on site and in database.

    I am guessing you then want to have a 3rd person come along months later and say "I wanna see that article BLiZZ wrote on how John is the greatest JS author ever.. ahhh a search bar" And then plug in keywords as if he was on Google and get results based on his search terms.

    I have seen this done a few times, and the answer (assuming on how the content is gathered) has always been to make the author (or uploader, or whatever) add key words. You then of course have the authors email addy, username, post date, and a list of CSV keywords in which to add to the DB table_SearchTerms.

    I must be missing something...
    {CWoT - Riddle } {OSTU - Psycho} {Invasion - Team}
    Follow Me on Twitter: @Negative_Chaos
    PHP Code:
    $result mysql_query("SELECT finger FROM hand WHERE id=3");
    echo 
    $result

  3. The Following User Says Thank You to BLiZZaRD For This Useful Post:

    djr33 (04-29-2010)

  4. #3
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,162
    Thanks
    263
    Thanked 690 Times in 678 Posts

    Default

    Yes, a search field like on google where you type in words.

    I've written keyword based searches before and it actually works really well, but it's a lot to maintain if there's a lot of content.

    As I said this will all be user contributed.

    It's going to be roughly a wiki. (Due to various reasons integrating existing packages is very difficult-- if this turns out to be too hard I might try going that route again, as a last resort).

    Basically I want to handle things just like google does: full text searching, but of course using a much smaller database, but still quite large.
    It's equivalent to searching dynamic drive's forums, probably, or at least once it gets going.


    Also, this will be across many translations and various complications.

    The simplest way to imagine this is that there will be a tables for large strings in the database. I want to be able to search those strings automatically. If this means some sort of pre-parsing fine; if it's in-the-moment parsing that's fine too; or perhaps a mix: caching of some sort. The problem though is that this content will always be changing. Again, it's all user contributed (and controlled).
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  5. #4
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    1,696
    Thanks
    82
    Thanked 90 Times in 88 Posts

    Default

    Off and on I have spent a lot of time with creating php/MySQL based search programs. The preferred method is to use fulltext indexing for fast, efficient, and accurate returns on queries. I am against this for a few reasons. fulltext indexing results are based on the indexed tables, which need to be indexed at regular intervals in order to get up to date returns. Stopwords are a factor. The min word length (ft_min_word_len) is often set at 4 letters and in order to alter this from 4 to 3 you need root access to your MySQL database, which with my GoDaddy account requires the significantly more expensive Deluxe account. Terms are also included that I would rather were not. Searching for "keyboards" could return "keyboard". Searches that are found in over 50% of the results are ignored as well. As I understand it Google uses search patterns similar to this.

    All that being said I use MySQL's LIKE function.
    Code:
    SELECT * FROM table WHERE col1 LIKE '%term%'
    .

    The searches will be be in realtime and will find results even if the text to be searched was entered into a database 2 seconds ago. Remember to sanitize the searches though so as to avoid SQL injection. The searches will be slower, but for most people with hardly be noticeable. This also is not the same as fulltext indexing as listed above.

    Skipping forward several paragraphs here is an example of one of my queries:

    Code:
    SELECT ID, summary, image FROM memoblog WHERE 
    (lcase( concat(cast(ID as char), summary, IFNULL(image,''))) 
    LIKE '%yoo%' )
    Breaking this apart I am returning the ID, summary, and image from table "memoblog" if "yoo" is found anywhere in ID, summary, or image columns.

    lcase is used to make the search case insensitive.

    concat means concatenation, which means the 3 columns are searched as if they were a single unified column. This is useful when searching multiple columns with multiple terms where one term may be found in one col and another is found in another column, but all of the terms are not found in a single column together. I hope that makes sense.

    cast is used here because if you are using MySQL 5.0 and later and you search your database for a column that is text and another column that is an integer the search becomes case sensitive again. I forget why or how though. cast (col as char) will return integers as alphanumeric characters.

    IFNULL if any field is null then no results will be returned for that particular row. IFNULL will allow you to recognize NULL values as something else. In this case the NULL value is returned as empty as opposed to NULL.

    The actual code I use in my search program is certainly more dynamic and can be rather complex if needed.

    There are other parameters that need to be considered for the admin like how do you make certain results unsearchable? Sometimes you want to keep certain entries from being found via your search program. How do you make those hidden results appear depending on the user? How do you keep the program simple enough so as to be easily editable? How do you search specific columns in multiple tables? Are you using MySQL 5.x or 4.x (big difference here)?

    PHP based questions include: how will you return the results? Will you return the enitre document or just a portion of 200 words or so that contains one or more of the search terms? Will you limit the searches to 10 results per page? Pagination? Can the user search for exact phrases?

    Other questions are implied, so I will avoid mentioning them here.

    On a side note there are free search programs out there, but I had some very specific requirements, so I made my own. It was not easy and a lot of the information was not all that easy for me to find, like ft_min_word_len manipulation, but it was rewarding. It has been a while since I have worked on my search engine and so I may be a bit out of practice with this.
    Last edited by james438; 03-29-2013 at 03:10 AM. Reason: typos
    To choose the lesser of two evils is still to choose evil. My personal site

  6. The Following User Says Thank You to james438 For This Useful Post:

    djr33 (04-29-2010)

  7. #5
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,162
    Thanks
    263
    Thanked 690 Times in 678 Posts

    Default

    That's some great info and I'm glad to hear from someone who's tried all this out before.

    The details will be helpful, but I have a general question:
    I've considered using the LIKE operator-- that's a clear place to start.
    But is it efficient? What if there is a LOT of text? That's a lot to go through and I know that MySQL is generally efficient, but if I'm searching through a big database (again, this will be roughly the size of searching through DD's posts), will this be a problem?
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  8. #6
    Join Date
    Aug 2005
    Location
    Other Side of My Monitor
    Posts
    3,486
    Thanks
    5
    Thanked 105 Times in 104 Posts
    Blog Entries
    1

    Default

    I have seen searches set up obviously using the LIKE operator. And it's a chore reading through the results. A causal user will type in things including "a, an, the, and" etc.

    Which will then return each result that matches that. Which will be 99.999999% of them, or more.

    You will have to limit it some way, maybe a character limit, or auto strip those words.
    {CWoT - Riddle } {OSTU - Psycho} {Invasion - Team}
    Follow Me on Twitter: @Negative_Chaos
    PHP Code:
    $result mysql_query("SELECT finger FROM hand WHERE id=3");
    echo 
    $result

  9. The Following User Says Thank You to BLiZZaRD For This Useful Post:

    djr33 (04-29-2010)

  10. #7
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,162
    Thanks
    263
    Thanked 690 Times in 678 Posts

    Default

    Pre-processing the search input is required; that's fine.

    Given good input, is the LIKE operator reliable/efficient/stable through lots of data?
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  11. #8
    Join Date
    Aug 2005
    Location
    Other Side of My Monitor
    Posts
    3,486
    Thanks
    5
    Thanked 105 Times in 104 Posts
    Blog Entries
    1

    Default

    You and everyone else knows a lot more than I do about so I am trying (really) to stay out of this, but I just can't help it.

    I think the query using the like operand would be fine, speed will be more based on the server proc anyway and bandwidth allotments.

    However, for set up and use the RLIKE and NOT RLIKE may be a little better suited as there are less chances of coding failures (for example in LIKE and NOT LIKE you need to look for a character, say "x", then you run %x% but in RLIKE you just use 'x' ). It also looks for a match anywhere in the search area where LIKE only looks for an exact match.

    From what I know (and again, its not a lot) RLIKE is a off shoot of REGEXP with the added bonuses of LIKE.
    {CWoT - Riddle } {OSTU - Psycho} {Invasion - Team}
    Follow Me on Twitter: @Negative_Chaos
    PHP Code:
    $result mysql_query("SELECT finger FROM hand WHERE id=3");
    echo 
    $result

  12. The Following User Says Thank You to BLiZZaRD For This Useful Post:

    djr33 (04-29-2010)

  13. #9
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,162
    Thanks
    263
    Thanked 690 Times in 678 Posts

    Default

    Interesting. More to look into.

    As I said, I'm currently just thinking about this. I've got other things to do at the moment on the site and at some point I'll have to start this part.

    My main concern is that I'll spend a lot of time and build a very nice system then find that after users contribute something like 100MB of text into the database (admittedly, it'll be a long while), the search will start overloading the server or even just take 10 seconds for each query...


    I guess I could test all of this, but it is difficult to test without having a real database to play with.
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  14. #10
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    1,696
    Thanks
    82
    Thanked 90 Times in 88 Posts

    Default

    RLIKE is synonymous with regex. The terms can be used interchangeably. I have recently discovered that MySQL is very efficient and can search through large amounts of data almost instantly. I am not sure how processor heavy RLIKE is with MySQL though. I imagine not very, but I have only tried regex with MySQL once for the novelty of it.

    fulltext indexing is designed for handling large amounts of data to help greatly speed up searches, etc. What exactly constitutes a large amount of data I am not sure. I assume the DD forums would constitute a large amount of data, but MySQL is designed to handle really large amounts of data that I suspect would dwarf even Dynamic Drive forums. I generally do not like fulltext searches due to all of the restrictions fulltext searches have. It also behaves quite differently. Really differently. It is a whole other world compared to LIKE. It can analyze the results and order them by most relevant. The searches are quite intelligent. When using this method you will be using MATCH as opposed to LIKE. It is a minor syntax thing. The real difference is in how it analyzes data.

    fulltext searches are good for DD forum sized data, but the designer should know what they are getting into. For me I will avoid it as much as possible, because I do not like coding that is both user friendly and uncustomizable. You can customize it, but there are many hoops to jump through like needing root access to alter the startup data so as to alter the source code of MySQL. You can read more about all this at: http://dev.mysql.com/doc/refman/5.0/...xt-search.html

    As far as the engine differences djr33 I wish I knew the answer.

    LIKE has been quite useful when searching through what I consider large amounts of data. It is fast, efficient, and reliable. I have worked with it extensively to be sure that the correct number of results are returned and to avoid "false positives" as well. False positives and the like were due to errors in my coding.

    If you want to ignore things like a, an, the, or, is then you are beginning to think along the lines of fulltext indexing.

    For limiting results and speed in returning results I use two methods. First, I limit results to 500 characters of sample text, second I use DD's Virtual Pagination script. If you are returning thousands of results then this may take up to a minute to return all of that data. Remember that you are dealing with text, so the results will be returned rather quickly as compared to an image. If you have 2000 results then I would recommend DD's Ajax pagination script as opposed to the Virtual Pagination script so as to save on bandwidth and load time. I plan on implementing this at some point, but have been a bit intimidated about working with Ajax and integrating it into my search program.

    If I search for "What time is the lecture?" It will search for all of the entries that contain all of the words: "what" "time" "is" "the" "lecture?" in any order. 'lecture?' with a '?' is considered a word. I could eliminate non word characters in my search, but I prefer not to, because sometimes I like to look for "lecture?" as opposed to "lecture"
    To choose the lesser of two evils is still to choose evil. My personal site

  15. The Following User Says Thank You to james438 For This Useful Post:

    djr33 (04-30-2010)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •