Page 3 of 4 FirstFirst 1234 LastLast
Results 21 to 30 of 37

Thread: Theories of a search algorithm

  1. #21
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Thank you for helping me confirm this. After some testing, it appears that "NOT" can precede a parenthetical block:
    "WHERE NOT (statement)" is completely valid.

    Now to put all of this together and I'll have a very helpful function.

    Thanks!


    And yes, you do need the "AND" and a full statement in the parentheses, but that's quite possible to setup.


    For you information, the query I used to test this (that worked) was:
    WHERE `field` LIKE '%a%' AND NOT (`field` LIKE '%b%')

    Based on that, I'm working on the assumption that truth values for the parenthetical are reversed so this should be properly recursive, etc.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  2. #22
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Not sure if I should start a new thread for this or not, but I may as well post here.

    I've finished my MySQL search-query generator.

    Input: the search terms in a text field including operators:
    "AND" is implied by just adding more words:
    a b will search for anything containing both a AND b.

    OR allows for one or the other (or both):
    a OR b searches for anything containing a and/or b.
    OR is case sensitive. Lowercase 'or' will act as a content-word and it will search for articles containing "or".

    quotation marks ("") make the text inside literal: syntax operators are ignored inside and spaces are allowed; the exact contents must be in the search:
    "a b" matches exactly a-space-b.

    () Parentheses can group sections:
    a OR (b c) searches for something containing a OR (both b and c).

    - negation is marked by a minus sign immediately before something:
    -a searches for anything that does not have a.
    -(a OR b) searches for anything that does not have either a or b.


    All of this can be layered into very complex setups and combinations of the logical operators:
    a -(b (c OR d))
    This generates a search for something that:
    1) contains a
    2) does not: contain b AND (c OR d)

    As an example, the above syntax generates:
    `field` LIKE '%a%' AND NOT (`field` LIKE '%b%' AND (`field` LIKE '%c%' OR `field` LIKE '%d%'))


    This function is multibyte safe after a lot of reworking. For the encoding of Japanese characters for example, 3 bytes are used for one character. The normal string functions ignore this. So I've fixed it up and it works. You can have any search term you'd like EXCEPT the logical operators above-- if you need those, put them in quotes. (I don't have a way to search for quotation marks-- " "--- themselves yet, but I don't really care that much).


    The result of this is as you can see a CONDITION for MySQL WHERE statements.

    So just use the syntax:
    WHERE $condition
    You can of course combine it with other things as well:
    WHERE `a` LIKE '%1%' AND $condition


    The value $field in the function is very important: this will generate the MySQL table's column name that you are matching for each of the LIKE statements. Due to MySQL's structure, this must be done literally as far as I know.


    Here's the code. It's a lot, but commented.

    I've minimally tested it and as far as I can tell it's working well.

    Note that if you give it a particularly weird input it'll return FALSE. Blank queries, mismatched quotes/parentheses and other odd things do this.

    I'm *hopeful* that no invalid queries will be generated by disallowing certain things.
    PHP Code:
    <?php

    //supporting function, splits a string into an array of [multibyte] characters
    function mbstring2chararray($s,$maxcharlen=4) {
        if (
    $maxcharlen!=4) {
            echo 
    'WARNING: mbstring2chararray() can only do [up to] 4byte chars now.';
            
    //this should be later fixed, but the syntax is hard...
        
    }
        
    $o = array();
        for(
    $x=0;$x<strlen($s);$x++) {
            
    $char $s[$x];

            
    $skipx 0//assume skipping no characters
            //now go through each byte number:
            //if the sequence of bytes is one character, then group them:
            //a 2byte char
            
    if (($x<(strlen($s)-1))&&mb_strlen($s[$x].$s[$x+1])===1) {
                
    $char $s[$x].$s[$x+1]; $skipx 1;
            }
            
    //a 3byte char
            
    if (($x<(strlen($s)-2))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2])===1) {
                
    $char $s[$x].$s[$x+1].$s[$x+2]; $skipx 2;
            }
            
    //a 4byte char
            
    if (($x<(strlen($s)-3))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2].$s[$x+3])===1) {
                
    $char $s[$x].$s[$x+1].$s[$x+2].$s[$x+3]; $skipx 3;
            }
            
    ///will not currently do more than 4byte chars!
            
            //how many following letters [and loop cycles] do we skip?
            
    $x $x+$skipx;
            
            
    $o[] = $char//store it
        
    }
        return 
    $o//return final array
    }


    function 
    search2mysql($q,$field) { //transform a user search to a mysql code 'WHERE' string
        
    if (!is_string($q)||$q=='') { return FALSE; } //must have a [nonblank] string...
        
    $q mbstring2chararray($q); //split query into [multibyte] character array
        
    $qwords = array(); //setup an array to store the content 'words'
        
    $ac=0//setup array counter for search content array
        
    $inquote 0//setup for check for being within quoted [literal] search data: 0 [false] or 1 [true]
        
    $plevel 0//setup counter for levels of parenthetical embedding
        
    $qstring ''//setup a blank term for the query syntax, later to get content too
        
    $newword 1//status for whether we're at a word boundary
        
    $negwords = array(); //setup blank array for tracking negative terms ("NOT")
        
    $qlen count($q); //how many characters?
        
    for($l=0;$l<$qlen;$l++) { //go through each character of the input and parse:
            //open quote: not in quotes and at word boundary
            
    if ($q[$l]=='"'&&$newword==1&&$inquote==0) {
                
    $inquote=1;
            }
            
    //end quote: found while at the end of a word and while in quotes
            
    elseif ($q[$l]=='"'&&$inquote==1&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')) {
                
    $inquote=0;
            }
            
    //start parentheses: at new word and not in quotes
            
    elseif ($q[$l]=='('&&$newword==1&&$inquote==0) {
                
    $qstring .= '#('//add the parenthesis and marker [for later]
                
    $plevel++;
                if (
    $plevel<0) { //error: can't have close before open
                    
    return FALSE;
                }
            }
            
    //end parentheses: at end of word and not in quotes
            
    elseif ($q[$l]==')'&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')&&$inquote==0) {
                
    $qstring .= ')#'//add the parenthesis and marker [for later]
                
    $plevel--;
            }
            
    //a space is a break unless in a quote:
            
    elseif ($q[$l]==' '&&$inquote==0) {
                
    $newword=1;
                
    $ac++;
            }
            
    //do OR which must be both the beginning and end of a word:
            
    elseif ($newword==1&&$q[$l]=='O'&&($l<($qlen-1))&&$q[$l+1]=='R'&&($l==($qlen-2)||$q[$l+2]==' '||$q[$l+1]=='(')&&$inquote==0) {
                
    $l++; //skip past the R
                
    $l++; //skip past extra space to avoid issues
                
    $qstring .= ' OR ';
            }
            
    //deal with - marker for "NOT" terms
            
    elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]!='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
                
    $negwords[] = $ac;
            }
            
    //deal with - marker for "NOT" parentheticals
            
    elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]=='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
                
    $qstring .= '#NOT ';
            }
            else { 
    //not syntactic, must be content
                
    if ($newword==1) { //begin a new word
                    
    $newword=0//no longer a new word
                    
    $qwords[$ac] = ''//setup blank string
                    
    $qstring .= '#*'.$ac.'*#'//add the reference + markers [for later] 
                
    }
                
    $qwords[$ac] .= $q[$l]; //add the letter
            
    }
        }
        
    //errors:
        
    if ($inquote==1) { //uneven quotes
            
    return FALSE;
        }
        if (
    $plevel!=0) { //uneven parentheses
            
    return FALSE;
        }
        if (
    in_array('',$qwords)) { //a blank word was found: input is badly formatted
            
    return FALSE;
        }
        if (
    count($qwords)==0) { //no blank searches...
            
    return FALSE;
        }
        if (
    strpos($qstring,'###')!==FALSE) { //weird error with markers
            
    return FALSE;
        }
        
    //end errors

        
    $qstring str_replace('##',' AND ',$qstring); //adjacent terms need to be conjoined with AND
        
    $qstring str_replace('#','',$qstring); //no need for floating markers now

        //escape data: this is confusing
        
    foreach($qwords as $qwkey=>$qword) { //go through the words
            
    $qword mysql_real_escape_string($qword); //make safe!
            
    $qword str_replace('%','\\%',$qword); //escape % so it's not a wildcard
            
    $qword str_replace('_','\\_',$qword); //escape _ so it's not a wildcard
            
    $not in_array($qwkey,$negwords)?'NOT ':''//is it negated?
            
    $qwords[$qwkey] = '`'.$field.'` '.$not.'LIKE \'%'.$qword.'%\''//format as a mysql query condition
        
    }

        
    //now that the terms and query are setup separately, time to put them together
        
    $invar=0//tracker value
        
    $q ''//reset as blank string for output
        
    for($x=0;$x<strlen($qstring);$x++) { //go through the whole query string
            
    if ($qstring[$x]=='*'&&$invar==0) { //start compiling a key
                
    $invar=1//set as compiling key
                
    $qwkey ''//setup blank string as key
            
    }
            elseif (
    $qstring[$x]=='*'&&$invar==1) { //key end reached
                
    $q .= $qwords[$qwkey]; //add the word to the query
                
    $invar=0//reset
            
    }
            elseif(
    $invar==1) { //continue compiling replace key
                
    $qwkey .= $qstring[$x];
            }
            else { 
    //it's just syntax
                
    $q .= $qstring[$x];
            }
        }

        return 
    $q//return the final CONDITION(s) for WHERE; use as required
    }

    ?>
    Note: the one thing I haven't dealt with yet is making the search case-insensitive: the problem is that quoted data should be case sensitive (no?) but other data should be case-insensitive. So because of this it's not very easy to determine what to do since that may be higher-level stuff than just for the condition following WHERE. I'll look into it. Obviously for now you can just make the whole query case-insensitive and be done with it, but I'm not sure how to make the quoted material stay case sensitive then.


    A couple other things to consider for the future:
    1. Wildcards aren't handled. You could either allow direct input of "%" and "_" or you could make up your own symbols and use these. It would also involve messing with the words rather than the query itself and that's not really covered yet. Easy enough to add, but I don't plan to do that now.
    2. Word boundaries are ignored-- searching for "good" will find "goodbye". Making it word-boundary only would require that the MySQL is a lot more complex than it is-- this is about the STORED data not the new user input. This is not on my list of things to do now.
    3. The results are not ranked in any sensible way. You can of course add an ORDER BY component, perhaps "ORDER BY `popularity`" and that would be a good start. But for now this isn't about "good" or "bad" matches, just "match" TRUE/FALSE and nothing more. Likewise it doesn't handle "partial" matches.
    4. There is not yet any function where certain words [ex. 'the'] are removed. It's a lot easier to *pre-process* the string to remove words like that.

    This is all solved by careful searching by the user, though.
    Last edited by djr33; 05-06-2010 at 04:34 AM.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  3. #23
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Thanks to all of you for the input. I've reworked this a bit more and now I'm happy with what I have:

    PHP Code:
    <?php

    function mbstring2chararray($s,$maxcharlen=4) {
        if (
    $maxcharlen!=4) {
            echo 
    'WARNING: mbstring2chararray() can only do [up to] 4byte chars now.';
            
    //this should be later fixed, but the syntax is hard...
        
    }
        
    $o = array();
        for(
    $x=0;$x<strlen($s);$x++) {
            
    $char $s[$x];

            
    $skipx 0//assume skipping no characters
            //now go through each byte number:
            //if the sequence of bytes is one character, then group them:
            //a 2byte char
            
    if (($x<(strlen($s)-1))&&mb_strlen($s[$x].$s[$x+1])===1) {
                
    $char $s[$x].$s[$x+1]; $skipx 1;
            }
            
    //a 3byte char
            
    if (($x<(strlen($s)-2))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2])===1) {
                
    $char $s[$x].$s[$x+1].$s[$x+2]; $skipx 2;
            }
            
    //a 4byte char
            
    if (($x<(strlen($s)-3))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2].$s[$x+3])===1) {
                
    $char $s[$x].$s[$x+1].$s[$x+2].$s[$x+3]; $skipx 3;
            }
            
    ///will not currently do more than 4byte chars!
            
            //how many following letters [and loop cycles] do we skip?
            
    $x $x+$skipx;
            
            
    $o[] = $char//store it
        
    }
        return 
    $o//return final array
    }

    //transform a user search to a mysql code 'WHERE' string
    function search2mysql($q,$column,$maxterms=10,$ifnull=1) {
        
    //$q string to convert;
        //$column [string or string array to concatenate] to be searched;
        //$maxterms limits search length to prevent overloading the server; 0 is unlimited;
        //$ifnull 0 or 1: compensate for possible null values? default is YES
        
    if (!is_string($q)||$q=='') { return FALSE; } //must have a [nonblank] string...
        
    $q mbstring2chararray($q); //split query into [multibyte] character array
        
    $qwords = array(); //setup an array to store the content 'words'
        
    $ac=0//setup array counter for search content array
        
    $inquote 0//setup for check for being within quoted [literal] search data: 0 [false] or 1 [true]
        
    $plevel 0//setup counter for levels of parenthetical embedding
        
    $qstring ''//setup a blank term for the query syntax, later to get content too
        
    $newword 1//status for whether we're at a word boundary
        
    $negwords = array(); //setup blank array for tracking negative terms ("NOT")
        
    $cswords = array(); //setup blank array for tracking case-sensitive terms [within quotes]
        
    $qlen count($q); //how many characters?
        
    for($l=0;$l<$qlen;$l++) { //go through each character of the input and parse:
            //open quote: not in quotes and at word boundary
            
    if ($q[$l]=='"'&&$newword==1&&$inquote==0) {
                
    $inquote=1;
            }
            
    //end quote: found while at the end of a word and while in quotes
            
    elseif ($q[$l]=='"'&&$inquote==1&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')) {
                
    $inquote=0;
                
    $cswords[] = $ac//add this to a list of words to keep case sensitive
            
    }
            
    //start parentheses: at new word and not in quotes
            
    elseif ($q[$l]=='('&&$newword==1&&$inquote==0) {
                
    $qstring .= '#('//add the parenthesis and marker [for later]
                
    $plevel++;
                if (
    $plevel<0) { //error: can't have close before open
                    
    return FALSE;
                }
            }
            
    //end parentheses: at end of word and not in quotes
            
    elseif ($q[$l]==')'&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')&&$inquote==0) {
                
    $qstring .= ')#'//add the parenthesis and marker [for later]
                
    $plevel--;
            }
            
    //a space is a break unless in a quote:
            
    elseif ($q[$l]==' '&&$inquote==0) {
                
    $newword=1;
                
    $ac++;
            }
            
    //do OR which must be both the beginning and end of a word:
            
    elseif ($newword==1&&$q[$l]=='O'&&($l<($qlen-1))&&$q[$l+1]=='R'&&($l==($qlen-2)||$q[$l+2]==' '||$q[$l+1]=='(')&&$inquote==0) {
                
    $l++; //skip past the R
                
    $l++; //skip past extra space to avoid issues
                
    $qstring .= ' OR ';
            }
            
    //deal with - marker for "NOT" terms
            
    elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]!='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
                
    $negwords[] = $ac;
            }
            
    //deal with - marker for "NOT" parentheticals
            
    elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]=='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
                
    $qstring .= '#NOT ';
            }
            else { 
    //not syntactic, must be content
                
    if ($newword==1) { //begin a new word
                    
    $newword=0//no longer a new word
                    
    $qwords[$ac] = ''//setup blank string
                    
    $qstring .= '#*'.$ac.'*#'//add the reference + markers [for later] 
                
    }
                
    $qwords[$ac] .= $q[$l]; //add the letter
            
    }
        }
        
    //errors:
        
    if ($inquote==1) { //uneven quotes
            
    return FALSE;
        }
        if (
    $plevel!=0) { //uneven parentheses
            
    return FALSE;
        }
        if (
    in_array('',$qwords)) { //a blank word was found: input is badly formatted
            
    return FALSE;
        }
        if (
    count($qwords)==0) { //no blank searches...
            
    return FALSE;
        }
        if (
    $maxterms!=0&&count($qwords)>$maxterms) { //over max length: prevent overload
            
    return FALSE;
        }
        if (
    strpos($qstring,'###')!==FALSE) { //weird error with markers
            
    return FALSE;
        }
        
    //end errors

        
    $qstring str_replace('##',' AND ',$qstring); //adjacent terms need to be conjoined with AND
        
    $qstring str_replace('#','',$qstring); //no need for floating markers now

        //format $column including concatenation as needed
        
    if (!is_array($column)) { //must switch string to array
            
    $skipconcat=1//skip concatenation later
            
    $column = array($column); //convert to array
        
    }
        foreach(
    $column as $colk=>$col) { //format each column individually
            
    $col '`'.$col.'`'//format as column
            
    if ($ifnull==1) {
                
    $col 'IFNULL('.$col.',\'\')'//avoid null errors
            
    }
            
    $column[$colk] = $col//store
        
    }
        
    $column implode(', ',$column); //implode them into a list
        
    if (!isset($skipconcat)) { //concatenate them:
            
    $column 'concat('.$column.')';
        }
        
        
    //escape data, then format as a condition
        
    foreach($qwords as $qwkey=>$qword) { //go through the words
            //escaping:
            
    $qword mysql_real_escape_string($qword); //make safe!
            
    $qword str_replace('%','\\%',$qword); //escape % so it's not a wildcard
            
    $qword str_replace('_','\\_',$qword); //escape _ so it's not a wildcard
            //format as condition:
            
    $not in_array($qwkey,$negwords)?' NOT ':' '//is it negated?
            //handle case sensitivity: lcase matching normal; in quotes is case sensitive
            
    $lcase ''//set to blank by default
            
    if (!in_array($qwkey,$cswords)) { //this is a non-case-sensitive word [otherwise do nothing special]
                
    $lcase 'lcase('//set col prefix
                
    $not ')'.$not//add col end to trailing $not var
                
    $qword mb_strtolower($qword); //make string lowercase
            
    }
            
    $qwords[$qwkey] = $lcase.$column.$not.'LIKE \'%'.$qword.'%\''//format as a mysql query condition
        
    }

        
    //now that the terms and query are setup separately, time to put them together
        
    $invar=0//tracker value
        
    $q ''//reset as blank string for output
        
    for($x=0;$x<strlen($qstring);$x++) { //go through the whole query string
            
    if ($qstring[$x]=='*'&&$invar==0) { //start compiling a key
                
    $invar=1//set as compiling key
                
    $qwkey ''//setup blank string as key
            
    }
            elseif (
    $qstring[$x]=='*'&&$invar==1) { //key end reached
                
    $q .= $qwords[$qwkey]; //add the word to the query
                
    $invar=0//reset
            
    }
            elseif(
    $invar==1) { //continue compiling replace key
                
    $qwkey .= $qstring[$x];
            }
            else { 
    //it's just syntax
                
    $q .= $qstring[$x];
            }
        }

        return 
    $q//return the final CONDITION(s) for WHERE; use as required
    }

    ?>

    Here are the improvements from the last post (and all of the rest still applies):
    1. Search is caseless except in quoted input.

    2. I have also added a "$maxterms" parameter.

    This prevents huge queries that overload the server. It's a number or 0 to turn this off and allow infinite terms, but I don't recommend that.

    3. $column (formerly 'field') is now able to take both array and string arguments.
    If it is a string then a single field is searched; if it is an array then these fields are concatenated and all are searched.

    Also, I made an optional "$isnull" parameter in the function that checks to see if you want to convert NULL fields to an empty string. You can turn this off and I bet it'll run faster (that's why I made it optional) but the default is that it's on to limit any possible errors.

    Thanks a lot to James for showing me the mysql syntax for all of this.



    Note: this is still untested for a actual MySQL query usage-- I've just been designing it to get the correct condition for a where statement generated.
    I'll be testing it soon (probably later tonight).



    I'm still a bit wary about all of this but overall I'm very happy with it. If you guys have any ideas, then let me know.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  4. #24
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    It works!

    I'm excited about this. I've been testing this on an active database and as far as I can tell it works really well. The query is generated in what looks (on casual inspection) correct, and the results seem reasonable. I obviously haven't manually cofirmed that all of the results are matches, but I see no reason they are not. The query is well formed and it works while searching.


    The time required in searching is not wonderful, but it's also not too worrying.
    For the average query it's about .5 seconds on my server.
    It is a bit faster if you just have one term. But if you have more than that it only gets slightly slower. Even at several layers and about 5 terms it's still around .5 seconds (hasn't yet been over 1 sec).


    I think I've actually created a workable search tool here

    Now on to the rest of the wiki, but after this I'm optimistic.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  5. #25
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 516 Times in 502 Posts
    Blog Entries
    5

    Default

    If you don't mind, I'd love to experiment with testing this out. I'll share my results, of course.

    About whole words vs. partial words: could you modify this so that if a single word is "quoted", the query condition will by '% word %' (note the spaces) instead of '%word%'?

    Hmm... actually, that would be complicated. It would only work for words in the middle of a sentence (not touching punctuation, for example). Even the last word in a record might not be matched if it were not followed by a space...

    Maybe it would be best served in post-processing (running a regex on the result set, for example, to only include the results that are not touching other letters -and therefore, part of a larger word...).

    Quote Originally Posted by djr33 View Post
    All of this can be layered into very complex setups and combinations of the logical operators:
    a -(b (c OR d))
    This generates a search for something that:
    1) contains a
    2) does not: contain a AND (c OR d)
    What goes on with "b" in the example above?

  6. #26
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Ah, oops. Too much typing. I meant "b" in that second line. It's fixed in my post above and here it is again:
    a -(b (c OR d))
    This generates a search for something that:
    1) contains a
    2) does not: contain b AND (c OR d)
    Or, a single line:
    a AND NOT (b AND (c OR d))

    It can be layered as much as you'd like. This is a "simple" query compared to the theoretical bounds of what you can enter. The only limit is the $maxterms parameter to save the server from overload. Turn that off if you'd like.

    Yes, of course, feel free to test. I am going to paste a working self-contained test page below here so that you can search. The only problem is that you must configure a database and have enough data to make this be a realistic test. Can't help you there, though.


    I have found one problem: while the results are cool and clearly accurate to some degree, something is wrong with 'NOT' as I feared it might be before.
    The result set of -a -b does not match the result set of -(a b).
    These generate the queries of:
    Code:
    -a -b
    lcase(`body`) NOT LIKE '%a%' AND lcase(`body`) NOT LIKE '%b%'
    -(a b)
    NOT (lcase(`body`) LIKE '%a%' AND lcase(`body`) LIKE '%b%')
    I also tried adding parentheses around each condition like (`col` LIKE 'X') but that didn't fix anything.

    I'm going to keep testing to try to figure out exactly how this works then see if it's possible to fix...



    Partial words are very complicated. This would need to be handled by parsing the database text itself. You could also include 3+ queries for each (word between spaces, before punctuation, after punctuation), but this would get messy and I didn't want to deal with it. Additionally, this is even more complex because in quotes you can specific that aspect as well.
    For my purposes I also require that this works across all languages so I didn't want to add too many unnecessary complications: the specific operators are sensitive and anything else is a "content" character.
    Of course if you do want to search for just "good" and not "goodbye" you can search for "good " manually.
    The odds of lots of false positives are low especially if you give a few terms.
    If there's a way to work this into the system that might help, but it's beyond this particular aspect I think. Parsing them out after this is complete might be one way to do it, though that would be really tough because you'd no longer have access to the terms from within the string parser fucntion's scope.


    EDIT: The case-insensitivity may not be working. More info soon.
    Last edited by djr33; 05-06-2010 at 06:27 AM.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  7. #27
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Here's the standalone page as promised:
    PHP Code:
    <?php


    $t1 
    microtime(1); //start counting...

    //connect to DB
    mysql_connect('HOST','USER','PASS') or die ('db conn problem'); /////EDIT THIS
    mysql_select_db('DBNAME'); /////EDIT THIS



    //MultiByte String setup
    //setup php for working with Unicode data
    mb_internal_encoding('UTF-8');
    mb_http_output('UTF-8');
    mb_http_input('UTF-8');
    mb_language('uni');
    mb_regex_encoding('UTF-8');
    ob_start('mb_output_handler');


    function 
    mbstring2chararray($s,$maxcharlen=4) {
        if (
    $maxcharlen!=4) {
            echo 
    'WARNING: mbstring2chararray() can only do [up to] 4byte chars now.';
            
    //this should be later fixed, but the syntax is hard...
        
    }
        
    $o = array();
        for(
    $x=0;$x<strlen($s);$x++) {
            
    $char $s[$x];

            
    $skipx 0//assume skipping no characters
            //now go through each byte number:
            //if the sequence of bytes is one character, then group them:
            //a 2byte char
            
    if (($x<(strlen($s)-1))&&mb_strlen($s[$x].$s[$x+1])===1) {
                
    $char $s[$x].$s[$x+1]; $skipx 1;
            }
            
    //a 3byte char
            
    if (($x<(strlen($s)-2))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2])===1) {
                
    $char $s[$x].$s[$x+1].$s[$x+2]; $skipx 2;
            }
            
    //a 4byte char
            
    if (($x<(strlen($s)-3))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2].$s[$x+3])===1) {
                
    $char $s[$x].$s[$x+1].$s[$x+2].$s[$x+3]; $skipx 3;
            }
            
    ///will not currently do more than 4byte chars!
            
            //how many following letters [and loop cycles] do we skip?
            
    $x $x+$skipx;
            
            
    $o[] = $char//store it
        
    }
        return 
    $o//return final array
    }

    //transform a user search to a mysql code 'WHERE' string
    function search2mysql($q,$column,$maxterms=10,$ifnull=1) {
        
    //$q string to convert;
        //$column [string or string array to concatenate] to be searched;
        //$maxterms limits search length to prevent overloading the server; 0 is unlimited;
        //$ifnull 0 or 1: compensate for possible null values? default is YES
        
    if (!is_string($q)||$q=='') { return FALSE; } //must have a [nonblank] string...
        
    $q mbstring2chararray($q); //split query into [multibyte] character array
        
    $qwords = array(); //setup an array to store the content 'words'
        
    $ac=0//setup array counter for search content array
        
    $inquote 0//setup for check for being within quoted [literal] search data: 0 [false] or 1 [true]
        
    $plevel 0//setup counter for levels of parenthetical embedding
        
    $qstring ''//setup a blank term for the query syntax, later to get content too
        
    $newword 1//status for whether we're at a word boundary
        
    $negwords = array(); //setup blank array for tracking negative terms ("NOT")
        
    $cswords = array(); //setup blank array for tracking case-sensitive terms [within quotes]
        
    $qlen count($q); //how many characters?
        
    for($l=0;$l<$qlen;$l++) { //go through each character of the input and parse:
            //open quote: not in quotes and at word boundary
            
    if ($q[$l]=='"'&&$newword==1&&$inquote==0) {
                
    $inquote=1;
            }
            
    //end quote: found while at the end of a word and while in quotes
            
    elseif ($q[$l]=='"'&&$inquote==1&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')) {
                
    $inquote=0;
                
    $cswords[] = $ac//add this to a list of words to keep case sensitive
            
    }
            
    //start parentheses: at new word and not in quotes
            
    elseif ($q[$l]=='('&&$newword==1&&$inquote==0) {
                
    $qstring .= '#('//add the parenthesis and marker [for later]
                
    $plevel++;
                if (
    $plevel<0) { //error: can't have close before open
                    
    return FALSE;
                }
            }
            
    //end parentheses: at end of word and not in quotes
            
    elseif ($q[$l]==')'&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')&&$inquote==0) {
                
    $qstring .= ')#'//add the parenthesis and marker [for later]
                
    $plevel--;
            }
            
    //a space is a break unless in a quote:
            
    elseif ($q[$l]==' '&&$inquote==0) {
                
    $newword=1;
                
    $ac++;
            }
            
    //do OR which must be both the beginning and end of a word:
            
    elseif ($newword==1&&$q[$l]=='O'&&($l<($qlen-1))&&$q[$l+1]=='R'&&($l==($qlen-2)||$q[$l+2]==' '||$q[$l+1]=='(')&&$inquote==0) {
                
    $l++; //skip past the R
                
    $l++; //skip past extra space to avoid issues
                
    $qstring .= ' OR ';
            }
            
    //deal with - marker for "NOT" terms
            
    elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]!='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
                
    $negwords[] = $ac;
            }
            
    //deal with - marker for "NOT" parentheticals
            
    elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]=='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
                
    $qstring .= '#NOT ';
            }
            else { 
    //not syntactic, must be content
                
    if ($newword==1) { //begin a new word
                    
    $newword=0//no longer a new word
                    
    $qwords[$ac] = ''//setup blank string
                    
    $qstring .= '#*'.$ac.'*#'//add the reference + markers [for later] 
                
    }
                
    $qwords[$ac] .= $q[$l]; //add the letter
            
    }
        }
        
    //errors:
        
    if ($inquote==1) { //uneven quotes
            
    return FALSE;
        }
        if (
    $plevel!=0) { //uneven parentheses
            
    return FALSE;
        }
        if (
    in_array('',$qwords)) { //a blank word was found: input is badly formatted
            
    return FALSE;
        }
        if (
    count($qwords)==0) { //no blank searches...
            
    return FALSE;
        }
        if (
    $maxterms!=0&&count($qwords)>$maxterms) { //over max length: prevent overload
            
    return FALSE;
        }
        if (
    strpos($qstring,'###')!==FALSE) { //weird error with markers
            
    return FALSE;
        }
        
    //end errors

        
    $qstring str_replace('##',' AND ',$qstring); //adjacent terms need to be conjoined with AND
        
    $qstring str_replace('#','',$qstring); //no need for floating markers now

        //format $column including concatenation as needed
        
    if (!is_array($column)) { //must switch string to array
            
    $skipconcat=1//skip concatenation later
            
    $column = array($column); //convert to array
        
    }
        foreach(
    $column as $colk=>$col) { //format each column individually
            
    $col '`'.$col.'`'//format as column
            
    if ($ifnull==1) {
                
    $col 'IFNULL('.$col.',\'\')'//avoid null errors
            
    }
            
    $column[$colk] = $col//store
        
    }
        
    $column implode(', ',$column); //implode them into a list
        
    if (!isset($skipconcat)) { //concatenate them:
            
    $column 'concat('.$column.')';
        }
        
        
    //escape data, then format as a condition
        
    foreach($qwords as $qwkey=>$qword) { //go through the words
            //escaping:
            
    $qword mysql_real_escape_string($qword); //make safe!
            
    $qword str_replace('%','\\%',$qword); //escape % so it's not a wildcard
            
    $qword str_replace('_','\\_',$qword); //escape _ so it's not a wildcard
            //format as condition:
            
    $not in_array($qwkey,$negwords)?' NOT ':' '//is it negated?
            //handle case sensitivity: lcase matching normal; in quotes is case sensitive
            
    $lcase ''//set to blank by default
            
    if (!in_array($qwkey,$cswords)) { //this is a non-case-sensitive word [otherwise do nothing special]
                
    $lcase 'lcase('//set col prefix
                
    $not ')'.$not//add col end to trailing $not var
                
    $qword mb_strtolower($qword); //make string lowercase
            
    }
            
    $qwords[$qwkey] = $lcase.$column.$not.'LIKE \'%'.$qword.'%\''//format as a mysql query condition
        
    }

        
    //now that the terms and query are setup separately, time to put them together
        
    $invar=0//tracker value
        
    $q ''//reset as blank string for output
        
    for($x=0;$x<strlen($qstring);$x++) { //go through the whole query string
            
    if ($qstring[$x]=='*'&&$invar==0) { //start compiling a key
                
    $invar=1//set as compiling key
                
    $qwkey ''//setup blank string as key
            
    }
            elseif (
    $qstring[$x]=='*'&&$invar==1) { //key end reached
                
    $q .= $qwords[$qwkey]; //add the word to the query
                
    $invar=0//reset
            
    }
            elseif(
    $invar==1) { //continue compiling replace key
                
    $qwkey .= $qstring[$x];
            }
            else { 
    //it's just syntax
                
    $q .= $qstring[$x];
            }
        }

        return 
    $q//return the final CONDITION(s) for WHERE; use as required
    }


    ///do the stuff and setup the output!
    $out='';
    if (isset(
    $_POST['query'])) {
        
    $condition search2mysql($_POST['query'],array('COL1','COL2')); /////EDIT THIS
        
    $r mysql_query("SELECT * FROM `YOURTABLE` WHERE $condition ORDER BY `ORDERCOLUMN`;"); ////EDIT THIS
        
    $out '<hr><b>Results found: '.mysql_num_rows($r).'</b><hr>';
        
    $count=0;
        while ((
    $result mysql_fetch_assoc($r))&&$count<10) {
            
    extract($result);
            
    $out .= "<b>$COL1<b><hr>";
            
    $count++;
        }
        
    $out .= microtime(1)-$t1;
    }



    ?>
    <html>
    <head>
    <title>search test</title>
    </head>
    <body>
    <form action="?" method="post">
    Query: <textarea name="query"><?php if (isset($_POST['query'])) { echo $_POST['query']; } ?></textarea> <input type="submit" value="submit">
    </form>
    <p style="font-family:courier;font-size:80%;"><?php echo isset($condition)?($condition===FALSE?'error':$condition):''?></p>
    <p><?php echo $out?></p>
    </body>
    </html>

    How to set it up:
    1. At the top you must setup your database connection. Do that as you'd like. 2 lines to edit using my current syntax.
    2. Near the bottom (in the last bit of PHP) you must edit two lines to reflect your DB's structure.



    I'm well aware that the HTML there is lazy and that the PHP even (aside from the function) may be messy. But it works for testing.


    This uses mysql_real_escape_string(), so I think it's reasonably safe, but I certainly wouldn't advise yet doing anything public with it. Select statements luckily aren't that dangerous, but be a bit careful...



    Also, if you have any trouble reading it (it's well commented, but also very complex), feel free to ask. The overall logic of the situation is:
    1. Go through character by character and split up the input into content bits and syntax bits.*
    2. Toss the syntax bits into a pseudo query without any content, leaving some markers for later
    3. Toss the content bits into an array of 'words' to search for later, keeping references [array keys] to the markers in the syntax
    4. Parse all of this now individually: A) make sure the syntax of the query is not badly formed; B) parse all of the "words" into the syntax for a "LIKE" condition in MySQL
    5. Now that the parts are ready, replace the markers with the corresponding "words", and put the whole query back together.
    (6. Place this after 'WHERE' in a MySQL query and do the search.)

    (*Regex might seem better here, but due to the immense recursive complexity here, I'm not even going to think about trying that.)
    Last edited by djr33; 05-06-2010 at 04:56 AM.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  8. #28
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 516 Times in 502 Posts
    Blog Entries
    5

    Default

    Quote Originally Posted by djr33 View Post
    something is wrong with 'NOT' as I feared it might be before.
    The result set of -a -b does not match the result set of -(a b).
    These generate the queries of:
    Code:
    -a -b
    lcase(`body`) NOT LIKE '%a%' AND lcase(`body`) NOT LIKE '%b%'
    -(a b)
    NOT (lcase(`body`) LIKE '%a%' AND lcase(`body`) LIKE '%b%')
    Are they supposed to match? On a DB with 7 entries ("a", "b", "c", "ab", "ac", "bc", "abc"), these are my result-sets:
    Code:
    -a -b:
    
    Array
    (
        [id] => 7
        [content] => c
    )
    
    
    -(a b):
    
    Array
    (
        [id] => 3
        [content] => bc
    )
    Array
    (
        [id] => 4
        [content] => ac
    )
    Array
    (
        [id] => 5
        [content] => a
    )
    Array
    (
        [id] => 6
        [content] => b
    )
    Array
    (
        [id] => 7
        [content] => c
    )
    They seem to be working fine; they're just not equivalent statements. -a -b ("not-a and not-b") returns the entry that has neither a nor b; where -(a b) ("not a-and-b") returns the five entries that don't have a and b. Or am I misunderstanding what your syntax is supposed to mean?
    Last edited by traq; 05-06-2010 at 06:45 AM.

  9. The Following User Says Thank You to traq For This Useful Post:

    djr33 (05-06-2010)

  10. #29
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Responding to the issue of negation with "NOT" above, I was wrong.
    MySQL is NOT wrong here.
    Here's why:
    I was not following the rules of formal logic:
    -(a b) is NOT the same as -a -b.
    It is instead the same as -a OR -b
    (Think of this as the "opposite" of "AND"=>"OR", and vice versa.)

    While this is confusing to deal with it's not in fact an error in mysql (or my algorithm).
    So that's fixed, if confusing for users.


    EDIT: Traq, that's exactly correct. Just as you posted this I had figured out what was wrong.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  11. #30
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 516 Times in 502 Posts
    Blog Entries
    5

    Default

    Quote Originally Posted by djr33
    Just as you posted this I had figured out what was wrong.
    You mean, "nothing"?

    I spent a long time staring at those two statements before I could figure out what I was even looking for. Once I did, though, everything suddenly struck me as completely logical.

  12. The Following User Says Thank You to traq For This Useful Post:

    djr33 (05-06-2010)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •