Page 4 of 4 FirstFirst ... 234
Results 31 to 37 of 37

Thread: Theories of a search algorithm

  1. #31
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Exactly. Once you can think in logic, everything logical is logical. It's a loophole. And human brains don't natively think in logic... just computers.


    Edit: haha, yes, I mean nothing. Well, something was wrong, but in my head, not in the SQL.
    Last edited by djr33; 05-06-2010 at 06:47 AM.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  2. #32
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 516 Times in 502 Posts
    Blog Entries
    5

    Default

    Quote Originally Posted by djr33
    haha, yes, I mean nothing. Well, something was wrong, but in my head, not in the SQL.
    so, in your query syntax, is nothing equivalent to "nothing" (...or -"something")?
    Last edited by traq; 05-06-2010 at 10:31 PM.

  3. #33
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Sorry, I don't understand. 'nothing' is not a keyword so if you type that it would be searching for the word "nothing" the same as if you searched for "cars".

    nothing (no quotes) is searched in the default way: it's caseless and parsed like normal.
    "nothing" (quotes) is searched literally and must match in case; it's also NOT parsed. This isn't clear here, but if you have syntax in it, it'll not parse in the quotes. For example:
    Harry Potter will find any articles containing harry and also potter (caseless). "Harry Potter" will find any articles containing the two words TOGETHER in that order with one space between them and also capital letters.



    However, if you are asking about the absense of any text in the search-- that is a "blank" query-- then the result will be an error
    I set this up so that it requires at least one search term. If not, you'll just be reading everything from the database which seems stupid.
    Of course that's easy enough to change-- just take out a few of the errors (I'd have to look through the code again to see which specifically), or even just add a line at the top of the function if($q=='') { return ''; }

    If all of the errors were removed for that, I believe it would simply return nothing at all, an empty string, so you would end up with a broken WHERE clause and no conditions. You could fix this of course.



    Current status: today I'm going to figure out why cases are always ignored. Quoted material should be searched case-sensitively... but it's currently not. That should be the last major issue from what I can tell.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  4. #34
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 516 Times in 502 Posts
    Blog Entries
    5

    Default

    Sorry: I was just kidding around. see my edited post above.

    I've run a couple queries using your script; it's working as expected/ described so far (though I haven't run it on a live database yet). Amazing job. :thumbsup:

    About case-insensitivity: are you sure you want it? people don't necessarily use proper capitalization when entering search terms - I'm thinking case-sensitive searches might exclude desired results in many cases. You could make it an option (e.g., [ ] make this search Case-Sensitive )

  5. The Following User Says Thank You to traq For This Useful Post:

    djr33 (05-06-2010)

  6. #35
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Oh, haha. Well it brought up a valid point: currently it throws an error if nothing is searched for and that's kinda awkward.

    I'm thinking that I will instead not throw any errors directly but return an array with 'query' as the final result, 'error' as the error if any, 'numterms' as how many terms were searched, etc. THEN you can take those values and decide what to do, like stop if there was an error or too many terms were used.

    Thanks, and yes, I'm excited this is working.

    I actually do want to keep it case-insensitive most of the time, but for quoted material I thought it might be better. However, I have figured out a few things:
    1. People are really lazy. They'll put "harry potter" instead of "Harry Potter", so actually even in quoted material it won't be that bad to have it be case-less.
    2. My database is currently configured for caseless searches. There's nothing I can do except modifying the collations of the tables while searching them. This seems really messy and like it'll slow things a lot.
    3. Since for my purposes this will work across many languages, the specifics of English big and small letters really isn't all that important

    Because of this, I think I'll add an extra parameter of $casemode:
    '0': do nothing to cases (use the database default, either sensitive or insensitive)
    '1': do the entire search case-insensitive (regardless of DB config)
    '2': do regular terms insensitive and quoted material sensitive (as it is setup now).

    BUT: if the DB is configured to ignore case anyway, then all 3 of these will do the exact same search every time. So 0 will be best because it will make things faster: no need to convert all of the fields to lcase() for every search.

    (So if someone feels like dealing with a case sensitive search [for ANY DB configuration] and adding this function to the search, have fun with that, and let me know how it turns out.)


    I'm going to rework the function a bit with the $casemode parameter and the array output (errors, query, numterms). That'll make it easier to work with, I think.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  7. #36
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    I'm coming back to this after a while because I have tested the code and have some updates.
    PHP Code:
    <?php

    //"BASIC SEARCH"
    //transform a user search to a mysql code for creating temp RANKING column
    function search2mysql_rank($q,$column,$wmin=3,$maxterms=10,$ifnull=1) {
        
    //$q string to convert;
        //$column [string or string array to concatenate] to be searched;
        //$maxterms limits search length to prevent overloading the server; 0 is unlimited;
        //$ifnull 0 or 1: compensate for possible null values? default is YES
        
    if (!is_string($q)||$q=='') { return FALSE; } //must have a [nonblank] string...
        
    $q explode(' ',$q); //split by words
        
    foreach($q as $qk=>$qv) { //go through them all to check format/content of each word
            
    if ($qv=='') { unset($q[$qk]); } //remove blanks
            
    if (strpos($qv,'-')===0) { $qv substr($qv,1); } //remove negation
            
    else if (($wmin!=0)&&(strlen($qv)<$wmin)) { //is the term too short?
                
    unset($q[$qk]); //remove it
            
    }
        }
        
    //check for errors:
        
    if (count($q)==0) { //no blank searches...
            
    return FALSE;
        }
        if (
    $maxterms!=0&&count($q)>$maxterms) { //over term limit
            
    return FALSE;
        }
        
    //end errors

        //format $column including concatenation as needed
        
    if (!is_array($column)) { //must switch string to array
            
    $skipconcat=1//skip concatenation later
            
    $column = array($column); //convert to array
        
    }
        foreach(
    $column as $colk=>$col) { //format each column individually
            
    $col '`'.$col.'`'//format as column
            
    if ($ifnull==1) {
                
    $col 'IFNULL('.$col.',\'\')'//avoid null errors
            
    }
            
    $column[$colk] = $col//store
        
    }
        
    $column implode(', ',$column); //implode them into a list
        
    if (!isset($skipconcat)) { //concatenate them:
            
    $column 'concat_ws(\' \','.$column.')';
        }
        
    $column 'lcase('.$column.')'//lowercase for caseless search
        
        //format terms:
        
    $negcount=0//no negative terms, yet
        
    foreach($q as $qk=>$qv) {
            
    //escaping:
            
    $qv mysql_real_escape_string($qv); //make safe!
            
    $qv str_replace('%','\\%',$qv); //escape % so it's not a wildcard
            
    $qv str_replace('_','\\_',$qv); //escape _ so it's not a wildcard
            
            //handle 'not':
            
    $not '';
            if (
    strpos($qv,'-')===0) { //this term is negated
                
    $qv substr($qv,1); //remove the negative operator
                
    $not ' NOT'//add for query
                
    $negcount++; //for later use
            
    }

            
    $qv mb_strtolower($qv); //lowercase for caseless search
            
            //format as parameter:
            
    $q[$qk] = 'IF('.$column.$not.' LIKE \'%'.$qv.'%\','.($not==''?1:'0.999').',0)';
        }

        if (
    $negcount==count($q)) { //only negative terms?
            
    return FALSE//not allowed
        
    }

        
    $q implode('+',$q); //sum these up
        
        
    if ($negcount>1) { //negative terms?
            
    $q $q.'-'.($negcount-1); //ignore results based on only the negative
        
    }

        return 
    $q//return the final OPERATIONS(s) for GENERATING COLUMN; use as required
    }


    //"ADVANCED SEARCH"
    //transform a user search to a mysql code 'WHERE' string
    function search2mysql_adv($q,$column,$wmin=3,$maxterms=10,$ifnull=1) {
        
    //$q string to convert;
        //$column [string or string array to concatenate] to be searched;
        //$maxterms limits search length to prevent overloading the server; 0 is unlimited;
        //$ifnull 0 or 1: compensate for possible null values? default is YES
        
    if (!is_string($q)||$q=='') { return FALSE; } //must have a [nonblank] string...
        
    $q mbstring2chararray($q); //split query into [multibyte] character array
        
    $qwords = array(); //setup an array to store the content 'words'
        
    $ac=0//setup array counter for search content array
        
    $inquote 0//setup for check for being within quoted [literal] search data: 0 [false] or 1 [true]
        
    $plevel 0//setup counter for levels of parenthetical embedding
        
    $qstring ''//setup a blank term for the query syntax, later to get content too
        
    $newword 1//status for whether we're at a word boundary
        
    $negwords = array(); //setup blank array for tracking negative terms ("NOT")
        
    $cswords = array(); //setup blank array for tracking case-sensitive terms [within quotes]
        
    $qlen count($q); //how many characters?
        
    for($l=0;$l<$qlen;$l++) { //go through each character of the input and parse:
            //open quote: not in quotes and at word boundary
            
    if ($q[$l]=='"'&&$newword==1&&$inquote==0) {
                
    $inquote=1;
            }
            
    //end quote: found while at the end of a word and while in quotes
            
    elseif ($q[$l]=='"'&&$inquote==1&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')) {
                
    $inquote=0;
                
    $cswords[] = $ac//add this to a list of words to keep case sensitive
            
    }
            
    //start parentheses: at new word and not in quotes
            
    elseif ($q[$l]=='('&&$newword==1&&$inquote==0) {
                
    $qstring .= '#('//add the parenthesis and marker [for later]
                
    $plevel++;
                if (
    $plevel<0) { //error: can't have close before open
                    
    return FALSE;
                }
            }
            
    //end parentheses: at end of word and not in quotes
            
    elseif ($q[$l]==')'&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')&&$inquote==0) {
                
    $qstring .= ')#'//add the parenthesis and marker [for later]
                
    $plevel--;
            }
            
    //a space is a break unless in a quote:
            
    elseif ($q[$l]==' '&&$inquote==0) {
                
    $newword=1;
                
    $ac++;
            }
            
    //do OR which must be both the beginning and end of a word:
            
    elseif ($newword==1&&$q[$l]=='O'&&($l<($qlen-1))&&$q[$l+1]=='R'&&($l==($qlen-2)||$q[$l+2]==' '||$q[$l+1]=='(')&&$inquote==0) {
                
    $l++; //skip past the R
                
    $l++; //skip past extra space to avoid issues
                
    $qstring .= ' OR ';
            }
            
    //deal with - marker for "NOT" terms
            
    elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]!='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
                
    $negwords[] = $ac;
            }
            
    //deal with - marker for "NOT" parentheticals
            
    elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]=='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
                
    $qstring .= '#NOT ';
            }
            else { 
    //not syntactic, must be content
                
    if ($newword==1) { //begin a new word
                    
    $newword=0//no longer a new word
                    
    $qwords[$ac] = ''//setup blank string
                    
    $qstring .= '#*'.$ac.'*#'//add the reference + markers [for later] 
                
    }
                
    $qwords[$ac] .= $q[$l]; //add the letter
            
    }
        }
        
    //errors:
        
    if ($inquote==1) { //uneven quotes
            
    return FALSE;
        }
        if (
    $plevel!=0) { //uneven parentheses
            
    return FALSE;
        }
        if (
    in_array('',$qwords)) { //a blank word was found: input is badly formatted
            
    return FALSE;
        }
        
    $qwordstemp $qwords//copy to check min length for words and query:
        
    foreach($qwordstemp as $qk=>$qv) { //go through them all
            
    if ($wmin!=0&&strlen($qv)<$wmin) { //are all terms too short? --> too hard right now to eliminate short terms, but can require that at least some are long enough
                
    unset($qwordstemp[$qk]); //remove it
            
    }
        }
        if (
    count($qwordstemp)==0) { //no blank searches... at least some terms must be valid: exist and be long enough, see above
            
    return FALSE;
        } 
    //end min length check
        
    if ($maxterms!=0&&count($qwords)>$maxterms) { //over max length: prevent overload
            
    return FALSE;
        }
        if (
    count($negwords)==count($qwords)) { //all words are negative?
            
    return FALSE;
        }
        if (
    strpos($qstring,'###')!==FALSE) { //weird error with markers
            
    return FALSE;
        }
        if (
    strpos($qstring,'()')!==FALSE) { //input had parentheses with no content
            
    return FALSE;
        }
        
    //end errors

        
    $qstring str_replace('##',' AND ',$qstring); //adjacent terms need to be conjoined with AND
        
    $qstring str_replace('#','',$qstring); //no need for floating markers now

        //format $column including concatenation as needed
        
    if (!is_array($column)) { //must switch string to array
            
    $skipconcat=1//skip concatenation later
            
    $column = array($column); //convert to array
        
    }
        foreach(
    $column as $colk=>$col) { //format each column individually
            
    $col '`'.$col.'`'//format as column
            
    if ($ifnull==1) {
                
    $col 'IFNULL('.$col.',\'\')'//avoid null errors
            
    }
            
    $column[$colk] = $col//store
        
    }
        
    $column implode(', ',$column); //implode them into a list
        
    if (!isset($skipconcat)) { //concatenate them:
            
    $column 'concat_ws(\' \','.$column.')';
        }
        
        
    //escape data, then format as a condition
        
    foreach($qwords as $qwkey=>$qword) { //go through the words
            //escaping:
            
    $qword mysql_real_escape_string($qword); //make safe!
            
    $qword str_replace('%','\\%',$qword); //escape % so it's not a wildcard
            
    $qword str_replace('_','\\_',$qword); //escape _ so it's not a wildcard
            //format as condition:
            
    $not in_array($qwkey,$negwords)?' NOT ':' '//is it negated?
            //handle case sensitivity: lcase matching normal; in quotes is case sensitive
            
    $lcase ''//set to blank by default
            
    if (!in_array($qwkey,$cswords)) { //this is a non-case-sensitive word [otherwise do nothing special]
                
    $lcase 'lcase('//set col prefix
                
    $not ')'.$not//add col end to trailing $not var
                
    $qword mb_strtolower($qword); //make string lowercase
            
    }
            
    $qwords[$qwkey] = $lcase.$column.$not.'LIKE \'%'.$qword.'%\''//format as a mysql query condition
        
    }

        
    //now that the terms and query are setup separately, time to put them together
        
    $invar=0//tracker value
        
    $q ''//reset as blank string for output
        
    for($x=0;$x<strlen($qstring);$x++) { //go through the whole query string
            
    if ($qstring[$x]=='*'&&$invar==0) { //start compiling a key
                
    $invar=1//set as compiling key
                
    $qwkey ''//setup blank string as key
            
    }
            elseif (
    $qstring[$x]=='*'&&$invar==1) { //key end reached
                
    $q .= $qwords[$qwkey]; //add the word to the query
                
    $invar=0//reset
            
    }
            elseif(
    $invar==1) { //continue compiling replace key
                
    $qwkey .= $qstring[$x];
            }
            else { 
    //it's just syntax
                
    $q .= $qstring[$x];
            }
        }

        return 
    $q//return the final CONDITION(s) for WHERE; use as required
    }

    ?>
    [continued below]
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  8. #37
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    [Continued]

    You will also need to define the following function as a resource that's used in the script above.
    PHP Code:
    function mbstring2chararray($s,$maxcharlen=4) {
        if (
    $maxcharlen!=4) {
            echo 
    'WARNING: mbstring2chararray() can only do [up to] 4byte chars now.';
            
    $maxcharlen=4//default
            //this should be later fixed, but the syntax is hard...
        
    }
        
    $o = array();
        for(
    $x=0;$x<strlen($s);$x++) {
            
    $char $s[$x];

            
    $skipx 0//assume skipping no characters
            //now go through each byte number:
            //if the sequence of bytes is one character, then group them:
            //a 2byte char
            
    if (($x<(strlen($s)-1))&&mb_strlen($s[$x].$s[$x+1])===1) {
                
    $char $s[$x].$s[$x+1]; $skipx 1;
            }
            
    //a 3byte char
            
    if (($x<(strlen($s)-2))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2])===1) {
                
    $char $s[$x].$s[$x+1].$s[$x+2]; $skipx 2;
            }
            
    //a 4byte char
            
    if (($x<(strlen($s)-3))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2].$s[$x+3])===1) {
                
    $char $s[$x].$s[$x+1].$s[$x+2].$s[$x+3]; $skipx 3;
            }
            
    ///will not currently do more than 4byte chars!
            
            //how many following letters [and loop cycles] do we skip?
            
    $x $x+$skipx;
            
            
    $o[] = $char//store it
        
    }
        return 
    $o//return final array

    (This function helps to deal with non-western character ranges in UTF8 for other languages than English.)



    As you can see in the first set of code, there are actually two functions. One is faster and more accurate without any advanced control from the user; the other allows the user to control all of the specific details of the search but it is a little less accurate without specification (as a default), and it's also a little slower. In my site, I use the basic mode as the default and the advanced mode as an option for advanced users.
    The two different functions work very differently but achieve the same result.


    Some demos:
    PHP Code:
    //BASIC MODE
    $ranking search2mysql_rank($INPUT,'my_column',3,10);
    if (
    $ranking == FALSE) { //probably no term recognized
        
    $error 'Error in search format';
    }
    else {
        
    $q 'SELECT *,('.$ranking.') as `ranking` FROM `my_table` HAVING `ranking` >= 1 ORDER BY `ranking` DESC;';

    //ADVANCED MODE:
    $condition search2mysql_adv($INPUT,'my_column',3,10);
    if (
    $condition == FALSE) { //probably no term recognized
        
    $error 'Error in search format';
    }
    else {
        
    $q 'SELECT * FROM `my_table` WHERE '.$condition.';';

    Format of user input:
    Basic mode: just a list of terms, no special characters. One exception. For negating (removing) a term, use this: -term
    Advanced mode: Quotes " group multi-word terms into single terms; parentheses () group sets of terms together for and/or logic; OR is or and and is implied by a space; -term removes the term "term".


    Error correction is not implemented, so if you need that you'll have to try it yourself at least until I get the time to go into all of it.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •