PDA

View Full Version : Theories of a search algorithm



djr33
04-29-2010, 02:27 AM
I am currently working on a project where for one section of the site I will need a strong search algorithm.

The content will be user submitted and keywords would be difficult to get/require.

Using a custom google search is problematic for a few reasons (since this is already a complex system, and that google doesn't integrate all that well with a few things).

That said, what I need is to figure out how to design a search system that is efficient, fast and reliable. And of course that means that it gets good results.

At the moment I'm trying to figure out if it makes sense to even attempt this or if I need to find a way to use an existing system.

I'm using MySQL and PHP.

I'd like to just start a discussion of good methods and known problems to figure out how to approach it. If there are any good references out there for a summary of this situation that would help too.

BLiZZaRD
04-29-2010, 04:00 AM
Okay, the obvious question here is what are you searching?

So, the content to be searched is uploaded (or written, or whatever) into the site, and that stuff is stored on site and in database.

I am guessing you then want to have a 3rd person come along months later and say "I wanna see that article BLiZZ wrote on how John is the greatest JS author ever.. ahhh a search bar" And then plug in keywords as if he was on Google and get results based on his search terms.

I have seen this done a few times, and the answer (assuming on how the content is gathered) has always been to make the author (or uploader, or whatever) add key words. You then of course have the authors email addy, username, post date, and a list of CSV keywords in which to add to the DB table_SearchTerms.

I must be missing something...

djr33
04-29-2010, 04:34 AM
Yes, a search field like on google where you type in words.

I've written keyword based searches before and it actually works really well, but it's a lot to maintain if there's a lot of content.

As I said this will all be user contributed.

It's going to be roughly a wiki. (Due to various reasons integrating existing packages is very difficult-- if this turns out to be too hard I might try going that route again, as a last resort).

Basically I want to handle things just like google does: full text searching, but of course using a much smaller database, but still quite large.
It's equivalent to searching dynamic drive's forums, probably, or at least once it gets going.


Also, this will be across many translations and various complications.

The simplest way to imagine this is that there will be a tables for large strings in the database. I want to be able to search those strings automatically. If this means some sort of pre-parsing fine; if it's in-the-moment parsing that's fine too; or perhaps a mix: caching of some sort. The problem though is that this content will always be changing. Again, it's all user contributed (and controlled).

james438
04-29-2010, 06:53 AM
Off and on I have spent a lot of time with creating php/MySQL based search programs. The preferred method is to use fulltext indexing for fast, efficient, and accurate returns on queries. I am against this for a few reasons. fulltext indexing results are based on the indexed tables, which need to be indexed at regular intervals in order to get up to date returns. Stopwords (http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html) are a factor. The min word length (ft_min_word_len) is often set at 4 letters and in order to alter this from 4 to 3 you need root access to your MySQL database, which with my GoDaddy account requires the significantly more expensive Deluxe account. Terms are also included that I would rather were not. Searching for "keyboards" could return "keyboard". Searches that are found in over 50% of the results are ignored as well. As I understand it Google uses search patterns similar to this.

All that being said I use MySQL's LIKE function.
SELECT * FROM table WHERE col1 LIKE '%term%'.

The searches will be be in realtime and will find results even if the text to be searched was entered into a database 2 seconds ago. Remember to sanitize the searches though so as to avoid SQL injection. The searches will be slower, but for most people with hardly be noticeable. This also is not the same as fulltext indexing as listed above.

Skipping forward several paragraphs here is an example of one of my queries:


SELECT ID, summary, image FROM memoblog WHERE
(lcase( concat(cast(ID as char), summary, IFNULL(image,'')))
LIKE '%yoo%' )

Breaking this apart I am returning the ID, summary, and image from table "memoblog" if "yoo" is found anywhere in ID, summary, or image columns.

lcase is used to make the search case insensitive.

concat means concatenation, which means the 3 columns are searched as if they were a single unified column. This is useful when searching multiple columns with multiple terms where one term may be found in one col and another is found in another column, but all of the terms are not found in a single column together. I hope that makes sense.

cast is used here because if you are using MySQL 5.0 and later and you search your database for a column that is text and another column that is an integer the search becomes case sensitive again. I forget why or how though. cast (col as char) will return integers as alphanumeric characters.

IFNULL if any field is null then no results will be returned for that particular row. IFNULL will allow you to recognize NULL values as something else. In this case the NULL value is returned as empty as opposed to NULL.

The actual code I use in my search program is certainly more dynamic and can be rather complex if needed.

There are other parameters that need to be considered for the admin like how do you make certain results unsearchable? Sometimes you want to keep certain entries from being found via your search program. How do you make those hidden results appear depending on the user? How do you keep the program simple enough so as to be easily editable? How do you search specific columns in multiple tables? Are you using MySQL 5.x or 4.x (big difference here)?

PHP based questions include: how will you return the results? Will you return the enitre document or just a portion of 200 words or so that contains one or more of the search terms? Will you limit the searches to 10 results per page? Pagination? Can the user search for exact phrases?

Other questions are implied, so I will avoid mentioning them here.

On a side note there are free search programs out there, but I had some very specific requirements, so I made my own. It was not easy and a lot of the information was not all that easy for me to find, like ft_min_word_len manipulation, but it was rewarding. It has been a while since I have worked on my search engine and so I may be a bit out of practice with this.

djr33
04-29-2010, 08:59 PM
That's some great info and I'm glad to hear from someone who's tried all this out before.

The details will be helpful, but I have a general question:
I've considered using the LIKE operator-- that's a clear place to start.
But is it efficient? What if there is a LOT of text? That's a lot to go through and I know that MySQL is generally efficient, but if I'm searching through a big database (again, this will be roughly the size of searching through DD's posts), will this be a problem?

BLiZZaRD
04-29-2010, 09:06 PM
I have seen searches set up obviously using the LIKE operator. And it's a chore reading through the results. A causal user will type in things including "a, an, the, and" etc.

Which will then return each result that matches that. Which will be 99.999999% of them, or more.

You will have to limit it some way, maybe a character limit, or auto strip those words.

djr33
04-29-2010, 09:08 PM
Pre-processing the search input is required; that's fine.

Given good input, is the LIKE operator reliable/efficient/stable through lots of data?

BLiZZaRD
04-29-2010, 09:38 PM
You and everyone else knows a lot more than I do about so I am trying (really) to stay out of this, but I just can't help it. :D

I think the query using the like operand would be fine, speed will be more based on the server proc anyway and bandwidth allotments.

However, for set up and use the RLIKE and NOT RLIKE may be a little better suited as there are less chances of coding failures (for example in LIKE and NOT LIKE you need to look for a character, say "x", then you run %x% but in RLIKE you just use 'x' ). It also looks for a match anywhere in the search area where LIKE only looks for an exact match.

From what I know (and again, its not a lot) RLIKE is a off shoot of REGEXP with the added bonuses of LIKE.

djr33
04-29-2010, 10:57 PM
Interesting. More to look into.

As I said, I'm currently just thinking about this. I've got other things to do at the moment on the site and at some point I'll have to start this part.

My main concern is that I'll spend a lot of time and build a very nice system then find that after users contribute something like 100MB of text into the database (admittedly, it'll be a long while), the search will start overloading the server or even just take 10 seconds for each query...


I guess I could test all of this, but it is difficult to test without having a real database to play with.

james438
04-29-2010, 11:51 PM
RLIKE is synonymous with regex. The terms can be used interchangeably. I have recently discovered that MySQL is very efficient and can search through large amounts of data almost instantly. I am not sure how processor heavy RLIKE is with MySQL though. I imagine not very, but I have only tried regex with MySQL once for the novelty of it.

fulltext indexing is designed for handling large amounts of data to help greatly speed up searches, etc. What exactly constitutes a large amount of data I am not sure. I assume the DD forums would constitute a large amount of data, but MySQL is designed to handle really large amounts of data that I suspect would dwarf even Dynamic Drive forums. I generally do not like fulltext searches due to all of the restrictions fulltext searches have. It also behaves quite differently. Really differently. It is a whole other world compared to LIKE. It can analyze the results and order them by most relevant. The searches are quite intelligent. When using this method you will be using MATCH as opposed to LIKE. It is a minor syntax thing. The real difference is in how it analyzes data.

fulltext searches are good for DD forum sized data, but the designer should know what they are getting into. For me I will avoid it as much as possible, because I do not like coding that is both user friendly and uncustomizable. You can customize it, but there are many hoops to jump through like needing root access to alter the startup data so as to alter the source code of MySQL. You can read more about all this at: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

As far as the engine differences djr33 I wish I knew the answer.

LIKE has been quite useful when searching through what I consider large amounts of data. It is fast, efficient, and reliable. I have worked with it extensively to be sure that the correct number of results are returned and to avoid "false positives" as well. False positives and the like were due to errors in my coding.

If you want to ignore things like a, an, the, or, is then you are beginning to think along the lines of fulltext indexing.

For limiting results and speed in returning results I use two methods. First, I limit results to 500 characters of sample text, second I use DD's Virtual Pagination script. If you are returning thousands of results then this may take up to a minute to return all of that data. Remember that you are dealing with text, so the results will be returned rather quickly as compared to an image. If you have 2000 results then I would recommend DD's Ajax pagination script as opposed to the Virtual Pagination script so as to save on bandwidth and load time. I plan on implementing this at some point, but have been a bit intimidated about working with Ajax and integrating it into my search program.

If I search for "What time is the lecture?" It will search for all of the entries that contain all of the words: "what" "time" "is" "the" "lecture?" in any order. 'lecture?' with a '?' is considered a word. I could eliminate non word characters in my search, but I prefer not to, because sometimes I like to look for "lecture?" as opposed to "lecture"

djr33
04-30-2010, 01:44 AM
Ok, that makes sense. From what you've both said, this sounds reasonable and probably much more useful than spending more endless hours trying to hack mediawiki to function within my site or something else. The rest of the functions should be easy enough, but I didn't want to get stuck with a search that doesn't work well after all that.

I can deal with all of the issues mentioned above, mostly outside of MySQL. I'll remove common words like "a", "the", etc. But I don't want to search an index without them because sometimes you do want those words-- such as when you put them in quotes. There's a lot of user-input parsing to deal with, but the actual search should be easy enough.

I also don't often expect to have a huge number of results, or at least not unless someone searches a word that is too relevant to be useful.

BLiZZaRD
04-30-2010, 07:44 PM
You could put a limit on the search results. Kind of like how Google does with their "Similar results hidden, click here to show results" thing. You could limit the results to show the first 5 or 10 or 20, then say "there were more but found to be less relevant, would you like to see them?" Or something.

djr33
05-01-2010, 08:07 PM
I just did some testing to see what happens with all of this.

I took the database from an old forum and search through the posts.
There are 27000 posts.

For each matching post, I echoed the username of the poster and the full text of the post. Obviously in a search this may not be needed, but it will give an idea of the higher end of resource-consumption.


For a standard against which to compare, echoing ALL 27000 posts took 13 seconds.

Now searching through using LIKE %term%:
0 matches: 0 seconds.
232 matches: 3 seconds.
2500 matches: 5 seconds.



While that's a bit worrying I don't really see much of a problem with that.


And now echoing only the id of the post:
2500 matches: .3 seconds.



Finally, if I limit the DISPLAYED results to 10 and only echo the post IDs, then:
232 matches (10 IDs displayed): .2 seconds
2500 matches (10 IDs displayed): .25 seconds.

(And if I remove all output and just loop through [doing nothing] for 10 entries, then it takes .12 seconds.)


Does that sound like a reasonable time to you?

If a few people are using the system at once, will it begin to slow down the server?

It's certainly not looking terrible, but it's still showing a little lag on the server.



Also, as an interesting note, I believe that I have now figured out that MySQL caches results: once the initial search takes .2 seconds (for the last round of testing) after that it's always about .001 seconds.

james438
05-02-2010, 02:49 AM
Also, as an interesting note, I believe that I have now figured out that MySQL caches results: once the initial search takes .2 seconds (for the last round of testing) after that it's always about .001 seconds.
Good to know.

djr33
05-05-2010, 07:06 AM
This was originally in a new thread but I realized I should merge it here.


I'm working on dynamically generating queries for a search engine and it's very complex. The good news is that I have gotten basically everything else working and it can do multiple levels of parentheticals and use OR. And is implied by spaces.
This all works.

But now I need to add negation.

So:
a OR (b c)

That's easy enough.

But how would I do:
a OR -(b c)

Meaning: "find something where X is like a or X is not like b and not like c"


The basic solution here is to find a true negation operator in MySQL, but I am not sure how this works.

Both ! and NOT are operators, but I can't find ANY reference on them aside from specific cases (like "NOT LIKE" and "NOT EXISTS").

Is this a valid query:
...WHERE NOT (`field` LIKE '%term%')

So that will find results that do not have 'term' anywhere in them?



The problem is that if I can't find an operator that will work across parentheticals then I'll need to "multiply out" all parentheticals and that will just be awful...
(remember, this is all dynamic)

traq
05-05-2010, 02:35 PM
other than creating a really long query -maybe like WHERE (`a` LIKE '%b%' (NOT (`a` LIKE '%c%'))) - I'm not sure. Maybe you could create a VIEW and run a second query on it to filter out the NOT results. Or do it in php.

Have you done much experimenting with either of those?

boogyman
05-05-2010, 04:33 PM
... WHERE (field NOT LIKE 'x' AND field NOT LIKE 'y')

djr33
05-05-2010, 05:09 PM
Is there no way at all to invert the value of a section? I understand that I could "multiply it out", but that would nearly impossible because this is generated automatically from any possible input including multiple levels:
Search input: "a OR -(b -(c d OR -e)))"

To create a dynamic system capable of multiplying out the negation, that would require tracking layers of parentheticals, multiple levels of negation and other factors. It's possible, but while I was thinking through how hard it would be, I decided it's best to find some other way because it's so hard...


Is there some reason mysql can't do a higher-order negation? Is this part of its efficiency/processing?

I really have no idea how mysql works at a technical level, so perhaps I'm missing something.


---
traq, no, I have not done much experimenting yet. I thought it would be best to ask first because mysql confuses me especially with complex queries. I can handle all the normal single-level queries, but once it gets like this I wanted to check a few things first.


WHERE (`a` LIKE '%b%' (NOT (`a` LIKE '%c%')))
Is this valid syntax? My parser is already generating this (from 'b -(c)').
Actually, it would generate:

WHERE `a` LIKE '%b%' AND NOT (`a` LIKE '%c%')
I could adjust the syntax slightly if the extra parentheses help or something.

---

In short, I just need to know if "NOT" can act as a standalone operator negating what's inside a parenthetical section.


WHERE (NOT (term))
Is that valid within a where statement?

I have been searching google, but I can't find any reference to NOT as a standalone operator. It's only referenced within constructs such as "NOT LIKE" and "NOT EXISTS" etc.


Here's the most "useful" page I've found, but it's not clear at all:
http://dev.mysql.com/doc/refman/5.1/en/logical-operators.html#operator_not

djr33
05-05-2010, 05:55 PM
Apparently this works:
http://www.hscripts.com/tutorials/mysql/logical-operator.php#not

So you can negate a parenthetical section.

I'll test this out later. I wish this was actually documented somewhere. In another tutorial it said you could only use NOT before IN, BETWEEN, and EXISTS....

traq
05-05-2010, 08:06 PM
WHERE (`a` LIKE '%b%' (NOT (`a` LIKE '%c%')))
I've been experimenting with stuff like that lately; I think I did try that particular statement, but I'd have to check. I may have an extra set of parenthesis in there. I'll let you know.

I think the big stumbling block here is that you can write WHERE(`a` LIKE '%b%' AND NOT `a` LIKE '%c%') but you can't just write WHERE(`a` LIKE '%b%' NOT '%c%')...

djr33
05-05-2010, 08:48 PM
Thank you for helping me confirm this. After some testing, it appears that "NOT" can precede a parenthetical block:
"WHERE NOT (statement)" is completely valid.

Now to put all of this together and I'll have a very helpful function.

Thanks!


And yes, you do need the "AND" and a full statement in the parentheses, but that's quite possible to setup.


For you information, the query I used to test this (that worked) was:
WHERE `field` LIKE '%a%' AND NOT (`field` LIKE '%b%')

Based on that, I'm working on the assumption that truth values for the parenthetical are reversed so this should be properly recursive, etc.

djr33
05-06-2010, 12:07 AM
Not sure if I should start a new thread for this or not, but I may as well post here.

I've finished my MySQL search-query generator.

Input: the search terms in a text field including operators:
"AND" is implied by just adding more words:
a b will search for anything containing both a AND b.

OR allows for one or the other (or both):
a OR b searches for anything containing a and/or b.
OR is case sensitive. Lowercase 'or' will act as a content-word and it will search for articles containing "or".

quotation marks ("") make the text inside literal: syntax operators are ignored inside and spaces are allowed; the exact contents must be in the search:
"a b" matches exactly a-space-b.

() Parentheses can group sections:
a OR (b c) searches for something containing a OR (both b and c).

- negation is marked by a minus sign immediately before something:
-a searches for anything that does not have a.
-(a OR b) searches for anything that does not have either a or b.


All of this can be layered into very complex setups and combinations of the logical operators:
a -(b (c OR d))
This generates a search for something that:
1) contains a
2) does not: contain b AND (c OR d)

As an example, the above syntax generates:
`field` LIKE '%a%' AND NOT (`field` LIKE '%b%' AND (`field` LIKE '%c%' OR `field` LIKE '%d%'))


This function is multibyte safe after a lot of reworking. For the encoding of Japanese characters for example, 3 bytes are used for one character. The normal string functions ignore this. So I've fixed it up and it works. You can have any search term you'd like EXCEPT the logical operators above-- if you need those, put them in quotes. (I don't have a way to search for quotation marks-- " "--- themselves yet, but I don't really care that much).


The result of this is as you can see a CONDITION for MySQL WHERE statements.

So just use the syntax:
WHERE $condition
You can of course combine it with other things as well:
WHERE `a` LIKE '%1%' AND $condition


The value $field in the function is very important: this will generate the MySQL table's column name that you are matching for each of the LIKE statements. Due to MySQL's structure, this must be done literally as far as I know.


Here's the code. It's a lot, but commented.

I've minimally tested it and as far as I can tell it's working well.

Note that if you give it a particularly weird input it'll return FALSE. Blank queries, mismatched quotes/parentheses and other odd things do this.

I'm *hopeful* that no invalid queries will be generated by disallowing certain things.
<?php

//supporting function, splits a string into an array of [multibyte] characters
function mbstring2chararray($s,$maxcharlen=4) {
if ($maxcharlen!=4) {
echo 'WARNING: mbstring2chararray() can only do [up to] 4byte chars now.';
//this should be later fixed, but the syntax is hard...
}
$o = array();
for($x=0;$x<strlen($s);$x++) {
$char = $s[$x];

$skipx = 0; //assume skipping no characters
//now go through each byte number:
//if the sequence of bytes is one character, then group them:
//a 2byte char
if (($x<(strlen($s)-1))&&mb_strlen($s[$x].$s[$x+1])===1) {
$char = $s[$x].$s[$x+1]; $skipx = 1;
}
//a 3byte char
if (($x<(strlen($s)-2))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2])===1) {
$char = $s[$x].$s[$x+1].$s[$x+2]; $skipx = 2;
}
//a 4byte char
if (($x<(strlen($s)-3))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2].$s[$x+3])===1) {
$char = $s[$x].$s[$x+1].$s[$x+2].$s[$x+3]; $skipx = 3;
}
///will not currently do more than 4byte chars!

//how many following letters [and loop cycles] do we skip?
$x = $x+$skipx;

$o[] = $char; //store it
}
return $o; //return final array
}


function search2mysql($q,$field) { //transform a user search to a mysql code 'WHERE' string
if (!is_string($q)||$q=='') { return FALSE; } //must have a [nonblank] string...
$q = mbstring2chararray($q); //split query into [multibyte] character array
$qwords = array(); //setup an array to store the content 'words'
$ac=0; //setup array counter for search content array
$inquote = 0; //setup for check for being within quoted [literal] search data: 0 [false] or 1 [true]
$plevel = 0; //setup counter for levels of parenthetical embedding
$qstring = ''; //setup a blank term for the query syntax, later to get content too
$newword = 1; //status for whether we're at a word boundary
$negwords = array(); //setup blank array for tracking negative terms ("NOT")
$qlen = count($q); //how many characters?
for($l=0;$l<$qlen;$l++) { //go through each character of the input and parse:
//open quote: not in quotes and at word boundary
if ($q[$l]=='"'&&$newword==1&&$inquote==0) {
$inquote=1;
}
//end quote: found while at the end of a word and while in quotes
elseif ($q[$l]=='"'&&$inquote==1&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')) {
$inquote=0;
}
//start parentheses: at new word and not in quotes
elseif ($q[$l]=='('&&$newword==1&&$inquote==0) {
$qstring .= '#('; //add the parenthesis and marker [for later]
$plevel++;
if ($plevel<0) { //error: can't have close before open
return FALSE;
}
}
//end parentheses: at end of word and not in quotes
elseif ($q[$l]==')'&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')&&$inquote==0) {
$qstring .= ')#'; //add the parenthesis and marker [for later]
$plevel--;
}
//a space is a break unless in a quote:
elseif ($q[$l]==' '&&$inquote==0) {
$newword=1;
$ac++;
}
//do OR which must be both the beginning and end of a word:
elseif ($newword==1&&$q[$l]=='O'&&($l<($qlen-1))&&$q[$l+1]=='R'&&($l==($qlen-2)||$q[$l+2]==' '||$q[$l+1]=='(')&&$inquote==0) {
$l++; //skip past the R
$l++; //skip past extra space to avoid issues
$qstring .= ' OR ';
}
//deal with - marker for "NOT" terms
elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]!='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
$negwords[] = $ac;
}
//deal with - marker for "NOT" parentheticals
elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]=='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
$qstring .= '#NOT ';
}
else { //not syntactic, must be content
if ($newword==1) { //begin a new word
$newword=0; //no longer a new word
$qwords[$ac] = ''; //setup blank string
$qstring .= '#*'.$ac.'*#'; //add the reference + markers [for later]
}
$qwords[$ac] .= $q[$l]; //add the letter
}
}
//errors:
if ($inquote==1) { //uneven quotes
return FALSE;
}
if ($plevel!=0) { //uneven parentheses
return FALSE;
}
if (in_array('',$qwords)) { //a blank word was found: input is badly formatted
return FALSE;
}
if (count($qwords)==0) { //no blank searches...
return FALSE;
}
if (strpos($qstring,'###')!==FALSE) { //weird error with markers
return FALSE;
}
//end errors

$qstring = str_replace('##',' AND ',$qstring); //adjacent terms need to be conjoined with AND
$qstring = str_replace('#','',$qstring); //no need for floating markers now

//escape data: this is confusing
foreach($qwords as $qwkey=>$qword) { //go through the words
$qword = mysql_real_escape_string($qword); //make safe!
$qword = str_replace('%','\\%',$qword); //escape % so it's not a wildcard
$qword = str_replace('_','\\_',$qword); //escape _ so it's not a wildcard
$not = in_array($qwkey,$negwords)?'NOT ':''; //is it negated?
$qwords[$qwkey] = '`'.$field.'` '.$not.'LIKE \'%'.$qword.'%\''; //format as a mysql query condition
}

//now that the terms and query are setup separately, time to put them together
$invar=0; //tracker value
$q = ''; //reset as blank string for output
for($x=0;$x<strlen($qstring);$x++) { //go through the whole query string
if ($qstring[$x]=='*'&&$invar==0) { //start compiling a key
$invar=1; //set as compiling key
$qwkey = ''; //setup blank string as key
}
elseif ($qstring[$x]=='*'&&$invar==1) { //key end reached
$q .= $qwords[$qwkey]; //add the word to the query
$invar=0; //reset
}
elseif($invar==1) { //continue compiling replace key
$qwkey .= $qstring[$x];
}
else { //it's just syntax
$q .= $qstring[$x];
}
}

return $q; //return the final CONDITION(s) for WHERE; use as required
}

?>
Note: the one thing I haven't dealt with yet is making the search case-insensitive: the problem is that quoted data should be case sensitive (no?) but other data should be case-insensitive. So because of this it's not very easy to determine what to do since that may be higher-level stuff than just for the condition following WHERE. I'll look into it. Obviously for now you can just make the whole query case-insensitive and be done with it, but I'm not sure how to make the quoted material stay case sensitive then.


A couple other things to consider for the future:
1. Wildcards aren't handled. You could either allow direct input of "%" and "_" or you could make up your own symbols and use these. It would also involve messing with the words rather than the query itself and that's not really covered yet. Easy enough to add, but I don't plan to do that now.
2. Word boundaries are ignored-- searching for "good" will find "goodbye". Making it word-boundary only would require that the MySQL is a lot more complex than it is-- this is about the STORED data not the new user input. This is not on my list of things to do now.
3. The results are not ranked in any sensible way. You can of course add an ORDER BY component, perhaps "ORDER BY `popularity`" and that would be a good start. But for now this isn't about "good" or "bad" matches, just "match" TRUE/FALSE and nothing more. Likewise it doesn't handle "partial" matches.
4. There is not yet any function where certain words [ex. 'the'] are removed. It's a lot easier to *pre-process* the string to remove words like that.

This is all solved by careful searching by the user, though.

djr33
05-06-2010, 03:11 AM
Thanks to all of you for the input. I've reworked this a bit more and now I'm happy with what I have:


<?php

function mbstring2chararray($s,$maxcharlen=4) {
if ($maxcharlen!=4) {
echo 'WARNING: mbstring2chararray() can only do [up to] 4byte chars now.';
//this should be later fixed, but the syntax is hard...
}
$o = array();
for($x=0;$x<strlen($s);$x++) {
$char = $s[$x];

$skipx = 0; //assume skipping no characters
//now go through each byte number:
//if the sequence of bytes is one character, then group them:
//a 2byte char
if (($x<(strlen($s)-1))&&mb_strlen($s[$x].$s[$x+1])===1) {
$char = $s[$x].$s[$x+1]; $skipx = 1;
}
//a 3byte char
if (($x<(strlen($s)-2))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2])===1) {
$char = $s[$x].$s[$x+1].$s[$x+2]; $skipx = 2;
}
//a 4byte char
if (($x<(strlen($s)-3))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2].$s[$x+3])===1) {
$char = $s[$x].$s[$x+1].$s[$x+2].$s[$x+3]; $skipx = 3;
}
///will not currently do more than 4byte chars!

//how many following letters [and loop cycles] do we skip?
$x = $x+$skipx;

$o[] = $char; //store it
}
return $o; //return final array
}

//transform a user search to a mysql code 'WHERE' string
function search2mysql($q,$column,$maxterms=10,$ifnull=1) {
//$q string to convert;
//$column [string or string array to concatenate] to be searched;
//$maxterms limits search length to prevent overloading the server; 0 is unlimited;
//$ifnull 0 or 1: compensate for possible null values? default is YES
if (!is_string($q)||$q=='') { return FALSE; } //must have a [nonblank] string...
$q = mbstring2chararray($q); //split query into [multibyte] character array
$qwords = array(); //setup an array to store the content 'words'
$ac=0; //setup array counter for search content array
$inquote = 0; //setup for check for being within quoted [literal] search data: 0 [false] or 1 [true]
$plevel = 0; //setup counter for levels of parenthetical embedding
$qstring = ''; //setup a blank term for the query syntax, later to get content too
$newword = 1; //status for whether we're at a word boundary
$negwords = array(); //setup blank array for tracking negative terms ("NOT")
$cswords = array(); //setup blank array for tracking case-sensitive terms [within quotes]
$qlen = count($q); //how many characters?
for($l=0;$l<$qlen;$l++) { //go through each character of the input and parse:
//open quote: not in quotes and at word boundary
if ($q[$l]=='"'&&$newword==1&&$inquote==0) {
$inquote=1;
}
//end quote: found while at the end of a word and while in quotes
elseif ($q[$l]=='"'&&$inquote==1&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')) {
$inquote=0;
$cswords[] = $ac; //add this to a list of words to keep case sensitive
}
//start parentheses: at new word and not in quotes
elseif ($q[$l]=='('&&$newword==1&&$inquote==0) {
$qstring .= '#('; //add the parenthesis and marker [for later]
$plevel++;
if ($plevel<0) { //error: can't have close before open
return FALSE;
}
}
//end parentheses: at end of word and not in quotes
elseif ($q[$l]==')'&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')&&$inquote==0) {
$qstring .= ')#'; //add the parenthesis and marker [for later]
$plevel--;
}
//a space is a break unless in a quote:
elseif ($q[$l]==' '&&$inquote==0) {
$newword=1;
$ac++;
}
//do OR which must be both the beginning and end of a word:
elseif ($newword==1&&$q[$l]=='O'&&($l<($qlen-1))&&$q[$l+1]=='R'&&($l==($qlen-2)||$q[$l+2]==' '||$q[$l+1]=='(')&&$inquote==0) {
$l++; //skip past the R
$l++; //skip past extra space to avoid issues
$qstring .= ' OR ';
}
//deal with - marker for "NOT" terms
elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]!='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
$negwords[] = $ac;
}
//deal with - marker for "NOT" parentheticals
elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]=='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
$qstring .= '#NOT ';
}
else { //not syntactic, must be content
if ($newword==1) { //begin a new word
$newword=0; //no longer a new word
$qwords[$ac] = ''; //setup blank string
$qstring .= '#*'.$ac.'*#'; //add the reference + markers [for later]
}
$qwords[$ac] .= $q[$l]; //add the letter
}
}
//errors:
if ($inquote==1) { //uneven quotes
return FALSE;
}
if ($plevel!=0) { //uneven parentheses
return FALSE;
}
if (in_array('',$qwords)) { //a blank word was found: input is badly formatted
return FALSE;
}
if (count($qwords)==0) { //no blank searches...
return FALSE;
}
if ($maxterms!=0&&count($qwords)>$maxterms) { //over max length: prevent overload
return FALSE;
}
if (strpos($qstring,'###')!==FALSE) { //weird error with markers
return FALSE;
}
//end errors

$qstring = str_replace('##',' AND ',$qstring); //adjacent terms need to be conjoined with AND
$qstring = str_replace('#','',$qstring); //no need for floating markers now

//format $column including concatenation as needed
if (!is_array($column)) { //must switch string to array
$skipconcat=1; //skip concatenation later
$column = array($column); //convert to array
}
foreach($column as $colk=>$col) { //format each column individually
$col = '`'.$col.'`'; //format as column
if ($ifnull==1) {
$col = 'IFNULL('.$col.',\'\')'; //avoid null errors
}
$column[$colk] = $col; //store
}
$column = implode(', ',$column); //implode them into a list
if (!isset($skipconcat)) { //concatenate them:
$column = 'concat('.$column.')';
}

//escape data, then format as a condition
foreach($qwords as $qwkey=>$qword) { //go through the words
//escaping:
$qword = mysql_real_escape_string($qword); //make safe!
$qword = str_replace('%','\\%',$qword); //escape % so it's not a wildcard
$qword = str_replace('_','\\_',$qword); //escape _ so it's not a wildcard
//format as condition:
$not = in_array($qwkey,$negwords)?' NOT ':' '; //is it negated?
//handle case sensitivity: lcase matching normal; in quotes is case sensitive
$lcase = ''; //set to blank by default
if (!in_array($qwkey,$cswords)) { //this is a non-case-sensitive word [otherwise do nothing special]
$lcase = 'lcase('; //set col prefix
$not = ')'.$not; //add col end to trailing $not var
$qword = mb_strtolower($qword); //make string lowercase
}
$qwords[$qwkey] = $lcase.$column.$not.'LIKE \'%'.$qword.'%\''; //format as a mysql query condition
}

//now that the terms and query are setup separately, time to put them together
$invar=0; //tracker value
$q = ''; //reset as blank string for output
for($x=0;$x<strlen($qstring);$x++) { //go through the whole query string
if ($qstring[$x]=='*'&&$invar==0) { //start compiling a key
$invar=1; //set as compiling key
$qwkey = ''; //setup blank string as key
}
elseif ($qstring[$x]=='*'&&$invar==1) { //key end reached
$q .= $qwords[$qwkey]; //add the word to the query
$invar=0; //reset
}
elseif($invar==1) { //continue compiling replace key
$qwkey .= $qstring[$x];
}
else { //it's just syntax
$q .= $qstring[$x];
}
}

return $q; //return the final CONDITION(s) for WHERE; use as required
}

?>


Here are the improvements from the last post (and all of the rest still applies):
1. Search is caseless except in quoted input.

2. I have also added a "$maxterms" parameter.
This prevents huge queries that overload the server. It's a number or 0 to turn this off and allow infinite terms, but I don't recommend that.

3. $column (formerly 'field') is now able to take both array and string arguments.
If it is a string then a single field is searched; if it is an array then these fields are concatenated and all are searched.

Also, I made an optional "$isnull" parameter in the function that checks to see if you want to convert NULL fields to an empty string. You can turn this off and I bet it'll run faster (that's why I made it optional) but the default is that it's on to limit any possible errors.

Thanks a lot to James for showing me the mysql syntax for all of this.



Note: this is still untested for a actual MySQL query usage-- I've just been designing it to get the correct condition for a where statement generated.
I'll be testing it soon (probably later tonight).



I'm still a bit wary about all of this but overall I'm very happy with it. If you guys have any ideas, then let me know.

djr33
05-06-2010, 03:52 AM
It works!

I'm excited about this. I've been testing this on an active database and as far as I can tell it works really well. The query is generated in what looks (on casual inspection) correct, and the results seem reasonable. I obviously haven't manually cofirmed that all of the results are matches, but I see no reason they are not. The query is well formed and it works while searching.


The time required in searching is not wonderful, but it's also not too worrying.
For the average query it's about .5 seconds on my server.
It is a bit faster if you just have one term. But if you have more than that it only gets slightly slower. Even at several layers and about 5 terms it's still around .5 seconds (hasn't yet been over 1 sec).


I think I've actually created a workable search tool here :)

Now on to the rest of the wiki, but after this I'm optimistic.

traq
05-06-2010, 04:02 AM
If you don't mind, I'd love to experiment with testing this out. I'll share my results, of course.

About whole words vs. partial words: could you modify this so that if a single word is "quoted", the query condition will by '% word %' (note the spaces) instead of '%word%'?

Hmm... actually, that would be complicated. It would only work for words in the middle of a sentence (not touching punctuation, for example). Even the last word in a record might not be matched if it were not followed by a space...

Maybe it would be best served in post-processing (running a regex on the result set, for example, to only include the results that are not touching other letters -and therefore, part of a larger word...).


All of this can be layered into very complex setups and combinations of the logical operators:
a -(b (c OR d))
This generates a search for something that:
1) contains a
2) does not: contain a AND (c OR d)

What goes on with "b" in the example above?

djr33
05-06-2010, 04:35 AM
Ah, oops. Too much typing. I meant "b" in that second line. It's fixed in my post above and here it is again:
a -(b (c OR d))
This generates a search for something that:
1) contains a
2) does not: contain b AND (c OR d)
Or, a single line:
a AND NOT (b AND (c OR d))

It can be layered as much as you'd like. This is a "simple" query compared to the theoretical bounds of what you can enter. The only limit is the $maxterms parameter to save the server from overload. Turn that off if you'd like.

Yes, of course, feel free to test. I am going to paste a working self-contained test page below here so that you can search. The only problem is that you must configure a database and have enough data to make this be a realistic test. Can't help you there, though.


I have found one problem: while the results are cool and clearly accurate to some degree, something is wrong with 'NOT' as I feared it might be before.
The result set of -a -b does not match the result set of -(a b).
These generate the queries of:
-a -b
lcase(`body`) NOT LIKE '%a%' AND lcase(`body`) NOT LIKE '%b%'
-(a b)
NOT (lcase(`body`) LIKE '%a%' AND lcase(`body`) LIKE '%b%')

I also tried adding parentheses around each condition like (`col` LIKE 'X') but that didn't fix anything.

I'm going to keep testing to try to figure out exactly how this works then see if it's possible to fix...



Partial words are very complicated. This would need to be handled by parsing the database text itself. You could also include 3+ queries for each (word between spaces, before punctuation, after punctuation), but this would get messy and I didn't want to deal with it. Additionally, this is even more complex because in quotes you can specific that aspect as well.
For my purposes I also require that this works across all languages so I didn't want to add too many unnecessary complications: the specific operators are sensitive and anything else is a "content" character.
Of course if you do want to search for just "good" and not "goodbye" you can search for "good " manually.
The odds of lots of false positives are low especially if you give a few terms.
If there's a way to work this into the system that might help, but it's beyond this particular aspect I think. Parsing them out after this is complete might be one way to do it, though that would be really tough because you'd no longer have access to the terms from within the string parser fucntion's scope.


EDIT: The case-insensitivity may not be working. More info soon.

djr33
05-06-2010, 04:51 AM
Here's the standalone page as promised:
<?php


$t1 = microtime(1); //start counting...

//connect to DB
mysql_connect('HOST','USER','PASS') or die ('db conn problem'); /////EDIT THIS
mysql_select_db('DBNAME'); /////EDIT THIS



//MultiByte String setup
//setup php for working with Unicode data
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');


function mbstring2chararray($s,$maxcharlen=4) {
if ($maxcharlen!=4) {
echo 'WARNING: mbstring2chararray() can only do [up to] 4byte chars now.';
//this should be later fixed, but the syntax is hard...
}
$o = array();
for($x=0;$x<strlen($s);$x++) {
$char = $s[$x];

$skipx = 0; //assume skipping no characters
//now go through each byte number:
//if the sequence of bytes is one character, then group them:
//a 2byte char
if (($x<(strlen($s)-1))&&mb_strlen($s[$x].$s[$x+1])===1) {
$char = $s[$x].$s[$x+1]; $skipx = 1;
}
//a 3byte char
if (($x<(strlen($s)-2))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2])===1) {
$char = $s[$x].$s[$x+1].$s[$x+2]; $skipx = 2;
}
//a 4byte char
if (($x<(strlen($s)-3))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2].$s[$x+3])===1) {
$char = $s[$x].$s[$x+1].$s[$x+2].$s[$x+3]; $skipx = 3;
}
///will not currently do more than 4byte chars!

//how many following letters [and loop cycles] do we skip?
$x = $x+$skipx;

$o[] = $char; //store it
}
return $o; //return final array
}

//transform a user search to a mysql code 'WHERE' string
function search2mysql($q,$column,$maxterms=10,$ifnull=1) {
//$q string to convert;
//$column [string or string array to concatenate] to be searched;
//$maxterms limits search length to prevent overloading the server; 0 is unlimited;
//$ifnull 0 or 1: compensate for possible null values? default is YES
if (!is_string($q)||$q=='') { return FALSE; } //must have a [nonblank] string...
$q = mbstring2chararray($q); //split query into [multibyte] character array
$qwords = array(); //setup an array to store the content 'words'
$ac=0; //setup array counter for search content array
$inquote = 0; //setup for check for being within quoted [literal] search data: 0 [false] or 1 [true]
$plevel = 0; //setup counter for levels of parenthetical embedding
$qstring = ''; //setup a blank term for the query syntax, later to get content too
$newword = 1; //status for whether we're at a word boundary
$negwords = array(); //setup blank array for tracking negative terms ("NOT")
$cswords = array(); //setup blank array for tracking case-sensitive terms [within quotes]
$qlen = count($q); //how many characters?
for($l=0;$l<$qlen;$l++) { //go through each character of the input and parse:
//open quote: not in quotes and at word boundary
if ($q[$l]=='"'&&$newword==1&&$inquote==0) {
$inquote=1;
}
//end quote: found while at the end of a word and while in quotes
elseif ($q[$l]=='"'&&$inquote==1&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')) {
$inquote=0;
$cswords[] = $ac; //add this to a list of words to keep case sensitive
}
//start parentheses: at new word and not in quotes
elseif ($q[$l]=='('&&$newword==1&&$inquote==0) {
$qstring .= '#('; //add the parenthesis and marker [for later]
$plevel++;
if ($plevel<0) { //error: can't have close before open
return FALSE;
}
}
//end parentheses: at end of word and not in quotes
elseif ($q[$l]==')'&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')&&$inquote==0) {
$qstring .= ')#'; //add the parenthesis and marker [for later]
$plevel--;
}
//a space is a break unless in a quote:
elseif ($q[$l]==' '&&$inquote==0) {
$newword=1;
$ac++;
}
//do OR which must be both the beginning and end of a word:
elseif ($newword==1&&$q[$l]=='O'&&($l<($qlen-1))&&$q[$l+1]=='R'&&($l==($qlen-2)||$q[$l+2]==' '||$q[$l+1]=='(')&&$inquote==0) {
$l++; //skip past the R
$l++; //skip past extra space to avoid issues
$qstring .= ' OR ';
}
//deal with - marker for "NOT" terms
elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]!='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
$negwords[] = $ac;
}
//deal with - marker for "NOT" parentheticals
elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]=='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
$qstring .= '#NOT ';
}
else { //not syntactic, must be content
if ($newword==1) { //begin a new word
$newword=0; //no longer a new word
$qwords[$ac] = ''; //setup blank string
$qstring .= '#*'.$ac.'*#'; //add the reference + markers [for later]
}
$qwords[$ac] .= $q[$l]; //add the letter
}
}
//errors:
if ($inquote==1) { //uneven quotes
return FALSE;
}
if ($plevel!=0) { //uneven parentheses
return FALSE;
}
if (in_array('',$qwords)) { //a blank word was found: input is badly formatted
return FALSE;
}
if (count($qwords)==0) { //no blank searches...
return FALSE;
}
if ($maxterms!=0&&count($qwords)>$maxterms) { //over max length: prevent overload
return FALSE;
}
if (strpos($qstring,'###')!==FALSE) { //weird error with markers
return FALSE;
}
//end errors

$qstring = str_replace('##',' AND ',$qstring); //adjacent terms need to be conjoined with AND
$qstring = str_replace('#','',$qstring); //no need for floating markers now

//format $column including concatenation as needed
if (!is_array($column)) { //must switch string to array
$skipconcat=1; //skip concatenation later
$column = array($column); //convert to array
}
foreach($column as $colk=>$col) { //format each column individually
$col = '`'.$col.'`'; //format as column
if ($ifnull==1) {
$col = 'IFNULL('.$col.',\'\')'; //avoid null errors
}
$column[$colk] = $col; //store
}
$column = implode(', ',$column); //implode them into a list
if (!isset($skipconcat)) { //concatenate them:
$column = 'concat('.$column.')';
}

//escape data, then format as a condition
foreach($qwords as $qwkey=>$qword) { //go through the words
//escaping:
$qword = mysql_real_escape_string($qword); //make safe!
$qword = str_replace('%','\\%',$qword); //escape % so it's not a wildcard
$qword = str_replace('_','\\_',$qword); //escape _ so it's not a wildcard
//format as condition:
$not = in_array($qwkey,$negwords)?' NOT ':' '; //is it negated?
//handle case sensitivity: lcase matching normal; in quotes is case sensitive
$lcase = ''; //set to blank by default
if (!in_array($qwkey,$cswords)) { //this is a non-case-sensitive word [otherwise do nothing special]
$lcase = 'lcase('; //set col prefix
$not = ')'.$not; //add col end to trailing $not var
$qword = mb_strtolower($qword); //make string lowercase
}
$qwords[$qwkey] = $lcase.$column.$not.'LIKE \'%'.$qword.'%\''; //format as a mysql query condition
}

//now that the terms and query are setup separately, time to put them together
$invar=0; //tracker value
$q = ''; //reset as blank string for output
for($x=0;$x<strlen($qstring);$x++) { //go through the whole query string
if ($qstring[$x]=='*'&&$invar==0) { //start compiling a key
$invar=1; //set as compiling key
$qwkey = ''; //setup blank string as key
}
elseif ($qstring[$x]=='*'&&$invar==1) { //key end reached
$q .= $qwords[$qwkey]; //add the word to the query
$invar=0; //reset
}
elseif($invar==1) { //continue compiling replace key
$qwkey .= $qstring[$x];
}
else { //it's just syntax
$q .= $qstring[$x];
}
}

return $q; //return the final CONDITION(s) for WHERE; use as required
}


///do the stuff and setup the output!
$out='';
if (isset($_POST['query'])) {
$condition = search2mysql($_POST['query'],array('COL1','COL2')); /////EDIT THIS
$r = mysql_query("SELECT * FROM `YOURTABLE` WHERE $condition ORDER BY `ORDERCOLUMN`;"); ////EDIT THIS
$out = '<hr><b>Results found: '.mysql_num_rows($r).'</b><hr>';
$count=0;
while (($result = mysql_fetch_assoc($r))&&$count<10) {
extract($result);
$out .= "<b>$COL1<b><hr>";
$count++;
}
$out .= microtime(1)-$t1;
}



?>
<html>
<head>
<title>search test</title>
</head>
<body>
<form action="?" method="post">
Query: <textarea name="query"><?php if (isset($_POST['query'])) { echo $_POST['query']; } ?></textarea> <input type="submit" value="submit">
</form>
<p style="font-family:courier;font-size:80%;"><?php echo isset($condition)?($condition===FALSE?'error':$condition):''; ?></p>
<p><?php echo $out; ?></p>
</body>
</html>


How to set it up:
1. At the top you must setup your database connection. Do that as you'd like. 2 lines to edit using my current syntax.
2. Near the bottom (in the last bit of PHP) you must edit two lines to reflect your DB's structure.



I'm well aware that the HTML there is lazy and that the PHP even (aside from the function) may be messy. But it works for testing.


This uses mysql_real_escape_string(), so I think it's reasonably safe, but I certainly wouldn't advise yet doing anything public with it. Select statements luckily aren't that dangerous, but be a bit careful...



Also, if you have any trouble reading it (it's well commented, but also very complex), feel free to ask. The overall logic of the situation is:
1. Go through character by character and split up the input into content bits and syntax bits.*
2. Toss the syntax bits into a pseudo query without any content, leaving some markers for later
3. Toss the content bits into an array of 'words' to search for later, keeping references [array keys] to the markers in the syntax
4. Parse all of this now individually: A) make sure the syntax of the query is not badly formed; B) parse all of the "words" into the syntax for a "LIKE" condition in MySQL
5. Now that the parts are ready, replace the markers with the corresponding "words", and put the whole query back together.
(6. Place this after 'WHERE' in a MySQL query and do the search.)

(*Regex might seem better here, but due to the immense recursive complexity here, I'm not even going to think about trying that.)

traq
05-06-2010, 06:35 AM
something is wrong with 'NOT' as I feared it might be before.
The result set of -a -b does not match the result set of -(a b).
These generate the queries of:
-a -b
lcase(`body`) NOT LIKE '%a%' AND lcase(`body`) NOT LIKE '%b%'
-(a b)
NOT (lcase(`body`) LIKE '%a%' AND lcase(`body`) LIKE '%b%')


Are they supposed to match? On a DB with 7 entries ("a", "b", "c", "ab", "ac", "bc", "abc"), these are my result-sets:
-a -b:

Array
(
[id] => 7
[content] => c
)


-(a b):

Array
(
[id] => 3
[content] => bc
)
Array
(
[id] => 4
[content] => ac
)
Array
(
[id] => 5
[content] => a
)
Array
(
[id] => 6
[content] => b
)
Array
(
[id] => 7
[content] => c
)

They seem to be working fine; they're just not equivalent statements. -a -b ("not-a and not-b") returns the entry that has neither a nor b; where -(a b) ("not a-and-b") returns the five entries that don't have a and b. Or am I misunderstanding what your syntax is supposed to mean?

djr33
05-06-2010, 06:36 AM
Responding to the issue of negation with "NOT" above, I was wrong.
MySQL is NOT wrong here.
Here's why:
I was not following the rules of formal logic:
-(a b) is NOT the same as -a -b.
It is instead the same as -a OR -b
(Think of this as the "opposite" of "AND"=>"OR", and vice versa.)

While this is confusing to deal with it's not in fact an error in mysql (or my algorithm).
So that's fixed, if confusing for users.


EDIT: Traq, that's exactly correct. Just as you posted this I had figured out what was wrong.

traq
05-06-2010, 06:41 AM
Just as you posted this I had figured out what was wrong.
You mean, "nothing"?

:) I spent a long time staring at those two statements before I could figure out what I was even looking for. Once I did, though, everything suddenly struck me as completely logical.

djr33
05-06-2010, 06:42 AM
Exactly. Once you can think in logic, everything logical is logical. It's a loophole. And human brains don't natively think in logic... just computers.


Edit: haha, yes, I mean nothing. Well, something was wrong, but in my head, not in the SQL.

traq
05-06-2010, 04:15 PM
haha, yes, I mean nothing. Well, something was wrong, but in my head, not in the SQL.
so, in your query syntax, is nothing equivalent to "nothing" (...or -"something")?

djr33
05-06-2010, 05:10 PM
Sorry, I don't understand. 'nothing' is not a keyword so if you type that it would be searching for the word "nothing" the same as if you searched for "cars".

nothing (no quotes) is searched in the default way: it's caseless and parsed like normal.
"nothing" (quotes) is searched literally and must match in case; it's also NOT parsed. This isn't clear here, but if you have syntax in it, it'll not parse in the quotes. For example:
Harry Potter will find any articles containing harry and also potter (caseless). "Harry Potter" will find any articles containing the two words TOGETHER in that order with one space between them and also capital letters.



However, if you are asking about the absense of any text in the search-- that is a "blank" query-- then the result will be an error :)
I set this up so that it requires at least one search term. If not, you'll just be reading everything from the database which seems stupid.
Of course that's easy enough to change-- just take out a few of the errors (I'd have to look through the code again to see which specifically), or even just add a line at the top of the function if($q=='') { return ''; }

If all of the errors were removed for that, I believe it would simply return nothing at all, an empty string, so you would end up with a broken WHERE clause and no conditions. You could fix this of course.



Current status: today I'm going to figure out why cases are always ignored. Quoted material should be searched case-sensitively... but it's currently not. That should be the last major issue from what I can tell.

traq
05-06-2010, 10:33 PM
Sorry: I was just kidding around. see my edited post above.

I've run a couple queries using your script; it's working as expected/ described so far (though I haven't run it on a live database yet). Amazing job. :thumbsup:

About case-insensitivity: are you sure you want it? people don't necessarily use proper capitalization when entering search terms - I'm thinking case-sensitive searches might exclude desired results in many cases. You could make it an option (e.g., [ ] make this search Case-Sensitive )

djr33
05-06-2010, 10:55 PM
Oh, haha. Well it brought up a valid point: currently it throws an error if nothing is searched for and that's kinda awkward.

I'm thinking that I will instead not throw any errors directly but return an array with 'query' as the final result, 'error' as the error if any, 'numterms' as how many terms were searched, etc. THEN you can take those values and decide what to do, like stop if there was an error or too many terms were used.

Thanks, and yes, I'm excited this is working.

I actually do want to keep it case-insensitive most of the time, but for quoted material I thought it might be better. However, I have figured out a few things:
1. People are really lazy. They'll put "harry potter" instead of "Harry Potter", so actually even in quoted material it won't be that bad to have it be case-less.
2. My database is currently configured for caseless searches. There's nothing I can do except modifying the collations of the tables while searching them. This seems really messy and like it'll slow things a lot.
3. Since for my purposes this will work across many languages, the specifics of English big and small letters really isn't all that important :p

Because of this, I think I'll add an extra parameter of $casemode:
'0': do nothing to cases (use the database default, either sensitive or insensitive)
'1': do the entire search case-insensitive (regardless of DB config)
'2': do regular terms insensitive and quoted material sensitive (as it is setup now).

BUT: if the DB is configured to ignore case anyway, then all 3 of these will do the exact same search every time. So 0 will be best because it will make things faster: no need to convert all of the fields to lcase() for every search.

(So if someone feels like dealing with a case sensitive search [for ANY DB configuration] and adding this function to the search, have fun with that, and let me know how it turns out.)


I'm going to rework the function a bit with the $casemode parameter and the array output (errors, query, numterms). That'll make it easier to work with, I think.

djr33
06-29-2011, 11:36 PM
I'm coming back to this after a while because I have tested the code and have some updates.

<?php

//"BASIC SEARCH"
//transform a user search to a mysql code for creating temp RANKING column
function search2mysql_rank($q,$column,$wmin=3,$maxterms=10,$ifnull=1) {
//$q string to convert;
//$column [string or string array to concatenate] to be searched;
//$maxterms limits search length to prevent overloading the server; 0 is unlimited;
//$ifnull 0 or 1: compensate for possible null values? default is YES
if (!is_string($q)||$q=='') { return FALSE; } //must have a [nonblank] string...
$q = explode(' ',$q); //split by words
foreach($q as $qk=>$qv) { //go through them all to check format/content of each word
if ($qv=='') { unset($q[$qk]); } //remove blanks
if (strpos($qv,'-')===0) { $qv = substr($qv,1); } //remove negation
else if (($wmin!=0)&&(strlen($qv)<$wmin)) { //is the term too short?
unset($q[$qk]); //remove it
}
}
//check for errors:
if (count($q)==0) { //no blank searches...
return FALSE;
}
if ($maxterms!=0&&count($q)>$maxterms) { //over term limit
return FALSE;
}
//end errors

//format $column including concatenation as needed
if (!is_array($column)) { //must switch string to array
$skipconcat=1; //skip concatenation later
$column = array($column); //convert to array
}
foreach($column as $colk=>$col) { //format each column individually
$col = '`'.$col.'`'; //format as column
if ($ifnull==1) {
$col = 'IFNULL('.$col.',\'\')'; //avoid null errors
}
$column[$colk] = $col; //store
}
$column = implode(', ',$column); //implode them into a list
if (!isset($skipconcat)) { //concatenate them:
$column = 'concat_ws(\' \','.$column.')';
}
$column = 'lcase('.$column.')'; //lowercase for caseless search

//format terms:
$negcount=0; //no negative terms, yet
foreach($q as $qk=>$qv) {
//escaping:
$qv = mysql_real_escape_string($qv); //make safe!
$qv = str_replace('%','\\%',$qv); //escape % so it's not a wildcard
$qv = str_replace('_','\\_',$qv); //escape _ so it's not a wildcard

//handle 'not':
$not = '';
if (strpos($qv,'-')===0) { //this term is negated
$qv = substr($qv,1); //remove the negative operator
$not = ' NOT'; //add for query
$negcount++; //for later use
}

$qv = mb_strtolower($qv); //lowercase for caseless search

//format as parameter:
$q[$qk] = 'IF('.$column.$not.' LIKE \'%'.$qv.'%\','.($not==''?1:'0.999').',0)';
}

if ($negcount==count($q)) { //only negative terms?
return FALSE; //not allowed
}

$q = implode('+',$q); //sum these up

if ($negcount>1) { //negative terms?
$q = $q.'-'.($negcount-1); //ignore results based on only the negative
}

return $q; //return the final OPERATIONS(s) for GENERATING COLUMN; use as required
}


//"ADVANCED SEARCH"
//transform a user search to a mysql code 'WHERE' string
function search2mysql_adv($q,$column,$wmin=3,$maxterms=10,$ifnull=1) {
//$q string to convert;
//$column [string or string array to concatenate] to be searched;
//$maxterms limits search length to prevent overloading the server; 0 is unlimited;
//$ifnull 0 or 1: compensate for possible null values? default is YES
if (!is_string($q)||$q=='') { return FALSE; } //must have a [nonblank] string...
$q = mbstring2chararray($q); //split query into [multibyte] character array
$qwords = array(); //setup an array to store the content 'words'
$ac=0; //setup array counter for search content array
$inquote = 0; //setup for check for being within quoted [literal] search data: 0 [false] or 1 [true]
$plevel = 0; //setup counter for levels of parenthetical embedding
$qstring = ''; //setup a blank term for the query syntax, later to get content too
$newword = 1; //status for whether we're at a word boundary
$negwords = array(); //setup blank array for tracking negative terms ("NOT")
$cswords = array(); //setup blank array for tracking case-sensitive terms [within quotes]
$qlen = count($q); //how many characters?
for($l=0;$l<$qlen;$l++) { //go through each character of the input and parse:
//open quote: not in quotes and at word boundary
if ($q[$l]=='"'&&$newword==1&&$inquote==0) {
$inquote=1;
}
//end quote: found while at the end of a word and while in quotes
elseif ($q[$l]=='"'&&$inquote==1&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')) {
$inquote=0;
$cswords[] = $ac; //add this to a list of words to keep case sensitive
}
//start parentheses: at new word and not in quotes
elseif ($q[$l]=='('&&$newword==1&&$inquote==0) {
$qstring .= '#('; //add the parenthesis and marker [for later]
$plevel++;
if ($plevel<0) { //error: can't have close before open
return FALSE;
}
}
//end parentheses: at end of word and not in quotes
elseif ($q[$l]==')'&&($l==($qlen-1)||$q[$l+1]==' '||$q[$l+1]==')')&&$inquote==0) {
$qstring .= ')#'; //add the parenthesis and marker [for later]
$plevel--;
}
//a space is a break unless in a quote:
elseif ($q[$l]==' '&&$inquote==0) {
$newword=1;
$ac++;
}
//do OR which must be both the beginning and end of a word:
elseif ($newword==1&&$q[$l]=='O'&&($l<($qlen-1))&&$q[$l+1]=='R'&&($l==($qlen-2)||$q[$l+2]==' '||$q[$l+1]=='(')&&$inquote==0) {
$l++; //skip past the R
$l++; //skip past extra space to avoid issues
$qstring .= ' OR ';
}
//deal with - marker for "NOT" terms
elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]!='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
$negwords[] = $ac;
}
//deal with - marker for "NOT" parentheticals
elseif ($newword==1&&$q[$l]=='-'&&($l<($qlen-1))&&$q[$l+1]=='('&&$inquote==0&&($l==0||$q[$l-1]!='-')) {
$qstring .= '#NOT ';
}
else { //not syntactic, must be content
if ($newword==1) { //begin a new word
$newword=0; //no longer a new word
$qwords[$ac] = ''; //setup blank string
$qstring .= '#*'.$ac.'*#'; //add the reference + markers [for later]
}
$qwords[$ac] .= $q[$l]; //add the letter
}
}
//errors:
if ($inquote==1) { //uneven quotes
return FALSE;
}
if ($plevel!=0) { //uneven parentheses
return FALSE;
}
if (in_array('',$qwords)) { //a blank word was found: input is badly formatted
return FALSE;
}
$qwordstemp = $qwords; //copy to check min length for words and query:
foreach($qwordstemp as $qk=>$qv) { //go through them all
if ($wmin!=0&&strlen($qv)<$wmin) { //are all terms too short? --> too hard right now to eliminate short terms, but can require that at least some are long enough
unset($qwordstemp[$qk]); //remove it
}
}
if (count($qwordstemp)==0) { //no blank searches... at least some terms must be valid: exist and be long enough, see above
return FALSE;
} //end min length check
if ($maxterms!=0&&count($qwords)>$maxterms) { //over max length: prevent overload
return FALSE;
}
if (count($negwords)==count($qwords)) { //all words are negative?
return FALSE;
}
if (strpos($qstring,'###')!==FALSE) { //weird error with markers
return FALSE;
}
if (strpos($qstring,'()')!==FALSE) { //input had parentheses with no content
return FALSE;
}
//end errors

$qstring = str_replace('##',' AND ',$qstring); //adjacent terms need to be conjoined with AND
$qstring = str_replace('#','',$qstring); //no need for floating markers now

//format $column including concatenation as needed
if (!is_array($column)) { //must switch string to array
$skipconcat=1; //skip concatenation later
$column = array($column); //convert to array
}
foreach($column as $colk=>$col) { //format each column individually
$col = '`'.$col.'`'; //format as column
if ($ifnull==1) {
$col = 'IFNULL('.$col.',\'\')'; //avoid null errors
}
$column[$colk] = $col; //store
}
$column = implode(', ',$column); //implode them into a list
if (!isset($skipconcat)) { //concatenate them:
$column = 'concat_ws(\' \','.$column.')';
}

//escape data, then format as a condition
foreach($qwords as $qwkey=>$qword) { //go through the words
//escaping:
$qword = mysql_real_escape_string($qword); //make safe!
$qword = str_replace('%','\\%',$qword); //escape % so it's not a wildcard
$qword = str_replace('_','\\_',$qword); //escape _ so it's not a wildcard
//format as condition:
$not = in_array($qwkey,$negwords)?' NOT ':' '; //is it negated?
//handle case sensitivity: lcase matching normal; in quotes is case sensitive
$lcase = ''; //set to blank by default
if (!in_array($qwkey,$cswords)) { //this is a non-case-sensitive word [otherwise do nothing special]
$lcase = 'lcase('; //set col prefix
$not = ')'.$not; //add col end to trailing $not var
$qword = mb_strtolower($qword); //make string lowercase
}
$qwords[$qwkey] = $lcase.$column.$not.'LIKE \'%'.$qword.'%\''; //format as a mysql query condition
}

//now that the terms and query are setup separately, time to put them together
$invar=0; //tracker value
$q = ''; //reset as blank string for output
for($x=0;$x<strlen($qstring);$x++) { //go through the whole query string
if ($qstring[$x]=='*'&&$invar==0) { //start compiling a key
$invar=1; //set as compiling key
$qwkey = ''; //setup blank string as key
}
elseif ($qstring[$x]=='*'&&$invar==1) { //key end reached
$q .= $qwords[$qwkey]; //add the word to the query
$invar=0; //reset
}
elseif($invar==1) { //continue compiling replace key
$qwkey .= $qstring[$x];
}
else { //it's just syntax
$q .= $qstring[$x];
}
}

return $q; //return the final CONDITION(s) for WHERE; use as required
}

?>

[continued below]

djr33
06-29-2011, 11:37 PM
[Continued]

You will also need to define the following function as a resource that's used in the script above.

function mbstring2chararray($s,$maxcharlen=4) {
if ($maxcharlen!=4) {
echo 'WARNING: mbstring2chararray() can only do [up to] 4byte chars now.';
$maxcharlen=4; //default
//this should be later fixed, but the syntax is hard...
}
$o = array();
for($x=0;$x<strlen($s);$x++) {
$char = $s[$x];

$skipx = 0; //assume skipping no characters
//now go through each byte number:
//if the sequence of bytes is one character, then group them:
//a 2byte char
if (($x<(strlen($s)-1))&&mb_strlen($s[$x].$s[$x+1])===1) {
$char = $s[$x].$s[$x+1]; $skipx = 1;
}
//a 3byte char
if (($x<(strlen($s)-2))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2])===1) {
$char = $s[$x].$s[$x+1].$s[$x+2]; $skipx = 2;
}
//a 4byte char
if (($x<(strlen($s)-3))&&mb_strlen($s[$x].$s[$x+1].$s[$x+2].$s[$x+3])===1) {
$char = $s[$x].$s[$x+1].$s[$x+2].$s[$x+3]; $skipx = 3;
}
///will not currently do more than 4byte chars!

//how many following letters [and loop cycles] do we skip?
$x = $x+$skipx;

$o[] = $char; //store it
}
return $o; //return final array
}(This function helps to deal with non-western character ranges in UTF8 for other languages than English.)



As you can see in the first set of code, there are actually two functions. One is faster and more accurate without any advanced control from the user; the other allows the user to control all of the specific details of the search but it is a little less accurate without specification (as a default), and it's also a little slower. In my site, I use the basic mode as the default and the advanced mode as an option for advanced users.
The two different functions work very differently but achieve the same result.


Some demos:

//BASIC MODE
$ranking = search2mysql_rank($INPUT,'my_column',3,10);
if ($ranking == FALSE) { //probably no term recognized
$error = 'Error in search format';
}
else {
$q = 'SELECT *,('.$ranking.') as `ranking` FROM `my_table` HAVING `ranking` >= 1 ORDER BY `ranking` DESC;';

//ADVANCED MODE:
$condition = search2mysql_adv($INPUT,'my_column',3,10);
if ($condition == FALSE) { //probably no term recognized
$error = 'Error in search format';
}
else {
$q = 'SELECT * FROM `my_table` WHERE '.$condition.';';
}

Format of user input:
Basic mode: just a list of terms, no special characters. One exception. For negating (removing) a term, use this: -term
Advanced mode: Quotes " group multi-word terms into single terms; parentheses () group sets of terms together for and/or logic; OR is or and and is implied by a space; -term removes the term "term".


Error correction is not implemented, so if you need that you'll have to try it yourself at least until I get the time to go into all of it.