Tokenizing Search Queries of Quoted Strings in PHP

by Charles Iliya Krempeaux, published on Thu Oct 4th, 2007

Ever wanted to handle search engine style queries from PHP‽

Today, anyone who has a used a Web-based search engine, with any bit of sophistication, has come across quoted strings.

For example, people might search for...

  • "halo 3"
  • "video blogging" software
  • "Internet TV"
  • "Zend Avesta" "english translation"

Often people do searches like these... using quoted strings... to improve the search results they are getting from the search engine. They do this by getting only those results that include the phrases... contained in the quotes string(s)... as a whole.

And while this paradigm for getting queries from the user is common place and seems to have become ubiquitous... AFAIK, PHP does NOT have a built in function to tokenize such queries.

This article provides you with a function you can use to do just that....

Query to Tokens

Here's the code....


    function querytotokens($q)
    {
        //
        // Check parameters.
        //
            if (  !isset($q) || FALSE === $q || !is_string($q)  ) {
                // Error.
                return FALSE;
            }
     
        //
        // Get the tokens from the query.
        //
                $x = trim($q);

                // SHORT CIRCUIT
                if (  '' === $x  ) {
    /////////////// RETURN
                    return array();
                }
       
                $chars = str_split($x);
                $mode = 'normal';
                $token = '';
                $tokens = array();
                for ($i=0;$i<count($chars);$i++) {
       
                    switch ($mode) {
                        case 'normal':
                            if (  '"' == $chars[$i]  ) {
                                if ( '' != $token) {
                                    $tokens[] = $token;
                                }
                                $token = '';
                                $mode = 'quoting';
                            } else if (  ' ' == $chars[$i] || "\t" == $chars[$i] || "\n" == $chars[$i]  ) {
                                if ( '' != $token) {
                                    $tokens[] = $token;
                                }
                                $token = '';
                            } else {
                                $token .= $chars[$i];
                            }
                        break;
       
                        case 'quoting':
                            if (  '"' == $chars[$i]  ) {
                                if ( '' != $token) {
                                    $tokens[] = $token;
                                }
                                $token = '';
                                $mode = 'normal';
                            } else {
                                $token .= $chars[$i];
                            }
                        break;
       
                    } // switch
       
                } // foreach
                if ( '' != $token) {
                    $tokens[] = $token;
                }


        //
        // Return.
        //
            return $tokens;
    }   
            
Examples

If you were to run the follow code below, that makes use of the querytotokens() function, then you will get the output below it....


    $q1 = 'apple';
    $q2 = '"apple"';
    $q3 = 'seedless grapes';
    $q4 = '"seedless grapes"';
    $q5 = '"once upon a time" "snow white"';
    $q6 = '';
    $q7 = '  ';
    $q8 = 'aouei   "m g ng z r"    h d t c q "blfsn"';
    
    $t1 = querytotokens($q1);
    $t2 = querytotokens($q2);
    $t3 = querytotokens($q3);
    $t4 = querytotokens($q4);
    $t5 = querytotokens($q5);
    $t6 = querytotokens($q6);
    $t7 = querytotokens($q7);
    $t8 = querytotokens($q8);
            

That would give you...


    $t1 == array
           ( 0 => 'apple'
           );

    $t2 == array
           ( 0 => 'apple'
           );

    $t3 == array
           ( 0 => 'seedless'
           , 1 => 'grapes'
           );

    $t4 == array
           ( 0 => 'seedless grapes'
           );

    $t5 == array
           ( 0 => 'once upon a time'
           , 1 => 'snow white'
           );

    $t6 == array();

    $t7 == array();

    $t8 == array
           ( 0 => 'aouei'
           , 1 => 'm g ng z r'
           , 2 => 'h'
           , 3 => 'd'
           , 4 => 't'
           , 5 => 'c'
           , 6 => 'q'
           , 7 => 'blfsn'
           );
            
--

Read more about: , .

Comments

No known comments. (There may be some out there though.)


New Comments

Want to write a comment to this post on your own blog? Then use the HTML code below to link to this article....

Or better yet, use the quote-o-matic below by "selecting" the part of the text (in the article) that you want to quote, and then use the HTML code that will get generated below to link to this article....


Preview: