lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 28827] New: - QueryParser treats CJK and English query strings differently
Date Fri, 07 May 2004 12:16:15 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=28827>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=28827

QueryParser treats CJK and English query strings differently

           Summary: QueryParser treats CJK and English query strings
                    differently
           Product: Lucene
           Version: unspecified
          Platform: PC
        OS/Version: Windows NT/2K
            Status: NEW
          Severity: Major
          Priority: Other
         Component: QueryParser
        AssignedTo: lucene-dev@jakarta.apache.org
        ReportedBy: ats37@hotmail.com


Since 1.3 final, the Standard Analyzer returns strings of CJK characters as
separate tokens.  However, the generated QueryParser has its own grammer which
doesn't take account of this.  So we get the following behaviour:

parse("one two three", "content", new StandardAnalyzer()) returns 'content:one
content:two content:three', searching for each term individually.
parse("\"one two three\"", "content", new StandardAnalyzer()) returns
'content:"one two three"', searching for the phrase.
parse("C1C2C3", "content", new StandardAnalyzer()) where Cn is a Chinese
character returns 'content:"C1 C2 C3"', when it should really be 'content:C1
content:C2 content:C3'.  This is inconsistent.
parse("\"C1C2C3\"", "content", new StandardAnalyzer())  also returns
'content:"C1 C2 C3"', identical to the previous case.

Although the string is separated out into the separate CJK tokens (indicated by
the spaces between them), the query parser builds a phrase search for them
rather than individual token searches.  To get the desired query the user has to
instead enter "C1 C2 C3" as the query string (or I have to pre-process the query
string in my code to add the spaces), which is non-intuitive.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message