lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sami Siren (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-1196) Incorrect matches when using non alphanumeric search string !@#$%\^\&\*\(\)
Date Tue, 17 Apr 2012 10:05:18 GMT

     [ https://issues.apache.org/jira/browse/SOLR-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sami Siren updated SOLR-1196:
-----------------------------

    Component/s:     (was: clients - java)
    
> Incorrect matches when using non alphanumeric search string !@#$%\^\&\*\(\)
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-1196
>                 URL: https://issues.apache.org/jira/browse/SOLR-1196
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 1.3
>         Environment: Solr 1.3/ Java 1.6/ Win XP/Eclipse 3.3
>            Reporter: Sam Michael
>
> When matching strings that do not include alphanumeric chars, all the data is returned
as matches. (There is actually no match, so nothing should be returned.)
> When I run a query like  - (activity_type:NAME) AND title:(\!@#$%\^&\*\(\)) all the
documents are returned even though there is not a single match. There is no title that matches
the string (which has been escaped).
> My document structure is as follows
> <doc>
> <str name="activity_type">NAME</str>
> <str name="title">Bathing</str>
> ....
> </doc> 
> The title field is of type text_title which is described below. 
> <fieldType name="text_title" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType> 
> -----------------------------------------------------
> Yonik's analysis as follows.
> <str name="rawquerystring">-features:foo features:(\!@#$%\^&\*\(\))</str>
> <str name="querystring">-features:foo features:(\!@#$%\^&\*\(\))</str>
> <str name="parsedquery">-features:foo</str>
> <str name="parsedquery_toString">-features:foo</str>
> The text analysis is throwing away non alphanumeric chars (probably
> the WordDelimiterFilter).  The Lucene (and Solr) query parser throws
> away term queries when the token is zero length (after analysis).
> Solr then interprets the left over "-features:foo" as "all documents
> not containing foo in the features field", so you get a bunch of
> matches. 
> As per his suggestion, a bug is filed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message