lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: KeywordAnalyzer still getting tokenized on spaces
Date Tue, 09 Sep 2014 07:51:44 GMT
Hi,

the QueryParser does not analyze the whole query text with the analyzer. It first parses the
query syntax and then only passes those parts through the analyzer, which are considered as
"tokens" by the query parser. If you want such an analyzer be respected by the query parser
you may need a nother one with a simplified syntax (e.g. SimpleQueryParser).

Ideally, if you want to just pass a text through an analyzer, you should not use a query parser
(because there is nothing to parse, just to analyze). So approach #2 is the right one. To
make it easier, Lucene contains the following class: 

http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/util/QueryBuilder.html

This one uses no syntax and just passes the string through the Analyzer to create the query:

So solution #2 looks like:

Query currQuery = new QueryBuilder(theAnalyzer)
    .createBooleanQuery("sn", currQueryStr, BooleanClause.Occur.MUST);

In your case this would return a Boolean query with one clause, but that gets rewritten by
the query execution, so its identical to a single term query. This approach is  like Elasticsearch's
"matchQuery" and is in most cases the approach you should use, if you don't need "syntax".

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: atawfik [mailto:contact.txlabs@gmail.com]
> Sent: Tuesday, September 09, 2014 9:37 AM
> To: java-user@lucene.apache.org
> Subject: Re: KeywordAnalyzer still getting tokenized on spaces
> 
> The result of QueryParser is confusing. The problem is that you assume the
> query parser uses the analyzer to parse your query. However, that is not the
> case. The query parser first parses the query string, then applies the
> analyzer.
> 
> In other words, the query parser will split the query string using spaces.
> So, you will get three terms : 1023, 4567 and 8765. In fact, you can see that in
> the output of the second query; you have three boolean clauses instead of
> one. After parsing query, the query parser applies the analyzer.
> 
> To fix that, you have two solutions:
> 
> 1- Use term query instead directly without using query parser. In this case,
> you will not apply the analyzer.
>      Query currQuery = new TermQuery(new Term("sn",currQueryStr));
> 2- Analyze the query, then create the Term query:
>       TokenStream ts = theAnalyzer.tokenStream("sn",new
> StringReader(currQueryStr));
>       ts.reset();
>       ts.incrementToken();
>      CharTermAttribute ca = ts.getAttribute(CharTermAttribute.class);
>      String query = ca.toString();
>      ts.close();
>      Query currQuery = new TermQuery(new Term("sn",query));
>      System.out.println(currQuery.getClass() + ", " + currQuery);
> 
> I am not aware of any method that uses QueryParser to achieve that. May
> someone here can correct me.
> 
> Regards
> Ameer
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/KeywordAnalyzer-still-getting-
> tokenized-on-spaces-tp4157537p4157560.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message