lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From govind bhardwaj <govins...@gmail.com>
Subject Re: Issue with StandardAnalyzer which splits single word with _(Lucene Version: 3.0)
Date Mon, 22 Aug 2011 10:16:18 GMT
Hi Srinivas,

It works for the latest Lucene Version 3.3.0 (in fact for versions after
3.0.0). Standard Analyzer just splits the text ignoring a set of
STOP_WORDS like "is", "in", etc.

In the class definition of StandardAnalyzer in Lucene 3.3.0 API, it is
clearly stated :-
"As of 3.1, StandardTokenizer implements Unicode text segmentation, and
StopFilter correctly handles Unicode 4.0 supplementary characters in
stopwords." I guess that takes care of the 'underscore' character now.

So I suggest that you should switch to the latest version for better
performance and functionality. Hope that helps.

Regards,
Govind

On Mon, Aug 22, 2011 at 11:17 AM, <srinu.hello@gmail.com> wrote:

> Hello All,
>           I observed  some unexpected behavior using StandardAnalyzer to
> parse the query. Here is the demonstration.
>
> I am passing the query as (key:xyz_abc) && (text:blabla)
>
> Expecting the parsed query to be +key:xyz_abc +text:blabla
>
> Actual Result is +key:"xyz abc" +text:blabla
>
> As per my understanding StandardAnalyzer splits the word boundaries into
> multiple words but the above word xyz_abc is a single word. Please correct
> me if i am wrong.
>
> I also observed if number is there after underscore the parsed query is as
> expected. i.e
>
> If i give the query as (key:xyz_1abc) && (text:blabla) the parsed query is
> +key:xyz_1abc +text:blabla
>
> This is the behavior i am expecting.
>
> Please help.
>
> Thanks,
> Srinivas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
No trees were harmed in the creation of this message, but several thousand
electrons were mildly inconvenienced.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message