lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Inconsistent tokenizing of words containing underscores.
Date Mon, 29 Aug 2005 16:56:08 GMT
That's StandardAnalyzer's expeceted behaviour.  If you want
tokenization to occur only on white spaces, use WhitespaceAnalyzer.  If
you want custom behaviour, you should write an Analyzer (there should
be a FAQ entry with an example).

Otis

--- "Is, Studcio" <Studcio.is@wincor-nixdorf.com> wrote:

> Hello,
>  
> I'm using Lucene for a few weeks now in a small project and just ran
> into a problem. My index contains words that contain one or more
> underlines, e.g. XYZZZY_DE_SA0001 or XYZZZY_AT0001. Unfortunately the
> tokenizer tokenizes / splits the word into multiple tokens at the
> underscores, except the last underscore. 
>  
> For example the word XYZZZY_DE_SA0001 is tokenized as follows:
>  
> 1. Token: XYZZY 
> 2. Token: DE_SA0001
>  
> which is not conforming to expectations. Either the tokenizer should
> split at every underscore or at none.
>  
> I'm using Lucene 1.4.3 with
> org.apache.lucene.analysis.standard.StandardAnalyzer and Java
> 1.4.2_08.
>  
> Has anybody experienced the same behaviour or can explain it? Could
> it
> be a bug in the StandardTokenizer?
>  
> Many thanks in advance
>  
> Sebastian Seitz
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message