lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremy Meyer" <>
Subject RE: Inconsistent tokenizing of words containing underscores.
Date Mon, 29 Aug 2005 17:21:21 GMT
The expected behavior is to sometimes treat a character as indicating a new
token and other times to ignore the same character?

This sounds like behavior that should be much better documented than it
currently is.

Why would this be the default? What cases is it meant for?

-----Original Message-----
From: Otis Gospodnetic [] 
Sent: Monday, August 29, 2005 10:56 AM
Subject: Re: Inconsistent tokenizing of words containing underscores.

That's StandardAnalyzer's expeceted behaviour.  If you want
tokenization to occur only on white spaces, use WhitespaceAnalyzer.  If
you want custom behaviour, you should write an Analyzer (there should
be a FAQ entry with an example).


--- "Is, Studcio" <> wrote:

> Hello,
> I'm using Lucene for a few weeks now in a small project and just ran
> into a problem. My index contains words that contain one or more
> underlines, e.g. XYZZZY_DE_SA0001 or XYZZZY_AT0001. Unfortunately the
> tokenizer tokenizes / splits the word into multiple tokens at the
> underscores, except the last underscore. 
> For example the word XYZZZY_DE_SA0001 is tokenized as follows:
> 1. Token: XYZZY 
> 2. Token: DE_SA0001
> which is not conforming to expectations. Either the tokenizer should
> split at every underscore or at none.
> I'm using Lucene 1.4.3 with
> org.apache.lucene.analysis.standard.StandardAnalyzer and Java
> 1.4.2_08.
> Has anybody experienced the same behaviour or can explain it? Could
> it
> be a bug in the StandardTokenizer?
> Many thanks in advance
> Sebastian Seitz

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message