lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Is, Studcio" <Studcio...@wincor-nixdorf.com>
Subject Inconsistent tokenizing of words containing underscores.
Date Mon, 29 Aug 2005 14:45:02 GMT
Hello,
 
I'm using Lucene for a few weeks now in a small project and just ran
into a problem. My index contains words that contain one or more
underlines, e.g. XYZZZY_DE_SA0001 or XYZZZY_AT0001. Unfortunately the
tokenizer tokenizes / splits the word into multiple tokens at the
underscores, except the last underscore. 
 
For example the word XYZZZY_DE_SA0001 is tokenized as follows:
 
1. Token: XYZZY 
2. Token: DE_SA0001
 
which is not conforming to expectations. Either the tokenizer should
split at every underscore or at none.
 
I'm using Lucene 1.4.3 with
org.apache.lucene.analysis.standard.StandardAnalyzer and Java 1.4.2_08.
 
Has anybody experienced the same behaviour or can explain it? Could it
be a bug in the StandardTokenizer?
 
Many thanks in advance
 
Sebastian Seitz

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message