lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Naber <daniel.na...@t-online.de>
Subject Re: Why does the StandardTokenizer split hyphenated words?
Date Wed, 15 Dec 2004 19:27:58 GMT
On Wednesday 15 December 2004 19:29, Mike Snare wrote:

> In my case, the words are keywords that must remain as is, searchable
> with the hyphen in place.  It was easy enough to modify the tokenizer
> to do what I need, so I'm not really asking for help there.  I'm
> really just curious as to why it is that "a-1" is considered a single
> token, but "a-b" is split.

a-1 is considered a typical product name that needs to be unchanged 
(there's a comment in the source that mentions this). Indexing 
"hyphen-word" as two tokens has the advantage that it can then be found 
with the following queries:
hypen-word (will be turned into a phrase query internally)
"hypen word" (phrase query)
(it cannot be found searching for hyphenword, however).

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message