lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Snare <mikesn...@gmail.com>
Subject Re: Why does the StandardTokenizer split hyphenated words?
Date Wed, 15 Dec 2004 20:14:46 GMT
> a-1 is considered a typical product name that needs to be unchanged
> (there's a comment in the source that mentions this). Indexing
> "hyphen-word" as two tokens has the advantage that it can then be found
> with the following queries:
> hypen-word (will be turned into a phrase query internally)
> "hypen word" (phrase query)
> (it cannot be found searching for hyphenword, however).

Sure.  But phrase queries are slower than a single word query.  In my
case, using the standard analyzer prior to my modification caused a
single (hyphenated) word query to take upwards of 10 seconds (1M+
documents with ~400K terms).  The exact same search with the new
Analyzer takes <.5 seconds (granted the new tokenization caused a
significant reduction in the number of terms).  Also, the phrase query
would place the same value on a doc that simply had the two words as a
doc that had the hyphenated version, wouldn't it?  This seems odd.

In addition, why do we assume that a-1 is a "typical product name" but
a-b isn't?

I am in no way second-guessing or suggesting a change, It just doesn't
make sense to me, and I'm trying to understand.  It is very likely, as
is oft the case, that this is just one of those things one has to
accept.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message