lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Why does the StandardTokenizer split hyphenated words?
Date Wed, 15 Dec 2004 20:23:51 GMT
: a-1 is considered a typical product name that needs to be unchanged
: (there's a comment in the source that mentions this). Indexing
: "hyphen-word" as two tokens has the advantage that it can then be found
: with the following queries:
: hypen-word (will be turned into a phrase query internally)
: "hypen word" (phrase query)
: (it cannot be found searching for hyphenword, however).

This isn't an area of Lucene that I've had a chance to investigate much
yet, but if I recall from my reading, Lucene allows you to place multiple
token sequences at the same position, generating something more easily
described as a "token graph" then a "token stream" .. correct?

so given an input "the quick-brown fox jumped over the a-1 sauce"
the tokenizer culd generate a token stream that looks like.

	"the"
	"quick" "brown"  OR  "quick-brown"  OR  "quickbrown"
	"fox"
	"jumped"
	"over"
	"the"
	"a" "1"  OR  "a-1"  OR  "a1"
	"sauce"

...at which point, a minimum 2 character word length filter, and stop
words filter could (if you wanted to use them) reduce that to...

	"quick" "brown"  OR  "quick-brown"  OR  "quickbrown"
	"fox"
	"jumped"
	"over"
	"a-1"  OR  "a1"
	"sauce"

allowing all of these future (phrase) searches to match...

	the quick brown fox jumped over the a1 sauce
	the quickbrown fox jumped over the a1 sauce
	the quick-brown fox jumped over the a1 sauce
	the quick brown fox jumped over the a 1 sauce
	the quickbrown fox jumped over the a 1 sauce
	the quick-brown fox jumped over the a 1 sauce
	the quick brown fox jumped over the a-1 sauce
	the quickbrown fox jumped over the a-1 sauce
	the quick-brown fox jumped over the a-1 sauce


...correct? or am I missunderstanding?

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message