lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Indexing bigrams and trigrams in Lucene
Date Mon, 04 Sep 2006 05:49:33 GMT

: This is a text document written by someone. Read this and post your comments
:
: words that must be indexed:
: text
: document
	...
: text document
: document written

typically when people talk about indexing n-grams -- they mean character
wise (so they can find words with simple spelling mistakes) .. it's not
relaly clear why you would need word wise n-grams, why not just search for
phrases with no slop?


: So, I made changes to StandardAnalyzer.java and StandardTokenizer.jj to try
: and achieve this.

if you really need/want an analyzer that produces those tokens, i would
suggest you do it with a TokenFilter -- no reason to complicate the
tokenizing process when you can leave that alone and combine the tokens
instead.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message