lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Kor <s0454...@sms.ed.ac.uk>
Subject RE: n-gram indexing
Date Sun, 24 Jul 2005 16:59:11 GMT
Quoting Rajesh Munavalli <rajeshm@dessci.com>:

> Let me explain a scenario where I would need to add the n-grams at
> indexing time.

I see your point and I do agree. As it stands, Lucene does not innately support
n-gram indexing. However it is not impossible to adapt Lucene to serve as an
n-gram index. The method I will describe can be adapted to any search engine,
not just Lucene. But before I go on I must warn you that the end result will
use a lot of diskspace and will also result in longer search time (by a
multiple of N).

What's the method? Use of multiple incompatible indexes, N indexes to be exact.

You can write an Analyzer that churns out bi-grams and use it to create an index
of bi-grams. Likewise, you can also write an Analyzer that churns out tri-grams
and create an index of tri-grams. Its a tedious and diskspace wasting method of
n-gram indexing, but it can be done.

You can then separately search all three indexes, the unigram index, bigram
index and trigram index, to generate three separate scores for every document,
then combine the three scores using weights.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message