lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: Efficient string lookup using Lucene
Date Sun, 26 Aug 2012 07:55:24 GMT
> Does Lucene support this type of structure, or do I need to somehow implement it outside
Lucene?

You'd have to implement it separately but it'd be much, much smaller
than Lucene itself (even obfuscated).

> By the way, I need this to run on an Android phone so size of memory might be an issue...

How large is your input? Do you need to index on the android or just
read the index on it? These are all factors to take into account. I
mentioned suffix trees and suffix arrays because these two are
"canonical" data structures to perform any substring lookups in
constant time (in fact, the lookup takes the number of elements of the
matched input string, building the suffix tree/ array is O(n), at
least in theory).

If you already have Lucene integrated in your pipeline then that
n-gram approach will also work. If you know your minimum match
substring length to be p then index p-sized shingles. For strings
longer than p you can create a query which will search for all n-gram
occurrences and take into account positional information to remove
false matches.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message