lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <izavo...@caci.com>
Subject RE: Efficient string lookup using Lucene
Date Sun, 26 Aug 2012 18:29:59 GMT
The user uploads a set of text files, either all of them at once or one at a time, and then
they will be searched locally on the phone against a set of "hotlist" words. This assumes
no connection to any sort of server so everything must be done locally.

I already have Lucene integrated so I might want to try the n-gram approach. But I just want
to double-check first that it will work with any Unicode string, be it an English word, a
foreign word, a sequence of digits or any random sequence of Unicode characters. In other
words, this is not in any way language-dependent/-specific.

Thanks,

Ilya

-----Original Message-----
From: Dawid Weiss [mailto:dawid.weiss@gmail.com] 
Sent: Sunday, August 26, 2012 3:55 AM
To: java-user@lucene.apache.org
Subject: Re: Efficient string lookup using Lucene

> Does Lucene support this type of structure, or do I need to somehow implement it outside
Lucene?

You'd have to implement it separately but it'd be much, much smaller than Lucene itself (even
obfuscated).

> By the way, I need this to run on an Android phone so size of memory might be an issue...

How large is your input? Do you need to index on the android or just read the index on it?
These are all factors to take into account. I mentioned suffix trees and suffix arrays because
these two are "canonical" data structures to perform any substring lookups in constant time
(in fact, the lookup takes the number of elements of the matched input string, building the
suffix tree/ array is O(n), at least in theory).

If you already have Lucene integrated in your pipeline then that n-gram approach will also
work. If you know your minimum match substring length to be p then index p-sized shingles.
For strings longer than p you can create a query which will search for all n-gram occurrences
and take into account positional information to remove false matches.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message