lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Efficient string lookup using Lucene
Date Sun, 26 Aug 2012 19:13:47 GMT
The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines.
After that, you can wildcards. This will use very little space. I
believe leading&trailing wildcards are supported now, right?

On Sun, Aug 26, 2012 at 11:29 AM, Ilya Zavorin <izavorin@caci.com> wrote:
> The user uploads a set of text files, either all of them at once or one at a time, and
then they will be searched locally on the phone against a set of "hotlist" words. This assumes
no connection to any sort of server so everything must be done locally.
>
> I already have Lucene integrated so I might want to try the n-gram approach. But I just
want to double-check first that it will work with any Unicode string, be it an English word,
a foreign word, a sequence of digits or any random sequence of Unicode characters. In other
words, this is not in any way language-dependent/-specific.
>
> Thanks,
>
> Ilya
>
> -----Original Message-----
> From: Dawid Weiss [mailto:dawid.weiss@gmail.com]
> Sent: Sunday, August 26, 2012 3:55 AM
> To: java-user@lucene.apache.org
> Subject: Re: Efficient string lookup using Lucene
>
>> Does Lucene support this type of structure, or do I need to somehow implement it
outside Lucene?
>
> You'd have to implement it separately but it'd be much, much smaller than Lucene itself
(even obfuscated).
>
>> By the way, I need this to run on an Android phone so size of memory might be an
issue...
>
> How large is your input? Do you need to index on the android or just read the index on
it? These are all factors to take into account. I mentioned suffix trees and suffix arrays
because these two are "canonical" data structures to perform any substring lookups in constant
time (in fact, the lookup takes the number of elements of the matched input string, building
the suffix tree/ array is O(n), at least in theory).
>
> If you already have Lucene integrated in your pipeline then that n-gram approach will
also work. If you know your minimum match substring length to be p then index p-sized shingles.
For strings longer than p you can create a query which will search for all n-gram occurrences
and take into account positional information to remove false matches.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message