lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noopur Julka <noopur.ju...@gmail.com>
Subject Re: Efficient string lookup using Lucene
Date Sun, 26 Aug 2012 03:01:14 GMT
Index being very large can be ruled out as Luke returned few results and
the app is capable of returning approx 200 results.

Regards,
Noopur Julka



On Sun, Aug 26, 2012 at 6:40 AM, Ilya Zavorin <izavorin@caci.com> wrote:

> Does Lucene support this type of structure, or do I need to somehow
> implement it outside Lucene?
>
> By the way, I need this to run on an Android phone so size of memory might
> be an issue...
>
> Thanks,
>
>
> Ilya Zavorin
>
>
> -----Original Message-----
> From: Dawid Weiss [mailto:dawid.weiss@gmail.com]
> Sent: Friday, August 24, 2012 4:50 PM
> To: java-user@lucene.apache.org
> Subject: Re: Efficient string lookup using Lucene
>
> What you need is a suffix tree or a suffix array. Both data structures
> will allow you to perform constant-time searches for existence/ occurrence
> of any input pattern. Depending on how much text you have on the input it
> may either be a simple task -- see here:
>
> http://labs.carrotsearch.com/jsuffixarrays.html
>
> or a complicated task if your input size is larger (larger than memory).
> Google search for suffix trees/ suffix arrays though, it's the data
> structure to use here.
>
> Dawid
>
> On Fri, Aug 24, 2012 at 9:48 PM, Ilya Zavorin <izavorin@caci.com> wrote:
> > Hi Everyone,
> >
> > I have the following task. I have a set of documents in multiple
> languages. I don't know what these languages are. Any given doc may contain
> text in several languages mixed up. So to me these are just a bunch of
> Unicode text files.
> >
> > What I need is to implement an efficient EXACT string lookup. That is, I
> need to be able to find ANY Unicode string exactly as it appears. I do not
> care about language-specific modifications of the string. That is, if I
> search for a string "run", I do not need to find "ran" but I do want to
> find it in all of these strings below:
> >
> > Fox is running fast
> > !%#^&$run!$!%@&$#
> > run,run
> >
> > Is there a way of using StandardAnalyzer or any other analyzer and the
> corresponding query parser to find these? Again, my queries might be more
> or less random Unicode sequences and I need to find all their accurrences
> in the text.
> >
> > Essentially, what I am trying to do is implement substring matching more
> efficiently that using Java's standard substring matching methods.
> >
> > Thanks!
> >
> > Ilya Zavorin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message