lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devon H. O'Dell" <devon.od...@gmail.com>
Subject Re: Efficient string lookup using Lucene
Date Sun, 26 Aug 2012 03:47:38 GMT
Seems worth mentioning in partial response to this thread's topics that
(almost) regardless of index strategy, lucene performance hinges on number
of matched documents per query, not total docs in index. There are other
mitigating factors (disk type, ram size, etc), but worst case performance
analysis can generally be modeled in terms of matched documents as opposed
to index size.

Apologies for any spelling / grammatical errors; this is sent from my phone.

--dho
 On Aug 25, 2012 11:02 PM, "Noopur Julka" <noopur.julka@gmail.com> wrote:

> Index being very large can be ruled out as Luke returned few results and
> the app is capable of returning approx 200 results.
>
> Regards,
> Noopur Julka
>
>
>
> On Sun, Aug 26, 2012 at 6:40 AM, Ilya Zavorin <izavorin@caci.com> wrote:
>
> > Does Lucene support this type of structure, or do I need to somehow
> > implement it outside Lucene?
> >
> > By the way, I need this to run on an Android phone so size of memory
> might
> > be an issue...
> >
> > Thanks,
> >
> >
> > Ilya Zavorin
> >
> >
> > -----Original Message-----
> > From: Dawid Weiss [mailto:dawid.weiss@gmail.com]
> > Sent: Friday, August 24, 2012 4:50 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Efficient string lookup using Lucene
> >
> > What you need is a suffix tree or a suffix array. Both data structures
> > will allow you to perform constant-time searches for existence/
> occurrence
> > of any input pattern. Depending on how much text you have on the input it
> > may either be a simple task -- see here:
> >
> > http://labs.carrotsearch.com/jsuffixarrays.html
> >
> > or a complicated task if your input size is larger (larger than memory).
> > Google search for suffix trees/ suffix arrays though, it's the data
> > structure to use here.
> >
> > Dawid
> >
> > On Fri, Aug 24, 2012 at 9:48 PM, Ilya Zavorin <izavorin@caci.com> wrote:
> > > Hi Everyone,
> > >
> > > I have the following task. I have a set of documents in multiple
> > languages. I don't know what these languages are. Any given doc may
> contain
> > text in several languages mixed up. So to me these are just a bunch of
> > Unicode text files.
> > >
> > > What I need is to implement an efficient EXACT string lookup. That is,
> I
> > need to be able to find ANY Unicode string exactly as it appears. I do
> not
> > care about language-specific modifications of the string. That is, if I
> > search for a string "run", I do not need to find "ran" but I do want to
> > find it in all of these strings below:
> > >
> > > Fox is running fast
> > > !%#^&$run!$!%@&$#
> > > run,run
> > >
> > > Is there a way of using StandardAnalyzer or any other analyzer and the
> > corresponding query parser to find these? Again, my queries might be more
> > or less random Unicode sequences and I need to find all their accurrences
> > in the text.
> > >
> > > Essentially, what I am trying to do is implement substring matching
> more
> > efficiently that using Java's standard substring matching methods.
> > >
> > > Thanks!
> > >
> > > Ilya Zavorin
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message