lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Kumar K <arunk...@gmail.com>
Subject Re: Wild Card Query Performance
Date Fri, 29 Mar 2013 11:18:52 GMT
Thanks Uwe for the detailed & clear response.
I will definitely go further into details of 4.0 improvements.

Thanks,
Arun


On Fri, Mar 29, 2013 at 4:21 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi Arun,
>
> this is hard to explain with one single reason - To conclude:
> - A wildcard query that consists only of a prefix ("ab*") is internally
> implemented by iterating all terms starting with "ab" and ending with the
> one before "ac"). This is not different to Lucene 3.x and there is no room
> for optimization. If you look at
> o.a.l.util.automaton.CompiledAutomaton#getTermsEnum(), you see that prefix
> terms are implemented by PrefixTermsEnum, that uses the above algorithm and
> is mostly a copy of the Lucene 3.x PrefixTermEnum.
> - A more complex wildcard or regex query that does not solely consist of a
> prefix but more restrictions after the prefix is faster in 4.x than in
> Lucene 3.x, because Lucene 3.x was only able to scan all terms starting
> with the common prefix ("ab*xy" would have approx. same speed as a prefix
> "ab*", it just removes all terms not matching *xy in a second step). This
> is also the improvement in FuzzyQuery: In Lucene 3.x a FuzzyQuery has no
> prefix at all (unless you specify one, but that limits the fuzzy use), so
> it has to scan all terms in the term dictionary and filter them against the
> Levenshtein distance. In Lucene 4.x, it can use an automaton, to
> effectively seek inside the term enum to find all terms that match the
> (non-prefix) regex, wildcard or Levensthein distance. In short: Lucene 4 is
> able to step over terms that cannot match the automaton. A cool example is
> the following regex: "(ht|f)tp://.*". In Lucene 3.x this would scan all
> terms because there is no common prefix, in 4.x this expands to something
> similar to the intersection of 2 prefix queries, so the automaton
> intersection between the regex and the term dictionary will seek to term
> "ftp://" first, iterate until the last term of this prefix and then seeks
> to "http://", doing the same.
>
> If you only have a prefix query (wildcard "ab*" or regex "ab.*"), then the
> query is executed in the same way like in Lucene 3.x, no automatons are
> involved at all! The reason why you see a speed improvement is not caused
> by the automaton implementation (it's not used at all), but it is caused by
> a complete rewrite of Lucene's algorithms regarding the term dictionary and
> the posting lists (Lucene Codecs). E.g. Lucene 4.0 uses a new term
> dictionary based on FST, Lucene 4.0 no longer creates Strings all the time
> (all terms are just slices inside a large byte[]),... and many more
> improvements. Also when scanning the posting lists to find all documents
> related to the terms found, Lucene uses different data structures (e.g.,
> block postings in 4.1). So it’s a number of improvements that speed up the
> index (also while indexing!). Just review the numerous presentations on
> conferences or the blog posts by Mike McCandless (
> http://blog.mikemccandless.com/).
>
> > - Can we do something to improve speed for such queries ?
>
> You cannot improve speed of pure prefix queries. Maybe another data
> structure is helpful? E.g. if you scan for dates as string (queries on
> dates like "201301*"), it might be better to use a numeric field and use
> NumericRangeQuery(20130101, 20130131)? Another approach for improving
> prefix queries is indexing additional terms: If you are always searching
> for a 2-char prefix for "ab*", then simply index an additional term in a
> separate field with 2 chars (e.g., "ab") in your documents while indexing
> and just search for "ab"? This is very fast! This also applies to Lucene
> 3.x!
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Arun Kumar K [mailto:arunk786@gmail.com]
> > Sent: Friday, March 29, 2013 11:13 AM
> > To: java-user
> > Subject: Re: Wild Card Query Performance
> >
> > Hi Uwe,
> >
> > Thanks for the info.
> > You were mentioning about term dictionary and and other index
> > components. I didn't get this.
> > - What could be the other factors that improve the speed of such query ?
> > Can u explain or give some pointers to this ?
> > - Can we do something to improve speed for such queries ?
> >
> > Also other observation is indexing time has increased by around 6% in
> 4.0.
> >
> > Arun
> >
> > On Fri, Mar 29, 2013 at 3:25 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> >
> > > Hi,
> > >
> > > It depends on the type of wildcard query. If you only have a prefix
> > > (ab*), they rewrite to a simple PrefixQuery and this one is
> > > implemented exactly like in 3.x, so you only see the speed
> > > improvements of Lucene 4.0 in the term dictionary and and other index
> > > components, not related to the query itsself.
> > >
> > > If you have wildcards like ab?xy, then this query will be multiple
> > > times faster than in 3.x, because the "?" wildcard can only expand to
> > > a limited set of terms, while in Lucene 3.x, it still scans all terms
> > > with prefix "ab". The same applies to other wildcard constructs, if
> > > they limit more than just prefix.
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe@thetaphi.de
> > >
> > >
> > > > -----Original Message-----
> > > > From: Arun Kumar K [mailto:arunk786@gmail.com]
> > > > Sent: Friday, March 29, 2013 10:38 AM
> > > > To: java-user
> > > > Subject: Wild Card Query Performance
> > > >
> > > > Hi Guys,
> > > >
> > > > I have been testing the search time improvement in Lucene 4.0 from
> > > > Lucene
> > > > 3.0.2 version for Wildcard Queries (with atleast say 2 chars Eg.ar*).
> > > >
> > > > For a 2GB size index with 4000000 docs, the following observations
> > > > were
> > > > made:
> > > >
> > > > Around 3X improvement with and without STRING sort on a sortable
> > field.
> > > >
> > > > I guess this improvement is because of the Automation Query by
> > > > Robert which is used in WildCard Queries.
> > > >
> > > > As per mike's blog, FuzzyQueries are 100X times faster in 4.0 but
> > > > these wildcard queries are not that faster comparatively.
> > > >
> > > > I have used default codecs and postings format.
> > > >
> > > > Did i miss something or is it the max improvement that we can expect
> > > > currently for WildCard Queries?
> > > >
> > > >
> > > > Arun
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message