lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Wild Card Query Performance
Date Fri, 29 Mar 2013 10:51:27 GMT
Hi Arun,

this is hard to explain with one single reason - To conclude:
- A wildcard query that consists only of a prefix ("ab*") is internally implemented by iterating
all terms starting with "ab" and ending with the one before "ac"). This is not different to
Lucene 3.x and there is no room for optimization. If you look at o.a.l.util.automaton.CompiledAutomaton#getTermsEnum(),
you see that prefix terms are implemented by PrefixTermsEnum, that uses the above algorithm
and is mostly a copy of the Lucene 3.x PrefixTermEnum.
- A more complex wildcard or regex query that does not solely consist of a prefix but more
restrictions after the prefix is faster in 4.x than in Lucene 3.x, because Lucene 3.x was
only able to scan all terms starting with the common prefix ("ab*xy" would have approx. same
speed as a prefix "ab*", it just removes all terms not matching *xy in a second step). This
is also the improvement in FuzzyQuery: In Lucene 3.x a FuzzyQuery has no prefix at all (unless
you specify one, but that limits the fuzzy use), so it has to scan all terms in the term dictionary
and filter them against the Levenshtein distance. In Lucene 4.x, it can use an automaton,
to effectively seek inside the term enum to find all terms that match the (non-prefix) regex,
wildcard or Levensthein distance. In short: Lucene 4 is able to step over terms that cannot
match the automaton. A cool example is the following regex: "(ht|f)tp://.*". In Lucene 3.x
this would scan all terms because there is no common prefix, in 4.x this expands to something
similar to the intersection of 2 prefix queries, so the automaton intersection between the
regex and the term dictionary will seek to term "ftp://" first, iterate until the last term
of this prefix and then seeks to "http://", doing the same.

If you only have a prefix query (wildcard "ab*" or regex "ab.*"), then the query is executed
in the same way like in Lucene 3.x, no automatons are involved at all! The reason why you
see a speed improvement is not caused by the automaton implementation (it's not used at all),
but it is caused by a complete rewrite of Lucene's algorithms regarding the term dictionary
and the posting lists (Lucene Codecs). E.g. Lucene 4.0 uses a new term dictionary based on
FST, Lucene 4.0 no longer creates Strings all the time (all terms are just slices inside a
large byte[]),... and many more improvements. Also when scanning the posting lists to find
all documents related to the terms found, Lucene uses different data structures (e.g., block
postings in 4.1). So it’s a number of improvements that speed up the index (also while indexing!).
Just review the numerous presentations on conferences or the blog posts by Mike McCandless
(http://blog.mikemccandless.com/).

> - Can we do something to improve speed for such queries ?

You cannot improve speed of pure prefix queries. Maybe another data structure is helpful?
E.g. if you scan for dates as string (queries on dates like "201301*"), it might be better
to use a numeric field and use NumericRangeQuery(20130101, 20130131)? Another approach for
improving prefix queries is indexing additional terms: If you are always searching for a 2-char
prefix for "ab*", then simply index an additional term in a separate field with 2 chars (e.g.,
"ab") in your documents while indexing and just search for "ab"? This is very fast! This also
applies to Lucene 3.x!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Arun Kumar K [mailto:arunk786@gmail.com]
> Sent: Friday, March 29, 2013 11:13 AM
> To: java-user
> Subject: Re: Wild Card Query Performance
> 
> Hi Uwe,
> 
> Thanks for the info.
> You were mentioning about term dictionary and and other index
> components. I didn't get this.
> - What could be the other factors that improve the speed of such query ?
> Can u explain or give some pointers to this ?
> - Can we do something to improve speed for such queries ?
> 
> Also other observation is indexing time has increased by around 6% in 4.0.
> 
> Arun
> 
> On Fri, Mar 29, 2013 at 3:25 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> 
> > Hi,
> >
> > It depends on the type of wildcard query. If you only have a prefix
> > (ab*), they rewrite to a simple PrefixQuery and this one is
> > implemented exactly like in 3.x, so you only see the speed
> > improvements of Lucene 4.0 in the term dictionary and and other index
> > components, not related to the query itsself.
> >
> > If you have wildcards like ab?xy, then this query will be multiple
> > times faster than in 3.x, because the "?" wildcard can only expand to
> > a limited set of terms, while in Lucene 3.x, it still scans all terms
> > with prefix "ab". The same applies to other wildcard constructs, if
> > they limit more than just prefix.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: Arun Kumar K [mailto:arunk786@gmail.com]
> > > Sent: Friday, March 29, 2013 10:38 AM
> > > To: java-user
> > > Subject: Wild Card Query Performance
> > >
> > > Hi Guys,
> > >
> > > I have been testing the search time improvement in Lucene 4.0 from
> > > Lucene
> > > 3.0.2 version for Wildcard Queries (with atleast say 2 chars Eg.ar*).
> > >
> > > For a 2GB size index with 4000000 docs, the following observations
> > > were
> > > made:
> > >
> > > Around 3X improvement with and without STRING sort on a sortable
> field.
> > >
> > > I guess this improvement is because of the Automation Query by
> > > Robert which is used in WildCard Queries.
> > >
> > > As per mike's blog, FuzzyQueries are 100X times faster in 4.0 but
> > > these wildcard queries are not that faster comparatively.
> > >
> > > I have used default codecs and postings format.
> > >
> > > Did i miss something or is it the max improvement that we can expect
> > > currently for WildCard Queries?
> > >
> > >
> > > Arun
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message