lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Itamar Syn-Hershko" <ita...@divrei-tora.com>
Subject RE: Why Lucene has to rewrite queries prior to actual searching?
Date Tue, 08 Apr 2008 13:18:34 GMT
Paul,

I don't see how this answers the question. I was asking why Lucene has to
access the index with exact terms, and not use RegEx or simpler wildcards
support internally? If Lucene will be able to look for "w?rd" or "wor*" and
treat the wildcards as wildcards, this will greatly improve speed of
searches and will eliminate the need for Query rewriting.

Since some people may want to index chars like those used in wildcards, they
could be escaped (or, those people will use the standard search classes
available today instead). I'm not entirely sure what part of Lucene does the
actual access to the terms and position vectors, but if it could be
sub-classed or cloned, and then modified to honor wildcards or even RegEx,
that would bring Lucene to new heights. Unless, again, there is a specific
reason why this can't be done.

Itamar. 

-----Original Message-----
From: Paul Elschot [mailto:paul.elschot@xs4all.nl] 
Sent: Tuesday, April 08, 2008 1:56 AM
To: java-user@lucene.apache.org
Subject: Re: Why Lucene has to rewrite queries prior to actual searching?

Itamar,

Have a look here:
http://lucene.apache.org/java/2_3_1/scoring.html

Regards,
Paul Elschot

Op Tuesday 08 April 2008 00:34:48 schreef Itamar Syn-Hershko:
> Paul and John,
>
> Thanks for your quick reply.
>
> The problem with query rewriting is the beforementioned 
> MaxClauseException. Instead of inflating the query and passing a 
> deterministic list of terms to the actual search routine, Lucene could 
> have accessed the vectors in the index using some sort of filter. So, 
> for example, if it knows to access "Foobar" by its name in the index, 
> why can't it take "Foo*" and just get all the vectors until "Fop" is 
> met (for example). Why does it have to get deterministic list of 
> terms?
>
> I will take a look at the Scorer - can you describe in short what 
> exactly it does and where and when it is being called?
>
> I don't get John's comment though - Query::rewrite is being called 
> prior to the actual searching (through QueryParser), how come it can 
> use "information gathered from IndexReader at search time"?
>
> Itamar.
>
> -----Original Message-----
> From: Paul Elschot [mailto:paul.elschot@xs4all.nl]
> Sent: Tuesday, April 08, 2008 12:57 AM
> To: java-user@lucene.apache.org
> Subject: Re: Why Lucene has to rewrite queries prior to actual 
> searching?
>
> Itamar,
>
> Query rewrite replaces wildcards with terms available from the index.
> Usually that involves replacing a wildcard with a BooleanQuery that is 
> an effective OR over the available terms while using a flat 
> coordination factor, i.e. it does not matter how many of the available 
> terms actually match a document, as long as at least one matches.
>
> For the required query parts (AND like), Scorer.skipTo() is used, and 
> that could well be the filter mechanism you are referring to; have a 
> look at the javadocs of Scorer, and, if necessary, at the actual code 
> of ConjunctionScorer.
>
> Regards,
> Paul Elschot
>
> Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko:
> > Hi all,
> >
> > Can someone from the experts here explain why Lucene has to get a 
> > "rewritten" query for the Searcher - so Phrase or Wildcards queries 
> > have to rewrite themselves into a "primitive" query, that is then 
> > passed to Lucene to look for? I'm probably not familiar too much 
> > with the internals of Lucene, but I'd imagine that if you can 
> > inflate a query using wildcards via xxxxQuery sub classing, you 
> > could as easily (?) have some sort of Filter mechanism during the 
> > search, so that Lucene retrieves the Position vectors for all the 
> > terms that pass that filter, instead of retrieving only the position 
> > data for deterministic terms (with no wildcards etc.). If that was 
> > possible to do somehow, it could greatly increase the searchability 
> > of Lucene indices by using RegEx (without re-writing and getting the 
> > dreaded MaxClauseCount error) and similar.
> >
> > Would love to hear some insights on this one.
> >
> > Itamar.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message