lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: (Lucene-core) Is Query's rewrite method mandated to preserver original Query's hashcode?
Date Wed, 22 Jan 2014 16:28:36 GMT
Thanks!

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Peng Cheng [mailto:peng@sciencescape.net] 
Sent: Wednesday, January 22, 2014 5:23 PM
To: dev@lucene.apache.org
Subject: Re: (Lucene-core) Is Query's rewrite method mandated to preserver original Query's
hashcode?

 

opened as https://issues.apache.org/jira/browse/LUCENE-5409

 

On Tue, Jan 14, 2014 at 5:42 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

Yes, open an issue!

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Peng Cheng [mailto:peng@sciencescape.net] 
Sent: Tuesday, January 14, 2014 10:41 PM
To: dev@lucene.apache.org
Subject: Re: (Lucene-core) Is Query's rewrite method mandated to preserver original Query's
hashcode?

 

Do you suggest me to open a jira ticket about it? I think its a bug considering common interface
standard (rewrite should not be exposed to the end user), documentation and running efficiency
(as you said, rewrite is slow).

 

On Tue, Jan 14, 2014 at 4:38 AM, Peng Cheng <peng@sciencescape.net> wrote:

I see, perhaps the best solution is to put the un-rewritten blockJoinQuries into the joinQueryID?
The result will be the same. Right now the code have very strange behavior if no rewrite is
called beforehand, it gives empty groups or correct results at random.

 

Its a great pleasure to read your reply, never expect someone to respond that fast.

 

Yours Peng

 

 

On Tue, Jan 14, 2014 at 2:33 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

Hi Peng,

 

rewrite() returns a different query that will definitely not preserve the hashCode() or be
equals() to the original one or any other rewritten one. The reason for this is: A rewritten
query is a new query that contains information about the index it will be executed on (e.g.,
it references terms from that index), so it *cannot* be equal to the original one. If it cannot
be equal, also the hashCode should be different. If you execute the query on a later stage
you have to rewrite the original query again, because the index may have changed. And take
care: This rewrite may produce a completely different query (with a new hashCode again) if
the index changed in the meantime.

 

As there is a workaround (to me it looks, that the code is missing documentation), so you
can manually rewrite the query before invoking getTopGroups() using Searcher#rewrite(query).
Why is a hotfix needed?

 

Also rewriting the query on every call of getTopGroups is a major overhead (most query’s
rewrites are very expensice and take as long as the execution of the query, e.g. MultiTermQueries),
so it should only be done once, not on every call. Maybe that’s the reason why it was left
out, but it was not documented.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: Peng Cheng [mailto:peng@sciencescape.net] 
Sent: Tuesday, January 14, 2014 3:59 AM
To: dev@lucene.apache.org; ds@sciencescape.org


Subject: (Lucene-core) Is Query's rewrite method mandated to preserver original Query's hashcode?

 

Hi developers,

 

I've recently found a few bugs in advanced features of Lucene-core 4.6 (which is perfectly
normal as those features are less likely to be used and tested), the most serious one has
rendered my ToParentBlockJoinCollector close to useless:

 

In the scorer generation stage, the ToParentBlockJoinCollector will automatically rewrite
all the associated ToParentBlockJoinQuery (and their subqueries), and save them into its in-memory
Look-up table, namely joinQueryID (see enroll() method for detail). Unfortunately, in the
getTopGroups method, the new ToParentBlockJoinQuery parameter is not rewritten (at least users
are not expected to do so). When the new one is searched in the old lookup table (considering
the impact of rewrite() on hashCode()), the result (namely _slot) will always fail and eventually
end up with a topGroup collection consisting of only empty groups (their hitCounts are guaranteed
to be zero).

 

I'm not positive about whether rewrite() should preserver Query's hashcode, as I've found
many counterexamples already. If this is not true, then this problem can be solved by rewriting
the origianl BlockJoinQuery before invoking getTopGroups method. Nevertheless users are not
expected to do so, therefore I would suggest submitting a hotfix that add the described rewrite
step.

 

If rewrite() must preserver the hashcode, then this is a problem of the various rewrite()
implementations and fix should be much harder.

 

This bug has caused widespread panic in my company and I would like to see it fixed ASAP.
Please give me some suggestion so I know which hotfix I should be working on.

 

All the best,

 

Yours Peng

 

 

 


Mime
View raw message