lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "paul.elschot (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields
Date Mon, 14 Nov 2005 19:24:29 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-323?page=comments#action_12357611 ] 

paul.elschot commented on LUCENE-323:
-------------------------------------

There is an issue with the MaxDisjunctionScorer in the .zip attachment, I'm
sorry I did not see this earlier when I posted on java-dev about this.

The problem is that MaxDisjunctionScorer uses bubble sort to keep the subscorer
sorted over the documents in the next() method (line 103), and this does not scale nicely
when the number of subscorers increases.
Supposing the number of subscores that match the document is N,
the amount of work to be done is proportional to (N*N) per document.
In DisjunctionSumScorer a priority queue is used, and there the amount of work is
proportional to (N log(N)) per document.
So I would recommend to rewrite MaxDisjunctionScorer to inherit from a new common
super class with DisjunctionSumScorer, sharing everything except the
advanceAfterCurrent() method (which could be abstract in the new superclass).
It's possible to be more aggressive in refactoring by initializing and adapting
the score per index document using different methods, but this would take N
extra method calls per document.

At the same time the name could be changed to DisjunctionMaxScorer
for consistency in the org.lucene.search package.

Regards,
Paul Elschot


> [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries
across multiple fields
> -----------------------------------------------------------------------------------------------------------------
>
>          Key: LUCENE-323
>          URL: http://issues.apache.org/jira/browse/LUCENE-323
>      Project: Lucene - Java
>         Type: Bug
>   Components: QueryParser
>     Versions: 1.4
>  Environment: Operating System: Windows XP
> Platform: PC
>     Reporter: Chuck Williams
>     Assignee: Lucene Developers
>  Attachments: TestMaxDisjunctionQuery.java, TestRanking.zip, TestRanking.zip, TestRanking.zip,
WikipediaSimilarity.java, WikipediaSimilarity.java, WikipediaSimilarity.java
>
> The attached test case demonstrates this problem and provides a fix:
>   1.  Use a custom similarity to eliminate all tf and idf effects, just to 
> isolate what is being tested.
>   2.  Create two documents doc1 and doc2, each with two fields title and 
> description.  doc1 has "elephant" in title and "elephant" in description.  
> doc2 has "elephant" in title and "albino" in description.
>   3.  Express query for "albino elephant" against both fields.
> Problems:
>       a.  MultiFieldQueryParser won't recognize either document as containing 
> both terms, due to the way it expands the query across fields.
>       b.  Expressing query as "title:albino description:albino title:elephant 
> description:elephant" will score both documents equivalently, since each 
> matches two query terms.
>   4.  Comparison to MaxDisjunctionQuery and my method for expanding queries 
> across fields.  Using notation that () represents a BooleanQuery and ( | ) 
> represents a MaxDisjunctionQuery, "albino elephant" expands to:
>         ( (title:albino | description:albino)
>           (title:elephant | description:elephant) )
> This will recognize that doc2 has both terms matched while doc1 only has 1 
> term matched, score doc2 over doc1.
> Refinement note:  the actual expansion for "albino query" that I use is:
>         ( (title:albino | description:albino)~0.1
>           (title:elephant | description:elephant)~0.1 )
> This causes the score of each MaxDisjunctionQuery to be the score of highest 
> scoring MDQ subclause plus 0.1 times the sum of the scores of the other MDQ 
> subclauses.  Thus, doc1 gets some credit for also having "elephant" in the 
> description but only 1/10 as much as doc2 gets for covering another query term 
> in its description.  If doc3 has "elephant" in title and both "albino" 
> and "elephant" in the description, then with the actual refined expansion, it 
> gets the highest score of all (whereas with pure max, without the 0.1, it 
> would get the same score as doc2).
> In real apps, tf's and idf's also come into play of course, but can affect 
> these either way (i.e., mitigate this fundamental problem or exacerbate it).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message