lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Bug in TopFieldCollector?
Date Mon, 30 Mar 2009 08:44:12 GMT
Well, IndexSearcher also sorts its readers biggest to smallest (by
.numDocs()) for better performance (so that the queues fill up as much
as possible before hitting reader transitions).

I think it's the exception, not the rule, for when a custom comparator
would require the full array of sub-readers up front (vs, "on the fly"
which it already gets with setNextReader), so I think we should not
pass it in during construction.

This really was a rote carryover from the old API which passed in the
top IndexReader when calling newComparator.

Mike

On Mon, Mar 30, 2009 at 4:39 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Why not call IndexSearcher.getIndexReader().getSequentialSubReaders() (see
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/index/IndexReader.html).
> Its public and documented as this:
>
>
>
> public IndexReader[] getSequentialSubReaders()
>
>
>
> Expert: returns the sequential sub readers that this reader is logically
> composed of. For example, IndexSearcher uses this API to drive searching by
> one sub reader at a time. If this reader is not composed of sequential child
> readers, it should return null. If this method returns an empty array, that
> means this reader is a null reader (for example a MultiReader that has no
> sub readers).
>
> NOTE: for a MultiSegmentReader, which is obtained by open(java.lang.String)
> when the index has more than one segment, you should not use the sub-readers
> returned by this method to make any changes (setNorm, deleteDocument, etc.).
> Doing so will likely lead to index corruption. Use the parent reader
> instead.
>
>
>
> You only have the problem to replicate the code that gathers the subreaders
> of the subreaders itself recursively.
>
>
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> ________________________________
>
> From: Shai Erera [mailto:serera@gmail.com]
> Sent: Monday, March 30, 2009 10:20 AM
> To: java-dev@lucene.apache.org
> Subject: Re: Bug in TopFieldCollector?
>
>
>
> Already did !
>
> Another question - I think we somehow broke TopFieldCollector ...
> Previously, in TopFieldDocCollector, it accepted an IndexReader as a
> parameter, and now it requires IndexReader[], which is called subReaders.
> Calling the 'fast' search methods with Sort has no problem obtaining that
> IndexReader[] (and sort it), but how would someone use TopFieldCollector w/o
> calling the appropriate Searcher methods?
>
> For example, since all the Searcher methods pass in fillFields = true, I
> wanted to use the method Searcher.search(Query, TopFieldCollector) in the
> test case I wrote, which BTW looks like this:
>
>   public void testSortWithoutFillFields() throws Exception {
>
>     // There was previously a bug in TopFieldCollector when fillFields was
> set
>     // to false - the same doc and score was set in ScoreDoc[] array. This
> test
>     // asserts that if fillFields is false, the documents are set properly.
> It
>     // does not use Searcher's default search methods (with Sort) since all
> set
>     // fillFields to true.
>     Sort sort = new Sort();
>     int nDocs=10;
>
>     TopDocsCollector tdc = new TopFieldCollector(sort, nDocs,
>         new IndexReader[] { ((IndexSearcher) full).getIndexReader() },
> false);
>
>     full.search(new MatchAllDocsQuery(), tdc);
>
>     ScoreDoc[] sd = tdc.topDocs().scoreDocs;
>     for (int i = 1; i < sd.length; i++) {
>       assertTrue(sd[i].doc != sd[i - 1].doc);
>     }
>   }
>
> You'll notice that creating a TopFieldCollector now is much more complicated
> and *ugly*. As a user of IndexSearcher, I can only call getIndexReader()
> which returns a single IndexReader. I don't have access to gatherSubReaders
> and sortSubReaders. I don't see why I should have access to them. So it
> forces me to create a dummy array with a single IndexReader.
>
> There are two ways I see to solve it:
> 1. Introduce a getIndexReaders() method on IndexSearcher, which will return
> an array of (sorted?) IndexReader.
> 2. Introduce a new constructor in TopFieldCollector which accepts a single
> IndexReader and make the other one package-private (for use by IndexSearcher
> only). That constructor can internally create a dummy array of readers, but
> at least it's private to the constructor and not exposed to the rest of the
> world.
>
> Otherwise, I think it ruins TopFieldCollector and will make it a lot less
> intuitive to use. At least, people who'd want to move from
> TopFieldDocCollector to TopFieldCollector, will find it very inconvenient
> and strange.
>
> What do you think? I can do that (2) as part of 1575. If (1) is better, then
> I think a different issue should be opened, because we might want to return
> such an array as sorted or something, which makes it less trivial.
>
> Shai
>
> On Mon, Mar 30, 2009 at 11:07 AM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>
> Looks like quite a bug, Shai!  Thanks.  It came in with LUCENE-1483.
> I would say add test case & fix it under 1575.
>
> Mike
>
> On Mon, Mar 30, 2009 at 3:50 AM, Shai Erera <serera@gmail.com> wrote:
>> Hi
>>
>> As I prepared the patch for 1575, I noticed a strange implementation in
>> TopFieldCollector's topDocs():
>>
>>     ScoreDoc[] scoreDocs = new ScoreDoc[queue.size()];
>>     if (fillFields) {
>>       for (int i = queue.size() - 1; i >= 0; i--) {
>>         scoreDocs[i] = queue.fillFields((FieldValueHitQueue.Entry)
>> queue.pop());
>>       }
>>     } else {
>>       Entry entry = (FieldValueHitQueue.Entry) queue.pop();
>>       for (int i = queue.size() - 1; i >= 0; i--) {
>>         scoreDocs[i] = new FieldDoc(entry.docID,
>>                                     entry.score);
>>       }
>>     }
>>
>>     return new TopFieldDocs(totalHits, scoreDocs, queue.getFields(),
>> maxScore);
>>
>>
>> Notice that if fillFields is true, then documents are popped from the
>> queue.
>> However if it's false, then the first document is popped out of the queue
>> and used to populate the entire ScoreDoc[]? I believe that's wrong, right?
>> Otherwise, the returned TopFieldDocs.scoreDocs array will include the same
>> document and score?
>>
>> I noticed there's no test case for that ... TopFieldCollector's
>> constructor
>> is called only from IndexSearcher.search(Weight, Filter, int, Sort,
>> boolean
>> /* fillFields */) which is called from IndexSearcher.search(Weight,
>> Filter,
>> int, sort) with fillFields always set to true. So perhaps that's why this
>> hasn't showed up?
>>
>> BTW, the now deprecated TopFieldDocCollector's topDocs() implementation
>> looks like this:
>>
>>     FieldSortedHitQueue fshq = (FieldSortedHitQueue)hq;
>>     ScoreDoc[] scoreDocs = new ScoreDoc[fshq.size()];
>>     for (int i = fshq.size()-1; i >= 0; i--)      // put docs in array
>>       scoreDocs[i] = fshq.fillFields ((FieldDoc) fshq.pop());
>>
>>     return new TopFieldDocs(totalHits, scoreDocs,
>>                             fshq.getFields(), fshq.getMaxScore());
>>
>> It assumes fillFields is always true and always pops elements out of the
>> queue.
>>
>> If this is a bug, I can fix it as part of 1575, as I'm touching that class
>> anyway. I can also add a test case ... The fix is very simple BTW, just
>> move
>> the line "Entry entry = (FieldValueHitQueue.Entry) queue.pop();" inside
>> the
>> for loop.
>>
>> Shai
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message