lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Trejkaz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
Date Thu, 25 Nov 2010 13:59:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935765#action_12935765
] 

Trejkaz commented on LUCENE-2348:
---------------------------------

That is exactly the workaround we performed for our own filters, including our private copy
of a filter which works like DuplicateFilter.  All the ones which need the context now take
the reader up-front.  The problem now, is that we have to use a different filter instance
on each reader.  Previously we were caching them globally, and somewhere in the system we
are evidently still caching them globally, because one time in a million we find the wrong
filter being used on the wrong reader.  I am now thinking of making another kind of context-sensitive
filter, which can somehow omnisciently know about all readers open in the entire JVM (e.g.
we hook the place where we open the top-level reader, and push the information about its structure
into some global watch.)

I think Robert's comments possibly stem from the misconception that the duplicate filter somehow
works like field collapsing.  I wrote a test just to illustrate how it actually behaves, just
to make sure I wasn't confused myself (since he seemed to think I was...)

{code}
public class TestDuplicateFilter {

    IndexReader reader;
    IndexSearcher searcher;

    @Before
    public void setUpSampleData() throws Exception {
        RAMDirectory dir = new RAMDirectory();
        IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
        Document doc;
        doc = new Document();
        doc.add(new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("text", "a", Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(doc);
        doc = new Document();
        doc.add(new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("text", "b", Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(doc);
        doc = new Document();
        doc.add(new Field("id", "2", Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("text", "c", Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(doc);
        writer.close();

        reader = IndexReader.open(dir, true);
        searcher = new IndexSearcher(reader);
    }

    @Test
    public void testHitOnOriginal() throws Exception {
        Filter filter = new DuplicateFilter("id", DuplicateFilter.KM_USE_FIRST_OCCURRENCE,
DuplicateFilter.PM_FULL_VALIDATION);
        TopDocs docs = searcher.search(new TermQuery(new Term("text", "a")), filter, 3);
        assertEquals("Expected one hit - matched the original", 1, docs.totalHits);
        assertEquals("Wrong doc hit", 0, docs.scoreDocs[0].doc);
    }

    @Test
    public void testHitOnCopy() throws Exception {
        Filter filter = new DuplicateFilter("id", DuplicateFilter.KM_USE_FIRST_OCCURRENCE,
DuplicateFilter.PM_FULL_VALIDATION);
        TopDocs docs = searcher.search(new TermQuery(new Term("text", "b")), filter, 3);
        // Field collapsing would return one hit here, which would be undesirable:
        assertEquals("Expected no hits - matched the copy", 0, docs.totalHits);
    }
}
{code}


> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2348
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 2.9.2
>            Reporter: Trejkaz
>         Attachments: LUCENE-2348.patch, LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without taking into
account that getDocIdSet() will be called once per segment and only with each segment's local
reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message