lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <>
Subject Filters and multiple, per-segment calls to getDocIdSet
Date Thu, 25 Mar 2010 04:55:27 GMT
Hi all.

I notice that Filter.getDocIdSet() is now documented as follows:

    Note: This method will be called once per segment in
    the index during searching.  The returned {@link DocIdSet}
    must refer to document IDs for that segment, not for
    the top-level reader.

If I look at Lucene's own DuplicateFilter, isn't it making the
assumption that it will only be called once?

And a related question: for those of us who want to implement
something *like* DuplicateFilter (as I have done before discovering
this new Javadoc), is there a good way to go about it?  It seems like
we now need to keep a hash of all terms previously seen so that when
we go over the new term enum we can check which ones have already been
seen.  This will dramatically increase memory usage compared to a
single BitSet/OpenBitSet.  Is there a better way?

Also, I presume this means that Filter is now explicitly not
threadsafe.  We weren't keeping any state in them anyway, but now we
will have to, so there is potential for a lot of new bugs if a filter
is somehow used by two queries running at the same time.


Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis                                and eDiscovery software

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message