lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <>
Subject [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
Date Fri, 05 Dec 2008 18:22:44 GMT


Marvin Humphrey commented on LUCENE-1476:

> Marvin, in KS/Lucy are you using random-access or iterator to access 
> deletedDocs & norms?

Both. There's a DelEnum class which is used by NOTScorer and MatchAllScorer, but it's implemented
using BitVectors which get the next deleted doc num by calling nextSetBit() internally. 

 I happened to be coding up those classes this spring when there was the big brouhaha about
IndexReader.isDeleted().  It seemed wrong to pay the method call overhead for IndexReader.isDeleted()
on each iter in or, when we could just store the next

MatchAllScorer_next(MatchAllScorer* self) 
    do {
        if (++self->doc_num > self->max_docs) {
            return 0;
        if (self->doc_num > self->next_deletion) {
                = DelEnum_Skip_To(self->del_enum, self->doc_num);
    } while (self->doc_num == self->next_deletion);
    return self->doc_num;

(Note: in KS returns the document number; doc nums start at 1, and 0 is the
sentinel signaling iterator termination. I expect that Lucy will be the same.)

Perhaps we could get away without needing the random access, but that's because IndexReader.isDeleted()
isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted
-- unlike Lucene which throws an exception. Also, you can't delete documents against an IndexReader,
so Robert's objection doesn't apply to us.

I had always assumed we were going to have to expose isDeleted() eventually, but maybe we
can get away with zapping it. Interesting!

I've actually been trying to figure out a new design for deletions because writing them out
for big segments is our last big write bottleneck, now that we've theoretically solved the
sort cache warming issue.  I figured we would continue to need bit-vector files because they're
straightforward to mmap, but if we only need iterator access, we can use vbyte encoding instead...
Hmm, we still face the problem of outsized write cost when a segment has a large number of
deletions and you add one more...

> BitVector implement DocIdSet
> ----------------------------
>                 Key: LUCENE-1476
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch
>   Original Estimate: 12h
>  Remaining Estimate: 12h
> BitVector can implement DocIdSet.  This is for making SegmentReader.deletedDocs pluggable.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message