lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <>
Subject [jira] Commented: (LUCENE-879) Document number integrity merge policy
Date Fri, 11 May 2007 18:43:15 GMT


Doron Cohen commented on LUCENE-879:

I skimmed through the patch and I understand that all terms and postings 
of deleted docs are discarded, and, instead, an empty doc is added.

I would like to comment on the idea behind this.

I think that this satisfies part of (some) applications needs, 
assuming it is mainly documents updating that causes deletions.

For example, assume initial 5 documents {A,B,C,D,E}, their internal ids 
are {0,1,2,3,4}, and used as keys to consumer's secondary storage.

Now, docs B and D are updated - so the internal ids would change.
As of now, they become:  {A:0, C:1, E:2, B`:3, D`:4}.
With this patch, I believe they would become:  {A:0, _:1, C:2, _:3, E:4, B`:5, D`:6}.

So, accessing the secondary storage is now working nicely for the unchanged 
docs A, C, E, but the keys in the secondary storage have to be modified for the 
updated documents B and D.

This is probably not too bad, because the application updated the secondary 
storage anyhow, so why not updating the access key at the same
time - especially if the application keeps track of number of added documents.

I like this idea, but can see a few issues:

1) statistics are somewhat distorted - docCount used at search time 
    computations (idf) now (always) includes docs that were deleted. 

2) In the long run, norms size grow, so more memory is used.
     Eventually a merge-and-clean/squeeze might be required, but I guess the 
     application can do that in a controlled and efficient manner, updating the 
     secondary storage ids at the same time.

How about a different - more external - approach, not changing the internal-ids 
behavior, but rather using payloads for storing external IDs, and, when opening a 
new reader, reading (once) these IDs to an int array, that maps from
internal IDs to application IDs. This information is now readily available 
at search time for referencing the secondary repository. Having these IDs as 
payloads should allow to load them relatively fast, so hopefully warming a new 
reader would not be too slow as result of this. That was part 1 of the price of this 
approach. Part 2 is the memory taken for the IDs - 4 bytes per doc per reader.
Part 3 is the complexity of using this, but I didn't think of API yet.


> Document number integrity merge policy
> --------------------------------------
>                 Key: LUCENE-879
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUNCENE-879.diff
> This patch allows for document numbers stays the same even after merge of segments with
> Consumer needs to do this:
> indexWriter.setSkipMergingDeletedDocuments(false);
> The effect will be that deleted documents are replaced by a new Document() in the merged
segment, but not marked as deleted. This should probably be some policy thingy that allows
for different solutions such as keeping the old document, et c.
> Also see

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message