lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject Re: external file stored field codec
Date Fri, 18 Oct 2013 12:00:33 GMT
On 10/18/2013 1:08 AM, Shai Erera wrote:
>> The codec intercepts merges in order to clean up files that are no longer
>> referenced
> What happens if a document is deleted while there's a reader open on the
> index, and the segments are merged? Maybe I misunderstand what you meant by
> this statement, but if the external file is deleted, since the document is
> "pruned" from the index, how will the reader be able to read the stored
> fields from it? How do you track references to the external files?
Right now you get a FileNotFoundException, or a missing field value, 
depending on how you configure the codec.  I believe the tests probably 
pass only because they don't test for the missing field value.  
Certainly I have a test (like the one you wrote, but that checks the 
field value explicitly) that exposes this problem.  My reasoning was 
that this is similar to the situation with NFS: the user has to be aware 
of the situation and deal with it by having an IndexDeletionPolicy that 
maintains old commits.  I don't see what else can be done without some 
(possibly heavyweight) additional tracking/garbage collection mechanism.

In our case (document archive), this behavior may be acceptable, but 
it's certainly one of the main areas that concerns me.  It would be nice 
if it were possible to receive an event when all outstanding readers for 
a commit were closed: that way we could clean up then instead of at the 
time of the commit, but I don't think this is how Lucene works?  At 
least I couldn't see how to do that, and given the discussion in 
IndexDeletionPolicy about NFS, I assumed that wasn't possible.

Another unsolved problem is how to clean up empty segments. Normally 
they're merged by simply not copying them, but in our case we have to 
actively delete.  I haven't looked at this carefully yet, but I have a 
couple of ideas: one is to use the Lucene docids as part of the 
filename: the idea being that as those are re-assigned, we would rename 
the files, unlinking the old ones with the same docid in the process.  
But I'm not totally clear on how the docid renumbering works, so not 
sure if that would be feasible.  Another idea is to use filesystem hard 
linking in some way as a reference counting mechanism, but that would 
restrict this to java 7. Finally, I suppose it's possible to build some 
data structure that actively manages the file references.

I guess my initial concern was with testing performance to see if it was 
even worth trying to solve these problems.  Now I think it is, but they 
are not necessarily easy to solve.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message