incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Snapshot
Date Wed, 25 Mar 2009 16:39:57 GMT
On Wed, Mar 25, 2009 at 07:38:45AM -0400, Michael McCandless wrote:
> > However, I have found it difficult to stop the caught exception from leaking
> > memory in the event of a retry.  Hopefully we can fix that, but it's tricky.
> This is tricky in Lucene, too.  You must go and close any of the
> segments that did succeed in opening.

The way KS handles try-catch is to punt back to the host.  There are two
PolyReader methods left for the binding to implement, Try_Read_Snapshot() and
Try_Open_SegReaders().  If they succeed, they return the object, and if they
fail, they return an error message; PolyReader_open() can key its retry logic
off of what kind of object came back.

The trick is to do nothing within those calls that leaks if it fails.  

I think it may be possible, for core components at least, but it requires
fine-grained return-code checking.  We probably won't be able to recover from
"stale NFS filehandle" exceptions without leaking, but if it's just a matter
of a file being found or not, we should be able handle things gracefully.

Each SegReader is made up of DataReaders, and the DataReaders are created by
calling Architecture's factory methods.  Each DataReader creator needs to be
able to indicate failure without leaking memory.  Same goes up the chain,
through Architecture's factory methods and SegReader's constructor.  

The only problem is that Architecture's factory methods are supposed to be
able to return NULL if a component isn't needed.  That makes error checking
harder, because we can't tell if the NULL was intended or not.

If LexReader's constructor throws an exception because a file wasn't found,
what should Arch_Make_Lex_Reader() do?  LexReader can free itself before
throwing the exception or returning NULL, so it won't leak; the problem is how
we indicate to SegReader's constructor that there was a real failure.


Got it.

We just need to have SegReader wrap all of its calls to Architecture's factory
methods in "try" blocks.  Then, it can catch an exception, free itself, and
rethrow for the benefit of higher level open call.  It'll mean a bunch more
silly private methods like SegReader_Try_Open_Lex_Reader() for the binding to
implement, but that's the way it goes.

> >> It still seems like storing per-segment metadata in the snapshot would
> >> be necessary/helpful.
> >
> > As you surmised over in the "Segment" thread, that's in segmeta.json.
> Right.  Though, since you store it w/ the segment, it can't be
> versioned?  (Segment files are write once)?


> Eg, you will store new deletions against segment X with segment Y
> (when X's new deletions got flushed at the same time that segment Y
> was flushed).  So, where will segment X's new delCount be recorded?

In Segment Y's metadata.

       "deletions" : {
          "files" : {
             "seg_1" : {
                "count" : "2",
                "filename" : "seg_3/"
             "seg_2" : {
                "count" : "1",
                "filename" : "seg_3/"
          "format" : "1"
       "segmeta" : {
          "doc_count" : "0",
          "field_names" : [
          "format" : "1",
          "name" : "seg_3"

> Also, what happens if I open a writer, do only deletes, and close?  Do
> you flush an empty (no added docs) segment Y simply to record the new
> deletions?

Yes.  Note the "doc_count" of 0 under segmeta's key in the previous JSON

> Will snapshot allow user-defined (opaque to Lucy) metadata to be recored
> inside it?

Yes.  ("User" in this case means the developer, not end-user, though an
irresponsible dev could forward end-user data.)  Custom DataWriter
implementations are encouraged to store their metadata within the segmeta
file, rather than write it themselves to custom files.

Marvin Humphrey

View raw message