Mailing-List: contact dev-help@apr.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 7 Jan 2003 11:37:52 -0600 (CST)
Message-Id: <200301071737.h07Hbqi62730@newton.ch.collab.net>
From: Karl Fogel <kfogel@newton.ch.collab.net>
To: dev@subversion.tigris.org
Cc: dev@apr.apache.org
Subject: serializeable md5 contexts
Reply-To: kfogel@collab.net
Emacs: anything free is worth what you paid for it.

The Problem:
============

Need a way to resume checksumming when appending to stored data.
Otherwise, we'd have to recompute the md5 context for all the data
already present, then continue with the same context as the new data
comes in.

If an MD5 digest could be reverted back to an unfinalized MD5 context,
this would be easy, since the representation has the digest.  But, of
course, that's not possible.

(This is for http://subversion.tigris.org/issues/show_bug.cgi?id=689.
The reason is that there's no guarantee that any particular stream
returned from svn_fs__rep_contents_write_stream() will be used for all
the data from beginning to end.)


Proposed Solution:
==================

I'm thinking of a pair of functions:

   /**
    * Return a portable string representation of an MD5 context.
    * apr_md5_resume_context() can convert the string back to a context.
    *
    * @param context An MD5 context
    * @param pool The pool in which to allocate the returned string
    * @return The serialized form of the context
    * @note Call this with a context that has not yet been finalized
    *    with apr_md5_finalize().
    */
   const char *apr_md5_serialize_context(struct apr_md5_ctx_t *context,
                                         apr_pool_t *pool);

   /**
    * Set an MD5 context to the state represented by a serialized context.
    *
    * @param context The MD5 context to serialize
    * @param serialized_context String obtained from apr_md5_serialize_context
    * @return The error APR_INVALID_MD5_SERIALIZATION if the
    *    serialized representation cannot be parsed, else return success. 
    */
   apr_status_t *apr_md5_resume_context(struct apr_md5_ctx_t *context
                                        const char *serialized_context);

Then we'd store the serialized context along with the digest.  In
other words, each time one writes data to a representation, one would:

   1. Resume the rep's context if any, else init a new context.

   2. Write the data through, calculating new checksum as we go. 

   3. Close the stream, reserialize the context, *then* finalize the
      context and compute a new digest, and store both the new
      serialization and digest in the rep.

Does anyone see either a better solution, or an unexpected
consequence/problem with this solution?  Writing the serialization and
deserialization isn't particularly hard, but I'd hate to do it and
then discover there was a simpler answer :-).

(By the way, I'm assuming that these would go into apr-util.  But
wouldn't have to, of course; they could live in Subversion's code if
people don't think they belong in apr-util.)

-K