jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mete Atamel <mata...@adobe.com>
Subject Re: [MongoMK] Reading blobs incrementally
Date Wed, 17 Oct 2012 09:02:23 GMT
Thanks for the feedback. Using AbstractBlobStore instead of GridFS is
indeed on the list of things I want to try out once the rest of missing
functionality is done in MongoMK. I'll report back once I get a chance to
implement that.


On 10/17/12 10:26 AM, "Thomas Mueller" <mueller@adobe.com> wrote:

>As a workaround, you could keep the last few streams open in the Mongo MK
>for some time (a cache) together with the current position. That way seek
>is not required in most cases, as usually binaries are read as a stream.
>However, keeping resources open is problematic (we do that in the
>DbDataStore in Jackrabbit, and we ran into various problems), and I would
>avoid it if possible. I would probably use the AbstractBlobStore instead
>which splits blobs into blocks, I believe that way you can just use
>regular MongoDB features and don't need to use GridFS. But you might want
>to test which approach is faster / easier.
>On 9/26/12 9:48 AM, "Mete Atamel" <matamel@adobe.com> wrote:
>>Forgot to mention. I could also increase the BufferedInputStream's buffer
>>size to something high to speed up the large blob read. That's probably
>>what I'll do in the short term but my question is more about whether the
>>optimization I mentioned in my previous email is worth pursuing at some
>>On 9/26/12 9:38 AM, "Mete Atamel" <matamel@adobe.com> wrote:
>>>I realized that MicroKernelIT#testBlobs takes a while to complete on
>>>MongoMK. This is partly due to how the test was written and partly due
>>>how the blob read offset is implemented in MongoMK. I'm looking for
>>>feedback on where to fix this.
>>>To give you an idea on testBlobs, it first writes a blob using MK. Then,
>>>it verifies that the blob bytes were written correctly by reading the
>>>from MK. However, blob read from MK is not done in one shot. Instead,
>>>done via this input stream:
>>>InputStream in2 = new BufferedInputStream(new MicroKernelInputStream(mk,
>>>MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>>>the reads in 8K chunks. Then, there's a while loop with in2.read() to
>>>the blob fully. This makes a call to MicroKernel#read method with the
>>>right offset for every 8K chunk until the blob bytes are fully read.
>>>This is not a problem for small blob sizes but for bigger blob sizes,
>>>reading 8K chunks can be slow because in MongoMK, every read with offset
>>>triggers the following:
>>>-Find the blob from GridFS
>>>-Retrieve its input stream
>>>-Skip to the right offset
>>>-Read 8K 
>>>-Close the input stream
>>>I could fix this by changing the test to read the blob bytes in one shot
>>>and then do the comparison. However, I was wondering if we should also
>>>work on an optimization for successive reads from the blob with
>>>incremental offsets? Maybe we could keep the input stream of recently
>>>blobs around for some time before closing them?

View raw message