Mailing-List: contact oak-dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: oak-dev@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of matamel@adobe.com designates
 64.18.1.31 as permitted sender)
From: Mete Atamel <matamel@adobe.com>
To: "oak-dev@jackrabbit.apache.org" <oak-dev@jackrabbit.apache.org>
Date: Wed, 17 Oct 2012 02:02:23 -0700
Subject: Re: [MongoMK] Reading blobs incrementally
Thread-Topic: [MongoMK] Reading blobs incrementally
Thread-Index: Ac2sRiL1KXOAIJpjRpC4co5XsEb0/w==
Message-ID: <CCA43FE8.1CD1B%matamel@adobe.com>
In-Reply-To: <CCA43669.350FB%mueller@adobe.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.4.120824
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Thanks for the feedback. Using AbstractBlobStore instead of GridFS is
indeed on the list of things I want to try out once the rest of missing
functionality is done in MongoMK. I'll report back once I get a chance to
implement that.

-Mete

On 10/17/12 10:26 AM, "Thomas Mueller" <mueller@adobe.com> wrote:

>Hi,
>
>As a workaround, you could keep the last few streams open in the Mongo MK
>for some time (a cache) together with the current position. That way seek
>is not required in most cases, as usually binaries are read as a stream.
>
>However, keeping resources open is problematic (we do that in the
>DbDataStore in Jackrabbit, and we ran into various problems), and I would
>avoid it if possible. I would probably use the AbstractBlobStore instead
>which splits blobs into blocks, I believe that way you can just use
>regular MongoDB features and don't need to use GridFS. But you might want
>to test which approach is faster / easier.
>
>Regards,
>Thomas
>
>
>
>On 9/26/12 9:48 AM, "Mete Atamel" <matamel@adobe.com> wrote:
>
>>Forgot to mention. I could also increase the BufferedInputStream's buffer
>>size to something high to speed up the large blob read. That's probably
>>what I'll do in the short term but my question is more about whether the
>>optimization I mentioned in my previous email is worth pursuing at some
>>point.
>>
>>Best,
>>Mete
>>
>>On 9/26/12 9:38 AM, "Mete Atamel" <matamel@adobe.com> wrote:
>>
>>>Hi,
>>>
>>>I realized that MicroKernelIT#testBlobs takes a while to complete on
>>>MongoMK. This is partly due to how the test was written and partly due
>>>to
>>>how the blob read offset is implemented in MongoMK. I'm looking for
>>>feedback on where to fix this.
>>>
>>>To give you an idea on testBlobs, it first writes a blob using MK. Then,
>>>it verifies that the blob bytes were written correctly by reading the
>>>blob
>>>from MK. However, blob read from MK is not done in one shot. Instead,
>>>it's
>>>done via this input stream:
>>>
>>>InputStream in2 =3D new BufferedInputStream(new MicroKernelInputStream(m=
k,
>>>id));
>>>
>>>
>>>MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>>>the reads in 8K chunks. Then, there's a while loop with in2.read() to
>>>read
>>>the blob fully. This makes a call to MicroKernel#read method with the
>>>right offset for every 8K chunk until the blob bytes are fully read.
>>>
>>>This is not a problem for small blob sizes but for bigger blob sizes,
>>>reading 8K chunks can be slow because in MongoMK, every read with offset
>>>triggers the following:
>>>-Find the blob from GridFS
>>>-Retrieve its input stream
>>>-Skip to the right offset
>>>-Read 8K=20
>>>-Close the input stream
>>>
>>>I could fix this by changing the test to read the blob bytes in one shot
>>>and then do the comparison. However, I was wondering if we should also
>>>work on an optimization for successive reads from the blob with
>>>incremental offsets? Maybe we could keep the input stream of recently
>>>read
>>>blobs around for some time before closing them?
>>>
>>>Best,
>>>Mete
>>>
>>>
>>
>