Mailing-List: contact oak-dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: oak-dev@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of mueller@adobe.com designates
 64.18.1.25 as permitted sender)
From: Thomas Mueller <mueller@adobe.com>
To: "oak-dev@jackrabbit.apache.org" <oak-dev@jackrabbit.apache.org>
Date: Wed, 17 Oct 2012 09:26:46 +0100
Subject: Re: [MongoMK] Reading blobs incrementally
Thread-Topic: [MongoMK] Reading blobs incrementally
Thread-Index: Ac2sQSYK3TFxje34S+SovjURK/wQVA==
Message-ID: <CCA43669.350FB%mueller@adobe.com>
In-Reply-To: <CC887F88.1B367%matamel@adobe.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.4.120824
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi,

As a workaround, you could keep the last few streams open in the Mongo MK
for some time (a cache) together with the current position. That way seek
is not required in most cases, as usually binaries are read as a stream.

However, keeping resources open is problematic (we do that in the
DbDataStore in Jackrabbit, and we ran into various problems), and I would
avoid it if possible. I would probably use the AbstractBlobStore instead
which splits blobs into blocks, I believe that way you can just use
regular MongoDB features and don't need to use GridFS. But you might want
to test which approach is faster / easier.

Regards,
Thomas


On 9/26/12 9:48 AM, "Mete Atamel" <matamel@adobe.com> wrote:

>Forgot to mention. I could also increase the BufferedInputStream's buffer
>size to something high to speed up the large blob read. That's probably
>what I'll do in the short term but my question is more about whether the
>optimization I mentioned in my previous email is worth pursuing at some
>point.
>
>Best,
>Mete
>
>On 9/26/12 9:38 AM, "Mete Atamel" <matamel@adobe.com> wrote:
>
>>Hi,
>>
>>I realized that MicroKernelIT#testBlobs takes a while to complete on
>>MongoMK. This is partly due to how the test was written and partly due to
>>how the blob read offset is implemented in MongoMK. I'm looking for
>>feedback on where to fix this.
>>
>>To give you an idea on testBlobs, it first writes a blob using MK. Then,
>>it verifies that the blob bytes were written correctly by reading the
>>blob
>>from MK. However, blob read from MK is not done in one shot. Instead,
>>it's
>>done via this input stream:
>>
>>InputStream in2 =3D new BufferedInputStream(new MicroKernelInputStream(mk=
,
>>id));
>>
>>
>>MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>>the reads in 8K chunks. Then, there's a while loop with in2.read() to
>>read
>>the blob fully. This makes a call to MicroKernel#read method with the
>>right offset for every 8K chunk until the blob bytes are fully read.
>>
>>This is not a problem for small blob sizes but for bigger blob sizes,
>>reading 8K chunks can be slow because in MongoMK, every read with offset
>>triggers the following:
>>-Find the blob from GridFS
>>-Retrieve its input stream
>>-Skip to the right offset
>>-Read 8K=20
>>-Close the input stream
>>
>>I could fix this by changing the test to read the blob bytes in one shot
>>and then do the comparison. However, I was wondering if we should also
>>work on an optimization for successive reads from the blob with
>>incremental offsets? Maybe we could keep the input stream of recently
>>read
>>blobs around for some time before closing them?
>>
>>Best,
>>Mete
>>
>>
>