Mailing-List: contact oak-dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: oak-dev@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of stefan.guggisberg@gmail.com
 designates 209.85.214.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <507E6F78.5030602@apache.org>
References: <CC887D8B.1B364%matamel@adobe.com>
	<507E6F78.5030602@apache.org>
Date: Wed, 17 Oct 2012 11:03:32 +0200
Message-ID: 
 <CAFYk8N=gCcR9MbXNC3bCCipN86EWkhcqT4RA5f0Sa2-ya2tA9g@mail.gmail.com>
Subject: Re: [MongoMK] Reading blobs incrementally
From: Stefan Guggisberg <stefan.guggisberg@gmail.com>
To: oak-dev@jackrabbit.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Wed, Oct 17, 2012 at 10:42 AM, Michael D=FCrig <mduerig@apache.org> wrot=
e:
>
> I wonder why the Microkernel API has an asymmetry here: for writing a bin=
ary
> you can pass a stream where as for reading you need to pass a byte array.

the write method implies a content-addressable storage for blobs,
i.e. identical binary content is identified by identical identifiers.
the identifier
needs to be computed from the entire blob content. that's why the
signature takes
a stream rather than supporting chunked writes.

cheers
stefan

>
> Michael
>
>
> On 26.9.12 8:38, Mete Atamel wrote:
>>
>> Hi,
>>
>> I realized that MicroKernelIT#testBlobs takes a while to complete on
>> MongoMK. This is partly due to how the test was written and partly due t=
o
>> how the blob read offset is implemented in MongoMK. I'm looking for
>> feedback on where to fix this.
>>
>> To give you an idea on testBlobs, it first writes a blob using MK. Then,
>> it verifies that the blob bytes were written correctly by reading the bl=
ob
>> from MK. However, blob read from MK is not done in one shot. Instead, it=
's
>> done via this input stream:
>>
>> InputStream in2 =3D new BufferedInputStream(new MicroKernelInputStream(m=
k,
>> id));
>>
>>
>> MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>> the reads in 8K chunks. Then, there's a while loop with in2.read() to re=
ad
>> the blob fully. This makes a call to MicroKernel#read method with the
>> right offset for every 8K chunk until the blob bytes are fully read.
>>
>> This is not a problem for small blob sizes but for bigger blob sizes,
>> reading 8K chunks can be slow because in MongoMK, every read with offset
>> triggers the following:
>> -Find the blob from GridFS
>> -Retrieve its input stream
>> -Skip to the right offset
>> -Read 8K
>> -Close the input stream
>>
>> I could fix this by changing the test to read the blob bytes in one shot
>> and then do the comparison. However, I was wondering if we should also
>> work on an optimization for successive reads from the blob with
>> incremental offsets? Maybe we could keep the input stream of recently re=
ad
>> blobs around for some time before closing them?
>>
>> Best,
>> Mete
>>
>>
>