Mailing-List: contact oak-dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: oak-dev@jackrabbit.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-ID: <507E7586.4080609@apache.org>
Date: Wed, 17 Oct 2012 10:08:22 +0100
From: =?ISO-8859-1?Q?Michael_D=FCrig?= <mduerig@apache.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:15.0) Gecko/20120907 Thunderbird/15.0.1
MIME-Version: 1.0
To: <oak-dev@jackrabbit.apache.org>
Subject: Re: [MongoMK] Reading blobs incrementally
References: <CC887D8B.1B364%matamel@adobe.com> <507E6F78.5030602@apache.org>
 <CAFYk8N=gCcR9MbXNC3bCCipN86EWkhcqT4RA5f0Sa2-ya2tA9g@mail.gmail.com>
In-Reply-To: 
 <CAFYk8N=gCcR9MbXNC3bCCipN86EWkhcqT4RA5f0Sa2-ya2tA9g@mail.gmail.com>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 8bit


On 17.10.12 10:03, Stefan Guggisberg wrote:
> On Wed, Oct 17, 2012 at 10:42 AM, Michael D�rig <mduerig@apache.org> wrote:
>>
>> I wonder why the Microkernel API has an asymmetry here: for writing a binary
>> you can pass a stream where as for reading you need to pass a byte array.
>
> the write method implies a content-addressable storage for blobs,
> i.e. identical binary content is identified by identical identifiers.
> the identifier
> needs to be computed from the entire blob content. that's why the
> signature takes
> a stream rather than supporting chunked writes.

Makes sense so far but this is only half of the story ;-) Why couldn't 
the read method also return a stream?

Michael

>
> cheers
> stefan
>
>>
>> Michael
>>
>>
>> On 26.9.12 8:38, Mete Atamel wrote:
>>>
>>> Hi,
>>>
>>> I realized that MicroKernelIT#testBlobs takes a while to complete on
>>> MongoMK. This is partly due to how the test was written and partly due to
>>> how the blob read offset is implemented in MongoMK. I'm looking for
>>> feedback on where to fix this.
>>>
>>> To give you an idea on testBlobs, it first writes a blob using MK. Then,
>>> it verifies that the blob bytes were written correctly by reading the blob
>>> from MK. However, blob read from MK is not done in one shot. Instead, it's
>>> done via this input stream:
>>>
>>> InputStream in2 = new BufferedInputStream(new MicroKernelInputStream(mk,
>>> id));
>>>
>>>
>>> MicroKernelInputStream reads from the MK and BufferedInputStream buffers
>>> the reads in 8K chunks. Then, there's a while loop with in2.read() to read
>>> the blob fully. This makes a call to MicroKernel#read method with the
>>> right offset for every 8K chunk until the blob bytes are fully read.
>>>
>>> This is not a problem for small blob sizes but for bigger blob sizes,
>>> reading 8K chunks can be slow because in MongoMK, every read with offset
>>> triggers the following:
>>> -Find the blob from GridFS
>>> -Retrieve its input stream
>>> -Skip to the right offset
>>> -Read 8K
>>> -Close the input stream
>>>
>>> I could fix this by changing the test to read the blob bytes in one shot
>>> and then do the comparison. However, I was wondering if we should also
>>> work on an optimization for successive reads from the blob with
>>> incremental offsets? Maybe we could keep the input stream of recently read
>>> blobs around for some time before closing them?
>>>
>>> Best,
>>> Mete
>>>
>>>
>>