Return-Path: X-Original-To: apmail-jackrabbit-oak-dev-archive@minotaur.apache.org Delivered-To: apmail-jackrabbit-oak-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C427AD954 for ; Wed, 17 Oct 2012 08:27:26 +0000 (UTC) Received: (qmail 91564 invoked by uid 500); 17 Oct 2012 08:27:26 -0000 Delivered-To: apmail-jackrabbit-oak-dev-archive@jackrabbit.apache.org Received: (qmail 91434 invoked by uid 500); 17 Oct 2012 08:27:24 -0000 Mailing-List: contact oak-dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: oak-dev@jackrabbit.apache.org Delivered-To: mailing list oak-dev@jackrabbit.apache.org Received: (qmail 91396 invoked by uid 99); 17 Oct 2012 08:27:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Oct 2012 08:27:23 +0000 X-ASF-Spam-Status: No, hits=-1.3 required=5.0 tests=FRT_ADOBE2,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mueller@adobe.com designates 64.18.1.25 as permitted sender) Received: from [64.18.1.25] (HELO exprod6og110.obsmtp.com) (64.18.1.25) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Oct 2012 08:27:14 +0000 Received: from outbound-smtp-1.corp.adobe.com ([192.150.11.134]) by exprod6ob110.postini.com ([64.18.5.12]) with SMTP ID DSNKUH5rzZvorKsAMPvAmxeVw0wGkuFsA4Gk@postini.com; Wed, 17 Oct 2012 01:26:53 PDT Received: from inner-relay-4.eur.adobe.com (inner-relay-4.adobe.com [193.104.215.14]) by outbound-smtp-1.corp.adobe.com (8.12.10/8.12.10) with ESMTP id q9H8OA1v024685 for ; Wed, 17 Oct 2012 01:24:11 -0700 (PDT) Received: from nahub01.corp.adobe.com (nahub01.corp.adobe.com [10.8.189.97]) by inner-relay-4.eur.adobe.com (8.12.10/8.12.9) with ESMTP id q9H8QoXL021207 for ; Wed, 17 Oct 2012 01:26:51 -0700 (PDT) Received: from eurcas01.eur.adobe.com (10.128.4.27) by nahub01.corp.adobe.com (10.8.189.97) with Microsoft SMTP Server (TLS) id 8.3.279.1; Wed, 17 Oct 2012 01:26:50 -0700 Received: from eurmbx01.eur.adobe.com ([10.128.4.32]) by eurcas01.eur.adobe.com ([10.128.4.27]) with mapi; Wed, 17 Oct 2012 09:26:47 +0100 From: Thomas Mueller To: "oak-dev@jackrabbit.apache.org" Date: Wed, 17 Oct 2012 09:26:46 +0100 Subject: Re: [MongoMK] Reading blobs incrementally Thread-Topic: [MongoMK] Reading blobs incrementally Thread-Index: Ac2sQSYK3TFxje34S+SovjURK/wQVA== Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.2.4.120824 acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Hi, As a workaround, you could keep the last few streams open in the Mongo MK for some time (a cache) together with the current position. That way seek is not required in most cases, as usually binaries are read as a stream. However, keeping resources open is problematic (we do that in the DbDataStore in Jackrabbit, and we ran into various problems), and I would avoid it if possible. I would probably use the AbstractBlobStore instead which splits blobs into blocks, I believe that way you can just use regular MongoDB features and don't need to use GridFS. But you might want to test which approach is faster / easier. Regards, Thomas On 9/26/12 9:48 AM, "Mete Atamel" wrote: >Forgot to mention. I could also increase the BufferedInputStream's buffer >size to something high to speed up the large blob read. That's probably >what I'll do in the short term but my question is more about whether the >optimization I mentioned in my previous email is worth pursuing at some >point. > >Best, >Mete > >On 9/26/12 9:38 AM, "Mete Atamel" wrote: > >>Hi, >> >>I realized that MicroKernelIT#testBlobs takes a while to complete on >>MongoMK. This is partly due to how the test was written and partly due to >>how the blob read offset is implemented in MongoMK. I'm looking for >>feedback on where to fix this. >> >>To give you an idea on testBlobs, it first writes a blob using MK. Then, >>it verifies that the blob bytes were written correctly by reading the >>blob >>from MK. However, blob read from MK is not done in one shot. Instead, >>it's >>done via this input stream: >> >>InputStream in2 =3D new BufferedInputStream(new MicroKernelInputStream(mk= , >>id)); >> >> >>MicroKernelInputStream reads from the MK and BufferedInputStream buffers >>the reads in 8K chunks. Then, there's a while loop with in2.read() to >>read >>the blob fully. This makes a call to MicroKernel#read method with the >>right offset for every 8K chunk until the blob bytes are fully read. >> >>This is not a problem for small blob sizes but for bigger blob sizes, >>reading 8K chunks can be slow because in MongoMK, every read with offset >>triggers the following: >>-Find the blob from GridFS >>-Retrieve its input stream >>-Skip to the right offset >>-Read 8K=20 >>-Close the input stream >> >>I could fix this by changing the test to read the blob bytes in one shot >>and then do the comparison. However, I was wondering if we should also >>work on an optimization for successive reads from the blob with >>incremental offsets? Maybe we could keep the input stream of recently >>read >>blobs around for some time before closing them? >> >>Best, >>Mete >> >> >