From dev-return-20840-apmail-jackrabbit-dev-archive=jackrabbit.apache.org@jackrabbit.apache.org Wed Nov 12 14:37:08 2008 Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 75196 invoked from network); 12 Nov 2008 14:37:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Nov 2008 14:37:08 -0000 Received: (qmail 59717 invoked by uid 500); 12 Nov 2008 14:37:13 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 59690 invoked by uid 500); 12 Nov 2008 14:37:13 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 59674 invoked by uid 99); 12 Nov 2008 14:37:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2008 06:37:13 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tmueller@day.com designates 207.126.148.183 as permitted sender) Received: from [207.126.148.183] (HELO eu3sys201aog003.obsmtp.com) (207.126.148.183) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 12 Nov 2008 14:35:53 +0000 Received: from source ([209.85.128.185]) by eu3sys201aob003.postini.com ([207.126.154.11]) with SMTP ID DSNKSRrp8plMudicPc4zPERhGgKlACoGhce1@postini.com; Wed, 12 Nov 2008 14:36:36 UTC Received: by fk-out-0910.google.com with SMTP id 18so475710fks.7 for ; Wed, 12 Nov 2008 06:36:34 -0800 (PST) Received: by 10.181.203.13 with SMTP id f13mr2857160bkq.168.1226500594401; Wed, 12 Nov 2008 06:36:34 -0800 (PST) Received: by 10.180.204.19 with HTTP; Wed, 12 Nov 2008 06:36:34 -0800 (PST) Message-ID: <91f3b2650811120636t63fe0d1al7f4952005e064774@mail.gmail.com> Date: Wed, 12 Nov 2008 15:36:34 +0100 From: "=?ISO-8859-1?Q?Thomas_M=FCller?=" To: dev@jackrabbit.apache.org Subject: Re: Workspace.copy() Question ... In-Reply-To: <510143ac0811120310s4f6bdbe8g5f6ec8361c00c6c4@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <91f3b2650811110106x3c2b576al25ed830de10aa470@mail.gmail.com> <510143ac0811120310s4f6bdbe8g5f6ec8361c00c6c4@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org Hi, The problem is: "process the binary only once". With 'process' we said 'text extraction', but it could be 'virus scan', 'index', 'create a thumbnail', 'transfer' (to the client or from the client), or 'backup' - any expensive task. I believe a good solution is to provide the object identity to the module (the text extraction engine, virus scanner, and so on), so that the module can decide itself what to do. Instead of returning an InputStream, Jackrabbit would return a DataStoreInputStream with the additional method getDataIdentifier(). Then the module can read the identifier, check if the item is already processed, and avoid reading the data itself if this identifier is already processed. I believe that would be a flexible solution. How the module stores the data for this object (the meta data) is module specific. I don't think the best solution is to always store it in a file or stream close to the binary. For text extraction, a separate file may make sense, but probably not for 'virus scan' because that's only a flag (you don't need the data). Thumbnails: for better performance you want to keep them together, and not save them separately (that is, in the data store). Regards, Thomas