Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 52050 invoked from network); 11 Nov 2008 09:07:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Nov 2008 09:07:26 -0000 Received: (qmail 23731 invoked by uid 500); 11 Nov 2008 09:07:32 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 23695 invoked by uid 500); 11 Nov 2008 09:07:32 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 23632 invoked by uid 99); 11 Nov 2008 09:07:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Nov 2008 01:07:32 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tmueller@day.com designates 207.126.148.183 as permitted sender) Received: from [207.126.148.183] (HELO eu3sys201aog003.obsmtp.com) (207.126.148.183) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 11 Nov 2008 09:06:13 +0000 Received: from source ([209.85.128.187]) by eu3sys201aob003.postini.com ([207.126.154.11]) with SMTP ID DSNKSRlLL+bxva2a4dDNa/19hFhMv7qGAp2p@postini.com; Tue, 11 Nov 2008 09:06:56 UTC Received: by fk-out-0910.google.com with SMTP id b27so3830805fka.0 for ; Tue, 11 Nov 2008 01:06:55 -0800 (PST) Received: by 10.181.54.10 with SMTP id g10mr1973676bkk.83.1226394414919; Tue, 11 Nov 2008 01:06:54 -0800 (PST) Received: by 10.180.204.19 with HTTP; Tue, 11 Nov 2008 01:06:54 -0800 (PST) Message-ID: <91f3b2650811110106x3c2b576al25ed830de10aa470@mail.gmail.com> Date: Tue, 11 Nov 2008 10:06:54 +0100 From: "=?ISO-8859-1?Q?Thomas_M=FCller?=" To: dev@jackrabbit.apache.org Subject: Re: Workspace.copy() Question ... In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: X-Virus-Checked: Checked by ClamAV on apache.org Hi, > i have a nice usecase .. i have a filenode in my workspace and i should create > about 70 copies of this node. > its a not so small pdf file (10Mb) and i am using the datastore so its no problem > the binary exists only one time but the problem is the textextractor. it will be called 70 times :-) > is it possible to reuse the fulltext index on a copy operation without new reindexing the file ? It's an interesting use case, and probably quite common. It would be good if the text extraction would be run only once for each binary. However I'm not sure how this should be implemented... One solution is to extract the text in the data store, but that would be in the 'wrong' level. What about this: the DataStore could return a special kind of InputStream that allows to get the DataIdentifier (DataStoreInputStream for example). The text extractor would then use this unique identifier to ensure text for the same binary file is only extracted once. The same mechanism could be used to avoid copying binary data within the same repository, and multiple repositories that share the same data store: if the data store detects such an input stream it would first check if the binary object already exists. Regards, Thomas