Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of tmueller@day.com designates
 207.126.148.183 as permitted sender)
Message-ID: <91f3b2650811110106x3c2b576al25ed830de10aa470@mail.gmail.com>
Date: Tue, 11 Nov 2008 10:06:54 +0100
From: "=?ISO-8859-1?Q?Thomas_M=FCller?=" <thomas.mueller@day.com>
To: dev@jackrabbit.apache.org
Subject: Re: Workspace.copy() Question ...
In-Reply-To: <F186AC080E44C146BBE4A33472CFA79F0222579F@mxs01.tirol.local>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <F186AC080E44C146BBE4A33472CFA79F0222579F@mxs01.tirol.local>

Hi,

> i have a nice usecase .. i have a filenode in my workspace and i should create
> about 70 copies of this node.
> its a not so small pdf file (10Mb) and i am using the datastore so its no problem
> the binary exists only one time but the problem is the textextractor. it will be called 70 times :-)
> is it possible to reuse the fulltext index on a copy operation without new reindexing the file ?

It's an interesting use case, and probably quite common. It would be
good if the text extraction would be run only once for each binary.
However I'm not sure how this should be implemented... One solution is
to extract the text in the data store, but that would be in the
'wrong' level.

What about this: the DataStore could return a special kind of
InputStream that allows to get the DataIdentifier
(DataStoreInputStream for example). The text extractor would then use
this unique identifier to ensure text for the same binary file is only
extracted once.

The same mechanism could be used to avoid copying binary data within
the same repository, and multiple repositories that share the same
data store: if the data store detects such an input stream it would
first check if the binary object already exists.

Regards,
Thomas