jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Workspace.copy() Question ...
Date Wed, 12 Nov 2008 11:10:36 GMT
Hi,

On Tue, Nov 11, 2008 at 10:06 AM, Thomas Müller <thomas.mueller@day.com> wrote:
> It's an interesting use case, and probably quite common. It would be
> good if the text extraction would be run only once for each binary.
> However I'm not sure how this should be implemented... One solution is
> to extract the text in the data store, but that would be in the
> 'wrong' level.

An alternative would be to add an extra stream to binary
InternalValues. That stream (if present) would contain the result of
text extraction on the binary value and could then be used for
indexing.

In fact last week at the ApacheCon I was discussing with the Lucene
people about a way to store the analyzed token stream to further
optimize the re-indexing case. Apparently that should be possible with
little effort.

The problem with this is that we'd need to move the text extraction
functionality down to the persistence or item state layer. The current
configuration mechanism we have isn't too well adjusted for this and
things like using the value of the jcr:mimeType property to guide text
extraction might become quite tricky. But I don't see any fundamental
reason why those issues could not be resolved.

BR,

Jukka Zitting

Mime
View raw message