jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Klimetschek <aklim...@day.com>
Subject Re: Metadata, TextExtractor
Date Fri, 14 Aug 2009 21:33:13 GMT
On Tue, Aug 4, 2009 at 8:16 AM, Zhou Wu<zwu_ca@yahoo.com> wrote:
> 1. It looks if one wants to put the metadata from a document in a
> repository, one has to do by his/her own. Why cannot we publish the metadata
> (it should be configurable) during the text extracting stages? If I do it by
> myself, I have to process the document once more just for the metadata --
> affecting performance badly. Please note that in v2.0, the metadata object
> is indeed obtained by Tika during the stage, but is discarded. Without
> metadata in place, we miss too much  searchable information in the
> repository.

Meta-data extraction is something that cannot easily be handled
generically, because it depends on your input content, what you want
as metadata and how you define your node structure. Hence such a
solution could only be a hook upon a JCR save() that would let you do
anything with the changed content and add additional properties. But
this is not a good idea, as a save would then always imply many
subsequent changes to your content. And since you need full API access
anyway to be able to express your metadata structure freely, this is
best done on the JCR API level by the application, not the repository.

Fulltext extraction is different, because it does not change the JCR
content. It "only" extracts full-text from binary or string properties
and makes it available for the full-text search index.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Mime
View raw message