jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Nuescheler <david.nuesche...@gmail.com>
Subject Re: best practices for searching binary content?
Date Sat, 07 May 2005 17:49:30 GMT
hi martin,

> what is the best practice for managing searchable binary content (like word- or
> pdf-documents) in jackrabbit?
> I am thinking about stripping the text with tools like Jakarta Apache POI and
> writing it as text content to the repository, with some structure like
>
> mynt:wordDocument
> |
> +- nt:unstructured (stripped text goes here)
> |
> +- nt:file (word doc as binary goes here)
>
>
> would that be the right way?
>  
>
to me this would look a bit awkward. if you decide to duplicate
the content (which may not be the right way) then i would extend
nt:resource with a myapp:fulltext string property,  however the
best integration is to integrate  the text-only extracting directly into
the indexer.

regards,
david

Mime
View raw message