jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edgar Poce <edgarp...@gmail.com>
Subject Re: best practices for searching binary content?
Date Wed, 04 May 2005 17:55:35 GMT
Martin Chalupka wrote:
> what is the best practice for managing searchable binary content (like word- or
> pdf-documents) in jackrabbit?
> I am thinking about stripping the text with tools like Jakarta Apache POI and
> writing it as text content to the repository, with some structure like
> would that be the right way?
Duplicating data is rarely the right way.

Apparently, indexing binary values with known mime types is in the todo 
quote from o.a.j.core.search.lucene.NodeIndexer:
"todo add support for indexing of nt:resource. e.g. when mime type is text"

I think that a configurable way to map text extractors to mime types 
would be useful. Mime types other than plain/text could be supported. WDYT?


View raw message