jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: Searching in file contents
Date Thu, 27 Oct 2005 13:37:58 GMT
Hi Martin,

jackrabbit comes with an extension mechanism that allows you to plugin 
text filters. those filters basically convert a binary stream into a 
character stream that can be indexed by lucene.

the core classes contain a sample implementation that filters binaries 
of type text/plain according (also not very innovative, but it takes the 
encoding into account. that's at least something ;))

there are additional text filters in contrib, if I remember correctly 
for some ms office documents and pdf.

simply build the text filter contrib and put it into the classpath, that 
should do it.

btw. this mechanism doesn't need an additional property to store the 
text version of the binary.


Martin Perez wrote:
> Hi again. Here goes another one about searching.
> I'm storing files on jackrabbit for later searching ( what innovative! ).
> Ok, I'm storing the content using the "jcr:data" property:
> node.setProperty("jcr:data",inputstream) being inputstream the stream with
> the file contents.
> The problem is that I don't know how to search later within that contents.
> The content can be sometimes binary (images, video, pdfs, ...) and sometimes
> text (html, xml, txt, ..) Currently I'm using the next query statement
> //*[jcr:contains(@jcr:data,'phrase')]
> So, first question, how to search within stream properties?
> And the second one. I'm migrating a repository system that was based on
> lucene. In that repository system, I was following the next process to index
> binary content:
> 1 - Try to extract the text from the file (pdf extractors, word extractors,
> excel extractors, etc..)
> 2 - Store the file contents in database or filesystem storage
> 3 - Index the text content.
> But now I have the problem of how to do word,pdf,excel, etc. management. One
> option is to extract the text and store both "extracted-text" and "content"
> as properties, but this will duplicate storage for these files.
> So, how would you handle storage and searching within binary text files like
> pdf or word ones?
> Thanks!
> Martin

View raw message