jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miro Walker" <miro.wal...@cognifide.com>
Subject RE: Searching in file contents
Date Thu, 27 Oct 2005 12:56:59 GMT

In response to a similar requirement to your second question, we have
been doing exactly what you suggested - we store "Search Index Text"
against the data. I guess it depends as to whether you are strictly
constrained in terms of storage.

The added advantage of this approach is that when you want to display
search results you don't need to process the original binary format in
order to display a snippet of matching text from the file.


-----Original Message-----
From: Martin Perez [mailto:mpermar@gmail.com] 
Sent: 27 October 2005 13:35
To: jackrabbit-dev@incubator.apache.org
Subject: Searching in file contents

Hi again. Here goes another one about searching.

I'm storing files on jackrabbit for later searching ( what innovative!
Ok, I'm storing the content using the "jcr:data" property:

node.setProperty("jcr:data",inputstream) being inputstream the stream
the file contents.

The problem is that I don't know how to search later within that
The content can be sometimes binary (images, video, pdfs, ...) and
text (html, xml, txt, ..) Currently I'm using the next query statement

So, first question, how to search within stream properties?

And the second one. I'm migrating a repository system that was based on
lucene. In that repository system, I was following the next process to
binary content:

1 - Try to extract the text from the file (pdf extractors, word
excel extractors, etc..)
2 - Store the file contents in database or filesystem storage
3 - Index the text content.

But now I have the problem of how to do word,pdf,excel, etc. management.
option is to extract the text and store both "extracted-text" and
as properties, but this will duplicate storage for these files.

So, how would you handle storage and searching within binary text files
pdf or word ones?



View raw message