jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Perez <mper...@gmail.com>
Subject Searching in file contents
Date Thu, 27 Oct 2005 12:34:35 GMT
Hi again. Here goes another one about searching.

I'm storing files on jackrabbit for later searching ( what innovative! ).
Ok, I'm storing the content using the "jcr:data" property:

node.setProperty("jcr:data",inputstream) being inputstream the stream with
the file contents.

The problem is that I don't know how to search later within that contents.
The content can be sometimes binary (images, video, pdfs, ...) and sometimes
text (html, xml, txt, ..) Currently I'm using the next query statement
//*[jcr:contains(@jcr:data,'phrase')]

So, first question, how to search within stream properties?


And the second one. I'm migrating a repository system that was based on
lucene. In that repository system, I was following the next process to index
binary content:

1 - Try to extract the text from the file (pdf extractors, word extractors,
excel extractors, etc..)
2 - Store the file contents in database or filesystem storage
3 - Index the text content.

But now I have the problem of how to do word,pdf,excel, etc. management. One
option is to extract the text and store both "extracted-text" and "content"
as properties, but this will duplicate storage for these files.

So, how would you handle storage and searching within binary text files like
pdf or word ones?

Thanks!

Martin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message