jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Toddsen <jt6...@gmail.com>
Subject Searching binary data
Date Fri, 16 Apr 2010 15:54:00 GMT
Hi all,

I am having a problem with full text searching and binary fields. I am
uploading some files to a Jackrabbit repository, and setting the jcr:data
property of a resource node with a Binary object containing the file
contents. I am able to retrieve the documents manually and verify that they
are there, but when I try running a query and searching for text that I know
is in the documents, my query returns empty.

I have set the SearchIndex tag in the repository.xml file:

        <SearchIndex
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${wsp.home}/index" />
            <param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
              org.apache.jackrabbit.extractor.MsExcelTextExtractor,
              org.apache.jackrabbit.extractor.PdfTextExtractor,
              org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
              org.apache.jackrabbit.extractor.RTFTextExtractor,
              org.apache.jackrabbit.extractor.HTMLTextExtractor,
              org.apache.jackrabbit.extractor.XMLTextExtractor" />
            <param name="extractorTimeout" value="100"/>
            <param name="extractorPoolSize " value="2"/>
            <param name="supportHighlighting" value="true"/>

        </SearchIndex>


I have tried storing documents of type MsWord (.doc), MsExcel (.xls), and
even plain html files just to test. All of them contain the phrase "the
quick brown fox jumped over the lazy dog". I am detecting the mime type with
Tika and storing that as a jcr:mimeType property when storing the file.

My query looks like:

final Query q = qm.createQuery("SELECT * from nt:resource WHERE contains (*,
'*quick brown*')", Query.SQL);

This is all using JCR 2.0. I have had success searching on non-binary
properties such as encoding and mimeType, but can never get a successful
query when trying to check the binary data. Any help would be appreciated,
thanks.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message