Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of gcaju-users@m.gmane.org
 designates 80.91.229.2 as permitted sender)
To: users@jackrabbit.apache.org
From: Sergio Tridente <tioduke@gmail.com>
Subject: Re: Searching inside binary contents abd other queries
Date: Tue, 10 Jun 2008 20:56:24 -0400
Lines: 108
Message-ID: <g2n7r1$4il$1@ger.gmane.org>
References: <g2lpd1$48r$1@ger.gmane.org> <484E846A.9050005@gmx.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7Bit
User-Agent: KNode/0.10.9
Sender: news <news@ger.gmane.org>

Thank you Marcel and Florian for your answers.

For the moment we are evaluating if we'll be using JackRabbit for our
application or go for a RDBMS. But that's another thing.

Please, bear with my ignorance. I have coded a simple program for storing a
pdf file and searching inside its contents.

Here's the code that stores the pdf:
Node root = session.getRootNode();
Node folder = root.addNode("my.project", "nt:folder");
Node file = folder.addNode("my.pdf", "nt:file");
Node resource = file.addNode("jcr:content", "nt:resource");
resource.setProperty("jcr:mimeType", "application/pdf");
FileInputStream pdf = new
FileInputStream("/home/sergio/Documents/10gR2_openSUSE102_introduction.pdf");
resource.setProperty("jcr:data", pdf);
Calendar cal = Calendar.getInstance();
cal.set(2008, Calendar.JUNE, 10);
resource.setProperty("jcr:lastModified", cal);
session.save();
pdf.close();

So far I think I got it right. But when I try to do a search I get the
following message:
Exception in thread "main" org.apache.lucene.store.AlreadyClosedException:
this IndexReader is closed

Here I am pasting the code for performing the search inside the PDF's
contents:
Workspace ws = session.getWorkspace();
QueryManager qm = ws.getQueryManager();
Query q = qm.createQuery("select * from nt:resource where
jcr:contains='Oracle'", Query.SQL);
QueryResult res = q.execute();
NodeIterator it = res.getNodes();
while (it.hasNext()) {
   Node n = it.nextNode();
   Property prop = n.getProperty("jcr:lastModified");
   System.out.println("Found document containing the word 'Oracle', last
modified date: " + prop.getDate());
}

I don't know if I got it right with the query language syntax. If you can
point me to some resources where I can take a look, it would be great.

I also wanted to point out that the TextFilters are declared inside the
repositoy.xml:
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
    <param name="path" value="${wsp.home}/index"/>
    <param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.MsWordTextExtractor,org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apache.jackrabbit.extractor.RTFTextExtractor,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabbit.extractor.XMLTextExtractor"/>
    <param name="extractorPoolSize" value="2"/>
    <param name="supportHighlighting" value="true"/>
</SearchIndex>

The following JARs are in my CLASSPATH: PDFBox.jar, poi.jar and
tm-extractors.jar

If you could point me to the right direction, I would highly appreciated. 

-- 
Best regards

Sergio Tridente


Marcel Reutegger wrote:
> Sergio wrote:
>> 1) As our database will be holding most of the data, I thought about the
>> following schema: storing the documents inside BLOBs in the database (in
>> case we need to access them using some other criteria) AND in
>> Jackrabbit's repository. While storing those documents using Jackrabbit,
>> I plan to keep the RDBMS' pointers (probably the document's record
>> primary key) using properties. The question is: does this make sense? Is
>> it a common practice? And if not, what is the standard approach?
> 
> well, the recommended approach is to replace your RDBMS with Jackrabbit.
> 
>> 2) Do I need to define node types for representing my documents? If not,
>> is there some standard type I can use?
> 
> for files and folders there's nt:file and nt:folder. See:
> http://wiki.apache.org/jackrabbit/NodeTypeRegistry and of course the JSR
> 170 specification.
> 
>> 3) I have read that Jackrabbit is able to read inside some document
>> types, how do you accomplish that? Using TextExtractors?
> 
> correct. see: http://jackrabbit.apache.org/jackrabbit-text-extractors.html
> 
>> How? Could you point me
>> to some examples? I failed to find any. Does it depend on the way I store
>> those documents? If so, how do you do it?
> 
> the text extractors only work with nt:resource nodes. this means your
> content structure would look like this:
> 
> + my.pdf (nt:file)
>    - jcr:created=20080101 (DATE)
>    + jcr:content (nt:resource)
>      - jcr:mimeType=application/pdf (STRING)
>      - jcr:lastModified=20080101 (DATE)
>      - jcr:date=<pdf-binary> (BINARY>
> 
> regards
>   marcel