jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergio Tridente <tiod...@gmail.com>
Subject Re: Searching inside binary contents abd other queries
Date Wed, 11 Jun 2008 00:56:24 GMT
Thank you Marcel and Florian for your answers.

For the moment we are evaluating if we'll be using JackRabbit for our
application or go for a RDBMS. But that's another thing.

Please, bear with my ignorance. I have coded a simple program for storing a
pdf file and searching inside its contents.

Here's the code that stores the pdf:
Node root = session.getRootNode();
Node folder = root.addNode("my.project", "nt:folder");
Node file = folder.addNode("my.pdf", "nt:file");
Node resource = file.addNode("jcr:content", "nt:resource");
resource.setProperty("jcr:mimeType", "application/pdf");
FileInputStream pdf = new
FileInputStream("/home/sergio/Documents/10gR2_openSUSE102_introduction.pdf");
resource.setProperty("jcr:data", pdf);
Calendar cal = Calendar.getInstance();
cal.set(2008, Calendar.JUNE, 10);
resource.setProperty("jcr:lastModified", cal);
session.save();
pdf.close();

So far I think I got it right. But when I try to do a search I get the
following message:
Exception in thread "main" org.apache.lucene.store.AlreadyClosedException:
this IndexReader is closed

Here I am pasting the code for performing the search inside the PDF's
contents:
Workspace ws = session.getWorkspace();
QueryManager qm = ws.getQueryManager();
Query q = qm.createQuery("select * from nt:resource where
jcr:contains='Oracle'", Query.SQL);
QueryResult res = q.execute();
NodeIterator it = res.getNodes();
while (it.hasNext()) {
   Node n = it.nextNode();
   Property prop = n.getProperty("jcr:lastModified");
   System.out.println("Found document containing the word 'Oracle', last
modified date: " + prop.getDate());
}

I don't know if I got it right with the query language syntax. If you can
point me to some resources where I can take a look, it would be great.

I also wanted to point out that the TextFilters are declared inside the
repositoy.xml:
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
    <param name="path" value="${wsp.home}/index"/>
    <param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.MsWordTextExtractor,org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apache.jackrabbit.extractor.RTFTextExtractor,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabbit.extractor.XMLTextExtractor"/>
    <param name="extractorPoolSize" value="2"/>
    <param name="supportHighlighting" value="true"/>
</SearchIndex>

The following JARs are in my CLASSPATH: PDFBox.jar, poi.jar and
tm-extractors.jar

If you could point me to the right direction, I would highly appreciated. 

-- 
Best regards

Sergio Tridente


Marcel Reutegger wrote:
> Sergio wrote:
>> 1) As our database will be holding most of the data, I thought about the
>> following schema: storing the documents inside BLOBs in the database (in
>> case we need to access them using some other criteria) AND in
>> Jackrabbit's repository. While storing those documents using Jackrabbit,
>> I plan to keep the RDBMS' pointers (probably the document's record
>> primary key) using properties. The question is: does this make sense? Is
>> it a common practice? And if not, what is the standard approach?
> 
> well, the recommended approach is to replace your RDBMS with Jackrabbit.
> 
>> 2) Do I need to define node types for representing my documents? If not,
>> is there some standard type I can use?
> 
> for files and folders there's nt:file and nt:folder. See:
> http://wiki.apache.org/jackrabbit/NodeTypeRegistry and of course the JSR
> 170 specification.
> 
>> 3) I have read that Jackrabbit is able to read inside some document
>> types, how do you accomplish that? Using TextExtractors?
> 
> correct. see: http://jackrabbit.apache.org/jackrabbit-text-extractors.html
> 
>> How? Could you point me
>> to some examples? I failed to find any. Does it depend on the way I store
>> those documents? If so, how do you do it?
> 
> the text extractors only work with nt:resource nodes. this means your
> content structure would look like this:
> 
> + my.pdf (nt:file)
>    - jcr:created=20080101 (DATE)
>    + jcr:content (nt:resource)
>      - jcr:mimeType=application/pdf (STRING)
>      - jcr:lastModified=20080101 (DATE)
>      - jcr:date=<pdf-binary> (BINARY>
> 
> regards
>   marcel



Mime
View raw message