Return-Path: Delivered-To: apmail-jackrabbit-users-archive@locus.apache.org Received: (qmail 93487 invoked from network); 11 Jun 2008 00:56:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Jun 2008 00:56:45 -0000 Received: (qmail 6381 invoked by uid 500); 11 Jun 2008 00:56:46 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 6361 invoked by uid 500); 11 Jun 2008 00:56:46 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 6350 invoked by uid 99); 11 Jun 2008 00:56:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jun 2008 17:56:46 -0700 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gcaju-users@m.gmane.org designates 80.91.229.2 as permitted sender) Received: from [80.91.229.2] (HELO ciao.gmane.org) (80.91.229.2) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jun 2008 00:55:56 +0000 Received: from list by ciao.gmane.org with local (Exim 4.43) id 1K6EdG-00016Q-Lz for users@jackrabbit.apache.org; Wed, 11 Jun 2008 00:56:10 +0000 Received: from modemcable008.170-82-70.mc.videotron.ca ([70.82.170.8]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 11 Jun 2008 00:56:10 +0000 Received: from tioduke by modemcable008.170-82-70.mc.videotron.ca with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 11 Jun 2008 00:56:10 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: users@jackrabbit.apache.org From: Sergio Tridente Subject: Re: Searching inside binary contents abd other queries Date: Tue, 10 Jun 2008 20:56:24 -0400 Lines: 108 Message-ID: References: <484E846A.9050005@gmx.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7Bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: modemcable008.170-82-70.mc.videotron.ca User-Agent: KNode/0.10.9 Sender: news X-Virus-Checked: Checked by ClamAV on apache.org Thank you Marcel and Florian for your answers. For the moment we are evaluating if we'll be using JackRabbit for our application or go for a RDBMS. But that's another thing. Please, bear with my ignorance. I have coded a simple program for storing a pdf file and searching inside its contents. Here's the code that stores the pdf: Node root = session.getRootNode(); Node folder = root.addNode("my.project", "nt:folder"); Node file = folder.addNode("my.pdf", "nt:file"); Node resource = file.addNode("jcr:content", "nt:resource"); resource.setProperty("jcr:mimeType", "application/pdf"); FileInputStream pdf = new FileInputStream("/home/sergio/Documents/10gR2_openSUSE102_introduction.pdf"); resource.setProperty("jcr:data", pdf); Calendar cal = Calendar.getInstance(); cal.set(2008, Calendar.JUNE, 10); resource.setProperty("jcr:lastModified", cal); session.save(); pdf.close(); So far I think I got it right. But when I try to do a search I get the following message: Exception in thread "main" org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed Here I am pasting the code for performing the search inside the PDF's contents: Workspace ws = session.getWorkspace(); QueryManager qm = ws.getQueryManager(); Query q = qm.createQuery("select * from nt:resource where jcr:contains='Oracle'", Query.SQL); QueryResult res = q.execute(); NodeIterator it = res.getNodes(); while (it.hasNext()) { Node n = it.nextNode(); Property prop = n.getProperty("jcr:lastModified"); System.out.println("Found document containing the word 'Oracle', last modified date: " + prop.getDate()); } I don't know if I got it right with the query language syntax. If you can point me to some resources where I can take a look, it would be great. I also wanted to point out that the TextFilters are declared inside the repositoy.xml: The following JARs are in my CLASSPATH: PDFBox.jar, poi.jar and tm-extractors.jar If you could point me to the right direction, I would highly appreciated. -- Best regards Sergio Tridente Marcel Reutegger wrote: > Sergio wrote: >> 1) As our database will be holding most of the data, I thought about the >> following schema: storing the documents inside BLOBs in the database (in >> case we need to access them using some other criteria) AND in >> Jackrabbit's repository. While storing those documents using Jackrabbit, >> I plan to keep the RDBMS' pointers (probably the document's record >> primary key) using properties. The question is: does this make sense? Is >> it a common practice? And if not, what is the standard approach? > > well, the recommended approach is to replace your RDBMS with Jackrabbit. > >> 2) Do I need to define node types for representing my documents? If not, >> is there some standard type I can use? > > for files and folders there's nt:file and nt:folder. See: > http://wiki.apache.org/jackrabbit/NodeTypeRegistry and of course the JSR > 170 specification. > >> 3) I have read that Jackrabbit is able to read inside some document >> types, how do you accomplish that? Using TextExtractors? > > correct. see: http://jackrabbit.apache.org/jackrabbit-text-extractors.html > >> How? Could you point me >> to some examples? I failed to find any. Does it depend on the way I store >> those documents? If so, how do you do it? > > the text extractors only work with nt:resource nodes. this means your > content structure would look like this: > > + my.pdf (nt:file) > - jcr:created=20080101 (DATE) > + jcr:content (nt:resource) > - jcr:mimeType=application/pdf (STRING) > - jcr:lastModified=20080101 (DATE) > - jcr:date= (BINARY> > > regards > marcel