Return-Path: Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: (qmail 59158 invoked from network); 16 Dec 2010 16:40:33 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Dec 2010 16:40:33 -0000 Received: (qmail 51021 invoked by uid 500); 16 Dec 2010 16:40:32 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 50829 invoked by uid 500); 16 Dec 2010 16:40:30 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 50818 invoked by uid 99); 16 Dec 2010 16:40:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Dec 2010 16:40:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of justinedelson@gmail.com designates 209.85.161.49 as permitted sender) Received: from [209.85.161.49] (HELO mail-fx0-f49.google.com) (209.85.161.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Dec 2010 16:40:22 +0000 Received: by fxm19 with SMTP id 19so3515249fxm.22 for ; Thu, 16 Dec 2010 08:40:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=/7MIZEJv5S1B/FyWQUoh1yE+AmtLjKzZ+840XbXoLAg=; b=ZioQqHHbjkh1zLhuyL6o+lKc+VSAwVeEO/CLeonTezJwWS7bxLVewCdinIIvwcHPbL 7O1Ua0hfZCmAzEylhM+Q4ig9QjxRAYk30CtGusIA/hOOkHSzfKM1H4oCvZAbZ4Y2U0YL P8PGOWURWkJ9j/MbYGoS57pDqplnwcr8SVsEA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=Pw/ARw2bhzIyz0e9/AXlasWMz8fH3mg+WQP/GxrCl2UxrJ/2poSvbs4xSR4JPdqrBX 3uzLZmHiWZK2VbiyV4P5LiUys4jZuiu/DtjlhMVO/4V+99l5V+fhHN8ZlaxKElnUXCSD gGLh1BVj4++M0n/aY97TAHW9nAgXTCHztruBY= MIME-Version: 1.0 Received: by 10.223.83.201 with SMTP id g9mr331995fal.140.1292517601990; Thu, 16 Dec 2010 08:40:01 -0800 (PST) Sender: justinedelson@gmail.com Received: by 10.223.109.2 with HTTP; Thu, 16 Dec 2010 08:40:01 -0800 (PST) In-Reply-To: References: Date: Thu, 16 Dec 2010 11:40:01 -0500 X-Google-Sender-Auth: _ACb7qWKYc1SPZU3_zIWFO5DS1I Message-ID: Subject: Re: FullText Indexing From: Justin Edelson To: users@jackrabbit.apache.org Content-Type: multipart/alternative; boundary=20cf3054a66b885844049789b47c X-Virus-Checked: Checked by ClamAV on apache.org --20cf3054a66b885844049789b47c Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I would remove that dependency. Using a 1.6.4 library with Jackrabbit 2.1.2 just seems like a bad idea. On Thu, Dec 16, 2010 at 11:10 AM, Rojas Buitrago, Sergio w= rote: > I'm using JackRabbit 2.1.2 deployed in a tomcat 6.0 managed from eclipse. > > For the text extractors, I get the necessary library form the next maven > dependency: > > > org.apache.jackrabbit > jackrabbit-text-extractors > 1.6.4 > > > Are there any other util information to proporcionate? > > Regards. > > > > -----Mensaje original----- > De: justinedelson@gmail.com [mailto:justinedelson@gmail.com] En nombre de > Justin Edelson > Enviado el: jueves, 16 de diciembre de 2010 16:26 > Para: users@jackrabbit.apache.org > Asunto: Re: FullText Indexing > > Sergio- > The ClassCastException and the NoSuchMethodException you posted on > dev@suggest a classpath problem. I would suggest posting the details > of your > deployment - what JARs you are using, app server details, etc. > > Justin > > On Thu, Dec 16, 2010 at 9:31 AM, Rojas Buitrago, Sergio >wrote: > > > Hello. > > > > > > > > I'm a newbie in Jackrabbit. > > > > > > > > I'm trying to index some content of different types of documents (word, > > pdf, xml, ...). > > > > > > > > I've configured the searchIndex in my workspace.xml in this way: > > > > > > > > > > > > > > > > > > > > > name=3D"textFilterClasses" > > value=3D"org.apache.jackrabbit.extractor.MsWordTextExtractor, > > > > > > org.apache.jackrabbit.extractor.MsExcelTextExtractor, > > > > > > org.apache.jackrabbit.extractor.MsPowerPointTextExtractor, > > > > > > org.apache.jackrabbit.extractor.PdfTextExtractor, > > > > > > org.apache.jackrabbit.extractor.OpenOfficeTextExtractor, > > > > > > org.apache.jackrabbit.extractor.RTFTextExtractor, > > > > > > org.apache.jackrabbit.extractor.HTMLTextExtractor, > > > > > > org.apache.jackrabbit.extractor.XMLTextExtractor"/> > > > > > > > > > > > > > > > > When I create a document in the repository, I add the content in this > way: > > > > > > > > contenido =3D nodo.addNode("jcr:content", "nt:resource"); > > > > contenido.setProperty("jcr:data", J_OperacionesSesion > > > > .*getValueFactory*().createBinary(is)); > > > > > > > > MimetypesFileTypeMap mimetypes =3D > *new*MimetypesFileTypeMap(); > > > > String *mime* =3D > > mimetypes.getContentType(nodo.getName()); > > > > contenido.setProperty("jcr:mimeType", "application/pd= f" > > ); > > > > > > > > Afer creating the document, this warning is thrown: > > > > > > > > 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract > text > > from a binary property (LazyTextExtractorField.java, line 180) > > > > *org.apache.tika.exception.TikaException*: Unable to extract PDF conten= t > > > > at > org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:61*) > > > > at org.apache.tika.parser.pdf.PDFParser.parse(*PDFParser.java:69*= ) > > > > at org.apache.tika.parser.CompositeParser.parse(* > > CompositeParser.java:120*) > > > > at org.apache.tika.parser.AutoDetectParser.parse(* > > AutoDetectParser.java:101*) > > > > at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse= (* > > JackrabbitParser.java:189*) > > > > at > > > org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTas= k.run( > > *LazyTextExtractorField.java:174*) > > > > at java.util.concurrent.Executors$RunnableAdapter.call(* > > Executors.java:417*) > > > > at java.util.concurrent.FutureTask$Sync.innerRun(* > > FutureTask.java:269*) > > > > at java.util.concurrent.FutureTask.run(*FutureTask.java:123*) > > > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.acce= ss$301( > > *ScheduledThreadPoolExecutor.java:65*) > > > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(= * > > ScheduledThreadPoolExecutor.java:168*) > > > > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(* > > ThreadPoolExecutor.java:650*) > > > > at java.util.concurrent.ThreadPoolExecutor$Worker.run(* > > ThreadPoolExecutor.java:675*) > > > > at java.lang.Thread.run(*Thread.java:595*) > > > > Caused by: *org.apache.pdfbox.exceptions.WrappedIOException*: > > OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could no= t > be > > instantiated > > > > at org.apache.pdfbox.util.PDFStreamEngine.(* > > PDFStreamEngine.java:152*) > > > > at org.apache.pdfbox.util.PDFTextStripper.(* > > PDFTextStripper.java:129*) > > > > at org.apache.tika.parser.pdf.PDF2XHTML.(*PDF2XHTML.java:69= *) > > > > at > org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:56*) > > > > ... 13 more > > > > Caused by: *java.lang.ClassCastException*: > > org.pdfbox.util.operator.ShowTextGlyph > > > > at org.apache.pdfbox.util.PDFStreamEngine.(* > > PDFStreamEngine.java:146*) > > > > ... 16 more > > > > > > > > Later, when I search for the document, filtering by content, in this wa= y: > > > > > > > > String consulta =3D "SELECT * FROM [arch:documento] AS documento WHERE > > CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from > > nt:file) > > > > > > > > No documents were found. > > > > > > > > > > > > Can you help me please??. > > > > > > > > > > > > Thanks and regards. > > > > > > > > > > > > *Sergio Rojas Buitrago* > > > > Desarrollo Software > > Gesti=F3n Documental > > > > Ronda de Toledo s/n > > 13003. Ciudad Real > > Espa=F1a > > > > T +34 926 27 08 49 > > > > Ext: 237849 > > > > > > > > srojas@indra.es > > www.indra.es > > > > [image: indra] > > > > > > > > ------------------------------ > > Este correo electr=F3nico y, en su caso, cualquier fichero anexo al mis= mo, > > contiene informaci=F3n de car=E1cter confidencial exclusivamente dirigi= da a > su > > destinatario o destinatarios. Si no es vd. el destinatario indicado, > queda > > notificado que la lectura, utilizaci=F3n, divulgaci=F3n y/o copia sin > > autorizaci=F3n est=E1 prohibida en virtud de la legislaci=F3n vigente. = En el > caso > > de haber recibido este correo electr=F3nico por error, se ruega notific= ar > > inmediatamente esta circunstancia mediante reenv=EDo a la direcci=F3n > > electr=F3nica del remitente. > > Evite imprimir este mensaje si no es estrictamente necesario. > > > > This email and any file attached to it (when applicable) contain(s) > > confidential information that is exclusively addressed to its > recipient(s). > > If you are not the indicated recipient, you are informed that reading, > > using, disseminating and/or copying it without authorisation is forbidd= en > in > > accordance with the legislation in effect. If you have received this > email > > by mistake, please immediately notify the sender of the situation by > > resending it to their email address. > > Avoid printing this message if it is not absolutely necessary. > > > > Este correo electr=F3nico y, en su caso, cualquier fichero anexo al mismo= , > contiene informaci=F3n de car=E1cter confidencial exclusivamente dirigida= a su > destinatario o destinatarios. Si no es vd. el destinatario indicado, qued= a > notificado que la lectura, utilizaci=F3n, divulgaci=F3n y/o copia sin > autorizaci=F3n est=E1 prohibida en virtud de la legislaci=F3n vigente. En= el caso > de haber recibido este correo electr=F3nico por error, se ruega notificar > inmediatamente esta circunstancia mediante reenv=EDo a la direcci=F3n > electr=F3nica del remitente. > Evite imprimir este mensaje si no es estrictamente necesario. > > This email and any file attached to it (when applicable) contain(s) > confidential information that is exclusively addressed to its recipient(s= ). > If you are not the indicated recipient, you are informed that reading, > using, disseminating and/or copying it without authorisation is forbidden= in > accordance with the legislation in effect. If you have received this emai= l > by mistake, please immediately notify the sender of the situation by > resending it to their email address. > Avoid printing this message if it is not absolutely necessary. > --20cf3054a66b885844049789b47c--