jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rojas Buitrago, Sergio" <sro...@indra.es>
Subject RE: FullText Indexing
Date Thu, 16 Dec 2010 16:10:58 GMT
I'm using JackRabbit 2.1.2 deployed in a tomcat 6.0 managed from eclipse.

For the text extractors, I get the necessary library form the next maven dependency:

                <dependency>
                        <groupId>org.apache.jackrabbit</groupId>
                        <artifactId>jackrabbit-text-extractors</artifactId>
                        <version>1.6.4</version>
                </dependency>

Are there any other util information to proporcionate?

Regards.



-----Mensaje original-----
De: justinedelson@gmail.com [mailto:justinedelson@gmail.com] En nombre de Justin Edelson
Enviado el: jueves, 16 de diciembre de 2010 16:26
Para: users@jackrabbit.apache.org
Asunto: Re: FullText Indexing

Sergio-
The ClassCastException and the NoSuchMethodException you posted on
dev@suggest a classpath problem. I would suggest posting the details
of your
deployment - what JARs you are using, app server details, etc.

Justin

On Thu, Dec 16, 2010 at 9:31 AM, Rojas Buitrago, Sergio <srojas@indra.es>wrote:

>  Hello.
>
>
>
> I'm a newbie in Jackrabbit.
>
>
>
> I'm trying to index some content of different types of documents (word,
> pdf, xml, ...).
>
>
>
> I've configured the searchIndex in my workspace.xml in this way:
>
>
>
> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>
>             <param name="path" value="${wsp.home}/index"/>
>
>             <param name="supportHighlighting" value="true"/>
>
>                                                <param
> name="textFilterClasses"
> value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.PdfTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.RTFTextExtractor,
>
>
>                    org.apache.jackrabbit.extractor.HTMLTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>
>         </SearchIndex>
>
>
>
>
>
> When I create a document in the repository, I add the content in this way:
>
>
>
> contenido = nodo.addNode("jcr:content", "nt:resource");
>
>                   contenido.setProperty("jcr:data", J_OperacionesSesion
>
>                              .*getValueFactory*().createBinary(is));
>
>
>
>                   MimetypesFileTypeMap mimetypes = *new*MimetypesFileTypeMap();
>
>                   String *mime* =
> mimetypes.getContentType(nodo.getName());
>
>                   contenido.setProperty("jcr:mimeType", "application/pdf"
> );
>
>
>
> Afer creating the document, this warning is thrown:
>
>
>
> 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text
> from a binary property (LazyTextExtractorField.java, line 180)
>
> *org.apache.tika.exception.TikaException*: Unable to extract PDF content
>
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:61*)
>
>       at org.apache.tika.parser.pdf.PDFParser.parse(*PDFParser.java:69*)
>
>       at org.apache.tika.parser.CompositeParser.parse(*
> CompositeParser.java:120*)
>
>       at org.apache.tika.parser.AutoDetectParser.parse(*
> AutoDetectParser.java:101*)
>
>       at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(*
> JackrabbitParser.java:189*)
>
>       at
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(
> *LazyTextExtractorField.java:174*)
>
>       at java.util.concurrent.Executors$RunnableAdapter.call(*
> Executors.java:417*)
>
>       at java.util.concurrent.FutureTask$Sync.innerRun(*
> FutureTask.java:269*)
>
>       at java.util.concurrent.FutureTask.run(*FutureTask.java:123*)
>
>       at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
> *ScheduledThreadPoolExecutor.java:65*)
>
>       at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(*
> ScheduledThreadPoolExecutor.java:168*)
>
>       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(*
> ThreadPoolExecutor.java:650*)
>
>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(*
> ThreadPoolExecutor.java:675*)
>
>       at java.lang.Thread.run(*Thread.java:595*)
>
> Caused by: *org.apache.pdfbox.exceptions.WrappedIOException*:
> OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be
> instantiated
>
>       at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> PDFStreamEngine.java:152*)
>
>       at org.apache.pdfbox.util.PDFTextStripper.<init>(*
> PDFTextStripper.java:129*)
>
>       at org.apache.tika.parser.pdf.PDF2XHTML.<init>(*PDF2XHTML.java:69*)
>
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:56*)
>
>       ... 13 more
>
> Caused by: *java.lang.ClassCastException*:
> org.pdfbox.util.operator.ShowTextGlyph
>
>       at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> PDFStreamEngine.java:146*)
>
>       ... 16 more
>
>
>
> Later, when I search for the document, filtering by content, in this way:
>
>
>
> String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
> CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from
> nt:file)
>
>
>
> No documents were found.
>
>
>
>
>
> Can you help me please??.
>
>
>
>
>
> Thanks and regards.
>
>
>
>
>
> *Sergio Rojas Buitrago*
>
> Desarrollo Software
> Gestión Documental
>
> Ronda de Toledo s/n
> 13003. Ciudad Real
> España
>
> T +34 926 27 08 49
>
> Ext: 237849
>
>
>
> srojas@indra.es
> www.indra.es
>
> [image: indra]
>
>
>
> ------------------------------
> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> contiene información de carácter confidencial exclusivamente dirigida a su
> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
> notificado que la lectura, utilización, divulgación y/o copia sin
> autorización está prohibida en virtud de la legislación vigente. En el caso
> de haber recibido este correo electrónico por error, se ruega notificar
> inmediatamente esta circunstancia mediante reenvío a la dirección
> electrónica del remitente.
> Evite imprimir este mensaje si no es estrictamente necesario.
>
> This email and any file attached to it (when applicable) contain(s)
> confidential information that is exclusively addressed to its recipient(s).
> If you are not the indicated recipient, you are informed that reading,
> using, disseminating and/or copying it without authorisation is forbidden in
> accordance with the legislation in effect. If you have received this email
> by mistake, please immediately notify the sender of the situation by
> resending it to their email address.
> Avoid printing this message if it is not absolutely necessary.
>

Este correo electrónico y, en su caso, cualquier fichero anexo al mismo, contiene información
de carácter confidencial exclusivamente dirigida a su destinatario o destinatarios. Si no
es vd. el destinatario indicado, queda notificado que la lectura, utilización, divulgación
y/o copia sin autorización está prohibida en virtud de la legislación vigente. En el caso
de haber recibido este correo electrónico por error, se ruega notificar inmediatamente esta
circunstancia mediante reenvío a la dirección electrónica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.

This email and any file attached to it (when applicable) contain(s) confidential information
that is exclusively addressed to its recipient(s). If you are not the indicated recipient,
you are informed that reading, using, disseminating and/or copying it without authorisation
is forbidden in accordance with the legislation in effect. If you have received this email
by mistake, please immediately notify the sender of the situation by resending it to their
email address.
Avoid printing this message if it is not absolutely necessary.

Mime
View raw message