jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rojas Buitrago, Sergio" <sro...@indra.es>
Subject RE: FullText Indexing
Date Thu, 16 Dec 2010 16:45:40 GMT
What version must i use?. 1.6.4 is the newly version of jackrabbit-text-extractors that I've
found.






-----Mensaje original-----
De: justinedelson@gmail.com [mailto:justinedelson@gmail.com] En nombre de Justin Edelson
Enviado el: jueves, 16 de diciembre de 2010 17:40
Para: users@jackrabbit.apache.org
Asunto: Re: FullText Indexing

I would remove that dependency. Using a 1.6.4 library with Jackrabbit 2.1.2
just seems like a bad idea.

On Thu, Dec 16, 2010 at 11:10 AM, Rojas Buitrago, Sergio <srojas@indra.es>wrote:

> I'm using JackRabbit 2.1.2 deployed in a tomcat 6.0 managed from eclipse.
>
> For the text extractors, I get the necessary library form the next maven
> dependency:
>
>                <dependency>
>                        <groupId>org.apache.jackrabbit</groupId>
>                        <artifactId>jackrabbit-text-extractors</artifactId>
>                        <version>1.6.4</version>
>                </dependency>
>
> Are there any other util information to proporcionate?
>
> Regards.
>
>
>
> -----Mensaje original-----
> De: justinedelson@gmail.com [mailto:justinedelson@gmail.com] En nombre de
> Justin Edelson
> Enviado el: jueves, 16 de diciembre de 2010 16:26
> Para: users@jackrabbit.apache.org
> Asunto: Re: FullText Indexing
>
> Sergio-
> The ClassCastException and the NoSuchMethodException you posted on
> dev@suggest a classpath problem. I would suggest posting the details
> of your
> deployment - what JARs you are using, app server details, etc.
>
> Justin
>
> On Thu, Dec 16, 2010 at 9:31 AM, Rojas Buitrago, Sergio <srojas@indra.es
> >wrote:
>
> >  Hello.
> >
> >
> >
> > I'm a newbie in Jackrabbit.
> >
> >
> >
> > I'm trying to index some content of different types of documents (word,
> > pdf, xml, ...).
> >
> >
> >
> > I've configured the searchIndex in my workspace.xml in this way:
> >
> >
> >
> > <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
> >
> >             <param name="path" value="${wsp.home}/index"/>
> >
> >             <param name="supportHighlighting" value="true"/>
> >
> >                                                <param
> > name="textFilterClasses"
> > value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
> >
> >
> >    org.apache.jackrabbit.extractor.MsExcelTextExtractor,
> >
> >
> >    org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
> >
> >
> >    org.apache.jackrabbit.extractor.PdfTextExtractor,
> >
> >
> >    org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
> >
> >
> >    org.apache.jackrabbit.extractor.RTFTextExtractor,
> >
> >
> >                    org.apache.jackrabbit.extractor.HTMLTextExtractor,
> >
> >
> >    org.apache.jackrabbit.extractor.XMLTextExtractor"/>
> >
> >         </SearchIndex>
> >
> >
> >
> >
> >
> > When I create a document in the repository, I add the content in this
> way:
> >
> >
> >
> > contenido = nodo.addNode("jcr:content", "nt:resource");
> >
> >                   contenido.setProperty("jcr:data", J_OperacionesSesion
> >
> >                              .*getValueFactory*().createBinary(is));
> >
> >
> >
> >                   MimetypesFileTypeMap mimetypes =
> *new*MimetypesFileTypeMap();
> >
> >                   String *mime* =
> > mimetypes.getContentType(nodo.getName());
> >
> >                   contenido.setProperty("jcr:mimeType", "application/pdf"
> > );
> >
> >
> >
> > Afer creating the document, this warning is thrown:
> >
> >
> >
> > 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract
> text
> > from a binary property (LazyTextExtractorField.java, line 180)
> >
> > *org.apache.tika.exception.TikaException*: Unable to extract PDF content
> >
> >       at
> org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:61*)
> >
> >       at org.apache.tika.parser.pdf.PDFParser.parse(*PDFParser.java:69*)
> >
> >       at org.apache.tika.parser.CompositeParser.parse(*
> > CompositeParser.java:120*)
> >
> >       at org.apache.tika.parser.AutoDetectParser.parse(*
> > AutoDetectParser.java:101*)
> >
> >       at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(*
> > JackrabbitParser.java:189*)
> >
> >       at
> >
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(
> > *LazyTextExtractorField.java:174*)
> >
> >       at java.util.concurrent.Executors$RunnableAdapter.call(*
> > Executors.java:417*)
> >
> >       at java.util.concurrent.FutureTask$Sync.innerRun(*
> > FutureTask.java:269*)
> >
> >       at java.util.concurrent.FutureTask.run(*FutureTask.java:123*)
> >
> >       at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
> > *ScheduledThreadPoolExecutor.java:65*)
> >
> >       at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(*
> > ScheduledThreadPoolExecutor.java:168*)
> >
> >       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(*
> > ThreadPoolExecutor.java:650*)
> >
> >       at java.util.concurrent.ThreadPoolExecutor$Worker.run(*
> > ThreadPoolExecutor.java:675*)
> >
> >       at java.lang.Thread.run(*Thread.java:595*)
> >
> > Caused by: *org.apache.pdfbox.exceptions.WrappedIOException*:
> > OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not
> be
> > instantiated
> >
> >       at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> > PDFStreamEngine.java:152*)
> >
> >       at org.apache.pdfbox.util.PDFTextStripper.<init>(*
> > PDFTextStripper.java:129*)
> >
> >       at org.apache.tika.parser.pdf.PDF2XHTML.<init>(*PDF2XHTML.java:69*)
> >
> >       at
> org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:56*)
> >
> >       ... 13 more
> >
> > Caused by: *java.lang.ClassCastException*:
> > org.pdfbox.util.operator.ShowTextGlyph
> >
> >       at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> > PDFStreamEngine.java:146*)
> >
> >       ... 16 more
> >
> >
> >
> > Later, when I search for the document, filtering by content, in this way:
> >
> >
> >
> > String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
> > CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from
> > nt:file)
> >
> >
> >
> > No documents were found.
> >
> >
> >
> >
> >
> > Can you help me please??.
> >
> >
> >
> >
> >
> > Thanks and regards.
> >
> >
> >
> >
> >
> > *Sergio Rojas Buitrago*
> >
> > Desarrollo Software
> > Gestión Documental
> >
> > Ronda de Toledo s/n
> > 13003. Ciudad Real
> > España
> >
> > T +34 926 27 08 49
> >
> > Ext: 237849
> >
> >
> >
> > srojas@indra.es
> > www.indra.es
> >
> > [image: indra]
> >
> >
> >
> > ------------------------------
> > Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> > contiene información de carácter confidencial exclusivamente dirigida a
> su
> > destinatario o destinatarios. Si no es vd. el destinatario indicado,
> queda
> > notificado que la lectura, utilización, divulgación y/o copia sin
> > autorización está prohibida en virtud de la legislación vigente. En el
> caso
> > de haber recibido este correo electrónico por error, se ruega notificar
> > inmediatamente esta circunstancia mediante reenvío a la dirección
> > electrónica del remitente.
> > Evite imprimir este mensaje si no es estrictamente necesario.
> >
> > This email and any file attached to it (when applicable) contain(s)
> > confidential information that is exclusively addressed to its
> recipient(s).
> > If you are not the indicated recipient, you are informed that reading,
> > using, disseminating and/or copying it without authorisation is forbidden
> in
> > accordance with the legislation in effect. If you have received this
> email
> > by mistake, please immediately notify the sender of the situation by
> > resending it to their email address.
> > Avoid printing this message if it is not absolutely necessary.
> >
>
> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> contiene información de carácter confidencial exclusivamente dirigida a su
> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
> notificado que la lectura, utilización, divulgación y/o copia sin
> autorización está prohibida en virtud de la legislación vigente. En el caso
> de haber recibido este correo electrónico por error, se ruega notificar
> inmediatamente esta circunstancia mediante reenvío a la dirección
> electrónica del remitente.
> Evite imprimir este mensaje si no es estrictamente necesario.
>
> This email and any file attached to it (when applicable) contain(s)
> confidential information that is exclusively addressed to its recipient(s).
> If you are not the indicated recipient, you are informed that reading,
> using, disseminating and/or copying it without authorisation is forbidden in
> accordance with the legislation in effect. If you have received this email
> by mistake, please immediately notify the sender of the situation by
> resending it to their email address.
> Avoid printing this message if it is not absolutely necessary.
>

Este correo electrónico y, en su caso, cualquier fichero anexo al mismo, contiene información
de carácter confidencial exclusivamente dirigida a su destinatario o destinatarios. Si no
es vd. el destinatario indicado, queda notificado que la lectura, utilización, divulgación
y/o copia sin autorización está prohibida en virtud de la legislación vigente. En el caso
de haber recibido este correo electrónico por error, se ruega notificar inmediatamente esta
circunstancia mediante reenvío a la dirección electrónica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.

This email and any file attached to it (when applicable) contain(s) confidential information
that is exclusively addressed to its recipient(s). If you are not the indicated recipient,
you are informed that reading, using, disseminating and/or copying it without authorisation
is forbidden in accordance with the legislation in effect. If you have received this email
by mistake, please immediately notify the sender of the situation by resending it to their
email address.
Avoid printing this message if it is not absolutely necessary.

Mime
View raw message