jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rojas Buitrago, Sergio" <sro...@indra.es>
Subject RE: FullText Indexing
Date Thu, 16 Dec 2010 17:29:08 GMT
Then, how can i configure searchIndex in my workspace.xml for working with tika text extractors?.

In I don't specify textFilterClasses no error or warning is thrown when I create a document,
but the search don´t find any result.

At this point, I don't know if what is failing is the indexer or my search query. My query
is:

String consulta = "SELECT * FROM [arch:documento] AS documento WHERE CONTAINS ( documento.*,
'ubicacion')";

arch:documento is a subtype of nt:file.

The content was added to node in this way:

contenido = nodo.addNode("jcr:content", "nt:resource");
contenido.setProperty("jcr:data", J_OperacionesSesion.*getValueFactory*().createBinary(is));

The content is well added because I can see it in the jackrabbit web browser.


Thanks and regards.

Sergio Rojas Buitrago
Desarrollo Software
Gestión Documental

Ronda de Toledo s/n
13003. Ciudad Real
España
T +34 926 27 08 49
Ext: 237849


srojas@indra.es
www.indra.es




-----Mensaje original-----
De: justinedelson@gmail.com [mailto:justinedelson@gmail.com] En nombre de Justin Edelson
Enviado el: jueves, 16 de diciembre de 2010 17:52
Para: users@jackrabbit.apache.org
Asunto: Re: FullText Indexing

AFAIK, all of that functionality is now in Apache Tika. So just remove it.

On Thu, Dec 16, 2010 at 11:45 AM, Rojas Buitrago, Sergio <srojas@indra.es>wrote:

> What version must i use?. 1.6.4 is the newly version of
> jackrabbit-text-extractors that I've found.
>
>
>
>
>
>
> -----Mensaje original-----
> De: justinedelson@gmail.com [mailto:justinedelson@gmail.com] En nombre de
> Justin Edelson
> Enviado el: jueves, 16 de diciembre de 2010 17:40
> Para: users@jackrabbit.apache.org
> Asunto: Re: FullText Indexing
>
> I would remove that dependency. Using a 1.6.4 library with Jackrabbit 2.1.2
> just seems like a bad idea.
>
> On Thu, Dec 16, 2010 at 11:10 AM, Rojas Buitrago, Sergio <srojas@indra.es
> >wrote:
>
> > I'm using JackRabbit 2.1.2 deployed in a tomcat 6.0 managed from eclipse.
> >
> > For the text extractors, I get the necessary library form the next maven
> > dependency:
> >
> >                <dependency>
> >                        <groupId>org.apache.jackrabbit</groupId>
> >
>  <artifactId>jackrabbit-text-extractors</artifactId>
> >                        <version>1.6.4</version>
> >                </dependency>
> >
> > Are there any other util information to proporcionate?
> >
> > Regards.
> >
> >
> >
> > -----Mensaje original-----
> > De: justinedelson@gmail.com [mailto:justinedelson@gmail.com] En nombre
> de
> > Justin Edelson
> > Enviado el: jueves, 16 de diciembre de 2010 16:26
> > Para: users@jackrabbit.apache.org
> > Asunto: Re: FullText Indexing
> >
> > Sergio-
> > The ClassCastException and the NoSuchMethodException you posted on
> > dev@suggest a classpath problem. I would suggest posting the details
> > of your
> > deployment - what JARs you are using, app server details, etc.
> >
> > Justin
> >
> > On Thu, Dec 16, 2010 at 9:31 AM, Rojas Buitrago, Sergio <srojas@indra.es
> > >wrote:
> >
> > >  Hello.
> > >
> > >
> > >
> > > I'm a newbie in Jackrabbit.
> > >
> > >
> > >
> > > I'm trying to index some content of different types of documents (word,
> > > pdf, xml, ...).
> > >
> > >
> > >
> > > I've configured the searchIndex in my workspace.xml in this way:
> > >
> > >
> > >
> > > <SearchIndex
> class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
> > >
> > >             <param name="path" value="${wsp.home}/index"/>
> > >
> > >             <param name="supportHighlighting" value="true"/>
> > >
> > >                                                <param
> > > name="textFilterClasses"
> > > value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
> > >
> > >
> > >    org.apache.jackrabbit.extractor.MsExcelTextExtractor,
> > >
> > >
> > >    org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
> > >
> > >
> > >    org.apache.jackrabbit.extractor.PdfTextExtractor,
> > >
> > >
> > >    org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
> > >
> > >
> > >    org.apache.jackrabbit.extractor.RTFTextExtractor,
> > >
> > >
> > >                    org.apache.jackrabbit.extractor.HTMLTextExtractor,
> > >
> > >
> > >    org.apache.jackrabbit.extractor.XMLTextExtractor"/>
> > >
> > >         </SearchIndex>
> > >
> > >
> > >
> > >
> > >
> > > When I create a document in the repository, I add the content in this
> > way:
> > >
> > >
> > >
> > > contenido = nodo.addNode("jcr:content", "nt:resource");
> > >
> > >                   contenido.setProperty("jcr:data", J_OperacionesSesion
> > >
> > >                              .*getValueFactory*().createBinary(is));
> > >
> > >
> > >
> > >                   MimetypesFileTypeMap mimetypes =
> > *new*MimetypesFileTypeMap();
> > >
> > >                   String *mime* =
> > > mimetypes.getContentType(nodo.getName());
> > >
> > >                   contenido.setProperty("jcr:mimeType",
> "application/pdf"
> > > );
> > >
> > >
> > >
> > > Afer creating the document, this warning is thrown:
> > >
> > >
> > >
> > > 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract
> > text
> > > from a binary property (LazyTextExtractorField.java, line 180)
> > >
> > > *org.apache.tika.exception.TikaException*: Unable to extract PDF
> content
> > >
> > >       at
> > org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:61*)
> > >
> > >       at
> org.apache.tika.parser.pdf.PDFParser.parse(*PDFParser.java:69*)
> > >
> > >       at org.apache.tika.parser.CompositeParser.parse(*
> > > CompositeParser.java:120*)
> > >
> > >       at org.apache.tika.parser.AutoDetectParser.parse(*
> > > AutoDetectParser.java:101*)
> > >
> > >       at
> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(*
> > > JackrabbitParser.java:189*)
> > >
> > >       at
> > >
> >
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(
> > > *LazyTextExtractorField.java:174*)
> > >
> > >       at java.util.concurrent.Executors$RunnableAdapter.call(*
> > > Executors.java:417*)
> > >
> > >       at java.util.concurrent.FutureTask$Sync.innerRun(*
> > > FutureTask.java:269*)
> > >
> > >       at java.util.concurrent.FutureTask.run(*FutureTask.java:123*)
> > >
> > >       at
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
> > > *ScheduledThreadPoolExecutor.java:65*)
> > >
> > >       at
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(*
> > > ScheduledThreadPoolExecutor.java:168*)
> > >
> > >       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(*
> > > ThreadPoolExecutor.java:650*)
> > >
> > >       at java.util.concurrent.ThreadPoolExecutor$Worker.run(*
> > > ThreadPoolExecutor.java:675*)
> > >
> > >       at java.lang.Thread.run(*Thread.java:595*)
> > >
> > > Caused by: *org.apache.pdfbox.exceptions.WrappedIOException*:
> > > OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could
> not
> > be
> > > instantiated
> > >
> > >       at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> > > PDFStreamEngine.java:152*)
> > >
> > >       at org.apache.pdfbox.util.PDFTextStripper.<init>(*
> > > PDFTextStripper.java:129*)
> > >
> > >       at
> org.apache.tika.parser.pdf.PDF2XHTML.<init>(*PDF2XHTML.java:69*)
> > >
> > >       at
> > org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:56*)
> > >
> > >       ... 13 more
> > >
> > > Caused by: *java.lang.ClassCastException*:
> > > org.pdfbox.util.operator.ShowTextGlyph
> > >
> > >       at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> > > PDFStreamEngine.java:146*)
> > >
> > >       ... 16 more
> > >
> > >
> > >
> > > Later, when I search for the document, filtering by content, in this
> way:
> > >
> > >
> > >
> > > String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
> > > CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from
> > > nt:file)
> > >
> > >
> > >
> > > No documents were found.
> > >
> > >
> > >
> > >
> > >
> > > Can you help me please??.
> > >
> > >
> > >
> > >
> > >
> > > Thanks and regards.
> > >
> > >
> > >
> > >
> > >
> > > *Sergio Rojas Buitrago*
> > >
> > > Desarrollo Software
> > > Gestión Documental
> > >
> > > Ronda de Toledo s/n
> > > 13003. Ciudad Real
> > > España
> > >
> > > T +34 926 27 08 49
> > >
> > > Ext: 237849
> > >
> > >
> > >
> > > srojas@indra.es
> > > www.indra.es
> > >
> > > [image: indra]
> > >
> > >
> > >
> > > ------------------------------
> > > Este correo electrónico y, en su caso, cualquier fichero anexo al
> mismo,
> > > contiene información de carácter confidencial exclusivamente dirigida a
> > su
> > > destinatario o destinatarios. Si no es vd. el destinatario indicado,
> > queda
> > > notificado que la lectura, utilización, divulgación y/o copia sin
> > > autorización está prohibida en virtud de la legislación vigente. En el
> > caso
> > > de haber recibido este correo electrónico por error, se ruega notificar
> > > inmediatamente esta circunstancia mediante reenvío a la dirección
> > > electrónica del remitente.
> > > Evite imprimir este mensaje si no es estrictamente necesario.
> > >
> > > This email and any file attached to it (when applicable) contain(s)
> > > confidential information that is exclusively addressed to its
> > recipient(s).
> > > If you are not the indicated recipient, you are informed that reading,
> > > using, disseminating and/or copying it without authorisation is
> forbidden
> > in
> > > accordance with the legislation in effect. If you have received this
> > email
> > > by mistake, please immediately notify the sender of the situation by
> > > resending it to their email address.
> > > Avoid printing this message if it is not absolutely necessary.
> > >
> >
> > Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> > contiene información de carácter confidencial exclusivamente dirigida a
> su
> > destinatario o destinatarios. Si no es vd. el destinatario indicado,
> queda
> > notificado que la lectura, utilización, divulgación y/o copia sin
> > autorización está prohibida en virtud de la legislación vigente. En el
> caso
> > de haber recibido este correo electrónico por error, se ruega notificar
> > inmediatamente esta circunstancia mediante reenvío a la dirección
> > electrónica del remitente.
> > Evite imprimir este mensaje si no es estrictamente necesario.
> >
> > This email and any file attached to it (when applicable) contain(s)
> > confidential information that is exclusively addressed to its
> recipient(s).
> > If you are not the indicated recipient, you are informed that reading,
> > using, disseminating and/or copying it without authorisation is forbidden
> in
> > accordance with the legislation in effect. If you have received this
> email
> > by mistake, please immediately notify the sender of the situation by
> > resending it to their email address.
> > Avoid printing this message if it is not absolutely necessary.
> >
>
> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> contiene información de carácter confidencial exclusivamente dirigida a su
> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
> notificado que la lectura, utilización, divulgación y/o copia sin
> autorización está prohibida en virtud de la legislación vigente. En el caso
> de haber recibido este correo electrónico por error, se ruega notificar
> inmediatamente esta circunstancia mediante reenvío a la dirección
> electrónica del remitente.
> Evite imprimir este mensaje si no es estrictamente necesario.
>
> This email and any file attached to it (when applicable) contain(s)
> confidential information that is exclusively addressed to its recipient(s).
> If you are not the indicated recipient, you are informed that reading,
> using, disseminating and/or copying it without authorisation is forbidden in
> accordance with the legislation in effect. If you have received this email
> by mistake, please immediately notify the sender of the situation by
> resending it to their email address.
> Avoid printing this message if it is not absolutely necessary.
>

Este correo electrónico y, en su caso, cualquier fichero anexo al mismo, contiene información
de carácter confidencial exclusivamente dirigida a su destinatario o destinatarios. Si no
es vd. el destinatario indicado, queda notificado que la lectura, utilización, divulgación
y/o copia sin autorización está prohibida en virtud de la legislación vigente. En el caso
de haber recibido este correo electrónico por error, se ruega notificar inmediatamente esta
circunstancia mediante reenvío a la dirección electrónica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.

This email and any file attached to it (when applicable) contain(s) confidential information
that is exclusively addressed to its recipient(s). If you are not the indicated recipient,
you are informed that reading, using, disseminating and/or copying it without authorisation
is forbidden in accordance with the legislation in effect. If you have received this email
by mistake, please immediately notify the sender of the situation by resending it to their
email address.
Avoid printing this message if it is not absolutely necessary.

Mime
View raw message