jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Edelson <jus...@justinedelson.com>
Subject Re: FullText Indexing
Date Thu, 16 Dec 2010 17:47:11 GMT
Unless you have a specific reason to do so, I would recommend configuring
SearchIndex *exactly* as it is in the default repository.xml for the version
of Jackrabbit you are using. In the case of 2.1.2, that is:

        <SearchIndex
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${wsp.home}/index"/>
            <param name="supportHighlighting" value="true"/>
        </SearchIndex>

I would suggest you use XPath instead of SQL2 as you're more likely to find
query examples.

Justin

On Thu, Dec 16, 2010 at 12:29 PM, Rojas Buitrago, Sergio <srojas@indra.es>wrote:

> Then, how can i configure searchIndex in my workspace.xml for working with
> tika text extractors?.
>
> In I don't specify textFilterClasses no error or warning is thrown when I
> create a document, but the search don´t find any result.
>
> At this point, I don't know if what is failing is the indexer or my search
> query. My query is:
>
> String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
> CONTAINS ( documento.*, 'ubicacion')";
>
> arch:documento is a subtype of nt:file.
>
> The content was added to node in this way:
>
> contenido = nodo.addNode("jcr:content", "nt:resource");
> contenido.setProperty("jcr:data",
> J_OperacionesSesion.*getValueFactory*().createBinary(is));
>
> The content is well added because I can see it in the jackrabbit web
> browser.
>
>
> Thanks and regards.
>
> Sergio Rojas Buitrago
> Desarrollo Software
> Gestión Documental
>
> Ronda de Toledo s/n
> 13003. Ciudad Real
> España
> T +34 926 27 08 49
> Ext: 237849
>
>
> srojas@indra.es
> www.indra.es
>
>
>
>
> -----Mensaje original-----
> De: justinedelson@gmail.com [mailto:justinedelson@gmail.com] En nombre de
> Justin Edelson
> Enviado el: jueves, 16 de diciembre de 2010 17:52
> Para: users@jackrabbit.apache.org
> Asunto: Re: FullText Indexing
>
> AFAIK, all of that functionality is now in Apache Tika. So just remove it.
>
> On Thu, Dec 16, 2010 at 11:45 AM, Rojas Buitrago, Sergio <srojas@indra.es
> >wrote:
>
> > What version must i use?. 1.6.4 is the newly version of
> > jackrabbit-text-extractors that I've found.
> >
> >
> >
> >
> >
> >
> > -----Mensaje original-----
> > De: justinedelson@gmail.com [mailto:justinedelson@gmail.com] En nombre
> de
> > Justin Edelson
> > Enviado el: jueves, 16 de diciembre de 2010 17:40
> > Para: users@jackrabbit.apache.org
> > Asunto: Re: FullText Indexing
> >
> > I would remove that dependency. Using a 1.6.4 library with Jackrabbit
> 2.1.2
> > just seems like a bad idea.
> >
> > On Thu, Dec 16, 2010 at 11:10 AM, Rojas Buitrago, Sergio <
> srojas@indra.es
> > >wrote:
> >
> > > I'm using JackRabbit 2.1.2 deployed in a tomcat 6.0 managed from
> eclipse.
> > >
> > > For the text extractors, I get the necessary library form the next
> maven
> > > dependency:
> > >
> > >                <dependency>
> > >                        <groupId>org.apache.jackrabbit</groupId>
> > >
> >  <artifactId>jackrabbit-text-extractors</artifactId>
> > >                        <version>1.6.4</version>
> > >                </dependency>
> > >
> > > Are there any other util information to proporcionate?
> > >
> > > Regards.
> > >
> > >
> > >
> > > -----Mensaje original-----
> > > De: justinedelson@gmail.com [mailto:justinedelson@gmail.com] En nombre
> > de
> > > Justin Edelson
> > > Enviado el: jueves, 16 de diciembre de 2010 16:26
> > > Para: users@jackrabbit.apache.org
> > > Asunto: Re: FullText Indexing
> > >
> > > Sergio-
> > > The ClassCastException and the NoSuchMethodException you posted on
> > > dev@suggest a classpath problem. I would suggest posting the details
> > > of your
> > > deployment - what JARs you are using, app server details, etc.
> > >
> > > Justin
> > >
> > > On Thu, Dec 16, 2010 at 9:31 AM, Rojas Buitrago, Sergio <
> srojas@indra.es
> > > >wrote:
> > >
> > > >  Hello.
> > > >
> > > >
> > > >
> > > > I'm a newbie in Jackrabbit.
> > > >
> > > >
> > > >
> > > > I'm trying to index some content of different types of documents
> (word,
> > > > pdf, xml, ...).
> > > >
> > > >
> > > >
> > > > I've configured the searchIndex in my workspace.xml in this way:
> > > >
> > > >
> > > >
> > > > <SearchIndex
> > class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
> > > >
> > > >             <param name="path" value="${wsp.home}/index"/>
> > > >
> > > >             <param name="supportHighlighting" value="true"/>
> > > >
> > > >                                                <param
> > > > name="textFilterClasses"
> > > > value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
> > > >
> > > >
> > > >    org.apache.jackrabbit.extractor.MsExcelTextExtractor,
> > > >
> > > >
> > > >    org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
> > > >
> > > >
> > > >    org.apache.jackrabbit.extractor.PdfTextExtractor,
> > > >
> > > >
> > > >    org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
> > > >
> > > >
> > > >    org.apache.jackrabbit.extractor.RTFTextExtractor,
> > > >
> > > >
> > > >                    org.apache.jackrabbit.extractor.HTMLTextExtractor,
> > > >
> > > >
> > > >    org.apache.jackrabbit.extractor.XMLTextExtractor"/>
> > > >
> > > >         </SearchIndex>
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > When I create a document in the repository, I add the content in this
> > > way:
> > > >
> > > >
> > > >
> > > > contenido = nodo.addNode("jcr:content", "nt:resource");
> > > >
> > > >                   contenido.setProperty("jcr:data",
> J_OperacionesSesion
> > > >
> > > >                              .*getValueFactory*().createBinary(is));
> > > >
> > > >
> > > >
> > > >                   MimetypesFileTypeMap mimetypes =
> > > *new*MimetypesFileTypeMap();
> > > >
> > > >                   String *mime* =
> > > > mimetypes.getContentType(nodo.getName());
> > > >
> > > >                   contenido.setProperty("jcr:mimeType",
> > "application/pdf"
> > > > );
> > > >
> > > >
> > > >
> > > > Afer creating the document, this warning is thrown:
> > > >
> > > >
> > > >
> > > > 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract
> > > text
> > > > from a binary property (LazyTextExtractorField.java, line 180)
> > > >
> > > > *org.apache.tika.exception.TikaException*: Unable to extract PDF
> > content
> > > >
> > > >       at
> > > org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:61*)
> > > >
> > > >       at
> > org.apache.tika.parser.pdf.PDFParser.parse(*PDFParser.java:69*)
> > > >
> > > >       at org.apache.tika.parser.CompositeParser.parse(*
> > > > CompositeParser.java:120*)
> > > >
> > > >       at org.apache.tika.parser.AutoDetectParser.parse(*
> > > > AutoDetectParser.java:101*)
> > > >
> > > >       at
> > org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(*
> > > > JackrabbitParser.java:189*)
> > > >
> > > >       at
> > > >
> > >
> >
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(
> > > > *LazyTextExtractorField.java:174*)
> > > >
> > > >       at java.util.concurrent.Executors$RunnableAdapter.call(*
> > > > Executors.java:417*)
> > > >
> > > >       at java.util.concurrent.FutureTask$Sync.innerRun(*
> > > > FutureTask.java:269*)
> > > >
> > > >       at java.util.concurrent.FutureTask.run(*FutureTask.java:123*)
> > > >
> > > >       at
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
> > > > *ScheduledThreadPoolExecutor.java:65*)
> > > >
> > > >       at
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(*
> > > > ScheduledThreadPoolExecutor.java:168*)
> > > >
> > > >       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(*
> > > > ThreadPoolExecutor.java:650*)
> > > >
> > > >       at java.util.concurrent.ThreadPoolExecutor$Worker.run(*
> > > > ThreadPoolExecutor.java:675*)
> > > >
> > > >       at java.lang.Thread.run(*Thread.java:595*)
> > > >
> > > > Caused by: *org.apache.pdfbox.exceptions.WrappedIOException*:
> > > > OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could
> > not
> > > be
> > > > instantiated
> > > >
> > > >       at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> > > > PDFStreamEngine.java:152*)
> > > >
> > > >       at org.apache.pdfbox.util.PDFTextStripper.<init>(*
> > > > PDFTextStripper.java:129*)
> > > >
> > > >       at
> > org.apache.tika.parser.pdf.PDF2XHTML.<init>(*PDF2XHTML.java:69*)
> > > >
> > > >       at
> > > org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:56*)
> > > >
> > > >       ... 13 more
> > > >
> > > > Caused by: *java.lang.ClassCastException*:
> > > > org.pdfbox.util.operator.ShowTextGlyph
> > > >
> > > >       at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> > > > PDFStreamEngine.java:146*)
> > > >
> > > >       ... 16 more
> > > >
> > > >
> > > >
> > > > Later, when I search for the document, filtering by content, in this
> > way:
> > > >
> > > >
> > > >
> > > > String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
> > > > CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from
> > > > nt:file)
> > > >
> > > >
> > > >
> > > > No documents were found.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Can you help me please??.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Thanks and regards.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Sergio Rojas Buitrago*
> > > >
> > > > Desarrollo Software
> > > > Gestión Documental
> > > >
> > > > Ronda de Toledo s/n
> > > > 13003. Ciudad Real
> > > > España
> > > >
> > > > T +34 926 27 08 49
> > > >
> > > > Ext: 237849
> > > >
> > > >
> > > >
> > > > srojas@indra.es
> > > > www.indra.es
> > > >
> > > > [image: indra]
> > > >
> > > >
> > > >
> > > > ------------------------------
> > > > Este correo electrónico y, en su caso, cualquier fichero anexo al
> > mismo,
> > > > contiene información de carácter confidencial exclusivamente dirigida
> a
> > > su
> > > > destinatario o destinatarios. Si no es vd. el destinatario indicado,
> > > queda
> > > > notificado que la lectura, utilización, divulgación y/o copia sin
> > > > autorización está prohibida en virtud de la legislación vigente. En
> el
> > > caso
> > > > de haber recibido este correo electrónico por error, se ruega
> notificar
> > > > inmediatamente esta circunstancia mediante reenvío a la dirección
> > > > electrónica del remitente.
> > > > Evite imprimir este mensaje si no es estrictamente necesario.
> > > >
> > > > This email and any file attached to it (when applicable) contain(s)
> > > > confidential information that is exclusively addressed to its
> > > recipient(s).
> > > > If you are not the indicated recipient, you are informed that
> reading,
> > > > using, disseminating and/or copying it without authorisation is
> > forbidden
> > > in
> > > > accordance with the legislation in effect. If you have received this
> > > email
> > > > by mistake, please immediately notify the sender of the situation by
> > > > resending it to their email address.
> > > > Avoid printing this message if it is not absolutely necessary.
> > > >
> > >
> > > Este correo electrónico y, en su caso, cualquier fichero anexo al
> mismo,
> > > contiene información de carácter confidencial exclusivamente dirigida a
> > su
> > > destinatario o destinatarios. Si no es vd. el destinatario indicado,
> > queda
> > > notificado que la lectura, utilización, divulgación y/o copia sin
> > > autorización está prohibida en virtud de la legislación vigente. En el
> > caso
> > > de haber recibido este correo electrónico por error, se ruega notificar
> > > inmediatamente esta circunstancia mediante reenvío a la dirección
> > > electrónica del remitente.
> > > Evite imprimir este mensaje si no es estrictamente necesario.
> > >
> > > This email and any file attached to it (when applicable) contain(s)
> > > confidential information that is exclusively addressed to its
> > recipient(s).
> > > If you are not the indicated recipient, you are informed that reading,
> > > using, disseminating and/or copying it without authorisation is
> forbidden
> > in
> > > accordance with the legislation in effect. If you have received this
> > email
> > > by mistake, please immediately notify the sender of the situation by
> > > resending it to their email address.
> > > Avoid printing this message if it is not absolutely necessary.
> > >
> >
> > Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> > contiene información de carácter confidencial exclusivamente dirigida a
> su
> > destinatario o destinatarios. Si no es vd. el destinatario indicado,
> queda
> > notificado que la lectura, utilización, divulgación y/o copia sin
> > autorización está prohibida en virtud de la legislación vigente. En el
> caso
> > de haber recibido este correo electrónico por error, se ruega notificar
> > inmediatamente esta circunstancia mediante reenvío a la dirección
> > electrónica del remitente.
> > Evite imprimir este mensaje si no es estrictamente necesario.
> >
> > This email and any file attached to it (when applicable) contain(s)
> > confidential information that is exclusively addressed to its
> recipient(s).
> > If you are not the indicated recipient, you are informed that reading,
> > using, disseminating and/or copying it without authorisation is forbidden
> in
> > accordance with the legislation in effect. If you have received this
> email
> > by mistake, please immediately notify the sender of the situation by
> > resending it to their email address.
> > Avoid printing this message if it is not absolutely necessary.
> >
>
> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> contiene información de carácter confidencial exclusivamente dirigida a su
> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
> notificado que la lectura, utilización, divulgación y/o copia sin
> autorización está prohibida en virtud de la legislación vigente. En el caso
> de haber recibido este correo electrónico por error, se ruega notificar
> inmediatamente esta circunstancia mediante reenvío a la dirección
> electrónica del remitente.
> Evite imprimir este mensaje si no es estrictamente necesario.
>
> This email and any file attached to it (when applicable) contain(s)
> confidential information that is exclusively addressed to its recipient(s).
> If you are not the indicated recipient, you are informed that reading,
> using, disseminating and/or copying it without authorisation is forbidden in
> accordance with the legislation in effect. If you have received this email
> by mistake, please immediately notify the sender of the situation by
> resending it to their email address.
> Avoid printing this message if it is not absolutely necessary.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message