jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rojas Buitrago, Sergio" <sro...@indra.es>
Subject FullText Indexing
Date Thu, 16 Dec 2010 12:09:01 GMT
Hello.

I'm a newbie in Jackrabbit.

I'm trying to index some content of different types of documents (word, pdf, xml, ...).

I've configured the searchIndex in my workspace.xml in this way:

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${wsp.home}/index"/>
            <param name="supportHighlighting" value="true"/>
                                               <param name="textFilterClasses" value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
                                                                                         
                                                                                         
           org.apache.jackrabbit.extractor.MsExcelTextExtractor,
                                                                                         
                                                                                         
           org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
                                                                                         
                                                                                         
           org.apache.jackrabbit.extractor.PdfTextExtractor,
                                                                                         
                                                                                         
           org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
                                                                                         
                                                                                         
           org.apache.jackrabbit.extractor.RTFTextExtractor,
                                                                                         
                                                                                         
           org.apache.jackrabbit.extractor.HTMLTextExtractor,
                                                                                         
                                                                                         
           org.apache.jackrabbit.extractor.XMLTextExtractor"/>
        </SearchIndex>


When I create a document in the repository, I add the content in this way:

contenido = nodo.addNode("jcr:content", "nt:resource");
                  contenido.setProperty("jcr:data", J_OperacionesSesion
                             .getValueFactory().createBinary(is));

                  MimetypesFileTypeMap mimetypes = new MimetypesFileTypeMap();
                  String mime = mimetypes.getContentType(nodo.getName());
                  contenido.setProperty("jcr:mimeType", "application/pdf");

Afer creating the document, this warning is thrown:

16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text from a binary property
(LazyTextExtractorField.java, line 180)
org.apache.tika.exception.TikaException: Unable to extract PDF content
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
      at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
      at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
      at java.util.concurrent.FutureTask.run(FutureTask.java:123)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
      at java.lang.Thread.run(Thread.java:595)
Caused by: org.apache.pdfbox.exceptions.WrappedIOException: OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph
could not be instantiated
      at org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152)
      at org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129)
      at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
      ... 13 more
Caused by: java.lang.ClassCastException: org.pdfbox.util.operator.ShowTextGlyph
      at org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146)
      ... 16 more

Later, when I search for the document, filtering by content, in this way:

String consulta = "SELECT * FROM [arch:documento] AS documento WHERE CONTAINS ( documento.*,
'ubicacion')"; (arch:document extends from nt:file)

No documents were found.


Can you help me please??.


Thanks and regards.






________________________________
Este correo electr?nico y, en su caso, cualquier fichero anexo al mismo, contiene informaci?n
de car?cter confidencial exclusivamente dirigida a su destinatario o destinatarios. Si no
es vd. el destinatario indicado, queda notificado que la lectura, utilizaci?n, divulgaci?n
y/o copia sin autorizaci?n est? prohibida en virtud de la legislaci?n vigente. En el caso
de haber recibido este correo electr?nico por error, se ruega notificar inmediatamente esta
circunstancia mediante reenv?o a la direcci?n electr?nica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.

This email and any file attached to it (when applicable) contain(s) confidential information
that is exclusively addressed to its recipient(s). If you are not the indicated recipient,
you are informed that reading, using, disseminating and/or copying it without authorisation
is forbidden in accordance with the legislation in effect. If you have received this email
by mistake, please immediately notify the sender of the situation by resending it to their
email address.
Avoid printing this message if it is not absolutely necessary.

Mime
View raw message