jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: FullText Indexing
Date Thu, 16 Dec 2010 13:09:18 GMT
ps pls use users@jackrabbit.apache.org for non dev issues

Regards Ard

On Thu, Dec 16, 2010 at 2:08 PM, Ard Schrijvers
<a.schrijvers@onehippo.com> wrote:
> Hello,
>
> seems to me a pdfbox issue. What happens if you try a different pdf?
> If other pdf's just work, and a single one fails, you can better post
> the question to one of the pdfbox mailinglists:
> http://pdfbox.apache.org/mail-lists.html
>
> Regards Ard
>
> On Thu, Dec 16, 2010 at 1:09 PM, Rojas Buitrago, Sergio <srojas@indra.es> wrote:
>> Hello.
>>
>>
>>
>> I’m a newbie in Jackrabbit.
>>
>>
>>
>> I’m trying to index some content of different types of documents (word, pdf,
>> xml, …).
>>
>>
>>
>> I’ve configured the searchIndex in my workspace.xml in this way:
>>
>>
>>
>> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>>
>>             <param name="path" value="${wsp.home}/index"/>
>>
>>             <param name="supportHighlighting" value="true"/>
>>
>>                                               
<param
>> name="textFilterClasses"
>> value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.PdfTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.RTFTextExtractor,
>>
>>
>>                    org.apache.jackrabbit.extractor.HTMLTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>>
>>         </SearchIndex>
>>
>>
>>
>>
>>
>> When I create a document in the repository, I add the content in this way:
>>
>>
>>
>> contenido = nodo.addNode("jcr:content", "nt:resource");
>>
>>                   contenido.setProperty("jcr:data", J_OperacionesSesion
>>
>>                              .getValueFactory().createBinary(is));
>>
>>
>>
>>                   MimetypesFileTypeMap mimetypes = new
>> MimetypesFileTypeMap();
>>
>>                   String mime = mimetypes.getContentType(nodo.getName());
>>
>>                   contenido.setProperty("jcr:mimeType", "application/pdf");
>>
>>
>>
>> Afer creating the document, this warning is thrown:
>>
>>
>>
>> 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text
>> from a binary property (LazyTextExtractorField.java, line 180)
>>
>> org.apache.tika.exception.TikaException: Unable to extract PDF content
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
>>
>>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>>
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>>
>>       at
>> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
>>
>>       at
>> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>>
>>       at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
>>
>>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
>>
>>       at java.util.concurrent.FutureTask.run(FutureTask.java:123)
>>
>>       at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
>>
>>       at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
>>
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
>>
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
>>
>>       at java.lang.Thread.run(Thread.java:595)
>>
>> Caused by: org.apache.pdfbox.exceptions.WrappedIOException:
>> OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be
>> instantiated
>>
>>       at
>> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152)
>>
>>       at
>> org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129)
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>>
>>       ... 13 more
>>
>> Caused by: java.lang.ClassCastException:
>> org.pdfbox.util.operator.ShowTextGlyph
>>
>>       at
>> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146)
>>
>>       ... 16 more
>>
>>
>>
>> Later, when I search for the document, filtering by content, in this way:
>>
>>
>>
>> String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
>> CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from nt:file)
>>
>>
>>
>> No documents were found.
>>
>>
>>
>>
>>
>> Can you help me please??.
>>
>>
>>
>>
>>
>> Thanks and regards.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ________________________________
>> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
>> contiene información de carácter confidencial exclusivamente dirigida a su
>> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
>> notificado que la lectura, utilización, divulgación y/o copia sin
>> autorización está prohibida en virtud de la legislación vigente. En el caso
>> de haber recibido este correo electrónico por error, se ruega notificar
>> inmediatamente esta circunstancia mediante reenvío a la dirección
>> electrónica del remitente.
>> Evite imprimir este mensaje si no es estrictamente necesario.
>>
>> This email and any file attached to it (when applicable) contain(s)
>> confidential information that is exclusively addressed to its recipient(s).
>> If you are not the indicated recipient, you are informed that reading,
>> using, disseminating and/or copying it without authorisation is forbidden in
>> accordance with the legislation in effect. If you have received this email
>> by mistake, please immediately notify the sender of the situation by
>> resending it to their email address.
>> Avoid printing this message if it is not absolutely necessary.
>>
>
>
>
> --
> Hippo
> Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20 522
4466
> USA  • San Francisco 755 Baywood Drive, Second Floor •  Petaluma, CA.
> 94954 •  +1 877 414 4776 (toll free)
> Canada    •   Montréal  5369 Boulevard St-Laurent #430 •  Montréal QC
> H2T 1S5  •  +1 (514) 316 8966
> www.onehippo.com  •  www.onehippo.org  •  info@onehippo.com
> ________________________________________________________________
> This e-mail may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this e-mail or the information it contains by other than an
> intended recipient is unauthorized. If you received this e-mail in
> error, please advise me (by return e-mail or otherwise) immediately.
>



-- 
Hippo
Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20 522 4466
USA  • San Francisco 755 Baywood Drive, Second Floor •  Petaluma, CA.
94954 •  +1 877 414 4776 (toll free)
Canada    •   Montréal  5369 Boulevard St-Laurent #430 •  Montréal QC
H2T 1S5  •  +1 (514) 316 8966
www.onehippo.com  •  www.onehippo.org  •  info@onehippo.com
________________________________________________________________
This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.

Mime
View raw message