jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Bachevsky <...@ciklum.com>
Subject Re: AW: AW: AW: Jackrabbit indexing in a separate thread
Date Mon, 27 Feb 2012 13:27:45 GMT
Hi Claus,

I switched off PDF parsing following your advice:

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
   <param name="tikaConfigPath" value="${wsp.home}/tika-config.xml"/>

where tika config contains:

   <parsers>
     <parser class="org.apache.tika.parser.DefaultParser"/>
     <parser class="org.apache.tika.parser.EmptyParser">
       <mime>application/pdf</mime>
     </parser>
   </parsers>

Does it mean I sill made something wrong?


Regards,
Anton

> Hi Anton,
>
> It seems that you index the pdf File as fulltext ?!?
>
>> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:530)
>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
>>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>>      at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> I think you have disabled it ?
> Indexing huge pdf files will take some time and memory :-)
>
> greets
> claus
>


Mime
View raw message