lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Kuhlmann <k...@solarier.de>
Subject Re: Help:Solr can't put all pdf files into index
Date Thu, 09 Feb 2012 16:21:16 GMT
I don't know much about Tika, but this seems to be a bug in PDFBox.

See: https://issues.apache.org/jira/browse/PDFBOX-797

Yoz might also have a look at this: 
http://stackoverflow.com/questions/7489206/error-while-parsing-binary-files-mostly-pdf

At least that's what I found when I googled the NPE.

Greetings,
Kuli

On 09.02.2012 17:13, Rong Kang wrote:
> I test one file that is missing in Solr index. And solr response as below
[...]

> Exception in entity : tika-test:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to read content Processing Document # 1
> at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
> at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
> at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
> at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
> at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:617)
> at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
> at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
> at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
> at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
> at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
> Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@190725e
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
> ... 8 more
> Caused by: java.lang.NullPointerException
> at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
> at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943)
> at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:108)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 10 more
>
>
> I think this is because tika can't read the pdf file or this  pdf file's format has some
error. But I can read this pdf file in Adobe Reader.
> Regards,
>
> Rong Kang

Mime
View raw message