pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: pdfbox parse error "Header doesn't contain versioninfo"
Date Mon, 13 May 2019 15:54:02 GMT
Am 13.05.2019 um 16:15 schrieb Zeke Steer:
>
> Hi,
>
> I'm using the latest version of the pdfbox command line tools 
> (pdfbox-app-2.0.15.jar) to extract the text from UK company annual 
> reports. I invoke the command line tools from a python 
> script, extracting each page of the company annual report .pdf 
> document in turn.
>
> I've noticed that some pages of the annual reports aren't extracted 
> correctly. I originally observed this problem in an earlier version of 
> the command line tools (pdfbox-app-2.0.6.jar). However, moving to the 
> latest version of the tools hasn't fixed the issue.
>
> I've attached a sample report which consistently reproduces the issue 
> (00054_CCH_Annual Report_2016-82-85.pdf). The report opens fine in 
> Adobe Reader but pdfbox is unable to extract it. The issue manifests 
> differently depending on whether the sequential (default) or 
> non-sequential parser is used.
>

The non-sequential parser is the only parser in 2.0.*

Using the option should bring a FileNotFoundException with 2.0.15, that 
is what I get.

"Header doesn't contain versioninfo" is with empty files. I suspect one 
of your calls used the PDF file as destination and you destroyed it.

Re different text extractions, please read

https://pdfbox.apache.org/2.0/faq.html#text-extraction

Your PDF file attachment didn't get through, please upload it to a 
sharehoster.

Tilman


> _Sequential Parser_
>
> I was initially executing the following command with the -nonSeq flag 
> unset:
>
> java -jar pdfbox-app-2.0.15.jar ExtractText -startPage 1 -endPage 1 
> "E:\Analyst Reports\2019-05-13 PDF Extraction Issue 
> Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual 
> Report_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt"
>
> This would generate a large number of unicode warnings in the console, 
> e.g.:
>
> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font 
> toUnicode
>
> WARNING: No Unicode mapping for CID+36 (36) in font 
> Effra-Medium-Identity-H
>
> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font 
> toUnicode
>
> WARNING: No Unicode mapping for CID+88 (88) in font 
> Effra-Medium-Identity-H
>
> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font 
> toUnicode
>
> WARNING: No Unicode mapping for CID+71 (71) in font 
> Effra-Medium-Identity-H
>
> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font 
> toUnicode
>
> WARNING: No Unicode mapping for CID+76 (76) in font 
> Effra-Medium-Identity-H
>
> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font 
> toUnicode
>
> WARNING: No Unicode mapping for CID+87 (87) in font 
> Effra-Medium-Identity-H
>
> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font 
> toUnicode
>
> WARNING: No Unicode mapping for CID+3 (3) in font Effra-Medium-Identity-H
>
> The pdfbox output was missing a large amount of text present on the 
> first page of the report. See the pdfbox output in the attached 1.txt 
> file and compare this to the first page of the company annual report, 
> also attached.
>
> _Non-Sequential Parser_
>
> I found the issue affected several of the annual reports within my 
> dataset. Investigating further, I read about the non-sequential 
> parser. You advise using this if the sequential parser fails so I 
> tried executing the following command instead, with the -nonSeq flag set:
>
> java -jar pdfbox-app-2.0.15.jar ExtractText -nonSeq -startPage 1 
> -endPage 1 "E:\Analyst Reports\2019-05-13 PDF Extraction Issue 
> Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual 
> Report_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt"
>
> However, this consistently fails with a 'java.io.IOException: Error: 
> Header doesn't contain versioninfo'. See the full exception stack 
> trace below:
>
> Exception in thread "main" java.io.IOException: Error: Header doesn't 
> contain versioninfo
>
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:221)
>
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1070)
>
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1008)
>
>         at 
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:216)
>
>         at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96)
>
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
>
> failed to extract text from 'E:\Analyst Reports\2019-05-13 PDF 
> Extraction Issue Investigation\00054_
>
> CCH_Annual Report_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf': 
> Command 'java -jar pdfbox-app-
>
> 2.0.15.jar ExtractText -nonSeq -startPage 1 -endPage 1 "E:\Analyst 
> Reports\2019-05-13 PDF Extraction
>
>  Issue Investigation\00054_CCH_Annual 
> Report_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf" "out
>
> \00054_CCH_Annual Report_2016-82-85\1.txt"' returned non-zero exit 
> status 1.
>
> I found a similar issue reported on your JIRA issue tracker here: 
> https://issues.apache.org/jira/browse/PDFBOX-4203?jql=text%20~%20%22versioninfo%22. However,

> it was closed without being resolved as the original reporter failed 
> to provide a PDF document which reproduced the issue. Hopefully with 
> the information I've supplied, you'll be able to reopen the bug and 
> take another look.
>
> Please can you keep me updated?
>
> Many thanks,
>
> Zeke
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message