pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zeke Steer" <m...@zekesteer.me.uk>
Subject pdfbox parse error "Header doesn't contain versioninfo"
Date Mon, 13 May 2019 14:15:25 GMT
Hi, 

 

I'm using the latest version of the pdfbox command line tools
(pdfbox-app-2.0.15.jar) to extract the text from UK company annual reports.
I invoke the command line tools from a python script, extracting each page
of the company annual report .pdf document in turn. 

 

I've noticed that some pages of the annual reports aren't extracted
correctly. I originally observed this problem in an earlier version of the
command line tools (pdfbox-app-2.0.6.jar). However, moving to the latest
version of the tools hasn't fixed the issue. 

 

I've attached a sample report which consistently reproduces the issue
(00054_CCH_Annual Report_2016-82-85.pdf). The report opens fine in Adobe
Reader but pdfbox is unable to extract it. The issue manifests differently
depending on whether the sequential (default) or non-sequential parser is
used. 

 

_Sequential Parser_

I was initially executing the following command with the -nonSeq flag unset:

 

java -jar pdfbox-app-2.0.15.jar ExtractText -startPage 1 -endPage 1
"E:\Analyst Reports\2019-05-13 PDF Extraction Issue
Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual
Report_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt"

 

This would generate a large number of unicode warnings in the console, e.g.:

 

May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode

WARNING: No Unicode mapping for CID+36 (36) in font Effra-Medium-Identity-H

May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode

WARNING: No Unicode mapping for CID+88 (88) in font Effra-Medium-Identity-H

May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode

WARNING: No Unicode mapping for CID+71 (71) in font Effra-Medium-Identity-H

May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode

WARNING: No Unicode mapping for CID+76 (76) in font Effra-Medium-Identity-H

May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode

WARNING: No Unicode mapping for CID+87 (87) in font Effra-Medium-Identity-H

May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode

WARNING: No Unicode mapping for CID+3 (3) in font Effra-Medium-Identity-H

 

The pdfbox output was missing a large amount of text present on the first
page of the report. See the pdfbox output in the attached 1.txt file and
compare this to the first page of the company annual report, also attached.

 

_Non-Sequential Parser_

I found the issue affected several of the annual reports within my dataset.
Investigating further, I read about the non-sequential parser. You advise
using this if the sequential parser fails so I tried executing the following
command instead, with the -nonSeq flag set:

 

java -jar pdfbox-app-2.0.15.jar ExtractText -nonSeq -startPage 1 -endPage 1
"E:\Analyst Reports\2019-05-13 PDF Extraction Issue
Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual
Report_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt"

 

However, this consistently fails with a 'java.io.IOException: Error: Header
doesn't contain versioninfo'. See the full exception stack trace below:

 

Exception in thread "main" java.io.IOException: Error: Header doesn't
contain versioninfo

        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:221)

        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1070)

        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1008)

        at
org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:216)

        at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96)

        at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)

failed to extract text from 'E:\Analyst Reports\2019-05-13 PDF Extraction
Issue Investigation\00054_

CCH_Annual Report_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf':
Command 'java -jar pdfbox-app-

2.0.15.jar ExtractText -nonSeq -startPage 1 -endPage 1 "E:\Analyst
Reports\2019-05-13 PDF Extraction

 Issue Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual
Report_2016-82-85.pdf" "out

\00054_CCH_Annual Report_2016-82-85\1.txt"' returned non-zero exit status 1.

 

I found a similar issue reported on your JIRA issue tracker here:
<https://issues.apache.org/jira/browse/PDFBOX-4203?jql=text%20~%20%22version
info%22.>
https://issues.apache.org/jira/browse/PDFBOX-4203?jql=text%20~%20%22versioni
nfo%22. However, it was closed without being resolved as the original
reporter failed to provide a PDF document which reproduced the issue.
Hopefully with the information I've supplied, you'll be able to reopen the
bug and take another look. 

 

Please can you keep me updated?

 

Many thanks,

 

Zeke

 


Mime
View raw message