pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Kronberg <wolfgang.kronb...@financial.com>
Subject Re: submitting non-public PDFs for bugfixing
Date Tue, 17 Jul 2012 16:50:37 GMT

Hi Maruan,

thank you for pointing me to the NonSequentialParser. I haven't noticed
that one before, and it works much better indeed - I now could extract
the text for all files except for one. This one file still shows
correctly in AdobeReader, but AdobeReader issues a warning that one
embedded font is missing. NonSequentialParser issues this exception:

java.io.IOException: Error reading stream using length value.
Expected='endstream' actual='H‰tV T”Çþæ"òXÞ" '
        at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseCOSStream(NonSequentialPDFParser.java:1327)
        at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1032)
        at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:955)
        at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseDictObjects(NonSequentialPDFParser.java:929)
        at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:337)
        at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574)
        at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124)
        at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1107)

Standard parser does not throw any exception but regards the document as
empty.

What I don't like about this solution is that I need to provide the PDF
as a file, not as a stream. In my application that means that I first
have to copy my stream to a temporary file. Also, the RandomAccess must
be fully re-build before each use (e.g. with new RandomAccessBuffer()),
because in some cases it will be closed implicitly, leaving me with a
NullPointerException on next access... Anyway, all of that is not a
problem for my current application. So thanks a lot, problem solved!

Nevertheless, perhaps some PDFBox developer is still interested in
getting the (now three) PDFs from me which exhibit PDFBox bugs? If so,
please drop me a note! :)

Best Regards,
Wolfgang


On 17.07.2012 16:53, Maruan Sahyoun wrote:
> Hello Wolfgang,
>
> did you try using the NonSequentialParser which was a new addition in 1.7. improving
the parsing of PDF documents? see https://issues.apache.org/jira/browse/PDFBOX-1199 for details.
>
> With kind regards
>
> Maruan
>
>
> Am 17.07.2012 um 16:09 schrieb Wolfgang Kronberg:
>
>>
>> Hello,
>>
>> I have recently converted some 2500 PDF files to text using PDFBox
>> 1.7.0. While doing so, I ran into two problems on a minority of the PDF
>> files (some 5% are affected for each problem). Usually, I would now file
>> a bug and attach a sample PDF so that the problem can be reproduced.
>>
>> However, the PDFs in question are not public, and I am not entitled to
>> publish them to the public. Is there any person who I could mail two
>> affected PDFs files, so that that person could nail down the actual bug
>> for a good bug description while keeping the actual files secret?
>>
>> Either case, here is what I see. In all cases, the affected document can
>> be displayed with no problems in Adobe Reader.
>>
>> Problem 1: The document is parsed to be empty (no pages), although it in
>> fact contains > 50 pages full of text. Running PDFDebugger on this
>> document produces this output (WARNUNG = WARNING):
>> 17.07.2012 14:01:50 org.apache.pdfbox.pdfparser.XrefTrailerResolver
>> setStartxref
>> WARNUNG: Did not found XRef object at specified startxref position 116
>>
>> Problem 2: On attempting to parse the document, I get an IOException.
>> PDFDebugger outputs the following on this document (SCHWERWIEGEND = SEVERE):
>> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> PDFDebugger failed with the following exception:
>> java.io.IOException
>>        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138)
>>        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
>>        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>>        at
>> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
>>        at
>> org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:61)
>>        at
>> org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(PDFParser.java:846)
>>        at
>> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:574)
>>        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
>>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
>>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
>>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1009)
>>        at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:408)
>>        at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:388)
>>        at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:376)
>>        at org.apache.pdfbox.PDFBox.main(PDFBox.java:48)
>> Caused by: java.util.zip.DataFormatException: unknown compression method
>>        at java.util.zip.Inflater.inflateBytes(Native Method)
>>        at java.util.zip.Inflater.inflate(Unknown Source)
>>        at java.util.zip.Inflater.inflate(Unknown Source)
>>        at
>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
>>        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
>>        ... 14 more
>>
>> Best Regards,
>> Wolfgang
>>
>> --
>> Dipl.-Math.
>> Wolfgang Kronberg
>> Senior Software Architect
>>
>> financial.com AG
>>
>> (t) +49 89 318528-75
>> (f) +49 89 318528-28
>> e-mail: wolfgang.kronberg@financial.com
>> http://www.financial.com
>>
>>
>> financial.com AG
>>
>> Munich head office/Hauptsitz München: Georg-Muche-Straße 3 | 80807 München | Germany
| Tel. +49 89 318528-0 | Google Maps: http://g.co/maps/4wcz
>> Frankfurt branch office/Niederlassung Frankfurt: Messeturm | Friedrich-Ebert-Anlage
49 | 60327 Frankfurt | Germany
>> Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer | Dr. Yann
Samson | Matthias Wiederwach
>> Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden (Chairman/Vorsitzender)
>> Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID number/St.Nr.:
DE205 370 553
>

--
Dipl.-Math.
Wolfgang Kronberg
Senior Software Architect

financial.com AG

(t) +49 89 318528-75
(f) +49 89 318528-28
e-mail: wolfgang.kronberg@financial.com
http://www.financial.com



financial.com AG

Munich head office/Hauptsitz München: Georg-Muche-Straße 3 | 80807 München | Germany |
Tel. +49 89 318528-0 | Google Maps: http://g.co/maps/4wcz
Frankfurt branch office/Niederlassung Frankfurt: Messeturm | Friedrich-Ebert-Anlage 49 | 60327
Frankfurt | Germany
Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer | Dr. Yann Samson
| Matthias Wiederwach
Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden (Chairman/Vorsitzender)
Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID number/St.Nr.: DE205
370 553

Mime
View raw message