pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CDB <cbu...@burkeitconsulting.com>
Subject Re: PDFBox issues
Date Sat, 21 Jul 2012 02:44:15 GMT
After replacing an old pdfbox lib with the new PDFBOX 1.7.0 my app is
printing "processing substrream tokenŠ" thousands of times when I read a
pdf.

The code is:

PDFTextStripper pdfTextStripper = new PDFTextStripper();
           
            doc = PDDocument.load( stream,true );
            
          return (pdfTextStripper.getText(doc));
        


And the output is:


org.apache.pdfbox.util.PDFStreamEngine - processing substream token:
COSInt{0}
org.apache.pdfbox.util.PDFStreamEngine - processing substream token:
COSInt{0}
org.apache.pdfbox.util.PDFStreamEngine - processing substream token:
COSInt{0}
org.apache.pdfbox.util.PDFStreamEngine - processing substream token:
PDFOperator{sc}
org.apache.pdfbox.util.PDFStreamEngine - processing substream token:
PDFOperator{q}
org.apache.pdfbox.util.PDFStreamEngine - processing substream token:
COSFloat{0.24}
org.apache.pdfbox.util.PDFStreamEngine - processing substream token:
COSInt{0}
.
.
.
.


Any ideas?





On 7/18/12 7:05 AM, "Andreas Lehmkühler" <andreas@lehmi.de> wrote:

>Hi,
>
>
>Yushuang Hao <yushuang.hao@codean.com> hat am 11. Juli 2012 um 12:08
>geschrieben:
>
>> Dear Sir/Madam,
>>
>> I experienced two issues when I was using the PDFBOX 1.7.0 to convert
>>the
>> PDF to Text:
>>
>> Firstly, the PDF is purely in English but after conversion I get random
>>CJK
>> characters in it. I have figured out this as under UTF-8 the Latin
>> character takes 1 bit ranging from 0x0000 to 0x00FF in Unicode, somehow
>>the
>> conversion randomly compressed two Latin characters together as a 2 bits
>> CJK character. For example, I got "?" (0x5365) rather than getting
>> "S"(0x0053) and "e"(0x0065). I don't know how this happened but I
>>managed
>> to convert this to the right ones.
>>
>> My second issue is in the same document the "?" was produced for where
>>it
>> should be 3,4,6,7,8,9,),* or %, see below example. Can you give me some
>> hints how to solve this? Many thanks.
>
>
>Hmm, it's not that easy to say without having a hand on the pdf. If you
>can
>share the doc in question with us, create an issue on JIRA [1] and attach
>the
>pdf to it.
>
>
>>
>> In PDF:
>> TERM C1 EUR 591736DB6 LX038684 07-Jun-2016 Shadow Shadow 450.0 0.00
>>0.404
>> 4.9040 0.00 0.00 462,025.59 462,025.59
>>
>> Conversion:
>> 07-Jun-201?TERM C1 EUR  462,025.5?Shadow Shadow  0.00 0.40? 450.0
>> 0.00591736DB?  4.9040  0.00  462,025.5?LX03868?
>
>
>Looks like you are not using the sort-option, are you?
>
>
>>
>> Kind regards,
>> Yushuang
>>
>> --
>>
>> *Yushuang Hao*
>> Codean
>> King's Gate
>> 1 Bravingtons Walk
>> London, N1 9AE, UK
>> yushuang.hao@codean.com
>>
>> tel. +44 (0)20 3475 3548
>> mob. +44 (0)7973 816 879
>>
>> www.codean.com
>
>BR
>Andreas Lehmkühler
>
>[1] https://issues.apache.org/jira/browse/PDFBOX



Mime
View raw message