pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zubiri, Tomas" <tomas.zub...@spglobal.com>
Subject RE: Chinese document: mangled characters, ASCII block code points off by 1
Date Thu, 03 Aug 2017 19:30:09 GMT
Thanks for the explanation Tilman!
I'll take a look into Tika if I ever need to extract text from these documents.
Regards.

Tomas Zubiri
Research Associate, Ownership
S&P Global Market Intelligence
Buenos Aires, Argentina
tomas.zubiri@spglobal.com 
www.spglobal.com/marketintelligence
 



-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Thursday, August 03, 2017 3:52 PM
To: users@pdfbox.apache.org
Subject: Re: Chinese document: mangled characters, ASCII block code points off by 1

Hi,

I just tested the files... bad news: only the digits can be extracted. 
The reason that the Chinese characters don't extract is similar to the case here:
https://issues.apache.org/jira/browse/PDFBOX-3886
Feel free to ask further questions.

That you got some output is because in 1.8 a lot of assumptions were done when ToUnicode was
missing. Sometimes these were right, and sometimes not. The 2.0 versions don't make such assumptions
so you get nothing.

Tilman

Am 03.08.2017 um 20:35 schrieb Zubiri, Tomas:
> Hey Tilman,
> I am sorry for the delay.
> I am indeed using version 1.8.3, I will update to 2.0.7 in order to solve the off by
one bug.
> Regarding the Chinese characters bug. I am extracting text from a pdf, not rendering.
> Here is what the documents look like.
>
> http://www.filedropper.com/1341025263
> http://www.filedropper.com/1308134649
>
> Here is the text I am extracting with our custom text extractor based 
> on TextPosition and PDFTextStripper from version 1.8.3
> http://www.filedropper.com/1341025263_1
> http://www.filedropper.com/1308134649_1
>
> Let me know if I missed something or if you need any additional info.
>
> Thanks!
>
>
> Tomas Zubiri
> Research Associate, Ownership
> S&P Global Market Intelligence
> Buenos Aires, Argentina
> tomas.zubiri@spglobal.com
> www.spglobal.com/marketintelligence
>
>
>
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Thursday, August 03, 2017 1:41 PM
> To: users@pdfbox.apache.org
> Subject: Re: Chinese document: mangled characters, ASCII block code 
> points off by 1
>
> Am 02.08.2017 um 00:16 schrieb Zubiri, Tomas:
>> Good afternoon,
>>
>>
>> http://www.filedropper.com/1308134649
>>
>> The document linked above isn't being read correctly by PDFBox.
>> Characters in the ASCII block appear to be off by 1, for example, 
>> numbers appear to be one value higher.
>>
>> Should I upload this as a bug in JIRA?
>>
> Despite you not answering, I was able to guess what you're trying to tell us.
>
> 1) You are using 1.8.* version. This is not very good in rendering, and it can't render
the chinese glyphs at all, and the numbers are off by one. Use 2.0.7.
> 2) The 2.0.7 renders the numbers correctly. (The cause in 1.8.* is that the internal
code is indeed off by one, this is a weirdness in the file and a bug in 1.8.*, but not a broken
PDF) The chinese glyphs do look chinese but in poor quality. This is a known and unsolved
problem and is described here:
> https://issues.apache.org/jira/browse/PDFBOX-3293
>
> Tilman
>
>
> ________________________________
>
> The information contained in this message is intended only for the recipient, and may
be a confidential attorney-client communication or may otherwise be privileged and confidential
and protected from disclosure. If the reader of this message is not the intended recipient,
or an employee or agent responsible for delivering this message to the intended recipient,
please be aware that any dissemination or copying of this communication is strictly prohibited.
If you have received this communication in error, please immediately notify us by replying
to the message and deleting it from your computer. S&P Global Inc. reserves the right,
subject to applicable local law, to monitor, review and process the content of any electronic
message or information sent to or from S&P Global Inc. e-mail addresses without informing
the sender or recipient of the message. By sending electronic message or information to S&P
Global Inc. e-mail addresses you, as the sender, are consenting to S&P Global Inc. processing
any of your personal data therein.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Mime
View raw message