pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Empty glyphs
Date Sat, 25 Jun 2016 11:52:21 GMT
Here's an excerpt the CMAP table of that font, to be found at 
Root/Pages/Kids/[0]/Resources/Font/F480/ToUnicode  :


/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo 3 dict dup begin
   /Registry (Adobe) def
   /Ordering (UCS) def
   /Supplement 0 def
end def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
1 beginbfchar
<0000> <ffff>
endbfchar
2 beginbfrange
<0001> <005f> <f020>
<0060> <00d0> <f080>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end



This means that characters in the content stream whole value is between 
0001 and 00d0 are converted to unicode starting with f020 (see 
beginbfrange - search for this word in the PDF 32000 specifiation).
https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

But the content stream has also

     [ (\000\000) ] TJ

16 times. This is being rendered as a square by Adobe and PDFBox. In the 
beginbfchar section, the 0000 is being converted to unicode ffff, this 
is the unicode non character. This becomes EF BF BF in utf8.

http://www.fileformat.info/info/unicode/char/ffff/index.htm

QED

Tilman





Am 23.06.2016 um 10:33 schrieb OYEBISI, Daniel:
> You can get the PDF file through this url
>
> http://www.pdf-archive.com/2016/06/23/modele-tableau-wingdings-3/
>
> -----Message d'origine-----
> De : Tilman Hausherr [mailto:THausherr@t-online.de]
> Envoyé : mercredi 22 juin 2016 20:03
> À : users@pdfbox.apache.org
> Objet : Re: Empty glyphs
>
>   From what I see, the "whitespace" are EF BF BF which is not a valid
> UTF8 character. Please upload the PDF file somewhere.
>
> Tilman
>
> Am 22.06.2016 um 18:39 schrieb OYEBISI, Daniel:
>> The problem is with some of the whitespace that appears empty in Notepad but are
really not.
>> Please try opening the text file with other text editors.
>> Thanks
>>
>> -----Message d'origine-----
>> De : Tilman Hausherr [mailto:THausherr@t-online.de] Envoyé : mercredi
>> 22 juin 2016 17:54 À : users@pdfbox.apache.org Objet : Re: Empty
>> glyphs
>>
>> Your PDF didn't get through (security) but this sounds like a N++ problem.
>>
>> I could display your txt file with the normal notepad, by changing the font to windings.
>>
>> Tilman
>>
>> Am 22.06.2016 um 16:58 schrieb OYEBISI, Daniel:
>>> Hello,
>>>
>>> I came across an issue while trying to extract the text using
>>> PDFTextStripper from the PDF file attached to this email.
>>>
>>> When I open the txt document generated in the Notepad, it appears
>>> normal but when I open it with Notepad++ and it gives an interesting
>>> result.
>>>
>>> Please can you have a look at this?
>>>
>>> Thanks
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.
>> org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message