pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: TextPosition.getIndividualWidths() returns array with less items than expected
Date Tue, 19 Jul 2016 21:48:47 GMT

> On 19 Jul 2016, at 14:28, Tilman Hausherr <THausherr@t-online.de> wrote:
> 
> Am 19.07.2016 um 23:09 schrieb Ygor Mutti:
>> Yes, it helps. Thank you for the prompt answer!
>> 
>> I wonder why the string returned by getUnicode contains the separate chars
>> instead of the ligature. Is there some way I can configure PDFTextStripper
>> to decode it as it is in the PDF?
> 
> No, I don't know.
> 
> The reason that it is decoded the way it is is the CMap table, which looks like this
and tells what to do with the codes in the PDF

You mean the ToUnicode CMap (that’s what’s below). The CMap is found in the Encoding entry
and maps a character code to a CID.

> 
> /CIDInit /ProcSet findresource begin
> 12 dict begin
> begincmap
> /CIDSystemInfo
> << /Registry (Adobe)
> /Ordering (UCS) /Supplement 0 >> def
> /CMapName /Adobe-Identity-UCS def
> /CMapType 2 def
> 1 begincodespacerange
> <0000> <FFFF>
> endcodespacerange
> 100 beginbfchar
> <1D> <0066006C>   <============ fl
> <1E> <2212>
> <1F> <00660069>    <=========== fi
> (...)
> 
> 1F = octal 037 decodes to 00660069 i.e. two unicode characters, f and i.
> 
> Think about it... if it would decode to the "fi" unicode character, you wouldn't be able
to text-search for "Justificação" easily in an extracted text.

Indeed. The ToUnicode CMap in this PDF specifies that the the “fi” glyph represents “f”
and “i” in Unicode.

— John

> Tilman
> 
> 
>> 
>> On Tue, Jul 19, 2016 at 4:47 PM Tilman Hausherr <THausherr@t-online.de>
>> wrote:
>> 
>>> Am 19.07.2016 um 20:43 schrieb Ygor Mutti:
>>>> Hi!
>>>> 
>>>> The javadoc states that the TextPosition.getIndividualWidths() method
>>>> returns "An array that is the same length as the length of the string."
>>>> Here is a gist containing a test case in which this statement is false:
>>>> https://gist.github.com/ygormutti/d40a80d425d552151625a063fb29c9ca
>>> I'd say the javadoc is wrong. It is the length of the CharacterCodes
>>> array, not the length of the unicode string. The "fi" in Justificação is
>>> one glyph, a ligature.
>>> 
>>> This is the content stream:
>>> 
>>> [ (J) 20 (usti\037ca\347\343o) ] TJ
>>> 
>>> Does this explanation help?
>>> 
>>> Tilman
>>> 
>>>> It prints a line for two cases where the TextPosition.getUnicode()
>>> returns
>>>> "fi" while at the same time TextPosition,getIndividualWidths() returns an
>>>> array containing a single float.
>>>> 
>>>> I've tried to pin down the version in which this behavior has been
>>>> introduced and found out it works as expected in 1.2.1 release and does
>>> not
>>>> since 1.3.0.
>>>> 
>>>> Should I open a ticket for this?
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:users-unsubscribe@pdfbox.apache.org>
> For additional commands, e-mail: users-help@pdfbox.apache.org <mailto:users-help@pdfbox.apache.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message