pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: TextPosition.getIndividualWidths() returns array with less items than expected
Date Tue, 19 Jul 2016 21:28:55 GMT
Am 19.07.2016 um 23:09 schrieb Ygor Mutti:
> Yes, it helps. Thank you for the prompt answer!
>
> I wonder why the string returned by getUnicode contains the separate chars
> instead of the ligature. Is there some way I can configure PDFTextStripper
> to decode it as it is in the PDF?

No, I don't know.

The reason that it is decoded the way it is is the CMap table, which 
looks like this and tells what to do with the codes in the PDF

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
100 beginbfchar
<1D> <0066006C>   <============ fl
<1E> <2212>
<1F> <00660069>    <=========== fi
(...)

1F = octal 037 decodes to 00660069 i.e. two unicode characters, f and i.

Think about it... if it would decode to the "fi" unicode character, you 
wouldn't be able to text-search for "Justificação" easily in an 
extracted text.

Tilman


>
> On Tue, Jul 19, 2016 at 4:47 PM Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 19.07.2016 um 20:43 schrieb Ygor Mutti:
>>> Hi!
>>>
>>> The javadoc states that the TextPosition.getIndividualWidths() method
>>> returns "An array that is the same length as the length of the string."
>>> Here is a gist containing a test case in which this statement is false:
>>> https://gist.github.com/ygormutti/d40a80d425d552151625a063fb29c9ca
>> I'd say the javadoc is wrong. It is the length of the CharacterCodes
>> array, not the length of the unicode string. The "fi" in Justificação is
>> one glyph, a ligature.
>>
>> This is the content stream:
>>
>> [ (J) 20 (usti\037ca\347\343o) ] TJ
>>
>> Does this explanation help?
>>
>> Tilman
>>
>>> It prints a line for two cases where the TextPosition.getUnicode()
>> returns
>>> "fi" while at the same time TextPosition,getIndividualWidths() returns an
>>> array containing a single float.
>>>
>>> I've tried to pin down the version in which this behavior has been
>>> introduced and found out it works as expected in 1.2.1 release and does
>> not
>>> since 1.3.0.
>>>
>>> Should I open a ticket for this?
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message