pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ygor Mutti <ygormu...@jusbrasil.com.br>
Subject Re: TextPosition.getIndividualWidths() returns array with less items than expected
Date Wed, 20 Jul 2016 21:33:42 GMT
IMHO, the responsibilities are messed up in this case.

I'm surprised to find out that Unicode deals with typographic sugar like
ligatures. This could be much more conveniently handled by the font using
separate glyphs.

Also, I think only text search algorithms, not PDF authoring tools, should
concern about searches using approximations. We already have to deal with
PDF authors that don't approximate uncommon glyphs, so we have to handle
them during text search anyway.

I've solved the problem by determining the width of each character in the
Unicode string as the width of the ligature divided by the length of the
string. This is adequate for our purposes.

Thank you, Tilman and John, for the help!

On Tue, Jul 19, 2016 at 6:48 PM John Hewson <john@jahewson.com> wrote:

>
> > On 19 Jul 2016, at 14:28, Tilman Hausherr <THausherr@t-online.de> wrote:
> >
> > Am 19.07.2016 um 23:09 schrieb Ygor Mutti:
> >> Yes, it helps. Thank you for the prompt answer!
> >>
> >> I wonder why the string returned by getUnicode contains the separate
> chars
> >> instead of the ligature. Is there some way I can configure
> PDFTextStripper
> >> to decode it as it is in the PDF?
> >
> > No, I don't know.
> >
> > The reason that it is decoded the way it is is the CMap table, which
> looks like this and tells what to do with the codes in the PDF
>
> You mean the ToUnicode CMap (that’s what’s below). The CMap is found in
> the Encoding entry and maps a character code to a CID.
>
> >
> > /CIDInit /ProcSet findresource begin
> > 12 dict begin
> > begincmap
> > /CIDSystemInfo
> > << /Registry (Adobe)
> > /Ordering (UCS) /Supplement 0 >> def
> > /CMapName /Adobe-Identity-UCS def
> > /CMapType 2 def
> > 1 begincodespacerange
> > <0000> <FFFF>
> > endcodespacerange
> > 100 beginbfchar
> > <1D> <0066006C>   <============ fl
> > <1E> <2212>
> > <1F> <00660069>    <=========== fi
> > (...)
> >
> > 1F = octal 037 decodes to 00660069 i.e. two unicode characters, f and i.
> >
> > Think about it... if it would decode to the "fi" unicode character, you
> wouldn't be able to text-search for "Justificação" easily in an extracted
> text.
>
> Indeed. The ToUnicode CMap in this PDF specifies that the the “fi” glyph
> represents “f” and “i” in Unicode.
>
> — John
>
> > Tilman
> >
> >
> >>
> >> On Tue, Jul 19, 2016 at 4:47 PM Tilman Hausherr <THausherr@t-online.de>
> >> wrote:
> >>
> >>> Am 19.07.2016 um 20:43 schrieb Ygor Mutti:
> >>>> Hi!
> >>>>
> >>>> The javadoc states that the TextPosition.getIndividualWidths() method
> >>>> returns "An array that is the same length as the length of the
> string."
> >>>> Here is a gist containing a test case in which this statement is
> false:
> >>>> https://gist.github.com/ygormutti/d40a80d425d552151625a063fb29c9ca
> >>> I'd say the javadoc is wrong. It is the length of the CharacterCodes
> >>> array, not the length of the unicode string. The "fi" in Justificação
> is
> >>> one glyph, a ligature.
> >>>
> >>> This is the content stream:
> >>>
> >>> [ (J) 20 (usti\037ca\347\343o) ] TJ
> >>>
> >>> Does this explanation help?
> >>>
> >>> Tilman
> >>>
> >>>> It prints a line for two cases where the TextPosition.getUnicode()
> >>> returns
> >>>> "fi" while at the same time TextPosition,getIndividualWidths()
> returns an
> >>>> array containing a single float.
> >>>>
> >>>> I've tried to pin down the version in which this behavior has been
> >>>> introduced and found out it works as expected in 1.2.1 release and
> does
> >>> not
> >>>> since 1.3.0.
> >>>>
> >>>> Should I open a ticket for this?
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>
> >>>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
> users-unsubscribe@pdfbox.apache.org>
> > For additional commands, e-mail: users-help@pdfbox.apache.org <mailto:
> users-help@pdfbox.apache.org>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message