xmlgraphics-fop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 32789] [PATCH] Arabic Shaping not Supported by FOP
Date Thu, 11 Feb 2010 12:25:46 GMT

--- Comment #11 from Vincent Hennebert <vhennebert@gmail.com> 2010-02-11 12:25:40 UTC
Hi Jonathan,

(In reply to comment #8)
> Hi Vincent,
> I will attach the .fo file I've been using for testing.  I will also attach the
> generated pdf.  This is from an example our Dubai team gave me for my own
> testing as I developed the code.

Well... It's a bit light for an example. Just a single word...

> Our Dubai team has been testing with a large variety of Arabic script - but
> they are using a report creation tool that invokes fop.bat with xsl input so
> the .fo file isn't part of their output.
> I could give them instructions for creating .fo files.
> We have found in testing that what is most important is the BIDI algorithm is
> applied so that text (including embedded numerals) is in the right order and
> that form shaping is correct.  You need to know the Arabic alphabet and its
> rules to assess the output of testing.  We have a team that knows Arabic to do
> our testing.  They "eyeball" the reports to make sure they are in proper Arabic
> with text and sub-text in the right order.  Embedded numerals can be in a
> different order - left-to-right rather than right-to-left. It isn't clear to me
> how this process can be automated.
> You are right that widths change and this could change line breaking decisions.
>  Do you know where in the FOP pipeline before we reach the rendering pipeline
> the Arabic shaping could go so as to be able to affect width selection?

Something needs to be done in the layout engine, possibly also on the FO tree.
At least section 5.8 (“Unicode BIDI Processing”) of XSL-FO 1.1 deserves a look
as it explains how the Unicode algorithm should be blended in XSL-FO
processing. Inline-level stuff is likely to be affected. It needs to be seen
how and when character re-ordering should be done WRT line breaking.

Also, something might need to be done at the font level. I don't know what
ICU4J does, but I suspect it replaces characters from the Arabic range
(U+0600–U+06FF) with ones from Arabic Presentation Forms-A (U+FB50–U+FDFF).
AFAIU from the Unicode specification this is legacy that may not be supported
by every font. I suppose modern fonts (especially OpenType ones) use the
standard ligature mechanism to provide contextual glyphs.

> I believe that what ensures the right glyphs are embedded in the PDF file is
> the nature of the ICU4J algorithm which transforms the UNICODE representation
> of the string.  The output for our Dubai team is PDFs with embedded fonts and
> these are working so ICU4J must have solved the problem in some way, and I
> believe the way they solve it is by using different UNICODE codes.

Actually this is taken care of by the font library called by PDFPainter. I
suspect the same is done at the layout stage, with the standalone glyphs. Which
would be suboptimal, as both standalone and contextual glyphs would be embedded
in the final PDF.

> I don't have performance numbers to give you yet.  If ICU4J was clever about
> the way they wrote their transform algorithm it should not be much of a
> performance impact since they only need to transform text in the Arabic UNICODE
> code range and testing whether text is in this range should be quick.
> Thanks,
> Jonathan
> (In reply to comment #7)
> > Hi,
> > Thanks for your patch. Do you have an example FO file that could be used for
> > testing purpose (even better, with an English translation)?
> > IIUC, Arabic shaping is about replacing glyphs for standalone letters with
> > suitable ligature glyphs for building words. Surely that affects character
> > widths, so line breaking decisions? In the patch, shaping is performed at the
> > rendering stage, so isn't there a danger of getting inconsistent results?
> > Also, IIC Arabic shaping affects glyphs selection. How do you make sure that
> > the right glyphs are being embedded in the PDF file?
> > The same piece of code is duplicated in the PCL and PDF painters. The same
> > would probably also need to be done for other painters. This is not desirable.
> > Finally, what is the impact on performance? It looks like shaping will be
> > applied to just any text, even non-arabic one.
> > Thanks,
> > Vincent
> > (In reply to comment #3)
> > > Created an attachment (id=24934)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=24934) [details]
[details] [details]
> > > Support for Arabic PDF rendering using ICU4J
> > > 
> > > This patch uses ICU4J to do form-shaping and BIDI transformation of rendered
> > > text.  It is a patch for the FOP trunk.   It does not change the layout manager
> > > or the area tree handler or allow a writing-mode other than “lr-tb”.  
For this
> > > patch to be integrated with FOP, FOP would need to distribute the ICU4J library
> > > - icu4j-4_2_1.jar.   It affects both PDF and PCL rendering but has only been
> > > tested with PDF rendering.  So far results of testing with PDF rendering have
> > > been positive.  The PCL aspect of the patch looks correct given that the PDF
> > > aspect works.


Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
View raw message