xmlgraphics-fop-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Xmlgraphics-fop Wiki] Trivial Update of "GoogleSummerOfCode2006/FloatsImplementationProgress/Phase1Documentation" by VincentHennebert
Date Fri, 04 Aug 2006 10:43:23 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Xmlgraphics-fop Wiki" for change notification.

The following page has been changed by VincentHennebert:
http://wiki.apache.org/xmlgraphics-fop/GoogleSummerOfCode2006/FloatsImplementationProgress/Phase1Documentation

The comment on the change is:
Splitting: 1- Documentation

New page:
#pragma section-numbers on

'''Contents'''
[[TableOfContents]]

I'm planning to read the following literature:
 * "Digital Typography", Knuth: glue/box/penalty model, optimal line-breaking algorithm
 * "Pagination Reconsidered", Brüggemann-Klein, Klein, Wohlfeil: a better algorithm for placing
floating objects than TeX's one.
 * Pages of this Wiki related to the Knuth Approach
 * Have a look at Simon Pepping's generalized glue/box/penalty model.

= "Digital Typography": Breaking Paragraphs into Lines =
The algorithm works when each line of the paragraph has a different length. This may be interesting
for implementing side-floats, provided that we know in advance which height each line will
have. And I think this may depend on the line-stacking strategy.

There are three main aspects which characterize the Knuth approach:
 * build a sequence of glue/box/penalty elements from some input data;
 * define a somewhat arbitrary formula used to compute the demerits of each break, and which
is to be minimized;
 * the algorithm itself, which corresponds to the finding of a shortest path in an acyclic
graph. The nodes are breakpoints and each edge is tagged by the demerits of the corresponding
line.

To enter a bit in the details:
 * a line may be broken either at a penalty item, or at a glue immediately following a box;
 * when a line is broken, all of the glue/penalty items at the beginning of the line up to
the first box are removed.
 * those properties may be utilized to achieve various effects, such as ragged-right/ragged-left/centered
lines. Many effects rely on the fact that a negative value may be given to the stretching
of a glue
 * the definition of the to-be-minimized formula may vary according to the criteria of optimization.
In the formula used by TeX to justify text, there are three criteria:
   * the badness beta, related to the adjustment ratio of the line (how much it is stretched/shrinked);
   * the penalty pi, if the line breaks at a penalty item;
   * an additional penalty alpha, if this line and the previous ones end at a flagged penalty
item.
 Those three values are mixed together to compute the line's ''demerit''. The algorithm will
find a set of lines with a minimal sum of demerits.

Regarding (before-)floats:
 * the fo:float element (with the "float" property set to "before") will have to be converted
to a sequence of glue/box/penalty elements, which themselves will have to be inserted at the
beginning of certain pages;
 * the formula used to determine the optimality of floats placements should be retrieved from
the "Pagination Reconsidered" paper. I still have to see how to mix it (if any) with the formula
currently used to break pages. I assume that it is different from the formula used for paragraphs,
as there are different constraints (more to come when I've looked at the code);
 * the line-breaking algorithm isn't designed to handle a floating sequence of g/b/p items.
That is, a subsequence that is not fixed in the larger general sequence, but may be put at
other places and possibly split.  The algorithm will have to be extended, probably by the
algorithm found in "Pagination Reconsidered";
 * All what have been stated so far may be invalidated by further readings...

= "On the Pagination of Complex Documents" =
>>From Brüggeman-Klein, Klein, Wohlfeil. This article may be found [http://wwwpi6.fernuni-hagen.de/Publikationen/tr234.pdf
here].
Also, from the same authors, the article [http://wwwpi6.fernuni-hagen.de/Publikationen/tr205.pdf
"Pagination Reconsidered"] gives the details of the algorithm.

== Some Theoretical Background ==
This article let me understand some abstract background of document layout. In fact, document
layout is somewhat similar to a [http://en.wikipedia.org/wiki/Bin_packing_problem bin packing
problem]: bins correspond to pages, objects to typeset material of some height. Those objects
usually come from different flows: before-floats, main flow, footnotes. The order of each
flow must be respected (a figure appearing after another one in the list must appear later
on the pages), but there is some freedom in choosing from which flow the next typeset element
will be taken.

An optimal way of putting those elements on the pages must be found. In other words there
is some cost function to minimize. This cost function should reflect what a human would call
a good pagination; among various criteria we may cite the following: figures should appear
as near to their first reference as possible; footnotes ''must'' appear on the same page as
their citation; facing pages (or columns) should be balanced; etc.

As for packing problems, page layout will usually be a [http://en.wikipedia.org/wiki/Combinatorics
combinatorial] NP-hard problem. However, if the cost function has some particular properties
(namely [http://en.wikipedia.org/wiki/Optimal_substructure optimal substructure]), it becomes
possible to solve it in polynomial time by using [http://en.wikipedia.org/wiki/Dynamic_programming
dynamic programming]. The cost function used by Knuth and Plass for computing demerits of
each line of a paragraph has such a property. 

For the page layout problem, the main issue is then to find a cost function:
 * which correctly reflects the quality of a layout;
 * which has the right properties to make it possible to use dynamic programming for minimizing
it.

Note that other techniques than dynamic programming may be used to layout pages. For example,
by using some heuristic approach combined to genetic algorithms or simulated annealing (like
in [http://citeseer.ist.psu.edu/johari97automatic.html this article]). But exploring those
approaches would lead us much too far. Let's concentrate on the dynamic programming approach
implemented in Fop.

== Comments on the Article ==
 * It is assumed that paragraphs have already been broken into lines. The main restriction
of such an assumption is that it won't be possible to apply this method to sequences of pages
of different widths. However, this shouldn't prevent us from registering several layouts of
the same paragraph with differing numbers of lines. Thus it would be possible to choose a
layout with fewer lines to avoid an orphan, for example.
 * In his thesis (see below), Plass defines a badness function for measuring pagination quality
that makes the problem NP-hard. In this article, another function is used which in the authors'
opinion better reflects a human's judgment, and for which there is a polynomial solution.
In short, they use the number of page turns necessary to read the document, that is, the total
number of pages + the number of page turns caused by figures appearing on a different page
from their citations. Plass used the sum of the squares of page differences between figures
and their citations.
 * The page model used to explain the algorithm is pretty similar to the one used by FO: an
optional figure region at the top of the page, with optional glue separating it from the text.
Good point.
 * The authors utilize the fact that on double-sided documents, a figure may appear on the
left page if its citation is on the right page. I think the XSL-FO spec doesn't allow this,
but perhaps that a user-configurable extension...
 * The computation time of the pagination algorithm presented in this paper is proportional
to the number of text lines times the number of figures. Note that this article does not deal
with footnotes, but it should be easy to extend it to handle them.
 * It is possible to tweak the algorithm to allow pages which are not entirely full (yet still
with balanced facing pages). Indeed sometimes it is impossible to find a layout which respects
all of the contraints (full pages, no widow/orphan, keeps...)
 * Although the algorithm assumes that each line of text is separated by some glue, it only
relies on their minimum and maximum values. It doesn't take the optimal value into account.
And as it tries to minimize the total number of pages, we may end up with spaces systematically
set to their minimum value. Indeed, if there are no before-floats nor footnotes, it will obviously
be the case. I think this may break the XSL-FO spec; at least we can expect complaints from
users.
 * Related to this: the possibility to break a page is binary: allowed or not. This constraint
is IMO too heavy: although an orphan is undesirable, it should be acceptable when no better
layout is possible. In fact this algorithm doesn't utilize the whole range of penalty values
available in the Knuth model; only two are allowed, 0 and infinity.

== Resulting Thoughts ==
The algorithm cannot be used directly as it is presented, mainly because of the two last items
of the previous section. However, it provides a strong basis IMO. Some points:
 * There may be cases where it is impossible to find a layout which fulfills all of the criteria.
We may consider to perform a second pass with some relaxed properties, but:
   * can we afford the additional computation time?
   * which properties to relax, in which order? Possibilities:
     * orphans/widows properties; but contrary to TeX, in XSL-FO it is only possible to define
the number of orphans/widows, not a penalty associated to them.
     * keep properties: those may have integer values that allow for some flexibility
     * page fullness: AFAICT the XSL-FO spec does not impose that pages be full. And as explained
earlier, the algorithm can use this possibility. According to the article this becomes quite
easy to find a layout just by allowing pages to be only 90% filled. This might even become
possible to perform only one pass (which would ideally find the optimal pagination when possible,
and one with some percentage of fullness otherwise). (See also the recent thread on fop-dev
on this subject.)
 * There is a free bonus implied by this algorithm: this should give pages which are vertically
justified [[FootNote(don't forget to have a look at Luca's LayoutExtensions on this issue)]].
That is, the bottoms of pages should coincide with the bottom of the region-body area. The
only exception would be in "relaxed mode", when a percentage of fullness is introduced. Such
a feature should remain optional, however. But I think the exact same situation occurs with
justified or ragged-right text: the algorithm is the same, just the glues are actually stretched/shrinked
or left as is.
 * BTW, is there in XSL-FO a way of determining if the document is single- or double-sided?
We can check if separate page masters are used for even and odd pages. Such a check should
be reliable, though. Anyway, such a thing may probably be deferred to a fine-tuning period,
when we have a working basic implementation.
 * The algorithm may be generalized to allow several kinds of floats e.g., tables and figures.
In such cases, even if the order of each float sequence must be respected, this is possible
to place a table before a figure even if it is referenced later in the main text. That said,
XSL-FO defines only one kind of before-floats. Anyway, there is room for future Fop extensions
IMHO...

When there are no figures nor footnotes, pagination becomes equivalent to breaking paragraphs
into lines. In such a particular case the Knuth algorithm may be used. It would thus be great
to find a cost function and an associated algorithm that become equivalent to line-breaking
in such a case. Anyway, this would be great to utilize the full potential of the g/b/p model
for page-breaking. And, moreover, to rely as much as possible on the current source code.
More to come when I have looked at it.


= "Optimal Pagination Techniques for Automatic Typesetting Systems" =
Michael Frederick Plass -- PhD thesis

== Introduction ==
Plass starts by having a look at the practices which were in use in the past. In short, compositors
were mostly performing a first-fit pagination, because of the difficulty to look ahead and
because re-typesetting the n-1 previous pages to have a better-looking page n demanded too
much time. Footnotes were particularly difficult to handle. Others problems were balanced
facing pages, and obviously floats placement.

The g/b/p model used for line-breaking does not readily apply to page-breaking, it must be
extended with some "insert" operator to handle figures and footnotes.

== Pagination may be an NP-complete Problem ==
The basic algorithm which consists in choosing the best pagination among all of the possible
ones is obviously exponential. But depending on the chosen badness function, the problem may
become solvable in polynomial time.

Basically Plass shows that, among others, "quadratic" badness functions lead to an NP-complete
problem. Such functions compute the square of the differences between the page numbers of
their citations and the page numbers of their corresponding figures.

In reality this is a bit more subtle. Indeed, with certain constraints, quadratic functions
may be solvable in polynomial time. Plass actually shows that this is quite easy to choose
constraints which makes the problem exponential. For example, simply allowing glue between
the elements suffices to make the problem NP-complete. The issue is then to find the right
constraints (number of citations per figure, number of objects per page, constant page size
or not, and so on) and the right badness function, so that the problem still is a good representation
of the reality yet is solvable in polynomial time.

== Making Pagination Polynomial ==
It may be interesting to quickly explain the case where there are no footnotes nor figures.
The problem becomes equivalent to line-breaking and lets understand the main basis on which
the whole dynamic programming approach relies.

We define B,,jk,, to be the badness corresponding to the best way to put the first j lines
on the first k pages. Then the following formula holds:
         B,,jk,, =  min,,0<=i<j,,  {B,,i k-1,, +  β,,ijk,,}
where β,,ijk,, is the badness due to placing lines i+1 through j on page k. The best way
to put the first j lines on the first k pages only depends on the best ways to put i lines
on the first k-1 pages. The only problem is to find this particular i for which (the badness
of the first k-1 pages) + (the badness of page k) will be minimal.

We can easily see the recursive nature of this problem. The rest, well, "only" corresponds
to implementation details.

When introducing footnotes and figures, the challenge is to find a formula which has a similar
structure. Plass begins with a formula which computes the distance (in absolute value) between
a figure and each(!) of its citations. The problem is solvable but the constraints are unrealistic:
their may be only one type of element on each page, either text or a figure. Plass then generalizes
the problem by handling any number of independant flows (normal text, figures, tables, etc.),
but again with only one element per page.

Ok, I've decided to stop here. At the end of the paragraph, Plass presents a differently generalized
problem and tries to find sufficient conditions which make it solvable by dynamic programming.
In this case the previous weird constraints are relaxed: their may be any number of element
on each page, several flows, several references from any flow to any other, etc. Then he formulates
the cost function as depending on 4 independent arbitrary sub-functions, which give some freedom
to the characterization of good pagination.

Then there is an entire chapter describing an implementing algorithm, which is very similar
to the line-breaking algorithm. Here I think it is more interesting to refer to the work of
Brüggemann et al., which is easier to read and more directly re-usable than Plass' work.
It will only have to be adapted to the g/b/p model, which shouldn't be too difficult.

It may be worth returning to Plass' work if we encounter unexpected difficulties during the
implementation of floats (e.g., the chosen algorithm doesn't provide satisfying results),
as it provides a strong theoretical basis. For now I think I know enough to proceed to the
implementation.


= The Fop Source Code =
Even if it is well explained, the Knuth line-breaking algorithm isn't that easy to understand.
ATM I've concentrated on the class layoutmgr.!BreakingAlgorithm, which contains the part of
the algorithm which is common to page- and line-breaking. It is splitted in parts which follow
pretty closely those described in "Digital Typography". It relies on the following skeleton:{{{
findBreakingPoints(a paragraph, max allowable ratio for stretching, force or not the finding
of breakpoints)
    initialize the various fields and variables
    For each Knuth element of the paragraph Do
        update totalWidth, totalStretch, totalShrink
        If the element is a feasible breakpoint Then
            > considerLegalBreak(at this element)
        EndIf
    EndFor
    If no set of feasible breakpoints could be found Then
        reactivate the last node and restart the algorithm from it
    EndIf
    Select some active nodes among the available ones
            > filterActiveNodes()
    For each remaining active node Do
        Compute the corresponding set of optimal breakpoints
                > calculateBreakPoints
    EndFor


considerLegalBreak(Knuth element)
    For each currently active node Do
        compute the adjustment ratio of a line starting at the active node and ending at this
element
        If the adjustment ratio is unacceptable Then
            deactivate this node
        Else
            compute the fitness class of the corresponding line
                    > computeFitness
            update the total demerits by adding the demerits of this line
                    > computeDemerits
            If the result is better than the best recorded demerits for the given fitness
class Then
                register this new best active node
                        > BestRecords.addRecord
            EndIf
        EndIf
        (perform something when the adjustment ratio is unacceptable and we are in forced
mode)
    EndFor
    add new active nodes for breaks at this element
            > addBreaks


addBreaks(element ending the line, line number)
    compute the new totalWidth, totalStretch, totalShrink (after this element)
    For each fitness class Do
        If the recorded best active node for this fitness class is to be retained Then
            create a new active node for a break from this node to the element
        EndIf
    EndFor


calculateBreakPoints(last active node, corresponding paragraph, corresponding number of lines)
    For j = number of lines DownTo 1 Do
        breakpoint = node.element
        node = node.previous
    EndFor
}}}

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-commits-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-commits-help@xmlgraphics.apache.org


Mime
View raw message