incubator-odf-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Weir <>
Subject Re: Is there a way to extract text on a page basis from odt ?
Date Sun, 25 Sep 2011 14:39:13 GMT
On Sun, Sep 25, 2011 at 10:30 AM, Wolf Halton <> wrote:
> Depending on the actual purpose of page-based extraction, couldn't a filter
> based on counting line returns?

Word wrapping and line splitting are similar to page breaks.  Unless
the user enter an explicit carriage return, the document doesn't know
where one line ends and another beings.  The line breaks are
calculated when the editor renders the page based on font metrics and
page dimensions.

Of course, if we had layout code in the ODF Toolkit, that would allow
us to solve this problem, in theory.  But you still have
complications.  For example, the fonts available to a process on the
server might be different than those available to the document
author's client.  Or the Toolkit code might be running on a "headless"
server without any graphics context available.  But that shouldn't
stop us from solving this where we can.


> On Sep 25, 2011 10:09 AM, "Rob Weir" <> wrote:
>> On Sun, Sep 25, 2011 at 3:57 AM, Svante Schubert
>> <> wrote:
>>> Am 24.09.2011 14:26, schrieb Rob Weir:
>>>> On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <> wrote:
>>>>> Hi,
>>>>> I need to extract all text (header, footer, comments, endnote, etc)
> from an
>>>>> ODT document. I need to do it on a page by page basis. I'm aware that
> ODTs
>>>>> are basically structured by paragraphs and headings, but i'd like to
> know if
>>>>> there's a way to achieve what i need.
>>>>> Thanks a lot.
>>>> Good question.
>>>> With WYSIWYG word processors, page numbers are calculated when the
>>>> document is loaded, based on your currently configured printer, font
>>>> metrics, etc.  So there is nothing at the level of the ODF markup that
>>>> is a structural equivalent to a "page".  ODF is similar to HTML in
>>>> this regard.  It has paragraphs, tables, etc., but line breaks and
>>>> page breaks are calculated at runtime.
>>>> However, starting in ODF 1.1, the format does allow an option for a
>>>> word processor to save "soft" page breaks in the document.  This was
>>>> intended to help with accessibility tools, screen readers, etc.  If
>>>> your word processor supports this (and many do) then you can try
>>>> looking for the <text:soft-page-break> element.  This would
>>>> indicate where the pages broke in the word processor that last saved
>>>> the document.  But there is no guarantee that every ODF document will
>>>> have soft page breaks.
>>>> So in theory you could walk the document, looking for
>>>> <text:soft-page-break> and determine pages that way.
>>> Rob already gave the answer on problematics and the solution.
>>> I would like to add the question, where to place the functionality to
>>> receive pages, for instance if the questioner would be willing to
>>> provide a patch?
>>> Certainly in the highest level of API, therefore in the Simple API (or
>>> DOC API), as those will be merged.
>>> Daisy or Devin you once implemented the text extraction for the complete
>>> document, right?
>>> org.odftoolkit.odfdom.incubator.doc.text.OdfEditableTextExtractor
>>> Is this as well accessible via the Simple API? I could not find it.
>> org.odftoolkit.simple.common.TextExtractor
>> But the problem is that there is there is not page element. A page is
>> only defined by what is between soft page breaks (taking into account
>> the implicit page start at the start of the document and the implicit
>> page end at the end of the document). But there is no parent in the
>> DOM that contains page content.
>> I could imagine a synthetic parent "page" object that could be
>> returned by the navigation API, and could then give access to the
>> "contained" content of that page. But it would need to be read-only,
>> I think. Change the content of the page, inserting/deleting, even
>> changing the header/footer can effect the pagination.
>> Something to consider: even without a page-oriented UI, we should
>> consider invalidating and removing all existing soft page breaks when
>> document content is modified, or at least give an easy method for a
>> programmer to do this if they wish.
>> -Rob
>>> In this context, when I looked for the extraction functionality, I
>>> stumpled over the method getFooter()/getHeader().
>>> You return those from the document without a context. But there might be
>>> multiple header/footer in a document.
>>> One pair for each master page style, therefore you need a context or
>>> your simplification is only a good guess, but sometimes wrong.
>>> Regards,
>>> Svante

View raw message