incubator-odf-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Weir <>
Subject Re: Is there a way to extract text on a page basis from odt ?
Date Sun, 25 Sep 2011 14:09:41 GMT
On Sun, Sep 25, 2011 at 3:57 AM, Svante Schubert
<> wrote:
> Am 24.09.2011 14:26, schrieb Rob Weir:
>> On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <> wrote:
>>> Hi,
>>> I need to extract all text (header, footer, comments, endnote, etc) from an
>>> ODT document. I need to do it on a page by page basis. I'm aware that ODTs
>>> are basically structured by paragraphs and headings, but i'd like to know if
>>> there's a way to achieve what i need.
>>> Thanks a lot.
>> Good question.
>> With WYSIWYG word processors, page numbers are calculated when the
>> document is loaded, based on your currently configured printer, font
>> metrics, etc.  So there is nothing at the level of the ODF markup that
>> is a structural equivalent to a "page".  ODF is similar to HTML in
>> this regard.  It has paragraphs, tables, etc., but line breaks and
>> page breaks are calculated at runtime.
>> However, starting in ODF 1.1, the format does allow an option for a
>> word processor to save "soft" page breaks in the document.  This was
>> intended to help with accessibility tools, screen readers, etc.  If
>> your word processor supports this (and many do) then you can try
>> looking for the <text:soft-page-break> element.  This would
>> indicate where the pages broke in the word processor that last saved
>> the document.  But there is no guarantee that every ODF document will
>> have soft page breaks.
>> So in theory you could walk the document, looking for
>> <text:soft-page-break> and determine pages that way.
> Rob already gave the answer on problematics and the solution.
> I would like to add the question, where to place the functionality to
> receive pages, for instance if the questioner would be willing to
> provide a patch?
> Certainly in the highest level of API, therefore in the Simple API (or
> DOC API), as those will be merged.
> Daisy or Devin you once implemented the text extraction for the complete
> document, right?
> org.odftoolkit.odfdom.incubator.doc.text.OdfEditableTextExtractor
> Is this as well accessible via the Simple API? I could not find it.


But the problem is that there is there is not page element.  A page is
only defined by what is between soft page breaks (taking into account
the implicit page start at the start of the document and the implicit
page end at the end of the document).  But there is no parent in the
DOM that contains page content.

I could imagine a synthetic parent "page" object that could be
returned by the navigation API, and could then give access to the
"contained" content of that page.  But it would need to be read-only,
I think.  Change the content of the page, inserting/deleting, even
changing the header/footer can effect the pagination.

Something to consider:  even without a page-oriented UI, we should
consider invalidating and removing all existing soft page breaks when
document content is modified, or at least give an easy method for a
programmer to do this if they wish.


> In this context, when I looked for the extraction functionality, I
> stumpled over the method getFooter()/getHeader().
> You return those from the document without a context. But there might be
> multiple header/footer in a document.
> One pair for each master page style, therefore you need a context or
> your simplification is only a good guess, but sometimes wrong.
> Regards,
> Svante

View raw message