incubator-odf-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Svante Schubert <>
Subject Re: Is there a way to extract text on a page basis from odt ?
Date Sun, 25 Sep 2011 07:57:51 GMT
Am 24.09.2011 14:26, schrieb Rob Weir:
> On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <> wrote:
>> Hi,
>> I need to extract all text (header, footer, comments, endnote, etc) from an
>> ODT document. I need to do it on a page by page basis. I'm aware that ODTs
>> are basically structured by paragraphs and headings, but i'd like to know if
>> there's a way to achieve what i need.
>> Thanks a lot.
> Good question.
> With WYSIWYG word processors, page numbers are calculated when the
> document is loaded, based on your currently configured printer, font
> metrics, etc.  So there is nothing at the level of the ODF markup that
> is a structural equivalent to a "page".  ODF is similar to HTML in
> this regard.  It has paragraphs, tables, etc., but line breaks and
> page breaks are calculated at runtime.
> However, starting in ODF 1.1, the format does allow an option for a
> word processor to save "soft" page breaks in the document.  This was
> intended to help with accessibility tools, screen readers, etc.  If
> your word processor supports this (and many do) then you can try
> looking for the <text:soft-page-break> element.  This would
> indicate where the pages broke in the word processor that last saved
> the document.  But there is no guarantee that every ODF document will
> have soft page breaks.
> So in theory you could walk the document, looking for
> <text:soft-page-break> and determine pages that way.
Rob already gave the answer on problematics and the solution.
I would like to add the question, where to place the functionality to
receive pages, for instance if the questioner would be willing to
provide a patch?
Certainly in the highest level of API, therefore in the Simple API (or
DOC API), as those will be merged.

Daisy or Devin you once implemented the text extraction for the complete
document, right?
Is this as well accessible via the Simple API? I could not find it.

In this context, when I looked for the extraction functionality, I
stumpled over the method getFooter()/getHeader().
You return those from the document without a context. But there might be
multiple header/footer in a document.
One pair for each master page style, therefore you need a context or
your simplification is only a good guess, but sometimes wrong.


View raw message