incubator-odf-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Weir <robw...@apache.org>
Subject Re: Is there a way to extract text on a page basis from odt ?
Date Sat, 24 Sep 2011 12:26:08 GMT
On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <ramdkane@gmail.com> wrote:
> Hi,
>
> I need to extract all text (header, footer, comments, endnote, etc) from an
> ODT document. I need to do it on a page by page basis. I'm aware that ODTs
> are basically structured by paragraphs and headings, but i'd like to know if
> there's a way to achieve what i need.
>
> Thanks a lot.
>

Good question.

With WYSIWYG word processors, page numbers are calculated when the
document is loaded, based on your currently configured printer, font
metrics, etc.  So there is nothing at the level of the ODF markup that
is a structural equivalent to a "page".  ODF is similar to HTML in
this regard.  It has paragraphs, tables, etc., but line breaks and
page breaks are calculated at runtime.

However, starting in ODF 1.1, the format does allow an option for a
word processor to save "soft" page breaks in the document.  This was
intended to help with accessibility tools, screen readers, etc.  If
your word processor supports this (and many do) then you can try
looking for the <text:soft-page-break> element.  This would
indicate where the pages broke in the word processor that last saved
the document.  But there is no guarantee that every ODF document will
have soft page breaks.

So in theory you could walk the document, looking for
<text:soft-page-break> and determine pages that way.

-Rob

Mime
View raw message