incubator-odf-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ram Kane <>
Subject Re: Is there a way to extract text on a page basis from odt ?
Date Wed, 28 Sep 2011 15:41:59 GMT
I'm using Symphony 3 and LibreOffice 3.3.2. They both display the
document with the same overall structure. That is, page X has the same
footer, header, footnotes, comments and main text in both

As you mention, i think my only chance for now is to try to understand
the underlying logic these applications use to render the document as
a series of pages.

On Tue, Sep 27, 2011 at 4:03 PM, Dennis E. Hamilton
<> wrote:
> I think the answer is you can't get there from here today, and it will be an unpredictable
time before the answer would change.
>  - Dennis
> JUST FOR FUN, More questions:
> Where are you seeing what the pages are?
> That is, what are you looking at where you see what is page X, what is on page X, and
what are those things that apply to it (headers, footers, notes, frames, tables, etc.).  What
do you have to say to go to page X directly and have it in view?
> It is important that the OpenDocument Format is not page oriented (in contrast with final
forms like PDFs that are).  I think you understand that from the APIs.
> It is some ODF Consumer that puts together the presentation you are looking at.  There
is no normative answer to those questions looking at the ODF format alone.  It is pretty
much all determined by an ODF Consumer.  What Consumer are you using that you see the pages
that you are interested in?
> For the time being, it appears that you need to rely on the programmability of that consumer,
if any, to be able to derive page-relative actions, because you are interested in features
of the rendered document, not the recorded format.
> Unless there is a simpler way of addressing a concrete case that could work well enough
in the short term.  (Mining PDFs might be better, but there might not be enough structure
left.  There are doubtless tools for working on PDFs that might address your problem.)
> -----Original Message-----
> From: [] On Behalf Of Ram Kane
> Sent: Monday, September 26, 2011 06:56
> To:
> Subject: Re: Is there a way to extract text on a page basis from odt ?
> Thanks all for the replies.
> > It seems best to revisit the problem statement and extract a
> > grounded case: What is the problem that needs to be solved;
> > what are the constraints on an acceptable solutions.
> >
> > Ram, can you please say more about the problem you want to solve?
> > What would be the simplest-acceptable result?
> I need to extract content for a given page inside a doc. By content i
> mean header, footer, footnotes, comments, main text from body.
> I need to have the option of extracting each of these elements of the
> page separately (extracting header for page X, footer for page X, body
> text for page X) and not just getting all the content as a single
> string.
> I've uploaded a doc that i found on your svn to use as an example here
> ->
> Using the example doc and assuming that i need to extract content for
> page 1, i'd need to extract:
>    _ header ("ODFDOM in a header")
>    _ footer ("ODFDOM in a footer")
>    _ footnotes for page ("ODFDOM in a footnote")
>    _ main text and all additional content in the page body (" ODFDOM
> in a title ODFDOM in a section header ODFDOM in paragraph1 ..."

View raw message