From odf-users-return-18-apmail-incubator-odf-users-archive=incubator.apache.org@incubator.apache.org Mon Sep 26 13:57:07 2011 Return-Path: X-Original-To: apmail-incubator-odf-users-archive@minotaur.apache.org Delivered-To: apmail-incubator-odf-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AD6B47778 for ; Mon, 26 Sep 2011 13:57:07 +0000 (UTC) Received: (qmail 44290 invoked by uid 500); 26 Sep 2011 13:57:07 -0000 Delivered-To: apmail-incubator-odf-users-archive@incubator.apache.org Received: (qmail 44264 invoked by uid 500); 26 Sep 2011 13:57:07 -0000 Mailing-List: contact odf-users-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: odf-users@incubator.apache.org Delivered-To: mailing list odf-users@incubator.apache.org Received: (qmail 44256 invoked by uid 99); 26 Sep 2011 13:57:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Sep 2011 13:57:07 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of pnajimovich@gmail.com designates 209.85.220.47 as permitted sender) Received: from [209.85.220.47] (HELO mail-dy0-f47.google.com) (209.85.220.47) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Sep 2011 13:57:02 +0000 Received: by dyk17 with SMTP id 17so94549dyk.6 for ; Mon, 26 Sep 2011 06:56:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:reply-to:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; bh=+v7WMye+2YYj6+1ZOyBKpxV8ZLk/+SOzuvuLzszqhDk=; b=mhq9LH3KpAAifEAy3zHTfGwPXs/49VSRP7mQnuhTJ7r7vKqJ2LKygB2Sz06I7TMa0l Lu3LROFFeRFwpurOMJqKru7P6htwpLvqJFUOcbjQ/7d7EM02/EiZRpu5y5n+oieXT1bv UTXOH/bBoMgroMTagT1k9fkEBDrujoBXvQeTw= Received: by 10.216.131.199 with SMTP id m49mr10039323wei.9.1317045401126; Mon, 26 Sep 2011 06:56:41 -0700 (PDT) MIME-Version: 1.0 Reply-To: ramdkane@gmail.com Sender: pnajimovich@gmail.com Received: by 10.216.135.67 with HTTP; Mon, 26 Sep 2011 06:56:21 -0700 (PDT) In-Reply-To: <001501cc7ba1$5219b7b0$f64d2710$@acm.org> References: <4E7EDEFF.3060909@gmail.com> <001501cc7ba1$5219b7b0$f64d2710$@acm.org> From: Ram Kane Date: Mon, 26 Sep 2011 10:56:21 -0300 X-Google-Sender-Auth: zU1iOEa99AIdRSmI-3hWVSTIzr4 Message-ID: Subject: Re: Is there a way to extract text on a page basis from odt ? To: odf-users@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Thanks all for the replies. > It seems best to revisit the problem statement and extract a > grounded case: What is the problem that needs to be solved; > what are the constraints on an acceptable solutions. > > Ram, can you please say more about the problem you want to solve? > What would be the simplest-acceptable result? I need to extract content for a given page inside a doc. By content i mean header, footer, footnotes, comments, main text from body. I need to have the option of extracting each of these elements of the page separately (extracting header for page X, footer for page X, body text for page X) and not just getting all the content as a single string. I've uploaded a doc that i found on your svn to use as an example here -> http://goo.gl/OMIEw Using the example doc and assuming that i need to extract content for page 1, i'd need to extract: _ header ("ODFDOM in a header") _ footer ("ODFDOM in a footer") _ footnotes for page ("ODFDOM in a footnote") _ main text and all additional content in the page body (" ODFDOM in a title ODFDOM in a section header ODFDOM in paragraph1 ..."