Return-Path: Delivered-To: apmail-poi-user-archive@www.apache.org Received: (qmail 61040 invoked from network); 23 Jan 2010 07:49:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Jan 2010 07:49:23 -0000 Received: (qmail 76226 invoked by uid 500); 23 Jan 2010 07:49:22 -0000 Delivered-To: apmail-poi-user-archive@poi.apache.org Received: (qmail 76199 invoked by uid 500); 23 Jan 2010 07:49:22 -0000 Mailing-List: contact user-help@poi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "POI Users List" Delivered-To: mailing list user@poi.apache.org Received: (qmail 76189 invoked by uid 99); 23 Jan 2010 07:49:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Jan 2010 07:49:22 +0000 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=FROM_LOCAL_NOVOWEL,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Jan 2010 07:49:14 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1NYajm-00040B-5U for user@poi.apache.org; Fri, 22 Jan 2010 23:48:54 -0800 Message-ID: <27284134.post@talk.nabble.com> Date: Fri, 22 Jan 2010 23:48:54 -0800 (PST) From: MSB To: user@poi.apache.org Subject: Re: Extract Text with style/type information In-Reply-To: <27275276.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: markbrdsly@tiscali.co.uk References: <27209960.post@talk.nabble.com> <27222594.post@talk.nabble.com> <27222890.post@talk.nabble.com> <27228585.post@talk.nabble.com> <27275276.post@talk.nabble.com> I think that my previous post may hav been a bit missleading because when you are parsing the document you are getting Paragraph objects so I think that the best thing to do is something like the following; if(paragraph instanceof ListEntry) { // then we have a List or at least an entry into alist } else { // We can assume we have a Paragrpah at this point but I do not think it is possible // to further check the type. The only thing we can do now is to see if the Paragraph // is in a Table cell. So Table table docRange.getTable(paragraph); if(table != null) { // We are dealing with text that is in a table cell. Thus we have found a table // that can be dealt with here. } else { // We are dealing with a paragraph of text only now. } } Sadly, I have not been able to find the code I wrote that strips tables out 'in line' but I am sure it is on the list somewhere if you have a good search through. Something like the above will allow you to detect lists, tables and 'normal' paragraphs of text as they occur in the document; sadly images are going to present a different problem I suspect and I do not as yet know how to approach this particular problem. One othet aspect you may want to consider are sections. Typically, the way I process a Word document is; Open the document. Get the Range object for the document (which one depends upon whether I want to process the headers/footers or not). Ask that Range how many Paragraph objects it contains. Iterate through the Paragraph's one at a time. It is possible - at least I think it is - to abstract this up one level, so; Open the document. Get the Range object for the dcoument. Ask it how many Sections it contains. Iterate through the Sections and for each; Ask it how many Paragraphs it contains. Iterate through the Paragrahs. Sections contain some information that may/will be valuable to you, not least being the number of columns on the page. Yours Mark B markl16 wrote: > > Yep i think you were on to something there, i tried: > [code] > if(paragraph instanceof ListEntry) > { > System.out.println("true"); > } > else > { > System.out.println("false"); > } > [/code] > Which seemed to work, ill do some more research and see does a similar > solution work for all the tags i want. > > Best > Mark > > > MSB wrote: >> >> I am hoping that it really is this simple but I cannot be too sure that >> it really will be. The org.apache.poi.hwpf.usermodel.Range class is the >> parent class for CharacterRun, DocumentPosition, Paragraph, Section, >> Table and TableCell, whilst Paragraph is the parent of ListEntry. I have >> never tried this but could it be as simple as using instanceof to test >> what class you actually had in hand whilst parsing the document? It >> should be easy enough to test this hypothesis; >> >> Open a document. >> Get the top level Range object. >> Get the number of Pargraphs. >> Iterate through the Paragraphs one at a time and test to see what object >> you actually have in hand. >> >> There are going to be one or two holes in this - I think that it will not >> deal with pictures for example - but it could well be a way to start. >> >> Yours >> >> Mark B >> > -- View this message in context: http://old.nabble.com/Extract-Text-with-style-type-information-tp27209960p27284134.html Sent from the POI - User mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@poi.apache.org For additional commands, e-mail: user-help@poi.apache.org