Return-Path: Delivered-To: apmail-jakarta-poi-user-archive@www.apache.org Received: (qmail 40650 invoked from network); 9 Feb 2006 13:44:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 9 Feb 2006 13:44:08 -0000 Received: (qmail 77485 invoked by uid 500); 9 Feb 2006 13:44:05 -0000 Delivered-To: apmail-jakarta-poi-user-archive@jakarta.apache.org Received: (qmail 77463 invoked by uid 500); 9 Feb 2006 13:44:05 -0000 Mailing-List: contact poi-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Help: List-Post: List-Id: "POI Users List" Reply-To: "POI Users List" Delivered-To: mailing list poi-user@jakarta.apache.org Received: (qmail 77451 invoked by uid 99); 9 Feb 2006 13:44:05 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2006 05:44:05 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [81.187.40.70] (HELO fluffy.torchbox.com) (81.187.40.70) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2006 05:44:04 -0800 Received: from grenache.internal.torchbox.com ([192.168.1.81]) by fluffy.torchbox.com with esmtp (Exim 4.50) id 1F7C5G-0003px-9d for poi-user@jakarta.apache.org; Thu, 09 Feb 2006 13:43:42 +0000 Date: Thu, 9 Feb 2006 13:43:42 +0000 (GMT) From: Nick Burch X-X-Sender: nick@localhost.localdomain To: POI Users List Subject: Best way to extract text from a word file Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Spam-Score: -105.9 (---------------------------------------------------) X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi All I'm thinking about adding a simple text extractor utility to hwpf, since everyone is currently rolling their own, and that's not very programmer efficient! When I get text out, I normally use something like: StringBuffer text = new StringBuffer(); Range r = wdoc.getRange(); for(int i=0; i < r.numParagraphs(); i++) { Paragraph p = r.getParagraph(i); text.append(p.text()); } However, I've also seen people advocate an approach like: StringBuffer text = new StringBuffer(); Iterator textPieces = doc.getTextTable().getTextPieces().iterator(); while (textPieces.hasNext()) { TextPiece piece = (TextPiece) textPieces.next(); String encoding = "Cp1252"; if (piece.usesUnicode()) { encoding = "UTF-16LE"; } text.append(new String(piece.getRawBytes(), encoding)); } (normally accompanied by some stripping out of macros) Is there any reason why I shouldn't use the first version? Nick --------------------------------------------------------------------- To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/