Mailing-List: contact poi-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "POI Users List" <poi-user@jakarta.apache.org>
Received-SPF: pass (asf.osuosl.org: local policy)
Date: Thu, 9 Feb 2006 13:43:42 +0000 (GMT)
From: Nick Burch <nick@torchbox.com>
To: POI Users List <poi-user@jakarta.apache.org>
Subject: Best way to extract text from a word file
Message-ID: <Pine.LNX.4.64.0602091339090.3533@localhost.localdomain>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

Hi All

I'm thinking about adding a simple text extractor utility to hwpf, since 
everyone is currently rolling their own, and that's not very 
programmer efficient!

When I get text out, I normally use something like:
 	StringBuffer text = new StringBuffer();
 	Range r = wdoc.getRange();
 	for(int i=0; i < r.numParagraphs(); i++) {
 		Paragraph p = r.getParagraph(i);
 		text.append(p.text());
 	}

However, I've also seen people advocate an approach like:
 	StringBuffer text = new StringBuffer();
 	Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
 	while (textPieces.hasNext()) {
         	TextPiece piece = (TextPiece) textPieces.next();

 	        String encoding = "Cp1252";
 	        if (piece.usesUnicode()) {
 	                encoding = "UTF-16LE";
 	        }
 	        text.append(new String(piece.getRawBytes(), encoding));
 	}
(normally accompanied by some stripping out of macros)

Is there any reason why I shouldn't use the first version?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/