poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <n...@torchbox.com>
Subject Best way to extract text from a word file
Date Thu, 09 Feb 2006 13:43:42 GMT
Hi All

I'm thinking about adding a simple text extractor utility to hwpf, since 
everyone is currently rolling their own, and that's not very 
programmer efficient!

When I get text out, I normally use something like:
 	StringBuffer text = new StringBuffer();
 	Range r = wdoc.getRange();
 	for(int i=0; i < r.numParagraphs(); i++) {
 		Paragraph p = r.getParagraph(i);
 		text.append(p.text());
 	}

However, I've also seen people advocate an approach like:
 	StringBuffer text = new StringBuffer();
 	Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
 	while (textPieces.hasNext()) {
         	TextPiece piece = (TextPiece) textPieces.next();

 	        String encoding = "Cp1252";
 	        if (piece.usesUnicode()) {
 	                encoding = "UTF-16LE";
 	        }
 	        text.append(new String(piece.getRawBytes(), encoding));
 	}
(normally accompanied by some stripping out of macros)

Is there any reason why I shouldn't use the first version?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Mime
View raw message