lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christiaan Fluit <christiaan.fl...@aduna.biz>
Subject Re: Word files & Build vs. Buy?
Date Thu, 09 Feb 2006 13:12:07 GMT
Nick Burch wrote:
> You could try using org.apache.poi.hwpf.HWPFDocument, and getting the 
> range, then the paragraphs, and grab the text from each paragraph. If 
> there's interest, I could probably commit an extractor that does this to 
> poi.

Yes, that's exactly what I'm doing. Having this in POI would benefit me 
a lot though, as I hardly understand the POI basics to be honest (my 
fault, not POI's).

This is my current code (adapted from Aperture code in CVS):

HWPFDocument doc = new HWPFDocument(poiFileSystem);
StringBuffer buffer = new StringBuffer(4096);

Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
while (textPieces.hasNext()) {
	TextPiece piece = (TextPiece) textPieces.next();

	// the following is derived from
	// http://article.gmane.org/gmane.comp.jakarta.poi.devel/7406
	String encoding = "Cp1252";
	if (piece.usesUnicode()) {
		encoding = "UTF-16LE";
	}

	buffer.append(new String(piece.getRawBytes(), encoding));
}

// normalize end-of-line characters and remove any lines
// containing macros
BufferedReader reader = new BufferedReader(new
     StringReader(buffer.toString()));
buffer.setLength(0);

String line;
while ((line = reader.readLine()) != null) {
	if (line.indexOf("DOCPROPERTY") == -1) {
		buffer.append(line);
		buffer.append(END_OF_LINE);
	}
}

// fetch the extracted full-text
String text = buffer.toString();


Regards,

Chris
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message