lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <>
Subject Re: Word files & Build vs. Buy?
Date Thu, 09 Feb 2006 12:29:00 GMT
On Thu, 9 Feb 2006, Christiaan Fluit wrote:
> My experience is that the WordDocument class crashes on about 25% of the 
> documents, i.e. it throws some sort of Exception. I've tested POI 
> 2.5.1-final as well as the current code in CVS, but both produce this 
> result. I even suspect the output to be 100% the same, but I haven't 
> verified this.

You could try using org.apache.poi.hwpf.HWPFDocument, and getting the 
range, then the paragraphs, and grab the text from each paragraph. If 
there's interest, I could probably commit an extractor that does this to 

(WordDocument is from the hdf package, which is older and less reliable 
than the current hwpf stuff)

> Another reason I don't like this class is that it operates on an 
> InputStream and internally creates a POIFSFileSystem which you cannot 
> access, so that it becomes hard to extract document metadata as well 
> (for which you need the PFSFS) without buffering the entire InputStream.

If you're using HWPFDocument from cvs, then you can create that from a 


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message