poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leila Homaeian <le...@cs.ualberta.ca>
Subject Extract pure text from MS Word documents
Date Mon, 08 Jan 2007 21:12:58 GMT

I am using the org.apache.poi.hwpf.extractor.WordExtractor class to 
extract the text from MS Word documents. The problem is that the output 
includes not only the text of interest, but also some keywords indicating 
the text format, e.g. TOC, HYPERLINK, REF, etc. Is there anyway to 
recognize and exclude these keywords?

I used the getIstd() function from org.apache.poi.hwpf.model.PAPX to 
access the sti codes of individual paragraphs. However, I did not find a 
similar class or function that can be applied to individual words.

Any help is much appreciated.


To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

View raw message