poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <n...@torchbox.com>
Subject Re: Extract pure text from MS Word documents
Date Tue, 09 Jan 2007 10:55:35 GMT
On Mon, 8 Jan 2007, Leila Homaeian wrote:
> I am using the org.apache.poi.hwpf.extractor.WordExtractor class to 
> extract the text from MS Word documents. The problem is that the output 
> includes not only the text of interest, but also some keywords 
> indicating the text format, e.g. TOC, HYPERLINK, REF, etc. Is there 
> anyway to recognize and exclude these keywords?

In theory, there ought to be. The trouble is that the person who wrote 
most of HWPF, Ryan Ackely, left to work for a firm that licensed the 
Microsoft file format documentation, so we no longer have an expert on the 
word file format.

If you can figure out how to identify these blocks of text, we'd love a 
patch!

> I used the getIstd() function from org.apache.poi.hwpf.model.PAPX to 
> access the sti codes of individual paragraphs. However, I did not find a 
> similar class or function that can be applied to individual words.

A paragraph is made up of a number of CharacterRuns. You could try looking 
for a similar functon for CharacterRuns?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Mime
View raw message