poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rainer Schwarze <...@admadic.de>
Subject Re: about HWPF
Date Tue, 01 Apr 2008 20:16:21 GMT
teena21 wrote:
> hi all
> 
> i m using HWPF to Extract data from word file.
> i extract only plain data.
> i m not able to extract properties of a particular word or character like
> bold, itelic,font name.
> 
> when i use iteration of CharacterRun it returns fontproperties only when
> it(Properties) changed. 
> by this i m not geting that which word is bold or which is unbold.
> 
> plz help me to extract data with its properties.
> 

Hi,

you can only retrieve the formatting information for a specific 
character or word by finding the CharacterRun(s) which contains it and 
then retrieve its properties. Word files contain formatting information 
in "layers". A paragraph may be bold by default and the text within it 
may have specific formatting which turns "bold" off again. CharacterRun 
takes care of these layers and delivers the final formatting.

To retrieve the formatting for specific words, I would suggest to 
identify the position of the word in the document's text - for instance 
in document content "abc def", the word "def" is at 4-7 (counting starts 
at 0, end is after last character of word). Now walk through the list of 
CharacterRuns and find all which have a range which intersects the 
interval of the word. If you are lucky, its only one CharacterRun; it 
gets complicated when more are matching.

For instance "def" could be formatted to be bold, and only 'e' is 
italic. Then you get three CharacterRuns intersecting the word interval. 
  So if each intersecting CharacterRun says isBold()==true, then the 
word is completely bold.

Beware of CharacterRuns which have an interval outside of the text range 
and also beware of CharacterRuns with length 0. I've encountered both in 
various Word files.

Let me know, if you need more information :-)

Best wishes, Rainer
-- 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message