poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markl16 <mwilliam...@tssg.org>
Subject Re: Extract Text with style/type information
Date Tue, 19 Jan 2010 09:50:44 GMT

At the moment i am just targeting the HWPF to read and parse MS word
documents with the goal of transforming the document into XML. Ideally i
would like to do this line per line from top to bottom in the docuemnt so
the resulting xml structure is similar to the original document.

I have come across the StyleDescription class which can give certain style
information such as headings and normal text, and ListEntry aListEntry =
(ListEntry) p; which works if the text is bullet points but throws an error
if the text is not bulleted, i can see the code getting quite messy if i
have to try and catch an error when ever im testing for tables, lists etc. I
wonder is there a way to test paragraphs if they are of type list and is
opening or closing the list or something like that.

As for developing HWPF, id be very grateful if i could develop those
features :)

Best
Mark




MSB wrote:
> 
> Not easilly, no. By this, I mean that there is no method you can call to
> say, for example, print out all of the information aboout this section of
> the document.
> 
> But, you can get at detailed information by digging around a little in the
> various methods but a lot does depend on exactly how you want to process
> the document. It is possible for example to get at all of the tables in
> the document or all of the pictures but these method calls remove some of
> the context; you cannot tell what comes before or after the picture/table
> for example. If you have a good search through the posts in the list, you
> will be able to find some code we put together that allows you to get at
> the tables - just for an example - as they occur in the document; it is
> simply a matter of asking whether the Pagagraph object appeared in a table
> cell or not.
> 
> If you can be more precise about exactly what information you want
> printing out about each different type of object then it may be possible
> to give you a better answer. Further, it is important to know which type
> of file you are targeting - binary (.doc) or OpenXML (.docx) - as HWPF and
> XWPF have different capabilities. Finally, you do need to be aware that
> HWPF in particular is still a very immature API that is in need of a lot
> of development; if you would be willing to undertake that work and develop
> those areas that you require, I am certain that there will be a lot of
> grateful users.
> 
> Yours
> 
> Mark B
> 
> 
> markl16 wrote:
>> 
>> Hi everyone,
>> 
>> Im just researching Apache POI at the moment. I have done some simple
>> Java programs, reading in a Word Document and printing out the text etc. 
>> 
>> Im just wondering is it possible to get style information based on each
>> paragraph in the word document such as POI printing out if the paragraph
>> is a Title header, or a list of bullet points, or an image, table etc. I
>> have come accross range.getgetCharacterRun() which can provide some info
>> such as font type but im looking for more deatiled information as
>> mentioned above.
>> 
>> Any feedback appreciated.
>> 
>> Best
>> Mark
>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Extract-Text-with-style-type-information-tp27209960p27222890.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message