poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MSB <markbrd...@tiscali.co.uk>
Subject Re: Extract Text with style/type information
Date Tue, 19 Jan 2010 16:32:23 GMT

I am hoping that it really is this simple but I cannot be too sure that it
really will be. The org.apache.poi.hwpf.usermodel.Range class is the parent
class for CharacterRun, DocumentPosition, Paragraph, Section, Table and
TableCell, whilst Paragraph is the parent of ListEntry. I have never tried
this but could it be as simple as using instanceof to test what class you
actually had in hand whilst parsing the document? It should be easy enough
to test this hypothesis;

Open a document.
Get the top level Range object.
Get the number of Pargraphs.
Iterate through the Paragraphs one at a time and test to see what object you
actually have in hand.

There are going to be one or two holes in this - I think that it will not
deal with pictures for example - but it could well be a way to start.


Mark B

markl16 wrote:
> At the moment i am just targeting the HWPF to read and parse MS word
> documents with the goal of transforming the document into XML. Ideally i
> would like to do this line per line from top to bottom in the docuemnt so
> the resulting xml structure is similar to the original document.
> I have come across the StyleDescription class which can give certain style
> information such as headings and normal text, and ListEntry aListEntry =
> (ListEntry) p; which works if the text is bullet points but throws an
> error if the text is not bulleted, i can see the code getting quite messy
> if i have to try and catch an error when ever im testing for tables, lists
> etc. I wonder is there a way to test paragraphs if they are of type list
> and is opening or closing the list or something like that.
> As for developing HWPF, id be very grateful if i could develop those
> features :)
> Best
> Mark
> MSB wrote:
>> Not easilly, no. By this, I mean that there is no method you can call to
>> say, for example, print out all of the information aboout this section of
>> the document.
>> But, you can get at detailed information by digging around a little in
>> the various methods but a lot does depend on exactly how you want to
>> process the document. It is possible for example to get at all of the
>> tables in the document or all of the pictures but these method calls
>> remove some of the context; you cannot tell what comes before or after
>> the picture/table for example. If you have a good search through the
>> posts in the list, you will be able to find some code we put together
>> that allows you to get at the tables - just for an example - as they
>> occur in the document; it is simply a matter of asking whether the
>> Pagagraph object appeared in a table cell or not.
>> If you can be more precise about exactly what information you want
>> printing out about each different type of object then it may be possible
>> to give you a better answer. Further, it is important to know which type
>> of file you are targeting - binary (.doc) or OpenXML (.docx) - as HWPF
>> and XWPF have different capabilities. Finally, you do need to be aware
>> that HWPF in particular is still a very immature API that is in need of a
>> lot of development; if you would be willing to undertake that work and
>> develop those areas that you require, I am certain that there will be a
>> lot of grateful users.
>> Yours
>> Mark B
>> markl16 wrote:
>>> Hi everyone,
>>> Im just researching Apache POI at the moment. I have done some simple
>>> Java programs, reading in a Word Document and printing out the text etc. 
>>> Im just wondering is it possible to get style information based on each
>>> paragraph in the word document such as POI printing out if the paragraph
>>> is a Title header, or a list of bullet points, or an image, table etc. I
>>> have come accross range.getgetCharacterRun() which can provide some info
>>> such as font type but im looking for more deatiled information as
>>> mentioned above.
>>> Any feedback appreciated.
>>> Best
>>> Mark

View this message in context: http://old.nabble.com/Extract-Text-with-style-type-information-tp27209960p27228585.html
Sent from the POI - User mailing list archive at Nabble.com.

To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

View raw message