poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markl16 <mwilliam...@tssg.org>
Subject Re: Extract Text with style/type information
Date Tue, 26 Jan 2010 13:18:11 GMT

During my reasearch into HWPF i suddenly now need to look into XWPF for
reading in a docx file, going through a file line by line and get style info
so an alternative custom xml document can be created based on the tags in
the original docx file.

Just wondering does XWPF offer similar features to HWPF as discussed in this
thread so far. I have read in a simple docx file and printed out the text
but i dont see many methods to get details on style etc. I wonder would it
be worth while investigating further.


MSB wrote:
> I think that my previous post may hav been a bit missleading because when
> you are parsing the document you are getting Paragraph objects so I think
> that the best thing to do is something like the following;
> if(paragraph instanceof ListEntry) {
>     // then we have a List or at least an entry into alist
> }
> else {
>     // We can assume we have a Paragrpah at this point but I do not think
> it is possible
>     // to further check the type. The only thing we can do now is to see
> if the Paragraph
>     // is in a Table cell. So
>     Table table docRange.getTable(paragraph);
>     if(table != null) {
>         // We are dealing with text that is in a table cell. Thus we have
> found a table
>         // that can be dealt with here.
>     }
>     else {
>         // We are dealing with a paragraph of text only now.
>     }
> }
> Sadly, I have not been able to find the code I wrote that strips tables
> out 'in line' but I am sure it is on the list somewhere if you have a good
> search through. Something like the above will allow you to detect lists,
> tables and 'normal' paragraphs of text as they occur in the document;
> sadly images are going to present a different problem I suspect and I do
> not as yet know how to approach this particular problem.
> One othet aspect you may want to consider are sections. Typically, the way
> I process a Word document is;
> Open the document.
> Get the Range object for the document (which one depends upon whether I
> want to process the headers/footers or not).
> Ask that Range how many Paragraph objects it contains.
> Iterate through the Paragraph's one at a time.
> It is possible - at least I think it is - to abstract this up one level,
> so;
> Open the document.
> Get the Range object for the dcoument.
> Ask it how many Sections it contains.
> Iterate through the Sections and for each;
>    Ask it how many Paragraphs it contains.
>    Iterate through the Paragrahs.
> Sections contain some information that may/will be valuable to you, not
> least being the number of columns on the page.
> Yours
> Mark B

View this message in context: http://old.nabble.com/Extract-Text-with-style-type-information-tp27209960p27322466.html
Sent from the POI - User mailing list archive at Nabble.com.

To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

View raw message