poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MSB <markbrd...@tiscali.co.uk>
Subject Re: Extract Text with style/type information
Date Tue, 26 Jan 2010 16:50:36 GMT

Sorry Mark, my knowledge of XWPF is VERY limited indeed. It may be best to
start a new thread asking specifically about XWPF if you want to get a
better response. Having said that though and speaking as someone with a
limited knowledge of such things, would it not be possible to transform the
xml formatted file into 'your' xml format and simply remove XWPF from the
equation entirely?

Yours

Mark B


markl16 wrote:
> 
> During my reasearch into HWPF i suddenly now need to look into XWPF for
> reading in a docx file, going through a file line by line and get style
> info so an alternative custom xml document can be created based on the
> tags in the original docx file.
> 
> Just wondering does XWPF offer similar features to HWPF as discussed in
> this thread so far. I have read in a simple docx file and printed out the
> text but i dont see many methods to get details on style etc. I wonder
> would it be worth while investigating further.
> 
> Best
> Mark
> 
> 
> MSB wrote:
>> 
>> I think that my previous post may hav been a bit missleading because when
>> you are parsing the document you are getting Paragraph objects so I think
>> that the best thing to do is something like the following;
>> 
>> if(paragraph instanceof ListEntry) {
>>     // then we have a List or at least an entry into alist
>> }
>> else {
>>     // We can assume we have a Paragrpah at this point but I do not think
>> it is possible
>>     // to further check the type. The only thing we can do now is to see
>> if the Paragraph
>>     // is in a Table cell. So
>>     Table table docRange.getTable(paragraph);
>>     if(table != null) {
>>         // We are dealing with text that is in a table cell. Thus we have
>> found a table
>>         // that can be dealt with here.
>>     }
>>     else {
>>         // We are dealing with a paragraph of text only now.
>>     }
>> }
>> 
>> Sadly, I have not been able to find the code I wrote that strips tables
>> out 'in line' but I am sure it is on the list somewhere if you have a
>> good search through. Something like the above will allow you to detect
>> lists, tables and 'normal' paragraphs of text as they occur in the
>> document; sadly images are going to present a different problem I suspect
>> and I do not as yet know how to approach this particular problem.
>> 
>> One othet aspect you may want to consider are sections. Typically, the
>> way I process a Word document is;
>> 
>> Open the document.
>> Get the Range object for the document (which one depends upon whether I
>> want to process the headers/footers or not).
>> Ask that Range how many Paragraph objects it contains.
>> Iterate through the Paragraph's one at a time.
>> 
>> It is possible - at least I think it is - to abstract this up one level,
>> so;
>> 
>> Open the document.
>> Get the Range object for the dcoument.
>> Ask it how many Sections it contains.
>> Iterate through the Sections and for each;
>>    Ask it how many Paragraphs it contains.
>>    Iterate through the Paragrahs.
>> 
>> Sections contain some information that may/will be valuable to you, not
>> least being the number of columns on the page.
>> 
>> Yours
>> 
>> Mark B
>> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Extract-Text-with-style-type-information-tp27209960p27325894.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message