poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MSB <markbrd...@tiscali.co.uk>
Subject Re: Extract Text with style/type information
Date Sat, 23 Jan 2010 07:48:54 GMT

I think that my previous post may hav been a bit missleading because when you
are parsing the document you are getting Paragraph objects so I think that
the best thing to do is something like the following;

if(paragraph instanceof ListEntry) {
    // then we have a List or at least an entry into alist
else {
    // We can assume we have a Paragrpah at this point but I do not think it
is possible
    // to further check the type. The only thing we can do now is to see if
the Paragraph
    // is in a Table cell. So
    Table table docRange.getTable(paragraph);
    if(table != null) {
        // We are dealing with text that is in a table cell. Thus we have
found a table
        // that can be dealt with here.
    else {
        // We are dealing with a paragraph of text only now.

Sadly, I have not been able to find the code I wrote that strips tables out
'in line' but I am sure it is on the list somewhere if you have a good
search through. Something like the above will allow you to detect lists,
tables and 'normal' paragraphs of text as they occur in the document; sadly
images are going to present a different problem I suspect and I do not as
yet know how to approach this particular problem.

One othet aspect you may want to consider are sections. Typically, the way I
process a Word document is;

Open the document.
Get the Range object for the document (which one depends upon whether I want
to process the headers/footers or not).
Ask that Range how many Paragraph objects it contains.
Iterate through the Paragraph's one at a time.

It is possible - at least I think it is - to abstract this up one level, so;

Open the document.
Get the Range object for the dcoument.
Ask it how many Sections it contains.
Iterate through the Sections and for each;
   Ask it how many Paragraphs it contains.
   Iterate through the Paragrahs.

Sections contain some information that may/will be valuable to you, not
least being the number of columns on the page.


Mark B

markl16 wrote:
> Yep i think you were on to something there, i tried:
> [code]
> if(paragraph instanceof ListEntry)
> {
> 	System.out.println("true");
> }
> else
> {
> 	System.out.println("false");
> }
> [/code]
> Which seemed to work, ill do some more research and see does a similar
> solution work for all the tags i want.
> Best
> Mark
> MSB wrote:
>> I am hoping that it really is this simple but I cannot be too sure that
>> it really will be. The org.apache.poi.hwpf.usermodel.Range class is the
>> parent class for CharacterRun, DocumentPosition, Paragraph, Section,
>> Table and TableCell, whilst Paragraph is the parent of ListEntry. I have
>> never tried this but could it be as simple as using instanceof to test
>> what class you actually had in hand whilst parsing the document? It
>> should be easy enough to test this hypothesis;
>> Open a document.
>> Get the top level Range object.
>> Get the number of Pargraphs.
>> Iterate through the Paragraphs one at a time and test to see what object
>> you actually have in hand.
>> There are going to be one or two holes in this - I think that it will not
>> deal with pictures for example - but it could well be a way to start.
>> Yours
>> Mark B

View this message in context: http://old.nabble.com/Extract-Text-with-style-type-information-tp27209960p27284134.html
Sent from the POI - User mailing list archive at Nabble.com.

To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

View raw message