poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 49189] New: XWPFWordExtractor discards <w:tab/> entries.
Date Tue, 27 Apr 2010 08:21:55 GMT
https://issues.apache.org/bugzilla/show_bug.cgi?id=49189

           Summary: XWPFWordExtractor discards <w:tab/> entries.
           Product: POI
           Version: 3.7-dev
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XWPF
        AssignedTo: dev@poi.apache.org
        ReportedBy: antoni.mylka@gmail.com


Created an attachment (id=25358)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=25358)
Test document which exposes the problem

In the current trunk, two characters separated by a tab character are glued
together the tab is removed. 

I tried to debug the issue and found a following piece of code in
XWPFParagraph.getText() method:

XmlObject o = c.getObject();
if (o instanceof CTText) {
    text.append(((CTText) o).getStringValue());
}
if (o instanceof CTPTab) {
    text.append("\t");
}

This seems to assume that wherever a <w:tab/> construct appears in the source
text file, XMLBeans will return an instance of CTPTab. Unfortunately in my case
it seems to return CTEmptyImpl, which is not a CTPTab. 

I tried to read the specs, and in section 17.3.1.37 it says that there is only
one possible parent element for <w:tab> and it is <w:tabs>. In my file,
generated with office 2010 beta I have:

<w:p w14:paraId="4EB09767" w14:textId="77777777" w:rsidR="00B3064F"
    w:rsidRDefault="00B3064F">
    <w:r>
        <w:t>a</w:t>
    </w:r>
    <w:r>
        <w:tab />
        <w:t>b</w:t>
    </w:r>
    <w:bookmarkStart w:id="0" w:name="_GoBack" />
    <w:bookmarkEnd w:id="0" />
</w:p>

You see that <w:tab /> is note enclosed within <w:tabs></w:tabs>

This might imply that either office produces a wrong file, or the OpenXML XSDs
are wrong, or there is something wrong with XMLBeans class generator, or with
its runtime parser.

Could someone with more knowledge of the OpenXML format take a look at this?
This error spoils fulltext indexing and seems pretty important for the users of
the Aperture Framework.

The easiest workaround for me would be to add a third 'if' for CTEmptyImpl and
put a space in the output. Superfluous whitespace (almost) never hurts, while
glueing words together is bad, but as I said, my knowledge on this topic is
limited.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message