poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 45622] New: Header/footer extraction for Word documents incomplete
Date Tue, 12 Aug 2008 18:13:22 GMT
https://issues.apache.org/bugzilla/show_bug.cgi?id=45622

           Summary: Header/footer extraction for Word documents incomplete
           Product: POI
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: critical
          Priority: P1
         Component: HWPF
        AssignedTo: dev@poi.apache.org
        ReportedBy: dgoldenberg@attivio.com


Created an attachment (id=22435)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=22435)
Simple Word doc with headers and footers.

There are several issues with the header/footer extraction for Word as it is
implemented now.

1. The newly added methods are on WordExtractor as follows:
public String getHeaderText()
public String getFooterText()

These methods do not account for the use-case of headers/footers defined
differently for odd vs. even pages in Word.

I propose a different model:

HWPFHeader header = extractor.getHeader();
String oddHeader = header.getOddHeader();
String evenHeader = header.getEvenHeader();

HWPFFooter footer = extractor.getFooter();
String oddFooter = footer.getOddFooter();
String evenFooter = footer.getEvenFooter();

This will be adequate to the Word's model and in line with the model adopted in
the Excel header/footer extraction code:

HSSFHeader header = sheet.getHeader();
String leftHeader = header.getLeft();
String centerHeader = header.getCenter();
String rightHeader = header.getRight();

2. The second issue is macros. You can define macros in headers and footers and
currently they show up in the extracted text. For example, in the attached file
HeadersFooters2.doc, the Author field was used in the header, and the string
"AUTHOR" gets returned. It would be great if the headers/footers would only
return the actual text and never the macros, or if the methods had a boolean
flag to strip off the macros.

For example, for the attached HeadersFooters2.doc, the following gets returned:

HEADER GOES HERE. 8/12/2008  AUTHOR \* MERGEFORMAT Eric Roch

It would be great if the returned text was simply:

HEADER GOES HERE. 8/12/2008 Eric Roch

In the interest of being generic, a flag for stripping off this extra markup is
probably best.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message