poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject formatting info in Header/FooterRecords in xls(x)
Date Mon, 04 Jan 2016 16:37:48 GMT
  Over on TIKA-1730 [0], we have a request to hide formatting info from header/footer records
for both xls and xlsx during text extraction.  
  When I look at the text from FooterCell's getText(), it looks like we may want to add some
parsing of the string to subcomponents for a HeaderCell/FooterCell.  Some useful information
from Microsoft is here [1].
For example, from Tika's testExcel_headers_footers.xls file:

&LFooter - Corporate Spreadsheet&CFooter - For Internal Use Only&RFooter - Author:
John Smith

Note, though, that the xlsx file already parses/separately stores the left/center/right components:
Footer - Corporate Spreadsheet Footer - For Internal Use Only Footer - Author: John Smith

From the TIKA-1730 .xls file:


From the TIKA-1730.xlsx file:

Has anyone worked with this area of our code base recently?  Is this something we should add/fix
at the POI level or at the Tika level?

Thank you.



[0] https://issues.apache.org/jira/browse/TIKA-1730
[1] https://support.microsoft.com/en-us/kb/142136 

To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

View raw message