poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Invalid header for xls: 0x0010000000060409?
Date Tue, 25 Nov 2014 15:40:38 GMT
Thank you, Nick!  I'll post a file to Tika's JIRA.  Or, should I raise this on POI's bugzilla?
 I can't imagine there's a burning need (or interest to add) processing for pre-OLE2 docs.

 -----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Tuesday, November 25, 2014 9:20 AM
To: POI Users List
Subject: Re: Invalid header for xls: 0x0010000000060409?

On Mon, 24 Nov 2014, Allison, Timothy B. wrote:
> I recently ran Tika against the ~1 million files in govdocs1.  Nearly 
> 91% (2,579/2,828) of the XLS exceptions via Tika 1.7 are the following. 
> Tika is detecting these as XLS and then the header exception is thrown.

You need to read that backwards to see the pattern, so the file starts 
with 0x090406

> Does this header ring any bells?  Old version of XLS, perhaps?  The 
> triggering files open in Excel and I think I see that they are "Excel 
> 4".

Sounds like one of the very old, pre-ole2 versions

Looking at the OpenOffice documentation, under section 2.2 and 2.3:
http://www.openoffice.org/sc/excelfileformat.pdf

That suggests that Excel 5 onwards (5, 95, 97 etc) used OLE2, so that'd 
mean it's Excel 1 through Excel 4

> I can't get the link to work, but one triggering file is 004444.xls.

If you can get that file out, and raise a JIRA, then we can look to add in 
magic to correctly detect/handle those files!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message