tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andreas Meier (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2632) Analyze unknown govdocs files
Date Mon, 16 Apr 2018 06:41:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andreas Meier updated TIKA-2632:
--------------------------------
    Description: 
I recently started to analyze randomly govdocs1 files that could not be recognized by TIKA
properly.

 

This ticket should be used to identify problems with old or proprietary files and to extend
TIKA step-by-step if needed.

 

Stumbled across the following filetypes/files:

 
1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized properly:

Found some mysterious files starting with 0xeddead0b and 0x0baddeed

Turned out that someone else already investigated this case a month ago:
[link http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]

The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
I think these Magic-strings should be added tika-mimetypes.xml as well as another PowerPoint
mime-type. (maybe application/vnd.ms-powerpoint.2 or application/vnd.ms-powerpoint.3 ?)

Example files in govdocs1: 
144/144504.unk
272/272490.unk
430/430427.unk
(several more...)


2. Proprietary File Format: SigmaPlot Exchange File .jxf:
Magic: 0x8888000c4a5846
Example file in govdocs1:
975/975382.unk
975/975383.unk
 (several more...)


3. There are two old excel file types which are not recognized at the Moment (application/vnd.ms-excel.sheet.2):

376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 0x0900040000001000

224/224485.unk and 615/615187.unk start with  0x0900040002001000 instead of 0x0900040000001000

The magic for application/vnd.ms-excel.sheet.2 should be adapted:
0x02001000
and
0x07001000
must be added.

Furthermore we have to check whether the parser can be adapted to process all the mentioned
files.

(LibreOffice can open all of these files)


4. Special Header/Wrapper in front of application/vnd.ms-excel.sheet.3
In file 611/611703.unk I found a 128-byte long header in front of the excel file.
therefore the file could not be recognized correclty by TIKA

After I cut the header, the file could be recognized and converted by TIKA.


5. SAS Data file
Example file:
020/020505.unk

6. AirSar Data (Airborne synthetic aperature Radar)
Example file:
348/349489.unk (several more...)

7. Advanced Data Format (ADF)
Used in CGNS (CFD General Notation System .cgns)
Example file:
363/363966.unk

8. Unknown Microsoft Word Document
Example file:
202/202718.unk
(Recognized as Microsoft Word Document by Linux Magic)

9. Unknown PowerPoint 3.0 file?
Example file:
388/388212.unk


Let me know if I should open a separate ticket for case 1. and 3.!


If there is any better place (except the mailing lists) to publish the analyzation results
let me know.

 

Regards

 

Andreas

  was:
I recently started to analyze randomly govdocs1 files that could not be recognized by TIKA
properly.

 

This ticket should be used to identify problems with old or proprietary files and to extend
TIKA step-by-step if needed.

 

Stumbled across the following problems:

 

1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized properly:

Found some mysterious files starting with 0xeddead0b and 0x0baddeed

Turned out that someone else already investigated this case a month ago:
[link http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]

The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
I think these Magic-strings should be added tika-mimetypes.xml as well as another PowerPoint
mime-type. (maybe application/vnd.ms-powerpoint.2 or application/vnd.ms-powerpoint.3 ?)

Example files in govdocs1: 
144/144504.unk
272/272490.unk
430/430427.unk
(several more...)


2. Proprietary File Format: SigmaPlot Exchange File .jxf:
Magic: 0x8888000c4a5846
Example file in govdocs1:
975/975382.unk
975/975383.unk
 (several more...)


3. Bitflip or valid Magic for application/vnd.ms-excel.sheet.2
In one file (376/376222.unk) I found
0x0900040007001000
instead of
0x0900040000001000

-I guess the bit just flipped for any reason (interception of the data or sth. else)-
EDIT: file 622/62252.unk also starts with 0x0900040007001000
Maybe the magic  for application/vnd.ms-excel.sheet.2 should be adapted.
Any thoughts?


4. Special Header/Wrapper in front of application/vnd.ms-excel.sheet.3
In file 611/611703.unk I found a 128-byte long header in front of the excel file.
therefore the file could not be recognized correclty by TIKA

After I cut the header, the file could be recognized and converted by TIKA.



Let me know if I should open a separate ticket for case 1.


If there is any better place (except the mailing lists) to publish the analyzation results
let me know.

 

Regards

 

Andreas


> Analyze unknown govdocs files
> -----------------------------
>
>                 Key: TIKA-2632
>                 URL: https://issues.apache.org/jira/browse/TIKA-2632
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Andreas Meier
>            Priority: Minor
>
> I recently started to analyze randomly govdocs1 files that could not be recognized by
TIKA properly.
>  
> This ticket should be used to identify problems with old or proprietary files and to
extend TIKA step-by-step if needed.
>  
> Stumbled across the following filetypes/files:
>  
> 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized properly:
> Found some mysterious files starting with 0xeddead0b and 0x0baddeed
> Turned out that someone else already investigated this case a month ago:
> [link http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]
> The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
> I think these Magic-strings should be added tika-mimetypes.xml as well as another PowerPoint
mime-type. (maybe application/vnd.ms-powerpoint.2 or application/vnd.ms-powerpoint.3 ?)
> Example files in govdocs1: 
> 144/144504.unk
> 272/272490.unk
> 430/430427.unk
> (several more...)
> 2. Proprietary File Format: SigmaPlot Exchange File .jxf:
> Magic: 0x8888000c4a5846
> Example file in govdocs1:
> 975/975382.unk
> 975/975383.unk
>  (several more...)
> 3. There are two old excel file types which are not recognized at the Moment (application/vnd.ms-excel.sheet.2):
> 376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 0x0900040000001000
> 224/224485.unk and 615/615187.unk start with  0x0900040002001000 instead of 0x0900040000001000
> The magic for application/vnd.ms-excel.sheet.2 should be adapted:
> 0x02001000
> and
> 0x07001000
> must be added.
> Furthermore we have to check whether the parser can be adapted to process all the mentioned
files.
> (LibreOffice can open all of these files)
> 4. Special Header/Wrapper in front of application/vnd.ms-excel.sheet.3
> In file 611/611703.unk I found a 128-byte long header in front of the excel file.
> therefore the file could not be recognized correclty by TIKA
> After I cut the header, the file could be recognized and converted by TIKA.
> 5. SAS Data file
> Example file:
> 020/020505.unk
> 6. AirSar Data (Airborne synthetic aperature Radar)
> Example file:
> 348/349489.unk (several more...)
> 7. Advanced Data Format (ADF)
> Used in CGNS (CFD General Notation System .cgns)
> Example file:
> 363/363966.unk
> 8. Unknown Microsoft Word Document
> Example file:
> 202/202718.unk
> (Recognized as Microsoft Word Document by Linux Magic)
> 9. Unknown PowerPoint 3.0 file?
> Example file:
> 388/388212.unk
> Let me know if I should open a separate ticket for case 1. and 3.!
> If there is any better place (except the mailing lists) to publish the analyzation results
let me know.
>  
> Regards
>  
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message