poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: [Bug 60519] Extractor for *SSF embeddings
Date Thu, 05 Jan 2017 13:00:17 GMT
Thank you Andi and Javen!

Javen,

  I respect your point about "not limited to MSOffice documents".  My selfish/Tika-ish goal
in processing them, frankly, is only to extract embedded documents and their metadata.  Andi's
patch demonstrated the need to handle the "feature" distinction btwn how Mac xls and Windows
xls handle embedded pdf files -- in Windows, the pdf is available as a standalone embedded
file, with an emf to represent the icon; in Mac, the emf contains the original pdf (and graphics
to represent the icon?)...in short, over on Tika, we're currently not extracting the PDF from
the Mac xls, but we are from the Windows xls.

  So, y, a robust read/write EMF parser/writer would make sense as a standalone project in
incubator.  However, I don't have the energy/time to do much more than read-only for this
one very small problem.  POI's scratchpad or Tika are the two immediate targets that I could
easily contribute to.  If there's a need and someone has the time, we could move whatever
code there is for this one small task into a future incubator project.

  I also respect your point about inviting bug reports that would distract us from focusing
on MSOffice documents.  Sounds like there's loose consensus to put this in Tika for now, and
if anyone wants to take it on, move it to incubator?  

  Cheers,

                  Tim


P.S. As a side note, I suspect there are some interesting metadata items that we can pull
out of EMFs... For example, I saw some text content of the PDF in the EMF portion of the mac
EMF.  I also saw some original paths for the embedded file in the EMF.

-----Original Message-----
From: Javen O'Neal [mailto:onealj@apache.org] 
Sent: Wednesday, January 4, 2017 8:05 PM
To: POI Developers List <dev@poi.apache.org>
Subject: Re: [Bug 60519] Extractor for *SSF embeddings

What about an Apache incubator project for reading and writing EMF(+) files?

On Jan 4, 2017 2:53 PM, "Andreas Beeker" <kiwiwings@apache.org> wrote:

> Hi Tim,
>
> every now and then I play with the idea to provide an EMF parser like 
> the WMF parser, to render images inside slideshows. This could be of 
> course used to extract other content too.
> The simplest way would be, to adapt the FreeHep library, but its GPL 
> licensed ... :(
>
> So for extracting embedded content, I guess it's not so difficult to 
> generically parse the emf(+) records and only handle the interesting ones.
> This limited functionality should be in scratchpad or the example classes.
> If it is not a huge code chunk, it could be in the Extractor class - 
> otherwise I would like to see it in Tika ...
>
> Andi
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional 
> commands, e-mail: dev-help@poi.apache.org
>
>
Mime
View raw message