poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Goldenberg <DGoldenb...@attivio.com>
Subject RE: How to extract embedded files from Office 07
Date Fri, 29 Aug 2008 00:30:12 GMT
Rainer,
You were right, the .bin file in /embeddings is Ole and can be read with POIFS.

The gotcha is, there's currently no API to extract the file out of the Ole structures within
POIFS.

HSLF has an API to enumerate Ole objects within slides. But what I need is a generic API that
would let me do the following:

List<Embedding> embeddings = poifs.getEmbeddings();
for (Embedding embedding : embeddings) {
    System.out.println(">> Embedding: " + embedding.getName());
    embedding.extractTo(new FileOutputStream(outputDir, Utils.getCleanFileName(embedding.getName())));
}

getEmbeddings() could be getOleObjects() or whatever, but that's the gist of it..

- Dmitry

-----Original Message-----
From: Rainer Schwarze [mailto:rsc@admadic.de]
Sent: Thursday, August 28, 2008 6:49 PM
To: POI Users List
Subject: Re: How to extract embedded files from Office 07

Dmitry Goldenberg wrote:
> Yegor,
>
> The first 8 bytes contain the standard MS Office magic number stuff - d0 cf 11 e0 a1
b1 1a e1.
>
> Seems like they compress data in a proprietary way. I've read one post where someone
recommended the .NET Packaging API to crack these ...  Not a good option ...

Hi Dmitry,

this may be interesting (unless you already found it):

http://www.nabble.com/Can-POIFS-convert-PDF-to-OLE-td18568081.html


Looking at such things I suspect this:

The data is inside "Ole10Native". This could be extracted using POIFS.
The structures there look like this:

[4 bytes] = size of structure including data
[???] a few flags and strings (zero terminated)
[4 bytes] = size of actually embedded binary data
[???] = the actual binary data

If you know that it is a ZIP file, you could search for a byte sequence
[size]"PK", where [size] depends on the search position. Assume you
start immediately after the first 4 bytes for total length, then the
size value is length-4. Step further by one byte and check for the
sequence with size set to length-5 a.s.o. When the 6 bytes match the
expected [size]PK sequence, you can be somewhat sure, that "PK"
represents the start of the ZIP file and [size] is its size.

Of course nothing beats the analysis of the actual binary data structure
:-) (Would this be worth the effort for your purpose?)

Best wishes, Rainer
--

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message