commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Tyler <>
Subject Re: Compress 1.1 release?
Date Fri, 12 Mar 2010 15:59:20 GMT

The file called Word.xps at:

exhibits the problem.

You are entirely correct the entry is STORED so COMPRESS-100 does the trick
for me.


On 12/03/2010 15:17, "Stefan Bodewig" <> wrote:

> On 2010-03-12, Simon Tyler <> wrote:
>> If I explain the scenario in more detail then it might become clearer.
>> I am seeing issues with certain zip files and file format based on zip (such
>> as docx and zip). We are reading these files from a stream so are using the
>> ZipArchiveInputStream.
>> What I see is that we loop around getting each entry with getNextZipEntry
>> and we get a null and stop. All looks good. However we have only extracted 1
>> or 2 entries out of a known 20 or 30 entries - the file based extractor
>> extracts all the file.
> Understood.  My guess is that whatever is creating your archives is
> using the optional header to identify data descriptors.  I'll try to
> create one with InfoZIP, can't promise anything, though.
>> I cannot provide an example of a file as the examples I have are all
>> customer owned.
> That's a pitty.
>> However every xps file I have seen suffers the issue:
> I just created one using the "Save as XPS" addin to Word 2007 on a
> "Hello world" document and the stream worked just fine.
> I'll take a look later, likely not today.
>> I have investigated the issue and it is caused by entries that use the
>> central directory.
> you mean data descriptor, right?
>> What happens in the zip stream reader is that the size, csize and crc
>> fields are all zero, there is no central directory available to the
>> reader so it performs no extraction.
> This is not true.  If the archiver works correctly it has set a flag
> that it is going to use a data descriptor after the entry's data.  If
> this flag has been set AND the compression method is DEFLATE, the stream
> can figure out itself where the entry data ends (since DEFLATE marks EOF
> internally).  If the entry data is STORED the stream cannot know where
> the data ends.
> I see several problems while looking through the code:
> * it doesn't verify the method is DEFLATE when a data descriptor is used
>   and it will try to read 0 bytes instead of throwing an exception -
>   this may be causing your problem.  COMPRESS-100
> * the stream just skips over the data descriptor and never reads it - it
>   rather sets size and crc fields from what it has found.  This may be
>   OK since we never check the claimed CRC anyway.
> * the stream skips over exactly four words while the archiver may have
>   used a signature of four bytes.  In that case the stream must skip
>   those extra bytes.  COMPRESS-101
>> So my two change requests are simply to enable me to validate entries and
>> detect these types of stream so I can do something appropriate.
> If I'm correct and you are bitten by what is now COMPRESS-100 then it
> should suffice if canReadEntryData returned false.  Right?
>> The second request is to not return a null when this type of error occurs
>> but indicate the error somehow. There might be issues here (I am no zip
>> expert) but I would be worried about false errors being reported.
> That could be COMPRESS-100 as well.  Or COMPRESS-101 is the problem for
> you, in which case we should be able to fix it.  Or it is yet another
> issue that we can't really identify without a testcase.
> Stefan
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message