commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Bodewig <bode...@apache.org>
Subject Re: [compress] XZ support and inconsistencies in the existing compressors
Date Thu, 04 Aug 2011 04:23:53 GMT
Hi Lasse and welcome

On 2011-08-03, Lasse Collin wrote:

> I have been working on XZ data compression implementation in Java
> <http://tukaani.org/xz/java.html>. I was told that it could be nice
> to get XZ support into Commons Compress.

Sounds interesting.

> I looked at the APIs and code in Commons Compress to see how XZ
> support could be added. I was especially looking for details where
> one would need to be careful to make different compressors behave
> consistently compared to each other.

This is in a big part due to the history of Commons Compress which
combined several different codebases with separate APIs and provided a
first attempt to layer a unifying API on top of it.  We are aware of
quite a few problems and want to address them in Commons Compress 2.x
and it would be really great if you would participate in the design of
the new APIs once that discussion kicks off.

Right now I myself am pretty busy implementing ZIP64 support for a 1.3
release of Commons Compress and intend to start the 2.x discussion once
this is done - which is (combined with some scheduled offline time)
about a month away for me.

I should probably also mention that right now probably no active
committer understands the bzip2 code well enough to make significant
changes at all.  I know that I don't.

> I found a few possible problems in the existing code:

> (1) CompressorOutputStream should have finish(). Now
>     BZip2CompressorOutputStream has finish() but
>     GzipCompressorOutputStream doesn't. This should be easy to
>     fix because java.util.zip.GZIPOutputStream supports finish().

+1

This is a good point we should earmark for 2.0 - doing so for 1.x would
break the API which we try to avoid.

> (2) BZip2CompressorOutputStream.flush() calls out.flush() but it
>     doesn't flush data buffered by BZip2CompressorOutputStream.
>     Thus not all data written to the Bzip2 stream will be available
>     in the underlying output stream after flushing. This kind of
>     flush() implementation doesn't seem very useful.

Agreed, do you want to open a JIRA issue for this?

>     GzipCompressorOutputStream.flush() is the default version
>     from InputStream and thus does nothing. Adding flush()
>     into GzipCompressorOutputStream is hard because
>     java.util.zip.GZIPOutputStream and java.util.zip.Deflater don't
>     support sync flushing before Java 7. To get Gzip flushing in
>     older Java versions one might need a complete reimplementation
>     of the Deflate algorithm which isn't necessarily practical.

Not really desirable, I agree.  As for Java7, we currently target Java5
but it might be possible to hack in flush support using reflection.  So
we could support sync flushing if the current Java classlib supports it.

> (3) BZip2CompressorOutputStream has finalize() that finishes a stream
>     that hasn't been explicitly finished or closed. This doesn't seem
>     useful. GzipCompressorOutputStream doesn't have an equivalent
>     finalize().

Removing it could cause backwards compatibility issues.  I agree it is
unnecessary but would leave fixing it to the point where we are willing
to break compatibility - i.e. 2.0.  This is in the same category as 
<https://issues.apache.org/jira/browse/COMPRESS-128> to me.

> (4) The decompressor streams don't support concatenated .gz and .bz2
>     files. This can be OK when compressed data is used inside another
>     file format or protocol, but with regular (standalone) .gz and
>     .bz2 files it is bad to stop after the first compressed stream
>     and silently ignore the remaining compressed data.

>     Fixing this in BZip2CompressorInputStream should be relatively
>     easy because it stops right after the last byte of the compressed
>     stream.

Is this <https://issues.apache.org/jira/browse/COMPRESS-146>?

>     Fixing GzipCompressorInputStream is harder because the problem is
>     inherited from java.util.zip.GZIPInputStream which reads input
>     past the end of the first stream. One might need to reimplement
>     .gz container support on top of java.util.zip.InflaterInputStream
>     or java.util.zip.Inflater.

Sounds doable but would need somebody to code it, I guess ;-)

> The XZ compressor supports finish() and flush(). The XZ decompressor
> supports concatenated .xz files, but there is also a single-stream
> version that behaves similarly to the current version of
> BZip2CompressorInputStream.

I think in the 1.x timeframe users that know they are using XZ would
simply bypass the Commons Compress interfaces like they'd do now if they
wanted to flush the bzip2 stream.  The main difference here likely is
they wouldn't need to use Commons Compress at all but could be using
your XZ package directly in that case.  They don't have that choice with
bzip2.

> Assuming that there will be some interest in adding XZ support into
> Commons Compress, is it OK make Commons Compress depend on the XZ
> package org.tukaani.xz, or should the XZ code be modified so that
> it could be included as an internal part in Commons Compress?

> I would prefer depending on org.tukaani.xz because then there is just
> one code base to keep up to date.

In the past we have incorporated external codebases (ar and cpio) that
used to be under compatible licenses to make things simpler for our
users, but if you prefer to develop your code base outside of Commons
Compress then I can fully understand that.

>From a license POV we obviously wouldn't have any problems with your
public domain code.  From the dependency management POV I know many
developers prefer dependencies that are available from a Maven
repository, is this the case for the org.tukaani.xz package (I'm too
lazy to check).  I'm an Ant person myself, but you know there are those
people who love repositories ...

Also I would have a problem with an external dependency on code that
says "The APIs aren't completely stable yet".  Any tentative timeframe
as to when you expect to have a stable API?  It might match our schedule
for 2.x so we could target that release rather than 1.3.

Cheers

        Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message