hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13578) Add Codec for ZStandard Compression
Date Mon, 14 Nov 2016 19:17:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664738#comment-15664738

Jason Lowe commented on HADOOP-13578:

The problem with ignoring the Codec interface is that code using the Compressor/Decompressor
interfaces will break if it happens to get the zstd codec but work for all others.  That's
not really adding a full Codec for zstd.  I agree it's not the most ideal interface for the
way zstd wants to be called, but that doesn't change the fact that existing code that should
work as-is on a Hadoop-provided codec will not if we don't implement the Compressor/Decompressor

It looks like the zstd code already includes a wrapper that translates the zlib API into the
zstd API (see https://github.com/facebook/zstd/tree/dev/zlibWrapper) which we can maybe leverage
or at least use as a model for the Compressor/Decompressor implementations.  To be clear,
I think it's a good thing that the compressor/decompressor streams are using zstd more natively,
but I also think it's important to not ignore the non-stream Codec APIs.  I'm also OK if the
initial implementation of the zstd Compressor/Decompressor is not super optimal.  One tactic
would be to always copy the input data to an internal input buffer and track the amount of
data zstd wants.  Then the needsInput method is easy, just return true when the amount of
data in the internal input buffer is less than what zstd last requested.  The decompress method
would return 0 if needInput would return true.  Not super efficient since we make a copy of
all the input data before passing it to zstd, but it should be relatively straightforward
to implement.

I'm also not so sure about the minimum buffer assertion.  I saw in the zstd unit tests there
is a byte-by-byte streaming decompression test where it tries to decompress a buffer with
a 1-byte input buffer and a 1-byte output buffer.  I also verified this independently by applying
the following changes to the streaming_decompression.c example:
$ diff -u streaming_decompression.c.orig streaming_decompression.c
--- streaming_decompression.c.orig	2016-11-14 18:39:43.423069894 +0000
+++ streaming_decompression.c	2016-11-14 19:08:36.395286748 +0000
@@ -63,10 +63,10 @@
 static void decompressFile_orDie(const char* fname)
     FILE* const fin  = fopen_orDie(fname, "rb");
-    size_t const buffInSize = ZSTD_DStreamInSize();
+    size_t const buffInSize = 1;
     void*  const buffIn  = malloc_orDie(buffInSize);
     FILE* const fout = stdout;
-    size_t const buffOutSize = ZSTD_DStreamOutSize();  /* Guarantee to successfully flush
at least one complete compressed block in all circumstances. */
+    size_t const buffOutSize = 1;
     void*  const buffOut = malloc_orDie(buffOutSize);
     ZSTD_DStream* const dstream = ZSTD_createDStream();
@@ -80,6 +80,7 @@
         while (input.pos < input.size) {
             ZSTD_outBuffer output = { buffOut, buffOutSize, 0 };
             toRead = ZSTD_decompressStream(dstream, &output , &input);  /* toRead
: size of next compressed block */
+	    toRead = (toRead > buffInSize) ? buffInSize : toRead;
             if (ZSTD_isError(toRead)) { fprintf(stderr, "ZSTD_decompressStream() error :
%s \n", ZSTD_getErrorName(toRead)); exit(12); }
             fwrite_orDie(buffOut, output.pos, fout);

With those changes the streaming_decompression example was able to decompress zstd files even
though it was using an extremely small input and output buffer.  So even though I'm not giving
the zstd library a full block of input it's able to compensate.

bq. The finished field is more to reinitialize the zstd stream than the actual java decompressor

Ah, I see that now.  Thanks for the explanation.

bq.  So we could allow the user to configure and take the larger of the 2 buffer sizes in
case they configure it with a buffer size that is too small.

See input/output buffer discussion above.  I don't think we need to apply safety rails to
what the user specifies because the zstd library apparently can accommodate.

> Add Codec for ZStandard Compression
> -----------------------------------
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch, HADOOP-13578.v2.patch
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 months
by facebook now.  v1.0 was recently released.  Create a codec for this library.  

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message