hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13578) Add Codec for ZStandard Compression
Date Thu, 03 Nov 2016 17:57:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633647#comment-15633647
] 

Jason Lowe commented on HADOOP-13578:
-------------------------------------

Thanks for updating the patch!  It's great to see compatibility between the zstd CLI and the
Hadoop codec.

I'm confused why the codec returns null for getCompressorType, createCompressor, getDecompressorType,
and createDecompressor.  This is going to break code that expects to wield the compressor/decompressor
rather than streams directly.  I would expect it to work more like the DefaultCodec or BZip2Codec
where the codec returns appropriate Compressor and Decompressor classes and leverages the
existing CompressorStream, DecompressorStream, and CompressionCodec.Util classes to handle
the stream interfaces of Codec.  For example, the CompressorStream abstract class already
handles the single-byte write method, protection from double-finish and double-close, the
DecompressorStream already implemetns skip, etc.

The original approach used direct byte buffers, but the new patch no longer does.  All the
other codecs leverage direct byte buffers, so I'm curious about the reasoning for that change.
 I'm not a JVM expert, but I'm wondering if the *PrimitiveArrayCritical methods have unfavorable
impacts on other threads in the JVM (e.g.: due to pausing GC or other effects).  Given that
GetPrimitiveArrayCritical could trigger an array copy in order to perform it's task and we
have to copy the bytes from the output buffer into the output stream anyway, I would expect
direct byte buffers to be faster for at least the output buffer case.

Speaking of double-finish, it looks like that could be problematic and lead to double-free's
in the native code layer.  In addition to leveraging CompressorStream/DecompressorStream as
mentioned above to help with this, we could zero the stream field after we are finished and
check for a non-null stream context before doing operations.

We should throw NPE if the caller passes a null for the source data buffer.

Similarly the native code should throw an error if it cannot obtain pointers to the Java buffers.
 Currently it just silently returns no progress which will result in an infinite loop in practice
as it tries to reach the end of the input and never gets there if the error keeps occurring
on each JNI invocation.

The documentation for the zstd streaming interface mentions that flushing or ending a stream
may require multiple invocations in order to fully accomplish the task, but I don't see corresponding
loops in the java code to handle that scenario:
{noformat}
*  At any moment, it's possible to flush whatever data remains within buffer, using ZSTD_flushStream().
*  `output->pos` will be updated.
*  Note some content might still be left within internal buffer if `output->size` is too
small.
*  @return : nb of bytes still present within internal buffer (0 if it's empty)
*            or an error code, which can be tested using ZSTD_isError().
*
*  ZSTD_endStream() instructs to finish a frame.
*  It will perform a flush and write frame epilogue.
*  The epilogue is required for decoders to consider a frame completed.
*  Similar to ZSTD_flushStream(), it may not be able to flush the full content if `output->size`
is too small.
*  In which case, call again ZSTD_endStream() to complete the flush.
{noformat}

IOUtils.closeQuietly is being called on the output stream which could lead to silent data
corruption.  If there was an issue writing out the last bits of data as a result of the close
call then this will silently eat the error.  The user is left with a "successful" operation
that did not actually succeed.

The getLibraryName native method should do something sane for the non-UNIX case, like returning
the Unavailable string as some codecs do when they can't compute it in the UNIX case.  Bonus
points for adding a WINDOWS case, and it looks like we can model after the Windows implementations
of getLibraryName in the other codecs.

In the decompressor, why are we finished when we decode a frame?  I thought it was valid for
a ZStandard compression stream to be made up of one or more frames, so if there is more input
after decoding a frame it seems like the prudent thing to do is try decoding another frame.

It would be nice to allow the input and output buffer sizes to be configurable.  When used
as part of wide merge sorts, there are going to be a lot of codec instances.  It may be necessary
to use a smaller buffer size per codec to reduce the merge's memory footprint due to all those
codec I/O buffers.  It makes sense to use the library-recommended size as the default value,
but it should be straightforward to allow the user to override this.

Nit: srcSize is not really a size in the following code but rather a terminating offset. 
Renaming it to something like endOffset would be more clear, since we're comparing it against
the inputBufferOffset.  Same comment for dstSize in the decompressor.
{code}
    int srcSize = offset + len;
    inputBufferOffset = offset;
    while (inputBufferOffset < srcSize) {
{code}

Nit: The whitespace was removed around this line in the ASF license header, and it would be
good to restore it so it's consistent with the ASF headers in other files.
{noformat}
 * http://www.apache.org/licenses/LICENSE-2.0
{noformat}


> Add Codec for ZStandard Compression
> -----------------------------------
>
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch, HADOOP-13578.v2.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 months
by facebook now.  v1.0 was recently released.  Create a codec for this library.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message