hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13578) Add Codec for ZStandard Compression
Date Thu, 01 Dec 2016 21:27:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15713129#comment-15713129
] 

Jason Lowe commented on HADOOP-13578:
-------------------------------------

Thanks for updating the patch!

I'm torn on the buffer size, whether we should use the generic io.file.buffer.size or have
a zst-specific config.  If the codec can significantly speed up with a larger default buffer
size than the file buffer size then maybe we should consider having a separate, zst-specific
buffer size config.  I'd recommend a zst-specific config key with a default value of 0 that
means use whatever buffer size the zst library wants to use, but the user can override it
to a custom size.  That way users can change the zst codec buffer size (again, useful for
keeping memory usage reasonable for very wide sort factor merges) but not change the buffer
sizes of anything else.  Similarly, someone changing the default buffer size to something
relatively small (e.g.: 4K) for some unrelated use case could unknowingly hamper the performance
of the zst codec.

Regarding the test input file, is it copyrighted or do we otherwise have the redistribution
rights?  Also the .zst binary test file is missing from the patch.

*ZStandardCodec.java*
Nit: should import the specific things needed rather than wildcard imports.

*ZStandardCompressor.java*
reinit calls reset then init, but reset already is calling init.

LOG should be private rather than public.

The LOG.isDebugEnabled check is unnecessary.

Nit: The VisibleForTesting annotation should be on a separate line like the other annotations.

*ZStandardCompressor.c:*
Shouldn't we throw if GetDirectBufferAddress returns null?

>From the zstd documentation, ZSTD_compressStream may not always fully consume the input.
 Therefore if the finish flag is set in deflateBytesDirect I think we need to avoid calling
ZTD_endStream if there is still bytes left in the input or we risk dropping some of the input.
 We can just return without setting the finished flag and expect to get called again from
the outer compression loop logic.

Similarly the documentation for ZSTD_endStream says that it may not be able to flush the full
contents of the data in one call, and if it indicates there is more data left to flush then
we should call it again.  So if we tried to end the stream and there's more data left then
we shouldn't throw an error.  Instead we simply leave the finished flag unset and return so
the outer logic loop will empty the output buffer and call compress() again.  We will then
again call ZSTD_endStream to continue attempting to finish the compression.  Bonus points
for adding a test case that uses a 1-byte output buffer to demonstrate the extreme case is
working properly.

Is it necessary to call ZSTD_flushStream after every ZSTD_compressStream call?  Seems to me
we can skip it entirely, but I might be missing something.

*ZStandardDecompressor.java:*
 LOG should be private rather than public.

Should there be a checkStream call at the beginning of decompress()?  This would mirror the
same check in the compressor for the corresponding compress() method.

In inflateDirect, the setting of preSliced = dst in the conditional block is redundant since
it was just done before the block.  Actually I think it would be faster and create less garbage
to avoid creating a temporary ByteBuffer and have the JNI code to lookup the output buffer's
position and limit and adjust accordingly.  That way we aren't creating an object each time.

The swapping of variables in inflateDirect is cumbersome.  It would be much cleaner if inflateBytesDirect
took arguments so we can pass what the JNI needs directly to it rather than shuffling the
parameters into the object's fields before we call it. That also cleans up the JNI code a
bit since it doesn't need to do many of the field lookups.  For example:
{code}
  private native int inflateBytesDirect(ByteBuffer src, int srcOffset, int srcLen, ByteBuffer
dst, int dstOffset, int dstLen);
{code}
We could simplify the variables even more if the JNI code called the ByteBuffer methods to
update the positions and limits directly rather than poking the values into the Java object
and having Java do it when the JNI returns.  I'll leave the details up to you, just pointing
out that we can clean this up and avoid a lot of swapping.  We could do a similar thing on
the compressor side.  

I'm a little confused about the endOfInput flag for the direct decompressor.  I assume it's
for handling the case where there are multiple frames within the input buffer.  The JNI code
will set finished = true once it hits the end of the frame, and it appears it will remain
that way even if more input is sent (i.e.: finished will go from false to true after the first
frame in the input buffer and remain true throughout all subsequent frames).  It makes me
wonder if we should be setting finished to false if we didn't see the end of a frame this
iteration?

*ZStandardDecompressor.c*
The handling of result codes from ZSTD_freeCStream is different between the compressor and
decompressor.  Not sure a non-zero result means unflushed data, and I'd recommend updating
the handling here to match what is done in the compressor JNI code.

Shouldn't we throw if GetDirectBufferAddress returns null?

The x ^ ((x ^ y) & -(x < y)) expression is clever, but it's probably too smart for
many compilers to realize what this is and generate the optimal code for max() on that platform.
 I tested this on gcc for x64 and it generates smaller, better code if written as a naive
implementation than with this expression.  Therefore I recommend we just do the simple thing
here (i.e.: ternary operator or plain ol' if statement) which will likely be faster in practice
(because compilers can easily detect and optimize for it) and easier to understand.

*TestZStandardCompressorDecompressor.java:*
org.junit.Ignore import is unnecessary

Nit: Should not do wildcard imports but only import what is necessary

uncompressedFile and compressedFile lookups only need to be done once, so we can do this in
the default constructor rather than before every test in the Before method (and those fields
could be marked final).


> Add Codec for ZStandard Compression
> -----------------------------------
>
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch, HADOOP-13578.v2.patch,
HADOOP-13578.v3.patch, HADOOP-13578.v4.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 months
by facebook now.  v1.0 was recently released.  Create a codec for this library.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message