hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "churro morales (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-13578) Add Codec for ZStandard Compression
Date Thu, 06 Oct 2016 03:22:21 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550599#comment-15550599
] 

churro morales edited comment on HADOOP-13578 at 10/6/16 3:21 AM:
------------------------------------------------------------------

[~jlowe] thank you for the thorough review.  The reason that the zstd cli and hadoop can't
read each other's compressed / decompressed data is that ZStandardCodec uses the Block(Compressor|Decompressor)Stream.
 I had made an assumption we wanted to compress at the hdfs block level.  When you use this
stream each hdfs block gets a header and some compressed data.  I believe the 8 bytes you
are referring are two ints (the size of the compressed and uncompressed block).  If you remove
these headers then the cli will be able to read the zstd blocks and if you use the zstd-cli
and compress a file (prepend the header for sizes) it will work in hadoop. 

The snappy compressor / decompressor works in the same way.  I do not believe you can compress
in snappy format using hadoop then transfer the file locally and call Snappy.uncompress()
without removing the headers. 

If we do not want this to be compressed at a block level, that is fine.  Otherwise we can
just add a utility in hadoop to take care of the block headers like they did with hadoop-snappy
or some of the CLI libraries for snappy like snzip.  

As far as the decompressed bytes, I agree.  I will check to see that the size returned from
the function that tells you how many bytes are necessary to uncompress the buffer and ensure
thats not larger than our buffer size.  I can also add the isError and getErrorName to ZStandardDecompressor.c.
 The reason I explicitly checked if the expected size was equal to the desired size is because
the error that zstd provided was too vague.  But I'll add it in case there are other errors.


Yes I will look at Hadoop-13684.  The build of the codec was very similar to snappy because
the license was BSD so we could package it in like snappy so I basically cut and paste the
build / packaging code and made it work for ZStandard.. 

I can also take care of the nits you described as well.  

Are we okay with the compression being at block level?  If we are then this implementation
will work just like all of the other block compression codecs where it will add / require
a header for each hdfs-block.

Thanks again for the review.  




was (Author: churromorales):
[~jlowe] thank you for the thorough review.  The reason that the zstd cli and hadoop can't
read each other's compressed / decompressed data is that ZStandardCodec uses the Block(Compressor|Decompressor)
stream.  I assumed this library would be used to compress large amounts of data.  So when
you use this stream each block gets a header and some compressed data.  I believe the 8 bytes
you are referring are two ints (the size of the compressed and uncompressed block).  If you
remove these headers then the cli will be able to read the zstd blocks and if you use the
zstd-cli and compress a file (prepend the header for sizes) it will work in hadoop. 

The snappy compressor / decompressor works in the same way.  I do not believe you can compress
in snappy format using hadoop then transfer the file locally and call Snappy.uncompress()
without removing the headers. 

If we do not want this to be compressed at a block level, that is fine.  Otherwise we can
just add a utility in hadoop to take care of the block headers like they did with hadoop-snappy
or some of the CLI libraries for snappy like snzip.  

As far as the decompressed bytes, I agree.  I will check to see that the size returned from
the function that tells you how many bytes are necessary to uncompress the buffer and ensure
thats not larger than our buffer size.  I can also add the isError and getErrorName to the
decompression library.  The reason I explicitly checked if the expected size was equal to
the desired size is because the error that zstd provided was too vague.  But I'll add it in
case there are other errors. 

Yes I will look at Hadoop-13684.  The build of the codec was very similar to snappy because
the license was BSD so we could package it in like snappy. 

I can also take care of the nits you described as well.  

Are we okay with the compression being at block level?  If we are then this implementation
will work just like all of the other block compression codecs where it will add / require
the header for the hadoop blocks.

Thanks again for the review.  



> Add Codec for ZStandard Compression
> -----------------------------------
>
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 months
by facebook now.  v1.0 was recently released.  Create a codec for this library.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message