hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7845) Compress block reports
Date Sat, 28 Feb 2015 23:12:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341822#comment-14341822

Todd Lipcon commented on HDFS-7845:

One quick hint here that may help: for a series of ints like block IDs, just using plain snappy/lz4
isn't likely to make a big difference (the maximum run length is pretty much bounded by the
size of the integer, because every block ID is different). However, you could likely make
a very good improvement by doing something like the following:

1) On the DN, sort the blocks by ascending block ID before doing the block report. This only
happens on the DN side, so it's easy to scale and doesn't consume NN CPU.
2) Shuffle the resulting array so that you have all of the MSBs, followed by all of the second
most significant bits, etc. Essentially, you're converting to a columnar layout where each
bit position within the ints is a column. This can be done very efficiently with SSE instructions
with a bit of JNI (similar throughput to memcpy). The result is likely to have long runs of
1 or 0 bits if the input block IDs are clustered around certain sets of values.
3) Run the result through LZ4 or Snappy.

You could optionally insert a differential encoding step between (1) and (2) which would probably
improve the compression ratio with little cost.

I didn't come up with the bit-shuffling idea - you can read more about it at http://www.blosc.org/
which also has some benchmarks showing that it gets very good compression performance and
adds almost no overhead relative to LZ4. It's also significantly faster than vint-encoding
from a CPU standpoint (since vints tend to be branchy)

> Compress block reports
> ----------------------
>                 Key: HDFS-7845
>                 URL: https://issues.apache.org/jira/browse/HDFS-7845
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7836
>            Reporter: Colin Patrick McCabe
>            Assignee: Charles Lamb
> We should optionally compress block reports using a low-cpu codec such as lz4 or snappy.

This message was sent by Atlassian JIRA

View raw message