hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Da Zheng <zhengda1...@gmail.com>
Subject Re: replicate data in HDFS with smarter encoding
Date Tue, 19 Jul 2011 05:37:02 GMT
Hello,

On 07/18/11 21:43, Uma Maheswara Rao G 72686 wrote:
> Hi,
>
> We have already thoughts about it.
No, I think we are talking about different problems. What I'm talking 
about is how to reduce the number of replica while still achieving the 
same data reliability. The replica of data can already be compressed.

To illustrate the problem, here is a more concrete example:
The size of block A is X. After it is compressed, its size is Y. When it 
is written to HDFS, it needs to be replicated if we want the data to be 
reliable. If the replication factor is R, then R*Y bytes will be written 
to the disk, and (R-1)*Y bytes will be transmitted in the network.

Now, if we use some better encoding to achieve data reliability, for B 
blocks of data, we can have P parity blocks. And for each block, we need 
to have (1+P/B)*Y bytes written to the disk and P/B*Y bytes transmitted 
over the network, and thus it's possible to further reduce the network 
and disk bandwidth.

So what Joey showed me is more relevant even though it doesn't reduce 
the data size before data is written to the network or the disk.

To implement that, I think we will probably not use pipeline any more.

> Looks like you are talking about this features right
> https://issues.apache.org/jira/browse/HDFS-1640
> https://issues.apache.org/jira/browse/HDFS-2115
About your patches, I don't know how useful it can be when we can ask 
the applications to compress data. For example, we can enable 
mapred.output.compress in MapReduce to ask reducers to compress data. I 
assume MapReduce is the major user of HDFS.

Thanks,
Da

Mime
View raw message