hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Revote: HDFS 0.20.1/HDFS-0.20.2 compatibility
Date Wed, 13 Jan 2010 20:24:25 GMT
Hi all,

Last week we had a vote regarding the compatibility problem introduced in
branch-0.20 by the backport of HDFS-793, necessary for HDFS-101, which fixes
a large bug in the write pipeline recovery code. The majority of people
seemed to indicate that this incompatibility was unacceptable, and thus we
should pull it out.

However, I think everyone agrees that the bug itself is pretty critical, and
it would be good to have it fixed - Hairong indicated that it's likely going
to go to Yahoo's internal customers, and Cloudera would like to include it
as well. In our experience we've run into it several times - whenever there
are a few "bad apple" nodes in a cluster that haven't failed hard, it causes
a lot of write pipeline failures (particularly, any pipeline that picks a
bad node as the first node will not recover). For MapReduce it's not a huge
deal, since the tasks will rerun elsewhere and usually succeed, but for
applications like HBase or continuous logging to HDFS, it's a big problem.

I have taken the time to develop and test a patch for branch-0.20 which goes
on top of HDFS-793 and HDFS-101 but maintains compatibility with 0.20.1.
I've posted this patch and a summary of my testing to HDFS-872. Although
this code is tricky to get right, the hardest parts are with the thread
communication and understanding the correct semantics, which I've not
touched at all. I think as long as there's a good review of my patch, we
should feel comfortable introducing it into branch-0.20.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message