hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Collins <...@cloudera.com>
Subject Re: Revote: HDFS 0.20.1/HDFS-0.20.2 compatibility
Date Wed, 13 Jan 2010 20:41:50 GMT
On Wed, Jan 13, 2010 at 12:24 PM, Todd Lipcon <todd@cloudera.com> wrote:
> Hi all,
> Last week we had a vote regarding the compatibility problem introduced in
> branch-0.20 by the backport of HDFS-793, necessary for HDFS-101, which fixes
> a large bug in the write pipeline recovery code. The majority of people
> seemed to indicate that this incompatibility was unacceptable, and thus we
> should pull it out.
> However, I think everyone agrees that the bug itself is pretty critical, and
> it would be good to have it fixed - Hairong indicated that it's likely going
> to go to Yahoo's internal customers, and Cloudera would like to include it
> as well. In our experience we've run into it several times - whenever there
> are a few "bad apple" nodes in a cluster that haven't failed hard, it causes
> a lot of write pipeline failures (particularly, any pipeline that picks a
> bad node as the first node will not recover). For MapReduce it's not a huge
> deal, since the tasks will rerun elsewhere and usually succeed, but for
> applications like HBase or continuous logging to HDFS, it's a big problem.
> I have taken the time to develop and test a patch for branch-0.20 which goes
> on top of HDFS-793 and HDFS-101 but maintains compatibility with 0.20.1.
> I've posted this patch and a summary of my testing to HDFS-872. Although
> this code is tricky to get right, the hardest parts are with the thread
> communication and understanding the correct semantics, which I've not
> touched at all. I think as long as there's a good review of my patch, we
> should feel comfortable introducing it into branch-0.20.
> Thanks
> -Todd


This is a critical bug, it would have been a blocker for 20 had we
known about it. Assuming your change that resolves the protocol
incompatibility is reviewed and tested to people's liking I think we
should put it in 20.


View raw message