hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brahma Reddy Battula <brahmareddy.batt...@hotmail.com>
Subject RE: Issue in handling checksum errors in write pipeline
Date Sat, 30 Jul 2016 17:51:22 GMT
Hi yongjun,
Thanks a lot for your reply..
Yes, It really N/W issue..will raise new jira.


Thanks And RegardsBrahma Reddy Battula

> From: yzhang@cloudera.com
> Date: Sat, 30 Jul 2016 10:22:19 -0700
> Subject: Re: Issue in handling checksum errors in write pipeline
> To: brahmareddy.battula@huawei.com
> CC: hdfs-dev@hadoop.apache.org
> 
> Hi Brahma,
> 
> Thanks for reporting the issue.
> 
> If your problem is really a network issue, then your proposed solution
> sounds reasonable to me, and it's different than what HDFS-6937 intends to
> solve. I think we can create a new jira for your issue. Here is why:
> 
> HDFS-6937's scenario is that we keep replacing the third node in recovery,
> and did not detect that the middle node is corrupt. Thus adding a
> corruption checking for the middle node would solve the issue; In your
> case, even if we try to check the middle node, it would appear as not
> corrupt. The problem is that, we don't have a check for network issue (and
> probably adding a network check may not be feasible here).
> 
> On the other hand, if it's not a network issue, then it could be caused by
> HDFS-4660, if you don't already have the fix.
> 
> Hope my explanation makes sense.
> 
> Thanks.
> 
> --Yongjun
> 
> On Sat, Jul 30, 2016 at 4:03 AM, Brahma Reddy Battula <
> brahmareddy.battula@huawei.com> wrote:
> 
> > Hello
> >
> >
> > We had come across one issue, where write is failed even 7 DN's are
> > available due to network fault at one datanode which is LAST_IN_PIPELINE.
> > It will be similar to HDFS-6937 .
> >
> > Scenario : (DN3 has N/W Fault and Min repl=2).
> >
> > Write pipeline:
> > DN1->DN2->DN3  => DN3 Gives ERROR_CHECKSUM ack. And so DN2 marked as bad
> > DN1->DN4-> DN3 => DN3 Gives ERROR_CHECKSUM ack. And so DN4 is marked as
bad
> > ....
> > And so on ( all the times DN3 is LAST_IN_PIPELINE) ... Continued till no
> > more datanodes to construct the pipeline.
> >
> > Thinking we can handle like below:
> >
> > Instead of throwing IOException for ERROR_CHECKSUM ack from downstream, If
> > we can send back the pipeline ack and client side we can replace both DN2
> > and DN3 with new nodes as we can't decide on which is having network
> > problem.
> >
> >
> > Please give you views the possible fix..
> >
> >
> > --Brahma Reddy Battula
> >
> >
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message