hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sam rash (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
Date Sun, 22 Aug 2010 04:07:18 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901130#action_12901130
] 

sam rash commented on HDFS-1262:
--------------------------------

my apologies for the delay.  I've been caught up in some hi-pri bits at work.

thanks for the comments.  inlined responses

#why does abandonFile return boolean? looks like right now it can only return true or throw,
may as well make it void, no?
good question: I stole abandonBlock() which has the same behavior.  It returns true or throws
an exception.  I was trying to keep it consistent (rather than logical per se).
I do prefer the void option as it makes the method more clear.

#in the log message in FSN.abandonFile it looks like there's a missing '+ src +' in the second
log message
#in the log messages, also log the "holder" argument perhaps
will fix

#in previous append-branch patches we've been trying to keep RPC compatibility with unpatched
0.20 - ie you can run an updated client against an old NN, with the provision #that it might
not fix all the bugs. Given that, maybe we should catch the exception we get if we call abandonFile()
and get back an exception indicating the method doesn't #exist? Check out what we did for
HDFS-630 backport for example.
nice idea, I will check this out

#looks like there are some other patches that got conflated into this one - eg testSimultaneousRecoveries
is part of another patch on the append branch.
hmm, yea, not sure what happened here...weird, I think I applied one of your patches.  Which
patch is that test from?

#missing Apache license on new test file
will fix
#typo: Excection instead of Exception
will fix
#"(PermissionStatus) anyObject()," might generated an unchecked cast warning - I think you
can do Matchers.<PermissionStatus>anyObject() or some such to avoid the unchecked #cast
ah, nice catch, will fix also

#given the complexity of the unit test, would be good to add some comments for the general
flow of what all the mocks/spys are achieving. I found myself a bit lost in the #abstractions

yea, sry, was in a rush b4 vacation to get some test + patch up.  It was a bit tricky to get
this case going for both create + append;  I'll document the case better (at all)

> Failed pipeline creation during append leaves lease hanging on NN
> -----------------------------------------------------------------
>
>                 Key: HDFS-1262
>                 URL: https://issues.apache.org/jira/browse/HDFS-1262
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20-append
>            Reporter: Todd Lipcon
>            Assignee: sam rash
>            Priority: Critical
>             Fix For: 0.20-append
>
>         Attachments: hdfs-1262-1.txt, hdfs-1262-2.txt, hdfs-1262-3.txt, hdfs-1262-4.txt
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the
following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so until soft
lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append pipeline creation
failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase
master. HBase assumed the file wasn't open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned to the same
DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this JIRA is
that the NN can think it failed to open a file for append when the NN thinks the writer holds
a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can
open or recover the file until the DFS client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message