hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3705) Add the possibility to mark a node as 'low priority' for read in the DFSClient
Date Mon, 27 Aug 2012 18:28:11 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442588#comment-13442588
] 

nkeywal commented on HDFS-3705:
-------------------------------

Hello Suresh,

For sure, HDFS-3703 is absolutely key for HBase. It's great you're doing this. BTW, don't
hesitate to ask me if you want me to test it with HBase. 

For HDFS-3705, it's a difficult question. There are two questions I think:
1) Is it supersedes by HDFS-3703
2) Can it be replaced by a server side only implementation.

For 1)
If I take HBase & HDFS as they are today, it's more a less a yes: Most of the time, people
configure HBase with a timeout of 45s. So if HDFS does 30s, it has the right state when HBase
starts the recovery. So done and HDFS-3705 is useless. 
However, even today, I've seen people claiming a configuration with a 10 seconds timeout.
Looking further, even if a configuration with HDFS being more aggressive than HBase will always
be simpler, I don't think we can have this as a systematic precondition:
- The hbase timeouts are driven a lot by GC issues. This stuff is getting resolved more and
more, for example in JDK 1.7 new settings. If it works, HBase timeout will be decreased.
- As HBase is using a different detection mechanism than HDFS, we will always have mismatch.
If ZooKeeper improves its detection mechanism, there will be a period of time with a better
detection time from ZooKeeper than HBase. On the same line, there are the differences between
connect/read timeouts: some issues are detected sooner than other. 
- If we want HDFS & HBase to be more and more realtime, settings will be more and more
aggressive, and at the end the difference between HBase & HDFS will be a few seconds,
i.e. something that you can't really rely on when there are failures on the cluster.

For 2) when we discussed on HBase list on HDFS-3702, doing this namenode side with HDFS-385
was rejected because the namenode could be shared between different teams / applications,
and the operation team could refuse to deploy a namenode configuration specific to HBase.
I guess it's a similar issue we're having there.


It's not simple; but given the above points, I think that even if HDFS-3703 does 90% of the
work, for the remaining 10% of the work, we need a deep cooperation between HBase & HDFS.
Having the API LimitedPrivate is not an issue for HBase imho, and it buys some time to validate
the API.

I'm happy to get other opinions here :-)

                
> Add the possibility to mark a node as 'low priority' for read in the DFSClient
> ------------------------------------------------------------------------------
>
>                 Key: HDFS-3705
>                 URL: https://issues.apache.org/jira/browse/HDFS-3705
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs client
>    Affects Versions: 1.0.3, 2.0.0-alpha, 3.0.0
>            Reporter: nkeywal
>             Fix For: 3.0.0
>
>         Attachments: hdfs-3705.sample.patch, HDFS-3705.v1.patch
>
>
> This has been partly discussed in HBASE-6435.
> The DFSClient includes a 'bad nodes' management for reads and writes. Sometimes, the
client application already know that some deads are dead or likely to be dead.
> An example is the 'HBase Write-Ahead-Log': when HBase reads this file, it knows that
the HBase regionserver died, and it's very likely that the box died so the datanode on the
same box is dead as well. This is actually critical, because:
> - it's the hbase recovery that reads these log files
> - if we read them it means that we lost a box, so we have 1 dead replica out the the
3. 
> - for all files read, we have 33% of chance to go to the dead datanode
> - as the box just died, we're very likely to get a timeout exception so we're delaying
the hbase recovery by 1 minute. For HBase, it means that the data is not available during
this minute.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message