hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes
Date Fri, 20 Jul 2012 22:53:34 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419646#comment-13419646
] 

nkeywal commented on HBASE-6435:
--------------------------------

If I can to keep the existing interface


Today, when you open a file, there is a call to a datanode if the file is also opened for
writing somewhere. In HBase, we want the priorities to be taken into account during this opening,
as we have a guess that one of these datanode may be dead.

So either I register a callback that the DFSClient will call before using its list, either
I change the 'open' interface to add the possibility to provide the list of replicas. Same
thing for chooseDataNode called from blockSeekTo: even if we have a list at the beginning,
this list is recreated during a read as a part of the retry process (in case the NN discovered
new replicas on new datanodes).

if we put a callback like

We would offer this service.
{noformat}
class  ReplicaSet {
  public List<Replica> getAvailableReplica(long pos); // return the list of available
replicas at given file offset, in priority order
  public void prioritizeReplica(Replica r); // move given replica to front of list
  public void blacklistReplica(Replica r); // move replica to back of list
}
{noformat}


The client would need to implement this interface:
{noformat}
// Implement this interface and provide it to the DFSClient during its construction to manage
the replica ordering
interface OrganizeReplicaSet{
 void organize(String fileName, ReplicaSet rs); 
}
{noformat}

And the DFSClient code would become:
{noformat}
LocatedBlocks callGetBlockLocations(ClientProtocol namenode,
      String src, long start, long length) throws IOException {
    try {
        LocatedBlocks lbs = namenode.getBlockLocations(src, start, length);
        if (organizeReplicaSet != null){
            ReplicaSet rs = LocatedBlocks.getAsReplicaSet()
            try {
                organizeReplicaSet.organize(src, rs);
            }catch (Throwable t){
                throw new IOException("ClientBlockReordorer failed. class="+reorderer.getClass(),
t);
            }
            return new LocatedBlocks(rs);
        } else
          return lbs;
{noformat}

This is called from the DFSInputStream constructor in openInfo today.

In real life I would try to use the class ReplicaSet as an interface on the internal LocatedBlock(s)
to limit the number of objects created. The callback could also be given as a parameter to
the DFSInputStream constructor if a there is a specific rule to apply...

                
> Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead
datanodes
> ------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6435
>                 URL: https://issues.apache.org/jira/browse/HBASE-6435
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 6435.unfinished.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure.
> This log is written with 'append' on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start the recovery
process. 
> This means reading the Write-Ahead-Log to replay the edits on the other servers.
> In standards deployments, HBase process (regionserver) are deployed on the same box as
the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we lost both
the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when we try to
read the blocks to recover. As such, we are delaying the recovery process by 60 seconds as
the read will usually fail with a socket timeout. If the file is still opened for writing,
it adds an extra 20s + a risk of losing edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local one.
> - reordering the blocks returned by the NN on the client side to put the blocks on the
same DN as the dead RS at the end of the priority queue. Requires a DFS code change or a kind
of workaround.
> The solution retained is the last one. Compared to what was discussed on the mailing
list, the proposed patch will not modify HDFS source code but adds a proxy. This for two reasons:
> - Some HDFS functions managing block orders are static (MD5MD5CRC32FileChecksum). Implementing
the hook in the DFSClient would require to implement partially the fix, change the DFS interface
to make this function non static, or put the hook static. None of these solution is very clean.

> - Adding a proxy allows to put all the code in HBase, simplifying dependency management.
> Nevertheless, it would be better to have this in HDFS. But this solution allows to target
the last version only, and this could allow minimal interface changes such as non static methods.
> Moreover, writing the blocks to the non local DN would be an even better solution long
term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message