hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4417) HDFS-347: fix case where local reads get disabled incorrectly
Date Tue, 22 Jan 2013 01:42:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559277#comment-13559277

Colin Patrick McCabe commented on HDFS-4417:

bq. How about newTcpPeer? Remote is kind of vague.


Using a mock for DomainSocket also worked out well.

For PeerCache, I tried out the two-cache solution, but it started getting pretty complicated,
since we refer to the cache in many places.  Instead, I just added a boolean to the cache

In {{TestParallelShortCircuitReadUnCached}}, since this *is* a regression test for HDFS-4417,
I figured I needed some way to make sure that we were not falling back on TCP sockets to read.
 So I added {{DFSInputStream#tcpReadsDisabledForTesting}}.

I considered several other solutions.  Any solution that makes TCP sockets unusable, like
setting a bad {{SocketFactory}}, runs into trouble because the first part of the test needs
to create the files that we're reading.  Killing the {{DataNode#dataXceiverServer}} thread
after doing the writes seemed like a promising approach, but it caused exceptions in the {{DFSOutputStream}}
worker threads, which led to the (only) {{DataNode}} getting kicked out of the cluster.  Another
approach is to create a subclass for {{DFSInputStream}} that overrides {{DFSInputStream#newTcpPeer}}
to throw an exception.  However, getting a {{DFSClient}} to return this subclass is difficult.
 Possibly Mockito's partial mocks could help here.
> HDFS-347: fix case where local reads get disabled incorrectly
> -------------------------------------------------------------
>                 Key: HDFS-4417
>                 URL: https://issues.apache.org/jira/browse/HDFS-4417
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, hdfs-client, performance
>            Reporter: Todd Lipcon
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-4417.002.patch, HDFS-4417.003.patch, hdfs-4417.txt
> In testing HDFS-347 against HBase (thanks [~jdcryans]) we ran into the following case:
> - a workload is running which puts a bunch of local sockets in the PeerCache
> - the workload abates for a while, causing the sockets to go "stale" (ie the DN side
disconnects after the keepalive timeout)
> - the workload starts again
> In this case, the local socket retrieved from the cache failed the newBlockReader call,
and it incorrectly disabled local sockets on that host. This is similar to an earlier bug
HDFS-3376, but not quite the same.
> The next issue we ran into is that, once this happened, it never tried local sockets
again, because the cache held lots of TCP sockets. Since we always managed to get a cached
socket to the local node, it didn't bother trying local read again.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message