hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bob Hansen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10311) libhdfs++: DatanodeConnection::Cancel should not delete the underlying socket
Date Thu, 21 Apr 2016 14:53:25 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252006#comment-15252006
] 

Bob Hansen commented on HDFS-10311:
-----------------------------------

Good comments, [~James Clampffer]

bq. I'd rather have the calling code report it so you know if it was a DN or NN issue and
if it was a cancel or normal disconnect that lead to it.
The calling code can provide the context of the error, but loses the details of what failed.
 Sometimes (though not as often as we'd like), the exception's what() method returns very
useful information.  

bq. If it doesn't check and Cancel has already been called on that connection you end up with
deterministic false positive warning messages for every canceled FD.
By my reading, the SafeDisconnect always checks is_open first thing and returns true if the
socket is closed or null:
{code}
bool SafeDisconnect(asio::ip::tcp::socket *sock) {
  bool good = true;
  if(sock && sock->is_open()) {
     ...
  }
  return good;
}
{code}

It doesn't hurt to have it in the SocketDeleter, but I think it is redundant.

bq.  Which callback are you referring to? Right now it's only guarding the asio::async_<something>
calls because the socket objects aren't thread safe. 

Again, by my read, patch 001 includes in DataNodeConnection:
{code}
   void async_read_some(const MutableBuffers &buf,
                         std::function<...> handler) {
    mutex_guard state_lock(state_lock_);
    event_handlers_->call("DN_read_req", "", "", buf.end() - buf.begin());
    conn_->async_read_some(buf, handler);
   };
 
   void async_write_some(const ConstBuffers &buf,
                          std::function<...> handler) {
    mutex_guard state_lock(state_lock_);
    event_handlers_->call("DN_write_req", "", "", buf.end() - buf.begin());
    ...
   ;}
 {code}
If a consumer has the event handler try to shut down the connection in any way, we'll get
a deadlock on the state_lock_.  

My reading may be wrong there; please correct me if it is.

> libhdfs++: DatanodeConnection::Cancel should not delete the underlying socket
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-10311
>                 URL: https://issues.apache.org/jira/browse/HDFS-10311
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>         Attachments: HDFS-10311.HDFS-8707.000.patch, HDFS-10311.HDFS-8707.001.patch
>
>
> DataNodeConnectionImpl calls reset on the unique_ptr that references the underlying asio::tcp::socket.
 If this happens after the continuation pipeline checks the cancel state but before asio uses
the socket it will segfault because unique_ptr::reset will explicitly change it's value to
nullptr.
> Cancel should only call shutdown() and close() on the socket but keep the instance of
it alive.  The socket can probably also be turned into a member of DataNodeConnectionImpl
to get rid of the unique pointer and simplify things a bit.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message