Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2BDBA19AE3 for ; Thu, 21 Apr 2016 14:53:26 +0000 (UTC) Received: (qmail 21176 invoked by uid 500); 21 Apr 2016 14:53:25 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 21102 invoked by uid 500); 21 Apr 2016 14:53:25 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 20799 invoked by uid 99); 21 Apr 2016 14:53:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Apr 2016 14:53:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 8FEA82C1F61 for ; Thu, 21 Apr 2016 14:53:25 +0000 (UTC) Date: Thu, 21 Apr 2016 14:53:25 +0000 (UTC) From: "Bob Hansen (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-10311) libhdfs++: DatanodeConnection::Cancel should not delete the underlying socket MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252006#comment-15252006 ] Bob Hansen commented on HDFS-10311: ----------------------------------- Good comments, [~James Clampffer] bq. I'd rather have the calling code report it so you know if it was a DN or NN issue and if it was a cancel or normal disconnect that lead to it. The calling code can provide the context of the error, but loses the details of what failed. Sometimes (though not as often as we'd like), the exception's what() method returns very useful information. bq. If it doesn't check and Cancel has already been called on that connection you end up with deterministic false positive warning messages for every canceled FD. By my reading, the SafeDisconnect always checks is_open first thing and returns true if the socket is closed or null: {code} bool SafeDisconnect(asio::ip::tcp::socket *sock) { bool good = true; if(sock && sock->is_open()) { ... } return good; } {code} It doesn't hurt to have it in the SocketDeleter, but I think it is redundant. bq. Which callback are you referring to? Right now it's only guarding the asio::async_ calls because the socket objects aren't thread safe. Again, by my read, patch 001 includes in DataNodeConnection: {code} void async_read_some(const MutableBuffers &buf, std::function<...> handler) { mutex_guard state_lock(state_lock_); event_handlers_->call("DN_read_req", "", "", buf.end() - buf.begin()); conn_->async_read_some(buf, handler); }; void async_write_some(const ConstBuffers &buf, std::function<...> handler) { mutex_guard state_lock(state_lock_); event_handlers_->call("DN_write_req", "", "", buf.end() - buf.begin()); ... ;} {code} If a consumer has the event handler try to shut down the connection in any way, we'll get a deadlock on the state_lock_. My reading may be wrong there; please correct me if it is. > libhdfs++: DatanodeConnection::Cancel should not delete the underlying socket > ----------------------------------------------------------------------------- > > Key: HDFS-10311 > URL: https://issues.apache.org/jira/browse/HDFS-10311 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client > Reporter: James Clampffer > Assignee: James Clampffer > Attachments: HDFS-10311.HDFS-8707.000.patch, HDFS-10311.HDFS-8707.001.patch > > > DataNodeConnectionImpl calls reset on the unique_ptr that references the underlying asio::tcp::socket. If this happens after the continuation pipeline checks the cancel state but before asio uses the socket it will segfault because unique_ptr::reset will explicitly change it's value to nullptr. > Cancel should only call shutdown() and close() on the socket but keep the instance of it alive. The socket can probably also be turned into a member of DataNodeConnectionImpl to get rid of the unique pointer and simplify things a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)