Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BF02E200B91 for ; Thu, 29 Sep 2016 23:46:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id BDB94160AC1; Thu, 29 Sep 2016 21:46:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1286A160AE4 for ; Thu, 29 Sep 2016 23:46:22 +0200 (CEST) Received: (qmail 65145 invoked by uid 500); 29 Sep 2016 21:46:21 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 64814 invoked by uid 99); 29 Sep 2016 21:46:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Sep 2016 21:46:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id A36C12C2A69 for ; Thu, 29 Sep 2016 21:46:20 +0000 (UTC) Date: Thu, 29 Sep 2016 21:46:20 +0000 (UTC) From: "James Clampffer (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-10931) libhdfs++: Fix object lifecycle issues in the BlockReader MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 29 Sep 2016 21:46:23 -0000 [ https://issues.apache.org/jira/browse/HDFS-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Clampffer updated HDFS-10931: ----------------------------------- Attachment: HDFS-10931.HDFS-8707.000.patch Patch added for the first part of the problem. Gratuitous use of shared_ptr to keep the DataNodeConnection alive. The fundamental fixes to the architecture can be addressed later on. > libhdfs++: Fix object lifecycle issues in the BlockReader > --------------------------------------------------------- > > Key: HDFS-10931 > URL: https://issues.apache.org/jira/browse/HDFS-10931 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client > Reporter: James Clampffer > Assignee: James Clampffer > Priority: Critical > Attachments: HDFS-10931.HDFS-8707.000.patch > > > The BlockReader can work itself into a a state during AckRead (possibly other stages as well) where the pipeline posts a task for asio with a pointer back into itself, then promptly calls "delete this" without canceling the asio request. The asio task finishes and tries to acquire the lock at the address where the DataNodeConnection used to live - but the DN connection is no longer valid so it's scribbling on some arbitrary bit of memory. On some platforms the underlying address used by the mutex state will be handed out to future mutexes so the scribble breaks that state and all the locks in that process start misbehaving. > This can be reproduced by using the patch from HDFS-8790 and adding more worker threads + a lot more reader threads. > I'm going to fix this in two parts: > 1) Duct tape + superglue patch to make sure that all top level continuations in the block reader pipeline hold a shared_ptr to the DataNodeConnection. Nested continuations also get a copy of the shared_ptr to make sure the connection is alive. This at least keeps the connection alive so that it can keep returning asio::operation_aborted. > 2) The continuation stuff needs a lot of work to make sure this type of bug doesn't keep popping up. We've already fixed these issues in the RPC code. This will most likely need to be split into a few jiras. > - Continuation "framework" can be slimmed down quite a bit, perhaps even removed. Near zero documentation + many implied contracts = constant bug chasing. > - Add comments to actually describe what's going on in the networking code. This bug took significantly longer than it should have to track down because I hadn't worked on the BlockReader in a while. > - No more "delete this". > - Flatten out nested continuations e.g. the guts of BlockReaderImpl::AckRead. It's unclear why they were implemented like this in the first place and there's no comments to indicate that this was intentional. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org