Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 430D91902A for ; Fri, 8 Apr 2016 17:06:34 +0000 (UTC) Received: (qmail 31159 invoked by uid 500); 8 Apr 2016 17:06:34 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 31105 invoked by uid 500); 8 Apr 2016 17:06:34 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 31093 invoked by uid 99); 8 Apr 2016 17:06:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Apr 2016 17:06:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 3E8E9C0D34 for ; Fri, 8 Apr 2016 17:06:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.448 X-Spam-Level: * X-Spam-Status: No, score=1.448 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id MIY2EaK1zzMv for ; Fri, 8 Apr 2016 17:06:31 +0000 (UTC) Received: from mail-io0-f179.google.com (mail-io0-f179.google.com [209.85.223.179]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 3B0015F19B for ; Fri, 8 Apr 2016 17:06:30 +0000 (UTC) Received: by mail-io0-f179.google.com with SMTP id 2so139158084ioy.1 for ; Fri, 08 Apr 2016 10:06:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=ZEkdyzRbjNPrVFUmzbU5+DcF3BBUkxaGlF/ev6V1fjg=; b=Simj54CfxWqqIl8Wmf1YHfBoF0DuoYPU6auIsD2fwwRFhm0Ij2toys45v4EQBvaL8F HGqzSq6rJbD5CiH0lUc3jCM58rkxPzrKRIu8I6RPgodHXRrv2YR9qaiNwxtAW+c5rrXt lhrPFWnUeyh41ae/0aZBfRjSPC+gYIhg0vOR/XCquSJi2/J0BlWPR1SgAQYScPktQqJJ kMNIPLsnRO/+4qon9X2Mhg7ChMUKgyiaLCdum82yupR2k/IiTO5vjmhn3bGB8/NkeDYq WoyS42/P1sjw1P7ncUFn0nB09plF+AGYWDA3wYmvDnARBf2zrgQcRvyMeLHxDlqabnsl eOiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=ZEkdyzRbjNPrVFUmzbU5+DcF3BBUkxaGlF/ev6V1fjg=; b=ZFksCLc1L6k4UqlgVwyWKEalTggNBVScW9zWvNYTOqAvtKcVnSHT6unIwKOTCDfTov +8bw9wfiOmTPA8rHQGaAW/3u8cCZuY/JMSEMXwG8WwP1sLvHRQloINjGHzJkAa0h4fTq BlWpoR289j+NY/rrSqnQtP1aNI7ZKOQg4uDGKyG5Qn69FbecRM+zOH+aog0oE4brA+5y tpiavj88SkGPjg2d6JKmSKIg+4FiP7+9ZSBUck6cOI/iNNSyq8jl50Yo2qVNQNXXor+G zTi8O3gZHVfqYQ7R6dMi6Se7uMBPS8uAqqu30FHisxqA3aN2docsGBlvRdwlUjncy05q ZcJA== X-Gm-Message-State: AD7BkJLHznnPye7xlps5ee/GAPhLDJrsZDW0jnFfDpNVVgS7fVvFbfCnQ5w3+8vdUtuCYHDPMehBS/F62MnbTw== MIME-Version: 1.0 X-Received: by 10.107.135.75 with SMTP id j72mr10757691iod.111.1460135188921; Fri, 08 Apr 2016 10:06:28 -0700 (PDT) Received: by 10.79.76.215 with HTTP; Fri, 8 Apr 2016 10:06:28 -0700 (PDT) In-Reply-To: References: Date: Fri, 8 Apr 2016 13:06:28 -0400 Message-ID: Subject: Re: Can this scenario cause a query to hang ? From: =?UTF-8?B?RnJhbsOnb2lzIE3DqXRob3Q=?= To: dev@drill.apache.org Content-Type: multipart/alternative; boundary=001a113fb98e428357052ffc37bc --001a113fb98e428357052ffc37bc Content-Type: text/plain; charset=UTF-8 It might just adds up to the mystery of this issue but when we start getting those hanging CTAS query, if we restart our drill cluster and the problem goes away. Next time we start getting this problem I will try to collect the JStack output of the foreman too. Thanks for looking into this. Francois On Fri, Apr 8, 2016 at 2:20 AM, Abdel Hakim Deneche wrote: > Opened DRILL-4595 [1] to track this issue. > > Thanks > > [1] https://issues.apache.org/jira/browse/DRILL-4595 > > On Fri, Apr 8, 2016 at 6:42 AM, Abdel Hakim Deneche > > wrote: > > > Hey John, thanks for sharing your experience. If you see this again try > > collecting the jstack output for the foreman node of the query, and also > > check in the query profile which fragments are still marked as RUNNING. > > > > Thanks > > > > On Thu, Apr 7, 2016 at 2:29 PM, John Omernik wrote: > > > >> Abdel - > >> > >> I think I've seen this on a MapR cluster I run, especially on CTAS. For > >> me, I have not brought it up because the cluster I am running on has > some > >> serious personal issues (like being hardware that's near 7 years old, > its > >> a > >> test cluster) and given the "hard to reproduce" nature of the problem, > >> I've > >> been reluctant to create noise. Given what you've described, it seems > very > >> similar to CTAS hangs I've seen, but couldn't accurately reproduce. > >> > >> This didn't add much to your post, but I wanted to give you a +1 for > >> outlining this potential problem. Once I move to more robust hardware, > >> and > >> I am in similar situations, I will post more verbose details from my > side. > >> > >> John > >> > >> > >> > >> On Thu, Apr 7, 2016 at 2:29 AM, Abdel Hakim Deneche < > >> adeneche@maprtech.com> > >> wrote: > >> > >> > So, we've been seeing some queries hang, I've come up with a possible > >> > explanation, but so far it's really difficult to reproduce. Let me > know > >> if > >> > you think this explanation doesn't hold up or if you have any ideas > how > >> we > >> > can reproduce it. Thanks > >> > > >> > - generally it's a CTAS running on a large cluster (lot's of writers > >> > running in parallel) > >> > - logs show that the user channel was closed and UserServer caused the > >> root > >> > fragment to move to a FAILED state [1] > >> > - jstack shows that the root fragment is blocked in it's receiver > >> waiting > >> > for data [2] > >> > - jstack also shows that ALL other fragments are no longer running, > and > >> the > >> > logs show that all of them succeeded [3] > >> > - the foreman waits *forever* for the root fragment to finish > >> > > >> > [1] the only case I can think off is when the user channel closed > while > >> the > >> > fragment was waiting for an ack from the user client > >> > [2] if a writer finishes earlier than the others, it will send a data > >> batch > >> > to the root fragment that will be sent to the user. The root will then > >> > immediately block on it's receiver waiting for the remaining writers > to > >> > finish > >> > [3] once the root fragment moves to a failed state, the receiver will > >> > immediately release any received batch and return an OK to the sender > >> > without putting the batch in it's blocking queue. > >> > > >> > Abdelhakim Deneche > >> > > >> > Software Engineer > >> > > >> > > >> > > >> > > >> > Now Available - Free Hadoop On-Demand Training > >> > < > >> > > >> > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > >> > > > >> > > >> > > > > > > > > -- > > > > Abdelhakim Deneche > > > > Software Engineer > > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > < > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > -- > > Abdelhakim Deneche > > Software Engineer > > > > > Now Available - Free Hadoop On-Demand Training > < > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > --001a113fb98e428357052ffc37bc--