From hadoop-user-return-2504-apmail-lucene-hadoop-user-archive=lucene.apache.org@lucene.apache.org Tue Oct 09 07:05:10 2007 Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 45210 invoked from network); 9 Oct 2007 07:05:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Oct 2007 07:05:09 -0000 Received: (qmail 27272 invoked by uid 500); 9 Oct 2007 07:04:55 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 27249 invoked by uid 500); 9 Oct 2007 07:04:55 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 27240 invoked by uid 99); 9 Oct 2007 07:04:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2007 00:04:55 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nedrocks@gmail.com designates 209.85.198.190 as permitted sender) Received: from [209.85.198.190] (HELO rv-out-0910.google.com) (209.85.198.190) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2007 07:04:57 +0000 Received: by rv-out-0910.google.com with SMTP id k20so866364rvb for ; Tue, 09 Oct 2007 00:04:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=wOojBvWUJcw1HjaXSFoUJ4LtDzlvQqeYW8Xg3MXXkww=; b=Dh9i7RG5KqCBbpbLQqpyamvQrWyOcnJRwNgVqdw2bl9k5MBbYRAo27Y/oM6hXReS81GeqKsxikrxpS5vFJ3ROfwKkqwMRF8hMw0bPdD/n1i2lZdbiwQWRIAXTPiJ16QzwJibUlGHBxTXVM2t60IFPF3tiGj3UV+QuWdrbmd/8DU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=hxUQokJpxGgSO95ySK86mEf9JISJAYJRuHq34nL+W9CiMG6FWo7CEOmSKeJ8lrniVpwYRhLQqKObInmeHLP4VMerqGO8RIlM0ufwBFJ+lF/5ttLIGLn3ZGFN9rapX6ePvX94FL2bwIkmOKOlhnDDT+8wTFygnoCmxF4ak2yIkz8= Received: by 10.140.193.16 with SMTP id q16mr1069878rvf.1191913476678; Tue, 09 Oct 2007 00:04:36 -0700 (PDT) Received: by 10.140.166.9 with HTTP; Tue, 9 Oct 2007 00:04:36 -0700 (PDT) Message-ID: <696cc4f20710090004v71c636e0n60e11cf89c045c62@mail.gmail.com> Date: Tue, 9 Oct 2007 00:04:36 -0700 From: "Ned Rockson" Sender: nedrocks@gmail.com To: hadoop-user@lucene.apache.org Subject: Re: Reduce task hangs In-Reply-To: <24232a230710082017t4a7149ak5eb04ef3a0ab6ca@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <24232a230710082017t4a7149ak5eb04ef3a0ab6ca@mail.gmail.com> X-Google-Sender-Auth: 2e3c4b797c44b755 X-Virus-Checked: Checked by ClamAV on apache.org Andrzej gave me a lot of help when he pointed me toward the kill -SIGQUIT [pid] command line function. This will write a java thread dump to stdout (which is caught in logs/userlogs/[task]/stdout/part-######). This is a lifesaver if you're getting caught anywhere and not sure why. --Ned On 10/8/07, Ming Yang wrote: > Hi, > > I have set up 2-node cluster running on Ubuntu 7.04 > and tested the examples, including wordcount and pi. > But the jobs don't always finish. Sometimes the reduce > tasks hang in the middle, such as 13%, and there's no > network traffic between nodes and no CPU usage. > I have been trying all different ways to make it more stable > but no luck. I checked the DFS and found all blocks are > under-replicated. Is this the cause of it? I really appreciate > anyone who can share some experience in this type of > problem. Thank you! > > Ming Yang >