Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 9959 invoked from network); 2 Mar 2010 23:56:27 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Mar 2010 23:56:27 -0000 Received: (qmail 33690 invoked by uid 500); 2 Mar 2010 23:56:16 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 33479 invoked by uid 500); 2 Mar 2010 23:56:16 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 33464 invoked by uid 500); 2 Mar 2010 23:56:15 -0000 Delivered-To: apmail-hadoop-core-user@hadoop.apache.org Received: (qmail 33459 invoked by uid 99); 2 Mar 2010 23:56:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Mar 2010 23:56:15 +0000 X-ASF-Spam-Status: No, hits=-4.0 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of awittenauer@linkedin.com designates 69.28.149.24 as permitted sender) Received: from [69.28.149.24] (HELO esv4-mav02.corp.linkedin.com) (69.28.149.24) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Mar 2010 23:56:05 +0000 DomainKey-Signature: s=prod; d=linkedin.com; c=nofws; q=dns; h=X-IronPort-AV:Received:User-Agent:Date:Subject:From:To: Message-ID:Thread-Topic:Thread-Index:In-Reply-To: Mime-version:Content-type:Content-transfer-encoding; b=Fa5O0jMLkG1uOGr4zNHKg8vBXnLm57aID7HTHuBkkW9fTAVChI4TkLuZ 9c5nOUCDVAomVpf/iHEfmaz/EMUN8y57KF8+H6BNPkmsM1oP9NIhsof3I SbOtiJbUV433MnN; DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=linkedin.com; i=awittenauer@linkedin.com; q=dns/txt; s=proddkim; t=1267574165; x=1299110165; h=from:sender:reply-to:subject:date:message-id:to:cc: mime-version:content-transfer-encoding:content-id: content-description:resent-date:resent-from:resent-sender: resent-to:resent-cc:resent-message-id:in-reply-to: references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:list-owner:list-archive; z=From:=20Allen=20Wittenauer=20 |Subject:=20Re:=20"lost=20task"=20suspect=20Distributed =20Cache=20to=20blame|Date:=20Tue,=2002=20Mar=202010=2015 :55:15=20-0800|Message-ID:=20|To:=20,=0D=0A =09|Mime-version:=201.0 |Content-transfer-encoding:=207bit|In-Reply-To:=20; bh=gQCvobcmLvAahP4WA0UqGQK46tgO0bUYRtcbd5wNgaY=; b=aKNqj8PhyedG0KEPKxclS6xyg0jx1OEguQ100d8EUXgwT/HVjTUsnWij hya0SkrMVDv7n8Z+pm23p5IckI1jF2/tNG6XQxXCYEaJrYh7U4YsC3tEA j8w6LatHF7RM3y/; X-IronPort-AV: E=Sophos;i="4.49,570,1262592000"; d="scan'208";a="11498192" Received: from 172.16.19.141 ([172.16.19.141]) by CORP-MAIL.linkedin.biz ([172.18.46.135]) via Exchange Front-End Server mail-access.linkedin.biz ([172.18.46.133]) with Microsoft Exchange Server HTTP-DAV ; Tue, 2 Mar 2010 23:55:16 +0000 User-Agent: Microsoft-Entourage/12.10.0.080409 Date: Tue, 02 Mar 2010 15:55:15 -0800 Subject: Re: "lost task" suspect Distributed Cache to blame From: Allen Wittenauer To: , Message-ID: Thread-Topic: "lost task" suspect Distributed Cache to blame Thread-Index: Acq6Y83BDsLR7Ovct0S/HhVLQMFYPw== In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 3/2/10 12:49 PM, "Edward Capriolo" wrote: > This job is somewhat special (for us) in that in involves shipping > large files over distributed cache. My working theory is that > something goes wrong with the distributed cache/job tracker and the > Job/Task/Tips never have a chance. > > Has anyone ever experienced something like this? How big is big and what archiving format are you using? Jar has issues with files >2gb. I've also noticed that dcache seems to get written to -every- tmp dir on a node, which seems a bit ridiculous.