Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 3349 invoked from network); 18 Aug 2006 13:56:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 18 Aug 2006 13:56:58 -0000 Received: (qmail 48507 invoked by uid 500); 18 Aug 2006 13:56:58 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 48487 invoked by uid 500); 18 Aug 2006 13:56:58 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 48478 invoked by uid 99); 18 Aug 2006 13:56:57 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Aug 2006 06:56:57 -0700 X-ASF-Spam-Status: No, hits=1.3 required=10.0 tests=RCVD_NUMERIC_HELO X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [208.229.144.195] (HELO mail2.apgcanada.netvigour.com) (208.229.144.195) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Aug 2006 06:56:55 -0700 Received: from mail6.netvigour.com ([10.201.10.1]) by mail2.apgcanada.netvigour.com with Microsoft SMTPSVC(6.0.3790.0); Fri, 18 Aug 2006 09:58:47 -0400 Received: from 71.139.11.164 ([71.139.11.164]) by mail6.netvigour.com ([10.201.10.1]) via Exchange Front-End Server mail.netvigour.com ([10.201.10.9]) with Microsoft Exchange Server HTTP-DAV ; Fri, 18 Aug 2006 13:09:08 +0000 User-Agent: Microsoft-Entourage/11.2.5.060620 Date: Fri, 18 Aug 2006 06:06:46 -0700 Subject: Bad tracker... From: Gian Lorenzo Thione To: "hadoop-user@lucene.apache.org" Message-ID: Thread-Topic: Bad tracker... Thread-Index: AcbCxyh5ZvNksi66EduKmwAWy4jymA== Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-OriginalArrivalTime: 18 Aug 2006 13:58:47.0118 (UTC) FILETIME=[6CCEC6E0:01C6C2CE] X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N If a task tracker is alive and continues sending heartbeat but the network falls in a state in which the job tracker is unable to contact the task tracker, the node remains on the list of clients but every attempt to assign a task to that tracker will fail. Unfortunately, it seems that hadoop doesn't really avoid scheduling the same task over and over to that same client, even if the vast majority of nodes in the cluster are alive and kicking and after a task fails 5 times, the entire job fails. Is there anyway that a bad tracker can be removed from the list of clients if the rate of failure is above a certain threshold (maybe consectuive errors even) even if it is sending heartbeats to the job tracker? I noticed that the total number of errors is tracked and the machine is even highlighted as having a high number of errors in the machine list page of the webserver.... Thanks, Lorenzo Thione Powerset, Inc.