Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
User-Agent: Microsoft-Entourage/11.2.5.060620
Date: Fri, 18 Aug 2006 06:06:46 -0700
Subject: Bad tracker...
From: Gian Lorenzo Thione <thione@powerset.com>
To: "hadoop-user@lucene.apache.org" <hadoop-user@lucene.apache.org>
Message-ID: <C10B0976.52EA%thione@powerset.com>
Thread-Topic: Bad tracker...
Thread-Index: AcbCxyh5ZvNksi66EduKmwAWy4jymA==
Mime-version: 1.0
Content-type: text/plain;
	charset="US-ASCII"
Content-transfer-encoding: 7bit

If a task tracker is alive and continues sending heartbeat but the network
falls in a state in which the job tracker is unable to contact the task
tracker, the node remains on the list of clients but every attempt to assign
a task to that tracker will fail.

Unfortunately, it seems that hadoop doesn't really avoid scheduling the same
task over and over to that same client, even if the vast majority of nodes
in the cluster are alive and kicking and after a task fails 5 times, the
entire job fails.

Is there anyway that a bad tracker can be removed from the list of clients
if the rate of failure is above a certain threshold (maybe consectuive
errors even) even if it is sending heartbeats to the job tracker?

I noticed that the total number of errors is tracked and the machine is even
highlighted as having a high number of errors in the machine list page of
the webserver....


Thanks,

Lorenzo Thione
Powerset, Inc.