Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of cagdas.gerede@gmail.com
 designates 209.85.162.181 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:references;
        b=FhWED+AfsK5ed4JdHHS5vTvxq+LuwdjdsO2n5sOPvPXNBarDH5HoB+m6U1Mugi8l3pHl8hwunRhLG/BtyefuCwhMIwqtLB8S743Il0jPGdrhgg80bgzn8cbaq5ba5KOoeIfBD8sQzE0X1BEuaigNnQsUkdQSF7nIEDapO/2EMzg=
Message-ID: <4cc657e40805011123n69efab2dgfe37b4d788c761c8@mail.gmail.com>
Date: Thu, 1 May 2008 11:23:10 -0700
From: "Cagdas Gerede" <cagdas.gerede@gmail.com>
Reply-To: cagdas.gerede@gmail.com
To: core-user@hadoop.apache.org
Subject: Re: Block reports: memory vs. file system,
 and Dividing offerService into 2 threads
In-Reply-To: <4818D926.10303@yahoo-inc.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_11147_29870284.1209666190994"
References: <4cc657e40804292332t5c0d37dcnea04ac0dc6505aa2@mail.gmail.com>
	 <84E52AD05F6F884AAFF3344FE4C9599101C7B6C3@SNV-EXVS08.ds.corp.yahoo.com>
	 <48189911.5000209@apache.org> <4818D926.10303@yahoo-inc.com>

------=_Part_11147_29870284.1209666190994
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

 As far as I understand, the current focus is on how to reduce namenode's
CPU time to process block reports from a lot of datanodes.

Don't we miss another issue? Doesn't the way a block report is computed
delays the master startup time. I have to make sure the master is up as
quick as possible for maximum availability. The bottleneck seems like the
scanning of the local disk. I wrote a simple java program that only scanned
the datanode directories as Hadoop code did, and the time the java program
took was equivalent to the 90% of the time that took for block report
generation and sending. It seems scanning is very costly. It takes about 2-4
minutes.

 To address the problem, can we have *two types of block reports*. Once is
generated from memory and the other from localfs. For master starts, we can
trigger the block report that is generated from memory, and for periodic
ones we can trigger the block report that is computed from localfs.


Another issue I have is even if we do it block reports every 10 days, once
it happens, it will almost freeze the datanode functions. More specifically,
data node won't be able to report to namenode about new blocks until this
report is computed. This takes at least a couple of minutes in my system for
each datanode. As a result, master thinks a block is not yet replicated
enough and it rejects addition of a new block to a file. Then, since it does
not wait for enough time, it eventually causes the failure of writing a
file. To address the first problem, can we separate this process of scanning
the underlying disk as a separate thread then reporting of newly received
blocks?

Dhruba points out
> This sequential nature is critical in ensuring that there is no erroneous
race condition in the Namenode

I do not have any insight to this.


Cagdas

-- 
------------
Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info

------=_Part_11147_29870284.1209666190994--