Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 86114 invoked from network); 1 May 2008 18:23:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 May 2008 18:23:50 -0000 Received: (qmail 33625 invoked by uid 500); 1 May 2008 18:23:45 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 33577 invoked by uid 500); 1 May 2008 18:23:44 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 33566 invoked by uid 99); 1 May 2008 18:23:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 May 2008 11:23:44 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cagdas.gerede@gmail.com designates 209.85.162.181 as permitted sender) Received: from [209.85.162.181] (HELO el-out-1112.google.com) (209.85.162.181) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 May 2008 18:22:57 +0000 Received: by el-out-1112.google.com with SMTP id m34so544566ele.9 for ; Thu, 01 May 2008 11:23:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:references; bh=4yXfkAgIubKYyQKLSRIuVszCI+3tmpkIOC09rt0PJ84=; b=NWwKGIshglR6eVjsdLY0ZbM3B1yoFDlq6km3P3torX27axRSLP/Y6Dgba4hhVXhXnFluwIEJ6jiW+F38gUGtXMJPDZ1YoGkAIHWHm6c0FInYIgDSkrCHB+ziXrJEV0UY6wAIWW6pCY17CcEcT5TJ57ZNW9c9dLoxbl9xP0cjQ9E= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:references; b=FhWED+AfsK5ed4JdHHS5vTvxq+LuwdjdsO2n5sOPvPXNBarDH5HoB+m6U1Mugi8l3pHl8hwunRhLG/BtyefuCwhMIwqtLB8S743Il0jPGdrhgg80bgzn8cbaq5ba5KOoeIfBD8sQzE0X1BEuaigNnQsUkdQSF7nIEDapO/2EMzg= Received: by 10.114.112.1 with SMTP id k1mr2181378wac.24.1209666191001; Thu, 01 May 2008 11:23:11 -0700 (PDT) Received: by 10.114.59.10 with HTTP; Thu, 1 May 2008 11:23:10 -0700 (PDT) Message-ID: <4cc657e40805011123n69efab2dgfe37b4d788c761c8@mail.gmail.com> Date: Thu, 1 May 2008 11:23:10 -0700 From: "Cagdas Gerede" Reply-To: cagdas.gerede@gmail.com To: core-user@hadoop.apache.org Subject: Re: Block reports: memory vs. file system, and Dividing offerService into 2 threads In-Reply-To: <4818D926.10303@yahoo-inc.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_11147_29870284.1209666190994" References: <4cc657e40804292332t5c0d37dcnea04ac0dc6505aa2@mail.gmail.com> <84E52AD05F6F884AAFF3344FE4C9599101C7B6C3@SNV-EXVS08.ds.corp.yahoo.com> <48189911.5000209@apache.org> <4818D926.10303@yahoo-inc.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_11147_29870284.1209666190994 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline As far as I understand, the current focus is on how to reduce namenode's CPU time to process block reports from a lot of datanodes. Don't we miss another issue? Doesn't the way a block report is computed delays the master startup time. I have to make sure the master is up as quick as possible for maximum availability. The bottleneck seems like the scanning of the local disk. I wrote a simple java program that only scanned the datanode directories as Hadoop code did, and the time the java program took was equivalent to the 90% of the time that took for block report generation and sending. It seems scanning is very costly. It takes about 2-4 minutes. To address the problem, can we have *two types of block reports*. Once is generated from memory and the other from localfs. For master starts, we can trigger the block report that is generated from memory, and for periodic ones we can trigger the block report that is computed from localfs. Another issue I have is even if we do it block reports every 10 days, once it happens, it will almost freeze the datanode functions. More specifically, data node won't be able to report to namenode about new blocks until this report is computed. This takes at least a couple of minutes in my system for each datanode. As a result, master thinks a block is not yet replicated enough and it rejects addition of a new block to a file. Then, since it does not wait for enough time, it eventually causes the failure of writing a file. To address the first problem, can we separate this process of scanning the underlying disk as a separate thread then reporting of newly received blocks? Dhruba points out > This sequential nature is critical in ensuring that there is no erroneous race condition in the Namenode I do not have any insight to this. Cagdas -- ------------ Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info ------=_Part_11147_29870284.1209666190994--