Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 64C2E4A32 for ; Tue, 5 Jul 2011 19:23:39 +0000 (UTC) Received: (qmail 4549 invoked by uid 500); 5 Jul 2011 19:23:39 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 4491 invoked by uid 500); 5 Jul 2011 19:23:38 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 4483 invoked by uid 99); 5 Jul 2011 19:23:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Jul 2011 19:23:38 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Jul 2011 19:23:37 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 06B30446CF for ; Tue, 5 Jul 2011 19:23:17 +0000 (UTC) Date: Tue, 5 Jul 2011 19:23:17 +0000 (UTC) From: "Hairong Kuang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1118487725.1513.1309893797024.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HDFS-395) DFS Scalability: Incremental block reports MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060086#comment-13060086 ] Hairong Kuang commented on HDFS-395: ------------------------------------ For the incremental block reports, we can always delay the deletion ack of a block until it is physically deleted from the disk to fix the race condition that Tom pointed out. However, async deletion does cause race condition in regular block reports (even without incremental block reports). In apache trunk, a block report is generated from in-memory block volume map. There is directory scanner that periodically sync the disk to the in-memory block map. So DN may first send a block report to NN indicating that the block is deleted, then sends a block report saying that it has the block because the scanner adds it back to the volumeMap before the block gets actually deleted. Another possible race condition occurs when a file gets deleted. NN removes the blocks of the file from blocksMap. Assume one block of the file has a block id of BlckID with a replica at DN1. Then NN may allocate the same block id with a newer generation stamp for a new file. If NN happens to allocate the block to DN1 and DN1 has not deleted the previous incarnation of the block from the disk yet, DN1 will end up with two blocks with the same id but different generation stamps. If rename is relatively cheaper in most file systems, why not using synchronous rename to fix all these possible race conditions? Also I would suggest that we fix the race conditions caused by asynchronous block deletions in a separate jira, so Tom could move forward with incremental block reports. > DFS Scalability: Incremental block reports > ------------------------------------------ > > Key: HDFS-395 > URL: https://issues.apache.org/jira/browse/HDFS-395 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: dhruba borthakur > Assignee: dhruba borthakur > Attachments: blockReportPeriod.patch, explicitDeleteAcks.patch > > > I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends a block report to the namenode once every hour. This means that the namenode processes a block report once every 2 seconds. Each block report contains all blocks that the datanode currently hosts. This makes the namenode compare a huge number of blocks that practically remains the same between two consecutive reports. This wastes CPU on the namenode. > The problem becomes worse when the number of datanodes increases. > One proposal is to make succeeding block reports (after a successful send of a full block report) be incremental. This will make the namenode process only those blocks that were added/deleted in the last period. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira