Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 02449736D for ; Sun, 21 Aug 2011 11:58:02 +0000 (UTC) Received: (qmail 67766 invoked by uid 500); 21 Aug 2011 11:57:58 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 65820 invoked by uid 500); 21 Aug 2011 11:57:52 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 65809 invoked by uid 99); 21 Aug 2011 11:57:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Aug 2011 11:57:49 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of avivaknin13@gmail.com designates 209.85.161.48 as permitted sender) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Aug 2011 11:57:40 +0000 Received: by fxg7 with SMTP id 7so4376779fxg.35 for ; Sun, 21 Aug 2011 04:57:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=from:to:subject:date:message-id:mime-version:content-type:x-mailer :thread-index:content-language; bh=farFBIJSBZUS7LBc5/pKLQbSKEMusVMhrzxn+mPI47o=; b=GkQ7flB19rE9yFjDxbDOYmzGZDCLyNWDRYjgYAK7E1n8Dyt0pO0pyKpB7GGo1Fdo3o XZVr26CLo0mUTQEQESInYWvE6koJD1T9INYBrrhnXObpFog14DEcFg5eMUuBQd1td6qS PEJB6uQUp3waKXqtan8dj/no9eXAwvYJmbilI= Received: by 10.223.53.77 with SMTP id l13mr1987254fag.93.1313927839888; Sun, 21 Aug 2011 04:57:19 -0700 (PDT) Received: from AviPC (DSL217-132-218-124.bb.netvision.net.il [217.132.218.124]) by mx.google.com with ESMTPS id p3sm4188812faa.33.2011.08.21.04.57.17 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 21 Aug 2011 04:57:18 -0700 (PDT) From: "Avi Vaknin" To: Subject: Hadoop cluster optimization Date: Sun, 21 Aug 2011 14:57:16 +0300 Message-ID: <000001cc5ff9$7b4ab210$71e01630$@com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0001_01CC6012.A097EA10" X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Acxf+XiOrCxrw1IlQkekfIIPPTGG2g== Content-Language: he X-Virus-Checked: Checked by ClamAV on apache.org ------=_NextPart_000_0001_01CC6012.A097EA10 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi all ! How are you? My name is Avi and I have been fascinated by Apache Hadoop for the last few months. I am spending the last two weeks trying to optimize my configuration files and environment. I have been going through many Hadoop's configuration properties and it seems that none of them is making a big difference (+- 3 minutes of a total job run time). In Hadoop's meanings my cluster considered to be extremely small (260 GB of text files, while every job is going through only +- 8 GB). I have one server acting as "NameNode and JobTracker", and another 5 servers acting as "DataNodes and TaskTreckers". Right now Hadoop's configurations are set to default, beside the DFS Block Size which is set to 256 MB since every file on my cluster takes 155 - 250 MB. All of the above servers are exactly the same and having the following hardware and software: 1.7 GB memory 1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz Ubuntu Server 10.10 , 32-bit platform Cloudera CDH3 Manual Hadoop Installation (for the ones who are familiar with Amazon Web Services, I am talking about Small EC2 Instances/Servers) Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250 MB and 10 reduce tasks). Based on the above information, does anyone can recommend on a best practice configuration?? Do you thinks that when dealing with such a small cluster, and when processing such a small amount of data, is it even possible to optimize jobs so they would run much faster? By the way, it seems like none of the nodes are having a hardware performance issues (CPU/Memory) while running the job. Thats true unless I am having a bottle neck somewhere else (seems like network bandwidth is not the issue). That issue is a little confusing because the NameNode process and the JobTracker process should allocate 1GB of memory each, which means that my hardware starting point is insufficient and in that case why am I not seeing a full Memory utilization using 'top' command on the NameNode & JobTracker Server? How would you recommend to measure/monitor different Hadoop's properties to find out where is the bottle neck? Thanks for your help!! Avi ------=_NextPart_000_0001_01CC6012.A097EA10--