Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 31874 invoked from network); 17 Apr 2009 16:40:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Apr 2009 16:40:18 -0000 Received: (qmail 19861 invoked by uid 500); 17 Apr 2009 16:40:14 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 19748 invoked by uid 500); 17 Apr 2009 16:40:14 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 19729 invoked by uid 99); 17 Apr 2009 16:40:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Apr 2009 16:40:14 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rakhi.khatwani@gmail.com designates 74.125.92.25 as permitted sender) Received: from [74.125.92.25] (HELO qw-out-2122.google.com) (74.125.92.25) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Apr 2009 16:40:06 +0000 Received: by qw-out-2122.google.com with SMTP id 5so783342qwi.35 for ; Fri, 17 Apr 2009 09:39:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:cc:content-type; bh=TJh4RE0SOhoFVcOeW9SPXa3CE9my469kfS+urdNn4z8=; b=NWkMYTU6hexhAC+RHCT4ZPo6qDv13VyF/rquWaBeOBGvHbLY/MW80sK0TGhaChG0bn LWO71Z3HszCxRfWhd9o3zivcv8JlgT0KelfNfWJtJKfoLe6R6vDyllW31G0n+OLZhm9S 6us1xsX6cilrpuyhbsKz3khsMJ/rhe/X+Fp2g= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:cc:content-type; b=FRcoF9C+gS9P/X8ScYl2CuREwbOeiVQT7Mw+u2rCRIEwgHF307CcMhU9sI6hCkr0be GsgyQx3GFv0e+DzD211GPdcvVpR1sf/ctxOrHh5ZDadJADjR3tiaOVKiBV/LfVINA4kY FL98dMpB2PoFUcaABxgYd0meqGB0t18tehexw= MIME-Version: 1.0 Received: by 10.224.20.10 with SMTP id d10mr3967306qab.34.1239986385349; Fri, 17 Apr 2009 09:39:45 -0700 (PDT) Date: Fri, 17 Apr 2009 22:09:45 +0530 Message-ID: <384813770904170939p62b5f79fq533ee8a0d7a92c38@mail.gmail.com> Subject: Ec2 instability From: Rakhi Khatwani To: hbase-user@hadoop.apache.org, core-user@hadoop.apache.org Cc: Ninad Content-Type: multipart/alternative; boundary=0015175cdd9e066e830467c2d4e5 X-Virus-Checked: Checked by ClamAV on apache.org --0015175cdd9e066e830467c2d4e5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hi, Its been several days since we have been trying to stabilize hadoop/hbase on ec2 cluster. but failed to do so. We still come across frequent region server fails, scanner timeout exceptions and OS level deadlocks etc... and 2day while doing a list of tables on hbase i get the following exception: hbase(main):001:0> list 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Zzzzz... 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Zzzzz... 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Zzzzz... 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Zzzzz... 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Zzzzz... but if i check on the UI, hbase master is still on, (tried refreshing it several times). and i have been getting a lot of exceptions from time to time including region servers going down (which happens very frequently due to which there is heavy data loss... that too on production data), scanner timeout exceptions, cannot allocate memory exceptions etc. I am working on amazon ec2 Large cluster with 6 nodes... with each node having the hardware configuration as follows: - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes and made sure that there is a symbolic link to hadoop-site from hbase/conf) Following is my configuration on hadoop-site.xml hadoop.tmp.dir /mnt/hadoop fs.default.name hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001 mapred.job.tracker domU-12-31-39-00-E5-D2.compute-1.internal:50002 tasktracker.http.threads 80 mapred.tasktracker.map.tasks.maximum 3 mapred.tasktracker.reduce.tasks.maximum 3 mapred.output.compress true mapred.output.compression.type BLOCK dfs.client.block.write.retries 3 mapred.child.java.opts -Xmx4096m Given it a high value since the RAM on each node is 7GB... not sure of this setting though **i got Cannot Allocate Memory Exception after making this setting. (got it for the first time) after going through the archives, someone suggested enabling the overcommit memory....not sure of it though ** dfs.datanode.max.xcievers 4096 As suggested by some of you... i guess it solved the data xceivers exception on hadoop dfs.datanode.handler.count 10 mapred.task.timeout 0 The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. This property has been set coz i have been getting a lot of exceptions "Cannot report in 602 seconds....killing" mapred.tasktracker.expiry.interval 360000 Expert: The time-interval, in miliseconds, after which a tasktracker is declared 'lost' if it doesn't send heartbeats. dfs.datanode.socket.write.timeout 0 To avoid socket timeout exceptions dfs.replication 5 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. mapred.job.reuse.jvm.num.tasks -1 How many tasks to run per jvm. If set to -1, there is no limit. and following is the configuration on hbase-site.xml hbase.master domU-12-31-39-00-E5-D2.compute-1.internal:60000 hbase.rootdir hdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/hbase hbase.regionserver.lease.period 12600000 HRegion server lease period in milliseconds. Default is 60 seconds. Clients must report in within this period else they are considered dead. I have set this coz there is a map reduce program which takes almost 3-4 minutes to process a row. worst case is 7 mins so this has been calculated as (7*60*1000) * (30) = 12600000 where (7*60*1000) = time to proccess a row in ms. and 30 = thedefault hbase scanner caching. so i shoudnt be getting scanner timeout exception ** made this change today..... i haven't come across scanner timeout exception today ** hbase.master.lease.period 3600000 HMaster server lease period in milliseconds. Default is 120 seconds. Region servers must report in within this period else they are considered dead. On loaded cluster, may need to up this period. Any suggesstions on changes in the configurations?? My main concern is the region servers goin down from time to time which happens very frequently. due to which my map-reduce tasks hangs and the entire application fails :( I have tried almost all the suggestions mentioned by you except separating the datanodes from computational nodes which i plan to do 2morrow. has it been tried before?? and what would be your recommendation?? how many nodes should i consider as datanodes and computational nodes? i am hoping that the cluster would be stable by 2morrow :) Thanks a ton, Raakhi --0015175cdd9e066e830467c2d4e5--