Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 599839753 for ; Tue, 7 Aug 2012 18:00:05 +0000 (UTC) Received: (qmail 56118 invoked by uid 500); 7 Aug 2012 18:00:03 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 56032 invoked by uid 500); 7 Aug 2012 18:00:03 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 56024 invoked by uid 99); 7 Aug 2012 18:00:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Aug 2012 18:00:03 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of anilgupta84@gmail.com designates 209.85.160.181 as permitted sender) Received: from [209.85.160.181] (HELO mail-gh0-f181.google.com) (209.85.160.181) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Aug 2012 17:59:57 +0000 Received: by ghz3 with SMTP id 3so2223734ghz.12 for ; Tue, 07 Aug 2012 10:59:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=WcvvUyAH7mXrMhWRhSrT3bJzsL+atnAOC0rbhdjcp+0=; b=FBnD2i8q3EboTzDjPrDYIDWU9VEALesr3W2J48CNWtxFO5Yn4gKFkFCHKtx4pV+ld8 +AQkrntzjToxHmLgF+QDTalmgJnoSgrpgq0yFuUavwCAZqG/zEqXzH/DqAWv9Gjw/v8p LxEi2Pjrpppw9MvBuVybsSwJyEn92zmWjG6dQ3fViXx+gmsmlBx0C2QfeSoa3R6v3c+f MCGwn/aQl868q0zosxHv0vUGSfxXwnFjRV9h1pv4zPUwJP6KQ2OqbykcImOkaeJn5xvn SZLpmfbJzhRqEOD8RXMhhdwj1B3DgDTpls8Vn7o1aD/nK5tK2nZBaO33l2yefiPm2FPf wdUw== Received: by 10.50.182.161 with SMTP id ef1mr215398igc.0.1344362375164; Tue, 07 Aug 2012 10:59:35 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.63.12 with HTTP; Tue, 7 Aug 2012 10:59:14 -0700 (PDT) In-Reply-To: References: From: anil gupta Date: Tue, 7 Aug 2012 10:59:14 -0700 Message-ID: Subject: Re: Bulk loading job failed when one region server went down in the cluster To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=14dae93404c7d2327f04c6b0c107 --14dae93404c7d2327f04c6b0c107 Content-Type: text/plain; charset=ISO-8859-1 Hi HBase Folks, I ran the bulk loader yesterday night to load data in a table. During the bulk loading job one of the region server crashed and the entire job failed. It takes around 2.5 hours for this job to finish and the job failed when it was at around 50% complete. After the failure that table was also corrupted in HBase. My cluster has 8 region servers. Is bulk loading not fault tolerant to failure of region servers? I am using this old email chain because at that time my question went unanswered. Please share your views. Thanks, Anil Gupta On Tue, Apr 3, 2012 at 9:12 AM, anil gupta wrote: > Hi Kevin, > > I am not really concerned about the RegionServer going down as the same > thing can happen when deployed in production. Although, in production we > wont be having VM environment and I am aware that my current Dev > environment is not good for heavy processing. What i am concerned about is > the failure of bulk loading job when the Region Server failed. Does this > mean that Bulk loading job is not fault tolerant to Failure of Region > Server? I was expecting the job to be successful even though the > RegionServer failed because there 6 more RS running in the cluster. Fault > Tolerance is one of the biggest selling point of Hadoop platform. Let me > know your views. > Thanks for your time. > > Thanks, > Anil Gupta > > > On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell wrote: > >> Anil, >> >> I am sorry for the delayed response. Reviewing the logs it appears: >> >> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out, >> have not heard from server in 59311ms for sessionid 0x136557f99c90065, >> closing socket connection and attempting reconnect >> >> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region >> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, >> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: >> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >> currently processing ihub-dn-b1,60020,1332955859363 as dead server >> >> It appears to be a classic overworked RS. You were doing too much >> for the RS and it did not respond in time, the Master marked it as >> dead, when the RS responded Master said no your are already dead and >> aborted the server. This is why you see the YouAreDeadException. >> This is probably due to the shared resources of the VM infrastructure >> you are running. You will either need to devote more resources or add >> more nodes(most likely physical) to the cluster if you would like to >> keep running these jobs. >> >> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta wrote: >> > Hi Kevin, >> > >> > Here is dropbox link to the log file of region server which failed: >> > >> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out >> > IMHO, the problem starts from the line #3009 which says: 12/03/30 >> 15:38:32 >> > FATAL regionserver.HRegionServer: ABORTING region server >> > serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, regions=44, >> > usedHeap=446, maxHeap=1197): Unhandled exception: >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >> > currently processing ihub-dn-b1,60020,1332955859363 as dead server >> > >> > I have already tested fault tolerance of HBase by manually bringing >> down a >> > RS while querying a Table and it worked fine and I was expecting the >> same >> > today(even though the RS went down by itself today) when i was loading >> the >> > data. But, it didn't work out well. >> > Thanks for your time. Let me know if you need more details. >> > >> > ~Anil >> > >> > On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell > >wrote: >> > >> >> Anil, >> >> >> >> Can you please attach the RS logs from the failure? >> >> >> >> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta >> wrote: >> >> > Hi All, >> >> > >> >> > I am using cdh3u2 and i have 7 worker nodes(VM's spread across two >> >> > machines) which are running Datanode, Tasktracker, and Region >> Server(1200 >> >> > MB heap size). I was loading data into HBase using Bulk Loader with a >> >> > custom mapper. I was loading around 34 million records and I have >> loaded >> >> > the same set of data in the same environment many times before >> without >> >> any >> >> > problem. This time while loading the data, one of the region >> server(but >> >> the >> >> > DN and TT kept on running on that node ) failed and then after >> numerous >> >> > failures of map-tasks the loding job failed. Is there any >> >> > setting/configuration which can make Bulk Loading fault-tolerant to >> >> failure >> >> > of region-servers? >> >> > >> >> > -- >> >> > Thanks & Regards, >> >> > Anil Gupta >> >> >> >> >> >> >> >> -- >> >> Kevin O'Dell >> >> Customer Operations Engineer, Cloudera >> >> >> > >> > >> > >> > -- >> > Thanks & Regards, >> > Anil Gupta >> >> >> >> -- >> Kevin O'Dell >> Customer Operations Engineer, Cloudera >> >> -- >> Thanks & Regards, >> Anil Gupta >> >> -- Thanks & Regards, Anil Gupta --14dae93404c7d2327f04c6b0c107--