hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From anil gupta <anilgupt...@gmail.com>
Subject Re: Bulk loading job failed when one region server went down in the cluster
Date Mon, 13 Aug 2012 20:14:47 GMT
Hi Mike,

I tried doing that by setting up properties in mapred-site.xml but Yarn
doesnt seems to work with "mapreduce.tasktracker.
map.tasks.maximum" property. Here is a reference to a discussion to same
problem:
https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25]
I have also posted about the same problem in Hadoop mailing list.

I already admitted in my previous email that YARN is having major issues
when we want to control it in low memory environment. I was just trying to
get views HBase experts on bulk load failures since we will be relying
heavily on Fault Tolerance.
If HBase Bulk Loader is fault tolerant to failure of RS in a viable
environment  then I dont have any issue. I hope this clears up my purpose
of posting on this topic.

Thanks,
Anil

On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel
<michael_segel@hotmail.com>wrote:

> Anil,
>
> Do you know what happens when you have an airplane that has too heavy a
> cargo when it tries to take off?
> You run out of runway and you crash and burn.
>
> Looking at your post, why are you starting 8 map processes on each slave?
> That's tunable and you clearly do not have enough memory in each VM to
> support 8 slots on a node.
> Here you swap, you swap you cause HBase to crash and burn.
>
> 3.2GB of memory means that no more than 1 slot per slave and even then...
> you're going to be very tight. Not to mention that you will need to loosen
> up on your timings since its all virtual and you have way too much i/o per
> drive going on.
>
>
> My suggestion is that you go back and tune your system before thinking
> about running anything.
>
> HTH
>
> -Mike
>
> On Aug 13, 2012, at 2:11 PM, anil gupta <anilgupta84@gmail.com> wrote:
>
> > Hi Guys,
> >
> > Sorry for not mentioning the version I am currently running. My current
> > version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN for
> > MR. My original post was for HBase0.92. Here are some more details of my
> > current setup:
> > I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's installed
> on
> > VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500
> > HDFS space.
> > I use this cluster for POC(Proof of Concepts). I am not looking for any
> > performance benchmarking from this set-up. Due to some major bugs in
> YARN i
> > am unable to make work in a proper way in memory less than 4GB. I am
> > already having discussion regarding them on Hadoop Mailing List.
> >
> > Here is the log of failed mapper: http://pastebin.com/f83xE2wv
> >
> > The problem is that when i start a Bulk loading job in YARN, 8 Map
> > processes start on each slave and then all of my slaves are hammered
> badly
> > due to this. Since the slaves are getting hammered badly then
> RegionServer
> > gets lease expired or YourAreDeadExpcetion. Here is the log of RS which
> > caused the job to fail: http://pastebin.com/9ZQx0DtD
> >
> > I am aware that this is happening due to underperforming hardware(Two
> > slaves are using one 7200 rpm Hard Drive in my setup) and some major bugs
> > regarding running YARN in less than 4 GB memory. My only concern is the
> > failure of entire MR job and its fault tolerance to RS failures. I am not
> > really concerned about RS failure since HBase is fault tolerant.
> >
> > Please let me know if you need anything else.
> >
> > Thanks,
> > Anil
> >
> >
> >
> > On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <
> michael_segel@hotmail.com>wrote:
> >
> >> Yes, it can.
> >> You can see RS failure causing a cascading RS failure. Of course YMMV
> and
> >> it depends on which version you are running.
> >>
> >> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he
> >> should upgrade.
> >>
> >> (Or go to CHD4...)
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <kevin.odell@cloudera.com>
> >> wrote:
> >>
> >>> Anil,
> >>>
> >>> Do you have root cause on the RS failure?  I have never heard of one RS
> >>> failure causing a whole job to fail.
> >>>
> >>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <anilgupta84@gmail.com>
> >> wrote:
> >>>
> >>>> Hi HBase Folks,
> >>>>
> >>>> I ran the bulk loader yesterday night to load data in a table. During
> >> the
> >>>> bulk loading job one of the region server crashed and the entire job
> >>>> failed. It takes around 2.5 hours for this job to finish and the job
> >> failed
> >>>> when it was at around 50% complete. After the failure that table was
> >> also
> >>>> corrupted in HBase. My cluster has 8 region servers.
> >>>>
> >>>> Is bulk loading not fault tolerant to failure of region servers?
> >>>>
> >>>> I am using this old email chain because at that time my question went
> >>>> unanswered. Please share your views.
> >>>>
> >>>> Thanks,
> >>>> Anil Gupta
> >>>>
> >>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <anilgupta84@gmail.com>
> >> wrote:
> >>>>
> >>>>> Hi Kevin,
> >>>>>
> >>>>> I am not really concerned about the RegionServer going down as the
> same
> >>>>> thing can happen when deployed in production. Although, in production
> >> we
> >>>>> wont be having VM environment and I am aware that my current Dev
> >>>>> environment is not good for heavy processing.  What i am concerned
> >> about
> >>>> is
> >>>>> the failure of bulk loading job when the Region Server failed. Does
> >> this
> >>>>> mean that Bulk loading job is not fault tolerant to Failure of Region
> >>>>> Server? I was expecting the job to be successful even though the
> >>>>> RegionServer failed because there 6 more RS running in the cluster.
> >> Fault
> >>>>> Tolerance is one of the biggest selling point of Hadoop platform.
Let
> >> me
> >>>>> know your views.
> >>>>> Thanks for your time.
> >>>>>
> >>>>> Thanks,
> >>>>> Anil Gupta
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <
> kevin.odell@cloudera.com
> >>>>> wrote:
> >>>>>
> >>>>>> Anil,
> >>>>>>
> >>>>>> I am sorry for the delayed response.  Reviewing the logs it
appears:
> >>>>>>
> >>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed
> out,
> >>>>>> have not heard from server in 59311ms for sessionid
> 0x136557f99c90065,
> >>>>>> closing socket connection and attempting reconnect
> >>>>>>
> >>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING
region
> >>>>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
> >>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
server
> >>>>>>
> >>>>>> It appears to be a classic overworked RS.  You were doing too
much
> >>>>>> for the RS and it did not respond in time, the Master marked
it as
> >>>>>> dead, when the RS responded Master said no your are already
dead and
> >>>>>> aborted the server.  This is why you see the YouAreDeadException.
> >>>>>> This is probably due to the shared resources of the VM
> infrastructure
> >>>>>> you are running.  You will either need to devote more resources
or
> add
> >>>>>> more nodes(most likely physical) to the cluster if you would
like to
> >>>>>> keep running these jobs.
> >>>>>>
> >>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <anilgupt@buffalo.edu>
> >>>> wrote:
> >>>>>>> Hi Kevin,
> >>>>>>>
> >>>>>>> Here is dropbox link to the log file of region server which
failed:
> >>>>>>>
> >>>>>>
> >>>>
> >>
> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
> >>>>>>> IMHO, the problem starts from the line #3009 which says:
12/03/30
> >>>>>> 15:38:32
> >>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
> >>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >>>> regions=44,
> >>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
> >>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> rejected;
> >>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead
server
> >>>>>>>
> >>>>>>> I have already tested fault tolerance of HBase by manually
bringing
> >>>>>> down a
> >>>>>>> RS while querying a Table and it worked fine and I was expecting
> the
> >>>>>> same
> >>>>>>> today(even though the RS went down by itself today) when
i was
> >> loading
> >>>>>> the
> >>>>>>> data. But, it didn't work out well.
> >>>>>>> Thanks for your time. Let me know if you need more details.
> >>>>>>>
> >>>>>>> ~Anil
> >>>>>>>
> >>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
> >>>> kevin.odell@cloudera.com
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Anil,
> >>>>>>>>
> >>>>>>>> Can you please attach the RS logs from the failure?
> >>>>>>>>
> >>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <anilgupt@buffalo.edu
> >
> >>>>>> wrote:
> >>>>>>>>> Hi All,
> >>>>>>>>>
> >>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's
spread across
> two
> >>>>>>>>> machines) which are running Datanode, Tasktracker,
and Region
> >>>>>> Server(1200
> >>>>>>>>> MB heap size). I was loading data into HBase using
Bulk Loader
> >>>> with a
> >>>>>>>>> custom mapper. I was loading around 34 million records
and I have
> >>>>>> loaded
> >>>>>>>>> the same set of data in the same environment many
times before
> >>>>>> without
> >>>>>>>> any
> >>>>>>>>> problem. This time while loading the data, one of
the region
> >>>>>> server(but
> >>>>>>>> the
> >>>>>>>>> DN and TT kept on running on that node ) failed
and then after
> >>>>>> numerous
> >>>>>>>>> failures of map-tasks the loding job failed. Is
there any
> >>>>>>>>> setting/configuration which can make Bulk Loading
fault-tolerant
> to
> >>>>>>>> failure
> >>>>>>>>> of region-servers?
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Thanks & Regards,
> >>>>>>>>> Anil Gupta
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Kevin O'Dell
> >>>>>>>> Customer Operations Engineer, Cloudera
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Thanks & Regards,
> >>>>>>> Anil Gupta
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Kevin O'Dell
> >>>>>> Customer Operations Engineer, Cloudera
> >>>>>>
> >>>>>> --
> >>>>>> Thanks & Regards,
> >>>>>> Anil Gupta
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Thanks & Regards,
> >>>> Anil Gupta
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Kevin O'Dell
> >>> Customer Operations Engineer, Cloudera
> >>
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
>
>


-- 
Thanks & Regards,
Anil Gupta

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message