hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Cockrill <jamie.cockr...@gmail.com>
Subject Re: HBase on same boxes as HDFS Data nodes
Date Wed, 07 Jul 2010 17:53:39 GMT
I think you're right.

Unfortunately the machines are on a separate network to this laptop,
so I'm having to type everything across, apologies if it doesn't
translate well...

free -m gave:

Mem    Total    Used     Free
            7992     7939      53
b/c                    7877    114
Swap: 23415       895  22519

I did this on another node that isn't being smashed at the moment and
the numbers came out similar, but the buffers/cache free was higher

vmstat -20 is giving non-zero si and so's ranging between 3 and just
short of 5000.

That seems to be it I guess. Hadoop troubleshooting suggests setting
swappiness to 0, is that just a case of changing the value in
/proc/sys/vm/swappiness?

thanks

Jamie




On 7 July 2010 18:40, Todd Lipcon <todd@cloudera.com> wrote:
> On Wed, Jul 7, 2010 at 10:32 AM, Jamie Cockrill <jamie.cockrill@gmail.com>wrote:
>
>> On the subject of GC and heap, I've left those as defaults. I could
>> look at those if that's the next logical step? Would there be anything
>> in any of the logs that I should look at?
>>
>> One thing I have noticed is that it does take an absolute age to log
>> in to the DN/RS to restart the RS once it's fallen over, in one
>> instance it took about 10 minutes. These are 8GB, 4 core amd64 boxes
>>
>>
> That indicates swapping. Can you run "free -m" on the node?
>
> Also let "vmstat 20" run while running your job and observe the "si" and
> "so" columns. If those are nonzero, it indicates you're swapping, and you've
> oversubscribed your RAM (very easy on 8G machines)
>
> -Todd
>
>
>
>> ta
>>
>> Jamie
>>
>>
>>
>> On 7 July 2010 18:30, Jamie Cockrill <jamie.cockrill@gmail.com> wrote:
>> > Bad news, it looks like my xcievers is set as it should be, it's in
>> > the hdfs-site.xml and looking at the job.xml of one of my jobs in the
>> > job-tracker, it's showing that property as set to 2047. I've cat |
>> > grepped one of the datanode logs and although there were a few in
>> > there, they were from a few months ago. I've upped my MAX_FILESIZE on
>> > my table to 1GB to see if that helps (not sure if it will!).
>> >
>> > Thanks,
>> >
>> > Jamie
>> >
>> > On 7 July 2010 18:12, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
>> >> xcievers exceptions will be in the datanodes' logs, and your problem
>> >> totally looks like it. 0.20.5 will have the same issue (since it's on
>> >> the HDFS side)
>> >>
>> >> J-D
>> >>
>> >> On Wed, Jul 7, 2010 at 10:08 AM, Jamie Cockrill
>> >> <jamie.cockrill@gmail.com> wrote:
>> >>> Hi Todd & JD,
>> >>>
>> >>> Environment:
>> >>> All (hadoop and HBase) installed as of karmic-cdh3, which means:
>> >>> Hadoop 0.20.2+228
>> >>> HBase 0.89.20100621+17
>> >>> Zookeeper 3.3.1+7
>> >>>
>> >>> Unfortunately my whole cluster of regionservers have now crashed, so
I
>> >>> can't really say if it was swapping too much. There is a DEBUG
>> >>> statement just before it crashes saying:
>> >>>
>> >>> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
>> >>> hdfs://<somewhere on my HDFS, in /hbase>
>> >>>
>> >>> What follows is:
>> >>>
>> >>> WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>> >>> org.apache.hadoop.ipc.RemoteException:
>> >>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
>> >>> on <file location as above> File does not exist. Holder
>> >>> DFSClient_-11113603 does not have any open files
>> >>>
>> >>> It then seems to try and do some error recovery (Error Recovery for
>> >>> block null bad datanode[0] nodes == null), fails (Could not get block
>> >>> locations. Source file "<hbase file as before>" - Aborting). There
is
>> >>> then an ERROR org.apache...HRegionServer: Close and delete failed.
>> >>> There is then a similar LeaseExpiredException as above.
>> >>>
>> >>> There are then a couple of messages from HRegionServer saying that
>> >>> it's notifying master of its shutdown and stopping itself. The
>> >>> shutdown hook then fires and the RemoteException and
>> >>> LeaseExpiredExceptions are printed again.
>> >>>
>> >>> ulimit is set to 65000 (it's in the regionserver log, printed as I
>> >>> restarted the regionserver), however I haven't got the xceivers set
>> >>> anywhere. I'll give that a go. It does seem very odd as I did have a
>> >>> few of them fall over one at a time with a few early loads, but that
>> >>> seemed to be because the regions weren't splitting properly, so all
>> >>> the traffic was going to one node and it was being overwhelmed. Once
I
>> >>> throttled it, after one load it a region split seemed to get
>> >>> triggered, which flung regions all over, which made subsequent loads
>> >>> much more distributed. However, perhaps the time-bomb was ticking...
>> >>> I'll  have a go at specifying the xcievers property. I'm pretty
>> >>> certain i've got everything else covered, except the patches as
>> >>> referenced in the JIRA.
>> >>>
>> >>> I just grepped some of the log files and didn't get an explicit
>> >>> exception with 'xciever' in it.
>> >>>
>> >>> I am considering downgrading(?) to 0.20.5, however because everything
>> >>> is installed as per karmic-cdh3, I'm a bit reluctant to do so as
>> >>> presumably Cloudera has tested each of these versions against each
>> >>> other? And I don't really want to introduce further versioning issues.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Jamie
>> >>>
>> >>>
>> >>> On 7 July 2010 17:30, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>> >>>> Jamie,
>> >>>>
>> >>>> Does your configuration meets the requirements?
>> >>>>
>> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements
>> >>>>
>> >>>> ulimit and xcievers, if not set, are usually time bombs that blow
off
>> when
>> >>>> the cluster is under load.
>> >>>>
>> >>>> J-D
>> >>>>
>> >>>> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill <
>> jamie.cockrill@gmail.com>wrote:
>> >>>>
>> >>>>> Dear all,
>> >>>>>
>> >>>>> My current HBase/Hadoop architecture has HBase region servers
on the
>> >>>>> same physical boxes as the HDFS data-nodes. I'm getting an awful
lot
>> >>>>> of region server crashes. The last thing that happens appears
to be a
>> >>>>> DroppedSnapshot Exception, caused by an IOException: could not
>> >>>>> complete write to file <file on HDFS>. I am running it
under load,
>> how
>> >>>>> heavy that is I'm not sure how that is quantified, but I'm guessing
>> it
>> >>>>> is a load issue.
>> >>>>>
>> >>>>> Is it common practice to put region servers on data-nodes? Is
it
>> >>>>> common to see region server crashes when either the HDFS or
region
>> >>>>> server (or both) is under heavy load? I'm guessing that is the
case
>> as
>> >>>>> I've seen a few similar posts. I've not got a great deal of
capacity
>> >>>>> to be separating region servers from HDFS data nodes, but it
might be
>> >>>>> an argument I could make.
>> >>>>>
>> >>>>> Thanks
>> >>>>>
>> >>>>> Jamie
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Mime
View raw message