hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Cockrill <jamie.cockr...@gmail.com>
Subject Re: HBase on same boxes as HDFS Data nodes
Date Wed, 07 Jul 2010 17:40:55 GMT
On the subject of swapping, I'm re-running one of the jobs to have a
go. All the load is going to one regionserver at the moment (no region
splits have occurred yet) and it's on (via top):

Mem: 8184284k total, ~8130000k used, ~524000k free, 28000k buffers
(might be inaccurate, can't type at a ms rate!)
Swap: 23976972k total, ~759000k used, ~23222000k free, 458000k cached

Not sure if that is indicative of anything.

thanks

Jamie

PS, I have disabled compression on my table for now, as having 'GZ'
compression specified slowed loading of data down massively and my RS
logs seemed to be filled with messages from a supposed CodecPool with
something like 'returning new codec instance'.


On 7 July 2010 18:32, Jamie Cockrill <jamie.cockrill@gmail.com> wrote:
> On the subject of GC and heap, I've left those as defaults. I could
> look at those if that's the next logical step? Would there be anything
> in any of the logs that I should look at?
>
> One thing I have noticed is that it does take an absolute age to log
> in to the DN/RS to restart the RS once it's fallen over, in one
> instance it took about 10 minutes. These are 8GB, 4 core amd64 boxes
>
> ta
>
> Jamie
>
>
>
> On 7 July 2010 18:30, Jamie Cockrill <jamie.cockrill@gmail.com> wrote:
>> Bad news, it looks like my xcievers is set as it should be, it's in
>> the hdfs-site.xml and looking at the job.xml of one of my jobs in the
>> job-tracker, it's showing that property as set to 2047. I've cat |
>> grepped one of the datanode logs and although there were a few in
>> there, they were from a few months ago. I've upped my MAX_FILESIZE on
>> my table to 1GB to see if that helps (not sure if it will!).
>>
>> Thanks,
>>
>> Jamie
>>
>> On 7 July 2010 18:12, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
>>> xcievers exceptions will be in the datanodes' logs, and your problem
>>> totally looks like it. 0.20.5 will have the same issue (since it's on
>>> the HDFS side)
>>>
>>> J-D
>>>
>>> On Wed, Jul 7, 2010 at 10:08 AM, Jamie Cockrill
>>> <jamie.cockrill@gmail.com> wrote:
>>>> Hi Todd & JD,
>>>>
>>>> Environment:
>>>> All (hadoop and HBase) installed as of karmic-cdh3, which means:
>>>> Hadoop 0.20.2+228
>>>> HBase 0.89.20100621+17
>>>> Zookeeper 3.3.1+7
>>>>
>>>> Unfortunately my whole cluster of regionservers have now crashed, so I
>>>> can't really say if it was swapping too much. There is a DEBUG
>>>> statement just before it crashes saying:
>>>>
>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
>>>> hdfs://<somewhere on my HDFS, in /hbase>
>>>>
>>>> What follows is:
>>>>
>>>> WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>>>> org.apache.hadoop.ipc.RemoteException:
>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
>>>> on <file location as above> File does not exist. Holder
>>>> DFSClient_-11113603 does not have any open files
>>>>
>>>> It then seems to try and do some error recovery (Error Recovery for
>>>> block null bad datanode[0] nodes == null), fails (Could not get block
>>>> locations. Source file "<hbase file as before>" - Aborting). There
is
>>>> then an ERROR org.apache...HRegionServer: Close and delete failed.
>>>> There is then a similar LeaseExpiredException as above.
>>>>
>>>> There are then a couple of messages from HRegionServer saying that
>>>> it's notifying master of its shutdown and stopping itself. The
>>>> shutdown hook then fires and the RemoteException and
>>>> LeaseExpiredExceptions are printed again.
>>>>
>>>> ulimit is set to 65000 (it's in the regionserver log, printed as I
>>>> restarted the regionserver), however I haven't got the xceivers set
>>>> anywhere. I'll give that a go. It does seem very odd as I did have a
>>>> few of them fall over one at a time with a few early loads, but that
>>>> seemed to be because the regions weren't splitting properly, so all
>>>> the traffic was going to one node and it was being overwhelmed. Once I
>>>> throttled it, after one load it a region split seemed to get
>>>> triggered, which flung regions all over, which made subsequent loads
>>>> much more distributed. However, perhaps the time-bomb was ticking...
>>>> I'll  have a go at specifying the xcievers property. I'm pretty
>>>> certain i've got everything else covered, except the patches as
>>>> referenced in the JIRA.
>>>>
>>>> I just grepped some of the log files and didn't get an explicit
>>>> exception with 'xciever' in it.
>>>>
>>>> I am considering downgrading(?) to 0.20.5, however because everything
>>>> is installed as per karmic-cdh3, I'm a bit reluctant to do so as
>>>> presumably Cloudera has tested each of these versions against each
>>>> other? And I don't really want to introduce further versioning issues.
>>>>
>>>> Thanks,
>>>>
>>>> Jamie
>>>>
>>>>
>>>> On 7 July 2010 17:30, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
>>>>> Jamie,
>>>>>
>>>>> Does your configuration meets the requirements?
>>>>> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements
>>>>>
>>>>> ulimit and xcievers, if not set, are usually time bombs that blow off
when
>>>>> the cluster is under load.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill <jamie.cockrill@gmail.com>wrote:
>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> My current HBase/Hadoop architecture has HBase region servers on
the
>>>>>> same physical boxes as the HDFS data-nodes. I'm getting an awful
lot
>>>>>> of region server crashes. The last thing that happens appears to
be a
>>>>>> DroppedSnapshot Exception, caused by an IOException: could not
>>>>>> complete write to file <file on HDFS>. I am running it under
load, how
>>>>>> heavy that is I'm not sure how that is quantified, but I'm guessing
it
>>>>>> is a load issue.
>>>>>>
>>>>>> Is it common practice to put region servers on data-nodes? Is it
>>>>>> common to see region server crashes when either the HDFS or region
>>>>>> server (or both) is under heavy load? I'm guessing that is the case
as
>>>>>> I've seen a few similar posts. I've not got a great deal of capacity
>>>>>> to be separating region servers from HDFS data nodes, but it might
be
>>>>>> an argument I could make.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Jamie
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message