hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Seraph Imalia <ser...@eisp.co.za>
Subject Re: Hbase pausing problems
Date Mon, 18 Jan 2010 08:49:02 GMT
Hi Jean-Daniel,

Thank you for your comprehensive response. Answers below...

> From: Jean-Daniel Cryans <jdcryans@apache.org>
> Reply-To: <hbase-user@hadoop.apache.org>
> Date: Fri, 15 Jan 2010 09:45:12 -0800
> To: <hbase-user@hadoop.apache.org>
> Subject: Re: Hbase pausing problems
> 
> General comments:
> 
> - Having less than 5-6 nodes is usually very problematic since HDFS
> isn't optimized at all for that number. At the very least you could
> set replication to 2 so that each write only hit 2/3 of the cluster
> instead of all of them.

We have plans to double the number of nodes because our load is increasing
quite steadily.  I will try changing the replication setting I think it will
improve the problem (it is currently set to 1), but I am not convinced it
will solve our problem altogether.

> 
> - Giving the all the RAM you can to the region servers is the way to
> go. In your case it slows you down because your HDFS is probably very
> saturated. My previous comment should help with that regard. You could
> also set bigger memstores and maxfilesize on the tables so that
> there's less churn (see the alter command in the shell).

I will try this.  I did try to increase the RAM and all that happened is
that the memstore got bigger and took longer to flush, so increasing the
memstore size may be necessary in the long run, but I don't want to do it
yet because it aggravates the current problem and makes us loose traffic.
Increasing the MaxFileSize might help a little, but again I don't think it
will solve the immediate problem.

> 
> - Calling the HTable constructor only once is the way to go.

Great, this is what we are doing :) - I was worried this was actually the
cause of the problem and that we were not supposed to call it only once.

> 
> - That your clients aren't able to recover the loss of a region server
> sounds weird. 

Yes, the clients are definitely not able to recover after the loss of a
regionserver and worse still not able to recover even after the regionserver
comes alive again.  During my tests, I even make sure the regionserver is
gracefully stopped and started again.

> Your comment "I would expect the other two regionservers
> to pick up the load" is the wrong intuition, a row is only served by a
> single region and a region is only served by a single region server.
> If that region flushes, the time it takes to take the snapshot will be
> the time your data is unavailable.

At the moment, we do very few reads on the data in hBase.  It may read a
maximum of 1000 records across an entire day so the availability of the data
in hBase is not much concern at the moment.

We are writing about 6 Million rows a day at the moment.

My comment about the other two regionservers picking up the load was
referring only to writing new rows.  I would expect that if one server is
busy doing a memstore flush, that the other two would still be available for
writing new rows - or is this still a flawed assumption?  If this is not
true, I understand why we cannot write new rows whilst a flush is in
progress - we would then have to work around it somehow?

> 
> Finally, which version of hadoop and hbase are your using?

Hadoop: 0.20.1
hBase: 0.20.2

Coldfusion runs on a Jrun server - are there any known issues using a Jrun
server with hBase/hadoop?  (perhaps it interferes with keepalive messages of
some sort?)

Any more suggestions and things I can try or look at would be greatly
appreciated.  I am busy writing a test client that works without coldfusion
to see what the results are - I will let you know if I learn anything in the
process.

Regards,
Seraph

> 
> Thx
> 
> J-D
> 
> On Fri, Jan 15, 2010 at 1:33 AM, Seraph Imalia <seraph@eisp.co.za> wrote:
>> Hi,
>> 
>> We are using coldfusion as our server-side coding language which is built on
>> java.   We have written a java class to simplify the coldfusion coding by
>> providing simple classes to insert data into hBase.
>> 
>> Our hBase cluster is 3 servers...
>> 
>> 1. each server has a hadoop datanode.
>> 2. each server has an hbase regionserver.
>> 3. each server has an instance of zookeeper.
>> 4. server A is the hadoop namenode
>> 5. server B is the master hBase server
>> 6. server C has the secondary name node and is also ready to be started as a
>> master should server B master go down.
>> 7. Each java process has been given 1 Gig of RAM ­ each server has 8 Gigs of
>> RAM.
>> 8. Each server is connected together using a 10/100 3Com Layer 3 Managed
>> switch and we are planning to put a 10/100/1000 3Com Layer 3 Managed Switch
>> in to improve the speed of a memstore flush (among other things).
>> 
>> The problem...
>> 
>> Approximately every 10 minutes, our entire coldfusion system pauses at the
>> point of inserting into hBase for between 30 and 60 seconds and then
>> continues.
>> 
>> Investigation...
>> 
>> Watching the logs of the regionserver, the pausing of the coldfusion system
>> happens as soon as one of the regionservers starts flushing the memstore and
>> recovers again as soon as it is finished flushing (recovers as soon as it
>> starts compacting).
>> I can recreate the error just by stopping 1 of the regionservers; but then
>> starting the regionserver again does not make coldfusion recover until I
>> restart the coldfusion servers.  It is important to note that if I keep the
>> built in hBase shell running, it is happily able to put and get data to and
>> from hBase whilst coldfusion is busy pausing/failing.
>> 
>> I have tried increasing the regionserver¹s RAM to 3 Gigs and this just made
>> the problem worse because it took longer for the regionservers to flush the
>> memory store.  One of the links I found on your site mentioned increasing
>> the default value for hbase.regionserver.handler.count to 100 ­ this did not
>> seem to make any difference.  I have double checked that the memory flush
>> very rarely happens on more than 1 regionserver at a time ­ in fact in my
>> many hours of staring at tails of logs, it only happened once where two
>> regionservers flushed at the same time.
>> 
>> My investigations point strongly towards a coding problem on our side rather
>> than a problem with the server setup or hBase itself.  I say this because
>> whilst I understand why a regionserver would go offline during a memory
>> flush, I would expect the other two regionservers to pick up the load ­
>> especially since the built-in hbase shell has no problem accessing hBase
>> whilst a regionserver is busy doing a memstore flush.
>> 
>> So let me give you some insight into our java code...
>> 
>> We have three main classes (the rest should not have much influence on
>> this)...
>> 
>> The one class (AdDeliveryData) is used to provide simple functions to
>> simplify the coldfusion code, the second is used to communicate with hBase
>> (TableManagement) and the third just contains some simple functions to
>> create, drop and fetch tables. (HBaseManager).
>> 
>> AdDeliveryData¹s constructor looks like this...
>> 
>>    public AdDeliveryData(String hBaseConfigPath) throws IOException{
>>        _hbManager = new HBaseManager(hBaseConfigPath);
>> 
>>        _adDeliveryTable = new AdDeliveryTable();
>> 
>>        try {
>>            _adDeliveryManagement = _hbManager.getTable(_adDeliveryTable);
>>        } catch (TableNotFoundException e) {
>>            _adDeliveryManagement =
>> _hbManager.createTable(_adDeliveryTable);
>>        }
>>    }
>> 
>> _hbManager, _adDeliveryTable and _adDeliveryManagement are private class
>> variables available to the whole class.
>> 
>> TableManagement¹s constructor looks like this...
>> 
>>    public TableManagement(HBaseConfiguration conf, TableDef table) throws
>> IOException {
>>        _table = table;
>> 
>>        if (table.is_indexed()) {
>>            _itd = new IndexedTable(conf, Bytes.toBytes(table.get_name()));
>>        } else {
>>            _td = new HTable(conf, table.get_name());
>>        }
>>    }
>> 
>> _table, _itd and _td are protected variables available to the whole class.
>> 
>> HBaseManager¹s constructor looks like this...
>> 
>>    public HBaseManager(String configurationPath) throws
>> MasterNotRunningException {
>>        Path confPath = new Path(configurationPath);
>>        hbConf = new HBaseConfiguration();
>>        hbConf.addResource(confPath);
>>        hbAdmin = new IndexedTableAdmin(hbConf);
>>    }
>> 
>> hbConf and hbAdmin are protected class variables available to the whole
>> class
>> 
>> The constructor for AdDeliveryData only gets called once when coldfusion is
>> started which in turn runs the constructors for TableManagement and
>> HBaseManager.
>> 
>> The coldfusion variable that gets stored in the Application scope is called
>> Application.objAdDeliveryData; then every time Coldfusion needs to insert
>> data, it calls the Application.objAdDeliveryData.insertAdImpressionData
>> which calls _adDeliveryManagement.insertOrUpdateRow which in turn builds an
>> ArrayList of Put¹s and runs _td.put(putList);
>> 
>> I think either I am leaving out code that is required to determine which
>> RegionServers are available OR I am keeping too many hBase objects in RAM
>> instead of calling their constructors each time (my purpose obviously was to
>> improve performance).
>> 
>> Currently the live system is inserting over 7 Million records per day
>> (mostly between 8AM and 10PM) which is not a ridiculously high load.
>> 
>> Any input will be incredibly helpful ­ I have a test system up and running
>> and I am trying to re-create the scenario so that I am not working on a live
>> environment and then basically all I can do is trial and error.
>> 
>> Please assist?
>> 
>> Regards,
>> Seraph
>> 
>> 





Mime
View raw message