Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <4FC4DC4E.2080102@free.fr>
Date: Tue, 29 May 2012 16:25:18 +0200
From: Cyril Scetbon <cyril.scetbon@free.fr>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: user@hbase.apache.org
Subject: hosts unreachables
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi,

I've installed hbase on the following configuration :

12 x (rest hbase + regionserver hbase + datanode hadoop)
2 x (zookeeper + hbase master)
1 x (zookeeper + hbase master + namenode hadoop)

OS used is ubuntu lucid (10.04)

The issue is that when I try to load data using rest api, some hosts 
become unreachable even if I can ping them. I can no longer connect to 
them and even monitoring tools can not work during a laps of time. For 
example, I use SAR on each host and you can see that between 7:10 and 
7:35 pm the host does not write any information :

06:45:01 PM     all      0.18      0.00      0.37      3.61      0.25 
   95.58
06:45:01 PM       0      0.24      0.00      0.54      6.62      0.35 
   92.25
06:45:01 PM       1      0.12      0.00      0.20      0.61      0.15 
   98.92
06:50:02 PM     all      5.69      0.00      1.79      4.23      1.94 
   86.36
06:50:02 PM       0      5.68      0.00      3.00      7.91      2.21 
   81.21
06:50:02 PM       1      5.70      0.00      0.59      0.55      1.66 
   91.51
06:55:01 PM     all      0.68      0.00      0.14      1.62      0.23 
   97.33
06:55:01 PM       0      0.87      0.00      0.20      3.19      0.31 
   95.44
06:55:01 PM       1      0.49      0.00      0.08      0.05      0.15 
   99.22
06:58:36 PM     all      0.03      0.00      0.02      0.45      0.07 
   99.43
06:58:36 PM       0      0.01      0.00      0.02      0.40      0.13 
   99.43
06:58:36 PM       1      0.04      0.00      0.01      0.51      0.00 
   99.43
07:05:01 PM     all      0.03      0.00      0.00      0.10      0.07 
   99.80
07:05:01 PM       0      0.02      0.00      0.00      0.10      0.10 
   99.78
07:05:01 PM       1      0.04      0.00      0.01      0.09      0.03 
   99.83 <--- last measure before host becomes reachable
07:40:07 PM     all     14.72      0.00     17.93      0.02     13.31 
   54.02 <--- new measure after host becomes reachable
07:40:07 PM       0     29.43      0.00     35.87      0.00     26.57 
    8.13
07:40:07 PM       1      0.00      0.00      0.00      0.04      0.04 
   99.91
07:45:01 PM     all      0.55      0.00      0.25      0.04      0.27 
   98.89
07:45:01 PM       0      0.54      0.00      0.14      0.05      0.21 
   99.07
07:45:01 PM       1      0.55      0.00      0.36      0.04      0.33 
   98.72
07:50:01 PM     all      0.11      0.00      0.05      0.18      0.06 
   99.60
07:50:01 PM       0      0.12      0.00      0.06      0.13      0.09 
   99.60
07:50:01 PM       1      0.11      0.00      0.04      0.23      0.04 
   99.59
07:55:01 PM     all      0.00      0.00      0.01      0.05      0.07 
   99.88
07:55:01 PM       0      0.00      0.00      0.01      0.01      0.13 
   99.84
07:55:01 PM       1      0.00      0.00      0.00      0.08      0.00 
   99.91
08:05:01 PM     all      0.01      0.00      0.00      0.00      0.05 
   99.94
08:05:01 PM       0      0.00      0.00      0.00      0.00      0.08 
   99.91
08:05:01 PM       1      0.03      0.00      0.00      0.00      0.01 
   99.96

I suppose it's caused by a high load but I don't have any proof :( Is 
there a known bug about that ? I had a similar issue with Cassandra that 
forced me to upgrade to linux kernel > 3.0

thanks.

-- 
Cyril SCETBON