hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/Troubleshooting" by JeanDanielCryans
Date Mon, 29 Mar 2010 17:35:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/Troubleshooting" page has been changed by JeanDanielCryans.
The comment on this change is: revamped the GC pauses entry.
http://wiki.apache.org/hadoop/Hbase/Troubleshooting?action=diff&rev1=38&rev2=39

--------------------------------------------------

  == 9. Problem: ZooKeeper SessionExpired events ==
   * Master or RegionServers reinitialize their ZooKeeper wrappers after receiving SessionExpired
events.
   * Master or RegionServer ephemeral nodes disappear while the node is still otherwise functional.
+  * Messages those in the logs:
+ {{{
+ WARN org.apache.zookeeper.ClientCnxn: Exception 
+ closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec
+ java.io.IOException: TIMED OUT
+        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
+ WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled:
5000
+ INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname/IP:PORT
+ INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected
local=/IP:PORT remote=hostname/IP:PORT]
+ INFO org.apache.zookeeper.ClientCnxn: Server connection successful
+ WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e
+ java.io.IOException: Session Expired
+        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
+        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
+        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
+ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
+ }}}
  === Causes ===
-  * Java GC is starving the ZooKeeper heartbeat thread.
+  * The JVM is doing a long running garbage collecting which is pausing every threads (aka
"stop the world").
+  * Since the region server's local zookeeper client cannot send heartbeats, the session
times out.
  === Resolution ===
+  * Make sure you give plenty of RAM (in hbase-env.sh), the default of 1GB won't be able
to sustain long running imports.
+  * Make sure you don't swap, the JVM never behaves well under swapping.
+  * Make sure you are not CPU starving the region server thread. For example, if you are
running a mapreduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably
starving the region server enough to create longer garbage collection pauses. 
-  * Increase the session timeout. For example, add the following to your hbase-site.xml to
increase the timeout from the default of 10 seconds to 60 seconds.
+  * If you wish to increase the session timeout, add the following to your hbase-site.xml
to increase the timeout from the default of 60 seconds to 120 seconds.
  {{{
    <property>
      <name>zookeeper.session.timeout</name>
-     <value>60000</value>
+     <value>1200000</value>
    </property>
+   <property>
+     <name>hbase.zookeeper.property.tickTime</name>
+     <value>6000</value>
+   </property>
  }}}
-  * For Java SE 6, some users have had success with {{{ -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode
-XX:ParallelGCThreads=8 }}}.  See HBase [[PerformanceTuning|Performance Tuning]] for more
on JVM GC tuning.
+  * Be aware that setting a higher timeout means that the regions served by a failed region
server will take at least that amount of time to be transfered to another region server. For
a production system serving live requests, we would instead recommend setting it lower than
1 minute and over-provision your cluster in order the lower the memory load on each machines
(hence having less garbage to collect per machine). 
+  * If this is happening during an upload which only happens once (like initially loading
all your data into HBase), consider [[http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk|importing
into HFiles directly]].
+  * HBase ships with some GC tuning, for more information see [[PerformanceTuning|Performance
Tuning]].
  
  
  <<Anchor(10)>>

Mime
View raw message