hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: substantial performance degradation when using WAL
Date Wed, 15 Dec 2010 18:10:22 GMT
Hey Friso,

A few thoughts:

 - Using the WAL will degrade your performance a lot and this is
expected. Remember, appending is the operation of sending a chunk of
data to the memory of 3 datanodes in a pipeline (compared to just
writing to memory when not using it). I'm not sure I quite understand
your surprise there.

 - Regarding the YouAreDeadException, the fact that it prints out
"have not heard from server in 70697ms" and that just before that
nothing was printed in the log between 02:38:55 and 02:40:11, strongly
indicates GC activity in that JVM. I'd like to see the GC log before
ruling that out.

 - Regarding the clients that died, did it happen at the same time as
region server died?

 - Finally, doing massive imports requires finer tuning your cluster
compared to one that serves "normal" traffic. Have you considered
using the bulk loading tools instead?

J-D

On Wed, Dec 15, 2010 at 6:44 AM, Friso van Vollenhoven
<fvanvollenhoven@xebia.com> wrote:
> Hi,
>
> I am experiencing some performance issues doing puts with WAL enabled. Without, everything
runs fine.
>
> My workload is doing roughly 30 million reads (rows) and after each read do a number
of puts (update multiple indexes, basically). Total is about 165 million puts. The work is
done from a MR job that uses 15 reducers against 8 RS. The reads and writes are across 16
tables, but I hit only a small number of regions (out of about 1000 in total for all tables).
Without WAL, the job takes about 30 to 45 minutes. With WAL, if it runs to completion, it
takes close to 4 hours (3h45m). Can the difference be that large? On the master UI HBase shows
doing between 10K and 50K requests per second with quite some drops to almost zero for some
amount of time, while without WAL for the same job it easily reaches over 100K sustained.
>
> Any hint on where to look is greatly appreciated. Below is a description of what happens.
>
> Also, on some runs, region servers die because they fail to report for more than 1 minute
(YouAreDeadException). I could set the timeout longer, but I think it should work for this
setup. The GC log does not show any obvious long pause. Is it possible for flushing / log
appending to block the ZK client / heartbeat? In the log snippet below you see a pause of
about 1m30s between two log lines before the RS starts to shut down.
>
> 2010-12-15 02:38:55,457 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://m1r1.inrdb.ripe.net:9000/hbase/inrdb_ris_update_rrc12/fe902bb3224a1522b0be94d8459f7217/meta/3270107182044894418,
entries=10653, sequenceid=367327898, memsize=2.4m, filesize=102.7k
> 2010-12-15 02:40:11,959 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,
have not heard from server in 70697ms for sessionid 0x12ce510319b0004, closing sock
> et connection and attempting reconnect
> 2010-12-15 02:40:12,125 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,
have not heard from server in 66270ms for sessionid 0x12ce510319b0005, closing sock
> et connection and attempting reconnect
> 2010-12-15 02:40:12,174 INFO org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: hconnection-0x12ce510319b0004
Received Disconnected from ZooKeeper, ignoring
> 2010-12-15 02:40:12,276 INFO org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12ce510319b0005
Received Disconnected from ZooKeeper, ignoring
> 2010-12-15 02:40:12,623 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server m1r1.inrdb.ripe.net/2001:610:240:1:0:0:c100:1733:2181
> 2010-12-15 02:40:12,802 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region server serverName=w2r1.inrdb.ripe.net,60020,1292333234919, load=(requests
> =10822, regions=575, usedHeap=6806, maxHeap=16000): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing w2r1.inr
> db.ripe.net,60020,1292333234919 as dead server
> org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing w2r1.inrdb.ripe.net,60020,1292333234
> 919 as dead server
>
> On some other runs there occurs blocking on the client side. This causes the reducers
to get killed after ten minutes of not reporting. Still, the GC log does not show any long
pauses.
>
> I guess that the RS dying and the clients blocking are just side effects of HBase not
being able to cope with the load.
>
> Versions and setup:
> Hadoop CDH3b3
> HBase 0.90 rc1
> 1 master, running: NN, HM, JT, ZK
> 8 workers, running: DN, RS, TT
> RS gets 16GB heap
> HBase max filesize = 1GB, client side write buffer = 16MB, memstore flush size is at
128MB
> CPU usage is not off the chart and no swapping is happening.
>
>
>
> Friso
>
>

Mime
View raw message