hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Problems with write performance (25kb rows)
Date Tue, 29 Dec 2009 22:07:51 GMT
Hello Dmitry:

Thanks for the detail.  Answers interlaced in the below.

On Tue, Dec 29, 2009 at 2:14 AM, Dmitriy Lyfar <dlyfar@gmail.com> wrote:

> ..
> Hbase has following configuration: http://pastebin.com/m6c7358e6

This looks good.

We have about 6M records to insert at once, client creates thread per each
> 100K records

So, loading script is a single-process of 60 threads?

> and
> then wait until all threads will be finished. Each row is about 25Kb size.
> Each thread creates its own HTable and HBaseConfiguration.

Reuse the configuration across threads otherwise it looks to zookeeper as
though each thread is a new connection (long story -- I can tell you more if
you are interested but basically if a new HBaseConfiguration, then we set up
new connections else we reuse).  Zookeeper has a basic protection against
DOS attack not letting > 30 clients (by default) from the same host connect.
 This makes for interesting issues.

> Something going wrong, because sometimes I get exception:
> Exception in thread "Thread-9" java.util.ConcurrentModificationException
Can we see full stack trace to see which Map is throwing the CME?

> What does it mean?

Usually it means a Map is being concurrently modified while there is an
outstanding iteration; access to the Map is not properly
protected/synchronized.  Is the Map that is complaining down in hbase or is
it a Map of yours?

> As for timings:
> For 5Kb rows we have about 35-40K records per second.
> For 25Kb rows -- about 1-2K records per second.
> So I have different throughput on different row size, looks illogical.
Is it?  Is same amount of data being carried?

Or it could be that while the 25k is being sent, all other access to a
particular node is blocked (Thats how hadoop RPC works -- one connection per
process per server with request/response exclusive on the channel).  Thread
dump a few times or add some logging to see if you can figure if this is the

In general, our client ain't to good at multiplexing because of such as the
above noted limitation (our client does not yet do nio).  If you want to
test cluster performance, run multiple concurrent clients each to its own
process.  MapReduce is good for doing this.  See the PerformanceEvaluation
code for a sample MR job that floats many clients doing different loading

> Also I see that nodes load is almost idle. Hbase jvm heap size is 5Gb on
> each node and only 300-500Mb is used during test.
> I've used all performance tuning advices, like autoflush off, write buffer
> =
> 12 MB, WAL is off.

My guess is that you are not putting up sufficent loading.  What if you
start up ten other client processes?  Do they all go at the same rate?  Does
the cluster then start to break a sweat?

> Btw how dangerous is to switch WAL off? Thank you.

Yes.  If server crashes, edits in memory will be lost.  This might be ok
though for bulk upload time.  You can always rerun that portion of the
loading.   Turning off WAL makes the upload run faster.


> --
> Regards, Lyfar Dmitriy

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message