hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tarnas <...@email.com>
Subject Re: EC2 + Thrift inserts
Date Fri, 30 Apr 2010 04:12:05 GMT
They are all at 100, but none of the regionservers are loaded - most  
are less than 20% CPU. Is this all network latency?

-chris

On Apr 29, 2010, at 8:29 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:

> Every insert on an indexed would require at the very least an RPC to a
> different regionserver.  If the regionservers are busy, your request
> could wait in the queue for a moment.
>
> One param to tune would be the handler thread count.  Set it to 100  
> at least.
>
> On Thu, Apr 29, 2010 at 2:16 AM, Chris Tarnas <cft@email.com> wrote:
>> I just finished some testing with JDK 1.6 u17 - so far no  
>> performance improvements with just changing that. Disabling LZO  
>> compression did gain a little bit (up to about 30/sec from 25/sec  
>> per thread). Turning of indexes helped the most - that brought me  
>> up to 115/sec @ 2875 total rows a second. A single perl/thrift  
>> process can load at over 350 rows/sec so its not scaling as well as  
>> I would have expected, even without the indexes.
>>
>> Are the transactional indexes that costly? What is the bottleneck  
>> there? CPU utilization and network packets went up when I disabled  
>> the indexes, I don't think those are the bottlenecks for the  
>> indexes. I was even able to add another 15 insert process (total of  
>> 40) and only lost about 10% on a per process throughput. I probably  
>> could go even higher, none of the nodes are above CPU 60%  
>> utilization and IO wait was at most 3.5%.
>>
>> Each rowkey is unique, so there should not be any blocking on the  
>> row locks. I'll do more indexed tests tomorrow.
>>
>> thanks,
>> -chris
>>
>>
>>
>>
>>
>>
>>
>> On Apr 29, 2010, at 12:18 AM, Todd Lipcon wrote:
>>
>>> Definitely smells like JDK 1.6.0_18. Downgrade that back to 16 or  
>>> 17 and you
>>> should be good to go. _18 is a botched release if I ever saw one.
>>>
>>> -Todd
>>>
>>> On Wed, Apr 28, 2010 at 10:54 PM, Chris Tarnas <cft@email.com>  
>>> wrote:
>>>
>>>> Hi Stack,
>>>>
>>>> Thanks for looking. I checked the ganglia charts, no server was  
>>>> at more
>>>> than ~20% CPU utilization at any time during the load test and  
>>>> swap was
>>>> never used. Network traffic was light - just running a count  
>>>> through hbase
>>>> shell generates a much higher use. One the server hosting meta  
>>>> specifically,
>>>> it was at about 15-20% CPU, and IO wait never went above 3%, was  
>>>> usually
>>>> down at near 0.
>>>>
>>>> The load also died with a thrift timeout on every single node  
>>>> (each node
>>>> connecting to localhost for its thrift server), it looks like a  
>>>> datanode
>>>> just died and caused every thrift connection to timeout - I'll  
>>>> have to up
>>>> that limit to handle a node death.
>>>>
>>>> Checking logs this appears in the logs of the region server  
>>>> hosting meta,
>>>> looks like the dead datanode causing this error:
>>>>
>>>> 2010-04-29 01:01:38,948 WARN org.apache.hadoop.hdfs.DFSClient:
>>>> DFSOutputStream ResponseProcessor exception  for block
>>>> blk_508630839844593817_11180java.io.IOException: Bad response 1  
>>>> for block
>>>> blk_508630839844593817_11180 from datanode 10.195.150.255:50010
>>>>       at
>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>>>> $ResponseProcessor.run(DFSClient.java:2423)
>>>>
>>>> The regionserver log on teh dead node, 10.195.150.255 has some  
>>>> more errors
>>>> in it:
>>>>
>>>> http://pastebin.com/EFH9jz0w
>>>>
>>>> I found this in the .out file on the datanode:
>>>>
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
>>>> linux-amd64 )
>>>> # Problematic frame:
>>>> # V  [libjvm.so+0x62263c]
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /usr/local/hadoop-0.20.1/hs_err_pid1364.log
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> #   http://java.sun.com/webapps/bugreport/crash.jsp
>>>> #
>>>>
>>>>
>>>> There is not a single error in the datanode's log though. Also of  
>>>> note -
>>>> this happened well into the test, so the node dying cause the  
>>>> load to abort
>>>> but not the prior poor performance. Looking through the mailing  
>>>> list it
>>>> looks like java 1.6.0_18 has a bad rep so I'll update the AMI  
>>>> (although I'm
>>>> using the same JVM on other servers in the office w/o issue and  
>>>> decent
>>>> single node performance and never dying...).
>>>>
>>>> Thanks for any help!
>>>> -chris
>>>>
>>>>
>>>>
>>>>
>>>> On Apr 28, 2010, at 10:10 PM, Stack wrote:
>>>>
>>>>> What is load on the server hosting meta like?  Higher than others?
>>>>>
>>>>>
>>>>>
>>>>> On Apr 28, 2010, at 8:42 PM, Chris Tarnas <cft@email.com> wrote:
>>>>>
>>>>>> Hi JG,
>>>>>>
>>>>>> Speed is now down to 18 rows/sec/table per process.
>>>>>>
>>>>>> Here is a regionserver log that is serving two of the regions:
>>>>>>
>>>>>> http://pastebin.com/Hx5se0hz
>>>>>>
>>>>>> Here is the GC Log from the same server:
>>>>>>
>>>>>> http://pastebin.com/ChrRvxCx
>>>>>>
>>>>>> Here is the master log:
>>>>>>
>>>>>> http://pastebin.com/L1Kn66qU
>>>>>>
>>>>>> The thrift server logs have nothing in them in the same time  
>>>>>> period.
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> -chris
>>>>>>
>>>>>> On Apr 28, 2010, at 7:32 PM, Jonathan Gray wrote:
>>>>>>
>>>>>>> Hey Chris,
>>>>>>>
>>>>>>> That's a really significant slowdown.  I can't think of anything
>>>> obvious that would cause that in your setup.
>>>>>>>
>>>>>>> Any chance of some regionserver and master logs from the time
 
>>>>>>> it was
>>>> going slow?  Is there any activity in the logs of the  
>>>> regionservers hosting
>>>> the regions of the table being written to?
>>>>>>>
>>>>>>> JG
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Christopher Tarnas [mailto:cft@tarnas.org] On Behalf
Of  
>>>>>>>> Chris
>>>>>>>> Tarnas
>>>>>>>> Sent: Wednesday, April 28, 2010 6:27 PM
>>>>>>>> To: hbase-user@hadoop.apache.org
>>>>>>>> Subject: EC2 + Thrift inserts
>>>>>>>>
>>>>>>>> Hello all,
>>>>>>>>
>>>>>>>> First, thanks to all the HBase developers for producing this,
 
>>>>>>>> it's a
>>>>>>>> great project and I'm glad to be able to use it.
>>>>>>>>
>>>>>>>> I'm looking for some help and hints here with insert  
>>>>>>>> performance help.
>>>>>>>> I'm doing some benchmarking, testing how I can scale up using
 
>>>>>>>> HBase,
>>>>>>>> not really looking at raw speed. The testing is happening
on  
>>>>>>>> EC2,
>>>> using
>>>>>>>> Andrew's scripts (thanks - those were very helpful) to set
 
>>>>>>>> them up and
>>>>>>>> with a slightly customized version of the default AMIs (added
 
>>>>>>>> my
>>>>>>>> application modules). I'm using HBase 20.3 and Hadoop 20.1.
 
>>>>>>>> I've
>>>> looked
>>>>>>>> at the tips in the Wiki and it looks like Andrew's scripts
 
>>>>>>>> are already
>>>>>>>> setup that way.
>>>>>>>>
>>>>>>>> I'm inserting into HBase from a hadoop streaming job that
 
>>>>>>>> runs perl
>>>> and
>>>>>>>> uses the thrift gateway. I'm also using the Transactional
 
>>>>>>>> tables so
>>>>>>>> that alone could be the case, but from what I can tell I
 
>>>>>>>> don't think
>>>>>>>> so. LZO compression is also enabled for the column families
 
>>>>>>>> (much of
>>>>>>>> the data is highly compressible). My cluster has 7 nodes,
5
>>>>>>>> regionservers, 1 master and 1 zookeeper. The regionservers
 
>>>>>>>> and master
>>>>>>>> are c1.xlarges. Each regionserver has the tasktrackers that
 
>>>>>>>> runs the
>>>>>>>> hadoop streaming jobs, and regionserver also runs its own
 
>>>>>>>> thrift
>>>>>>>> server. Each mapper that does the load talks to the  
>>>>>>>> localhost's thrift
>>>>>>>> server.
>>>>>>>>
>>>>>>>> The Row keys a fixed string + an incremental number then
the  
>>>>>>>> order of
>>>>>>>> the bytes are reversed, so runA123 becomes 321Anur. I though
 
>>>>>>>> of using
>>>>>>>> murmur hash but was worried about collisions.
>>>>>>>>
>>>>>>>> As I add more insert jobs, each jobs throughput goes down.
 
>>>>>>>> Way down. I
>>>>>>>> went from about 200 row/sec/table per job with one job to
 
>>>>>>>> about 24
>>>>>>>> rows/sec/table per job with 25 running jobs. The servers
are  
>>>>>>>> mostly
>>>>>>>> idle. I'm loading into two tables, one has several indexes
 
>>>>>>>> and I'm
>>>>>>>> loading into three column families, the other has no indexes
 
>>>>>>>> and one
>>>>>>>> column family. Both tables only currently have two region
each.
>>>>>>>>
>>>>>>>> The regionserver that serves the indexed table's regions
is  
>>>>>>>> using the
>>>>>>>> most CPU but is 87% idle. The other servers are all at ~90%
 
>>>>>>>> idle.
>>>> There
>>>>>>>> is no IO wait. the perl processes are barely ticking over.
 
>>>>>>>> Java on the
>>>>>>>> most "loaded" server is using about 50-60% of one CPU.
>>>>>>>>
>>>>>>>> Normally when I do load in a pseudo-distrbuted hbase (my
 
>>>>>>>> development
>>>>>>>> platform) perl's speed is the limiting factor and uses about
 
>>>>>>>> 85% of a
>>>>>>>> CPU. In this cluster they are using only 5-10% of a CPU as
 
>>>>>>>> they are
>>>> all
>>>>>>>> waiting on thrift (hbase). When I run only 1 process on the
 
>>>>>>>> cluster,
>>>>>>>> perl uses much more of a CPU, maybe 70%.
>>>>>>>>
>>>>>>>> Any tips or help in getting the speed/scalability up would
be  
>>>>>>>> great.
>>>>>>>> Please let me know if you need any other info.
>>>>>>>>
>>>>>>>> As I send this - it looks like the main table has split again
 
>>>>>>>> and is
>>>>>>>> being served by three regionservers.. My performance is going
 
>>>>>>>> up a bit
>>>>>>>> (now 35 rows/sec/table per processes), but still seems like
 
>>>>>>>> I'm not
>>>>>>>> using the full potential of even the limited EC2 system,
no  
>>>>>>>> IO wait
>>>> and
>>>>>>>> lots of idle CPU.
>>>>>>>>
>>>>>>>>
>>>>>>>> many thanks
>>>>>>>> -chris
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>
>>

Mime
View raw message