hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Optimizations for random read performance
Date Wed, 17 Feb 2010 05:18:13 GMT
When you look at top on the loaded server is it the regionserver or
the datanode that is using up the cpu?

I look at your hdfs listing.  Some of the regions have 3 and 4 files
but most look fine.   A good few are on the compaction verge so I'd
imagine a lot of compaction going on but this is background though it
does suck cpu and i/o... it shouldn't be too bad.

I took a look at the regionserver log.  The server is struggling
during which time period?  There is one log run at the start and there
it seems like nothing untoward.  Please enable DEBUG going forward.
It'll shed more light on whats going on: See
http://wiki.apache.org/hadoop/Hbase/FAQ#A5 for how.  Otherwise, the
log doesn't have anything  running long enough for it to have been
under serious load.

This is a four node cluster now?  You don't seem to have too many
regions per server yet you have a pretty high read/write rate going by
earlier requests postings.   Maybe you need to add more servers.  Are
you going to add in those 16G machines?

When you look at the master ui, you can see that the request rate over
time is about the same for all regionservers?  (refresh the master ui
every so often to take a new sampling).

St.Ack




On Tue, Feb 16, 2010 at 3:59 PM, James Baldassari <james@dataxu.com> wrote:
> Nope.  We don't do any map reduce.  We're only using Hadoop for HBase at
> the moment.
>
> That one node, hdfs02, still has a load of 16 with around 40% I/O and
> 120% CPU.  The other nodes are all around 66% CPU with 0-1% I/O and load
> of 1 to 3.
>
> I don't think all the requests are going to hdfs02 based on the status
> 'detailed' output.  It seems like that node is just having a much harder
> time getting the data or something.  Maybe we have some incorrect HDFS
> setting.  All the configs are identical, though.
>
> -James
>
>
> On Tue, 2010-02-16 at 17:45 -0600, Dan Washusen wrote:
>> You mentioned in a previous email that you have a Task Tracker process
>> running on each of the nodes.  Is there any chance there is a map reduce job
>> running?
>>
>> On 17 February 2010 10:31, James Baldassari <james@dataxu.com> wrote:
>>
>> > On Tue, 2010-02-16 at 16:45 -0600, Stack wrote:
>> > > On Tue, Feb 16, 2010 at 2:25 PM, James Baldassari <james@dataxu.com>
>> > wrote:
>> > > > On Tue, 2010-02-16 at 14:05 -0600, Stack wrote:
>> > > >> On Tue, Feb 16, 2010 at 10:50 AM, James Baldassari <james@dataxu.com>
>> > wrote:
>> > > >
>> > > > Whether the keys themselves are evenly distributed is another matter.
>> > > > Our keys are user IDs, and they should be fairly random.  If we do
a
>> > > > status 'detailed' in the hbase shell we see the following distribution
>> > > > for the value of "requests" (not entirely sure what this value means):
>> > > > hdfs01: 7078
>> > > > hdfs02: 5898
>> > > > hdfs03: 5870
>> > > > hdfs04: 3807
>> > > >
>> > > That looks like they are evenly distributed.  Requests are how many
>> > > hits a second.  See the UI on master port 60010.  The numbers should
>> > > match.
>> >
>> > So the total across all 4 region servers would be 22,653/second?  Hmm,
>> > that doesn't seem too bad.  I guess we just need a little more
>> > throughput...
>> >
>> > >
>> > >
>> > > > There are no order of magnitude differences here, and the request
count
>> > > > doesn't seem to map to the load on the server.  Right now hdfs02
has a
>> > > > load of 16 while the 3 others have loads between 1 and 2.
>> > >
>> > >
>> > > This is interesting.  I went back over your dumps of cache stats above
>> > > and the 'loaded' servers didn't have any attribute there that
>> > > differentiated it from others.  For example, the number of storefiles
>> > > seemed about same.
>> > >
>> > > I wonder what is making for the high load?  Can you figure it?  Is it
>> > > high CPU use (unlikely).  Is it then high i/o?  Can you try and figure
>> > > whats different about the layout under the loaded server and that of
>> > > an unloaded server?  Maybe do a ./bin/hadoop fs -lsr /hbase and see if
>> > > anything jumps out at you.
>> >
>> > It's I/O wait that is killing the highly loaded server.  The CPU usage
>> > reported by top is just about the same across all servers (around 100%
>> > on an 8-core node), but one server at any given time has a much higher
>> > load due to I/O.
>> >
>> > >
>> > > If you want to post the above or a loaded servers log to pastbin we'll
>> > > take a looksee.
>> >
>> > I'm not really sure what to look for, but maybe someone else will notice
>> > something, so here's the output of hadoop fs -lsr /hbase:
>> > http://pastebin.com/m98096de
>> >
>> > And here is today's region server log from hdfs02, which seems to get
>> > hit particularly hard: http://pastebin.com/m1d8a1e5f
>> >
>> > Please note that we restarted it several times today, so some of those
>> > errors are probably just due to restarting the region server.
>> >
>> > >
>> > >
>> > > Applying
>> > > > HBASE-2180 did not make any measurable difference.  There are no
errors
>> > > > in the region server logs.  However, looking at the Hadoop datanode
>> > > > logs, I'm seeing lots of these:
>> > > >
>> > > > 2010-02-16 17:07:54,064 ERROR
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>> > 10.24.183.165:50010,
>> > storageID=DS-1519453437-10.24.183.165-50010-1265907617548, infoPort=50075,
>> > ipcPort=50020):DataXceiver
>> > > > java.io.EOFException
>> > > >        at java.io.DataInputStream.readShort(DataInputStream.java:298)
>> > > >        at
>> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>> > > >        at java.lang.Thread.run(Thread.java:619)
>> > >
>> > > You upped xceivers on your hdfs cluster?  If you look at otherend of
>> > > the above EOFE, can you see why it died?
>> >
>> > Max xceivers = 3072; datanode handler count = 20; region server handler
>> > count = 100
>> >
>> > I can't find the other end of the EOFException.  I looked in the Hadoop
>> > and HBase logs on the server that is the name node and HBase master, as
>> > well as the on HBase client.
>> >
>> > Thanks for all the help!
>> >
>> > -James
>> >
>> > >
>> > >
>> > > >
>> > > > However, I do think it's strange that
>> > > > the load is so unbalanced on the region servers.
>> > > >
>> > >
>> > > I agree.
>> > >
>> > >
>> > > > We're also going to try throwing some more hardware at the problem.
>> > > > We'll set up a new cluster with 16-core, 16G nodes to see if they
are
>> > > > better able to handle the large number of client requests.  We might
>> > > > also decrease the block size to 32k or lower.
>> > > >
>> > > Ok.
>> > >
>> > > >> Should only be a matter if you intend distributing the above.
>> > > >
>> > > > This is probably a topic for a separate thread, but I've never seen
a
>> > > > legal definition for the word "distribution."  How does this apply
to
>> > > > the SaaS model?
>> > > >
>> > > Fair enough.
>> > >
>> > > Something is up.  Especially if hbase-2180 made no difference.
>> > >
>> > > St.Ack
>> >
>> >
>
>

Mime
View raw message