hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yousuf Ahmad <myahm...@gmail.com>
Subject Re: High IPC Latency
Date Tue, 30 Oct 2012 01:07:09 GMT
Hi guys,

I have an update. Actually a question.

  @Override
  public synchronized Object[] batch(final List<Row> actions) throws
InterruptedException, IOException {
    Object[] results = new Object[actions.size()];
    connection.processBatch(actions, tableName, pool, results);
    return results;
  }

This synchronized method in HTable seems to be causing contention among
concurrent multigets.

I suppose my question is: why must we have this method synchronized?

Thanks,
Yousuf


On Fri, Oct 19, 2012 at 2:21 PM, Yousuf Ahmad <myahmad0@gmail.com> wrote:

> No coprocessors :-)
> On Oct 19, 2012 1:21 PM, "Pamecha, Abhishek" <apamecha@x.com> wrote:
>
>> Also, I hope no coprocessors are in play.
>>
>> Thanks,
>> Abhishek
>>
>>
>> -----Original Message-----
>> From: Yousuf Ahmad [mailto:myahmad0@gmail.com]
>> Sent: Friday, October 19, 2012 10:12 AM
>> To: user@hbase.apache.org
>> Cc: Ivan Brondino; Ricardo Vilaça
>> Subject: Re: High IPC Latency
>>
>> Hi Lars,
>>
>> We are following your suggestion and testing against a single region
>> server. We just ran a test against a remote region server and soon we will
>> test against a local one as well. We will get back to you soon with the
>> results.
>>
>> It will take us a couple of days to port to and test our code with 0.94.2.
>> Once we have it working, we will run some experiments and update this
>> thread.
>>
>> Unfortunately, the nature of our project is such that we cannot easily
>> translate the benchmark's workload into a program executing the equivalent
>> HBase operations directly. For this reason, I attempted to roughly
>> translate the workload in terms of HBase operations in my first email and I
>> attached a portion of the logs to be a bit more concrete.
>>
>> Your assistance is very much appreciated! Thank you! We'll keep you
>> updated.
>>
>> Best regards,
>> Yousuf
>>
>>
>> On Fri, Oct 19, 2012 at 1:25 AM, lars hofhansl <lhofhansl@yahoo.com>
>> wrote:
>>
>> > Can you reproduce this against a single, local region server?
>> > Any chance that you can try with the just released 0.94.2?
>> >
>> >
>> > I would love to debug this. If would be a tremendous help if you had a
>> > little test program that reproduces this against a single server, so
>> > that I can see what is going on.
>> >
>> > Thanks.
>> >
>> > -- Lars
>> >
>> >
>> >
>> > ----- Original Message -----
>> > From: Yousuf Ahmad <myahmad0@gmail.com>
>> > To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
>> > Cc: Ivan Brondino <ibrondino@fi.upm.es>; Ricardo Vilaça <
>> > rmvilaca@di.uminho.pt>
>> > Sent: Thursday, October 18, 2012 12:59 PM
>> > Subject: Re: High IPC Latency
>> >
>> > Hi,
>> >
>> > Thank you for your questions guys.
>> >
>> > We are using HBase 0.92 with HDFS 1.0.1.
>> >
>> > The experiment lasts 15 minutes. The measurements stabilize in the
>> > first two minutes of the run.
>> >
>> > The data is distributed almost evenly across the regionservers so each
>> > client hits most of them over the course of the experiment. However,
>> > for the data we have, any given multi-get or scan should touch only
>> > one or at most two regions.
>> >
>> > The client caches the locations of the regionservers, so after a
>> > couple of minutes of the experiment running, it wouldn't need to
>> > re-visit ZooKeeper, I believe. Correct me if I am wrong please.
>> >
>> > Regards,
>> > Yousuf
>> >
>> >
>> > On Thu, Oct 18, 2012 at 2:42 PM, lars hofhansl <lhofhansl@yahoo.com>
>> > wrote:
>> >
>> > > Also, what version of HBase/HDFS is this using?
>> > >
>> > >
>> > >
>> > >
>> > > ----- Original Message -----
>> > > From: "Pamecha, Abhishek" <apamecha@x.com>
>> > > To: "user@hbase.apache.org" <user@hbase.apache.org>
>> > > Cc: Ivan Brondino <ibrondino@fi.upm.es>; Ricardo Vilaça <
>> > > rmvilaca@di.uminho.pt>
>> > > Sent: Thursday, October 18, 2012 11:38 AM
>> > > Subject: RE: High IPC Latency
>> > >
>> > > Is it sustained for the same client hitting the same region server
>> > > OR
>> > does
>> > > it get better for the same client-RS combination when run for longer
>> > > duration?  Trying to eliminate Zookeeper from this.
>> > >
>> > > Thanks,
>> > > Abhishek
>> > >
>> > > From: Yousuf Ahmad [mailto:myahmad0@gmail.com]
>> > > Sent: Thursday, October 18, 2012 11:26 AM
>> > > To: user@hbase.apache.org
>> > > Cc: Ivan Brondino; Ricardo Vilaça
>> > > Subject: High IPC Latency
>> > >
>> > > Hello,
>> > >
>> > > We are seeing slow times for read operations in our experiments. We
>> > > are hoping that you guys can help us figure out what's going wrong.
>> > >
>> > > Here are some details:
>> > >
>> > >   *   We are running a read-only benchmark on our HBase cluster.
>> > >   *
>> > >   *   There are 10 regionservers, each co-located with a datanode.
>> HDFS
>> > > replication is 3x.
>> > >   *   All the data read by the experiment is already in the block
>> cache
>> > > and the hit ratio is 99%.
>> > >   *
>> > >   *   We have 10 clients, each with around 400 threads making a mix of
>> > > read-only requests involving multi-gets and scans.
>> > >   *
>> > >   *   We settled on the default client pool type/size (roundrobin/1)
>> and
>> > a
>> > > regionserver handler count of 100 after testing various combinations
>> > > to
>> > see
>> > > what setting worked best.
>> > >   *
>> > >   *   Our scans are short, fetching around 10 rows on average. Scanner
>> > > caching is set to 50.
>> > >   *   An average row in a scan has either around 10 columns (small
>> row)
>> > or
>> > > around 200 columns (big row).
>> > >   *
>> > >   *   Our multi-gets fetch around 200 rows on average.
>> > >   *   An average row in a multi-get has around 10 columns.
>> > >   *   Each column holds an integer (encoded into bytes).
>> > >   *
>> > >   *   None of the machines involved reach CPU, memory, or IO
>> saturation.
>> > > In fact resource utilization stays quite low.
>> > >   *
>> > >   *   Our statistics show that the average time for a scan, measured
>> > > starting from the first scanner.next() call to the last one which
>> > returns a
>> > > null, is around 2-3 seconds.
>> > >   *   Since we use scanner caching, the major portion of this time
>> > (around
>> > > 2 seconds) is spent on the first call to next(), while the remaining
>> > calls
>> > > take a negligible amount of time.
>> > >   *   Similarly, we see that a multi-get on average takes around 2
>> > seconds.
>> > >   *   A single get on average takes around 1 second.
>> > > We are not sure what the bottleneck is or where it lies. We thought
>> > > we should look deeper into what is going on at the regionservers. We
>> > monitored
>> > > the IPC calls during one of the experiments. Here is a sample of one
>> > > regionserver log:
>> > >
>> > > 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace:
>> > > Call #115483; Served: HRegionInterface#get queueTime=0
>> > > processingTime=1
>> > > contents=1 Get, 75 bytes
>> > > 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace:
>> > > Call #115487; Served: HRegionInterface#get queueTime=0
>> > > processingTime=0
>> > > contents=1 Get, 75 bytes
>> > > 2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace:
>> > > Call #115489; Served: HRegionInterface#get queueTime=0
>> > > processingTime=0
>> > > contents=1 Get, 75 bytes
>> > > 2012-10-18 17:00:09,982 DEBUG org.apache.hadoop.ipc.HBaseServer.trace:
>> > > Call #111421; Served: HRegionInterface#get queueTime=0
>> > > processingTime=0
>> > > contents=1 Get, 75 bytes
>> > > 2012-10-18 17:00:09,982 DEBUG org.apache.hadoop.ipc.HBaseServer.trace:
>> > > Call #115497; Served: HRegionInterface#multi queueTime=0
>> > > processingTime=9
>> > > contents=200 Gets
>> > > 2012-10-18 17:00:09,984 DEBUG org.apache.hadoop.ipc.HBaseServer.trace:
>> > > Call #115499; Served: HRegionInterface#openScanner queueTime=0
>> > > processingTime=0 contents=1 Scan, 63 bytes
>> > > 2012-10-18 17:00:09,990 DEBUG org.apache.hadoop.ipc.HBaseServer.trace:
>> > > Call #115503; Served: HRegionInterface#get queueTime=0
>> > > processingTime=0
>> > > contents=1 Get, 75 bytes
>> > > 2012-10-18 17:00:09,992 DEBUG org.apache.hadoop.ipc.HBaseServer.trace:
>> > > Call #103230; Served: HRegionInterface#next queueTime=0
>> > > processingTime=0
>> > > contents=1 Long, 1 Integer
>> > > 2012-10-18 17:00:09,994 DEBUG org.apache.hadoop.ipc.HBaseServer.trace:
>> > > Call #103234; Served: HRegionInterface#close queueTime=0
>> > > processingTime=0
>> > > contents=1 Long
>> > > 2012-10-18 17:00:09,994 DEBUG org.apache.hadoop.ipc.HBaseServer.trace:
>> > > Call #103232; Served: HRegionInterface#next queueTime=0
>> > > processingTime=0
>> > > contents=1 Long, 1 Integer
>> > >
>> > > I have attached a larger chunk of the logs we collected for this
>> > > experiment in case that helps.
>> > >
>> > > From the logs, we saw that the next() operation at the regionserver
>> > > takes
>> > > 1 millisecond or less; and a multi-get takes 10 ms on average.
>> > > Yet the corresponding times we see at the client are orders of
>> > > magnitude higher.
>> > > Ping times between the machines are at most 1ms and we are not
>> > > saturating the network.
>> > >
>> > > We would really appreciate some insights from you guys on this.
>> > > Where do you suggest we focus our efforts in order to hunt down this
>> > > bottleneck/contention?
>> > >
>> > > Thanks!
>> > > Yousuf
>> > >
>> >
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message