hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pamecha, Abhishek" <apame...@x.com>
Subject RE: High IPC Latency
Date Thu, 18 Oct 2012 18:38:44 GMT
Is it sustained for the same client hitting the same region server OR does it get better for
the same client-RS combination when run for longer duration?  Trying to eliminate Zookeeper
from this.

Thanks,
Abhishek

From: Yousuf Ahmad [mailto:myahmad0@gmail.com]
Sent: Thursday, October 18, 2012 11:26 AM
To: user@hbase.apache.org
Cc: Ivan Brondino; Ricardo Vila├ža
Subject: High IPC Latency

Hello,

We are seeing slow times for read operations in our experiments. We are hoping that you guys
can help us figure out what's going wrong.

Here are some details:

  *   We are running a read-only benchmark on our HBase cluster.
  *
  *   There are 10 regionservers, each co-located with a datanode. HDFS replication is 3x.
  *   All the data read by the experiment is already in the block cache and the hit ratio
is 99%.
  *
  *   We have 10 clients, each with around 400 threads making a mix of read-only requests
involving multi-gets and scans.
  *
  *   We settled on the default client pool type/size (roundrobin/1) and a regionserver handler
count of 100 after testing various combinations to see what setting worked best.
  *
  *   Our scans are short, fetching around 10 rows on average. Scanner caching is set to 50.
  *   An average row in a scan has either around 10 columns (small row) or around 200 columns
(big row).
  *
  *   Our multi-gets fetch around 200 rows on average.
  *   An average row in a multi-get has around 10 columns.
  *   Each column holds an integer (encoded into bytes).
  *
  *   None of the machines involved reach CPU, memory, or IO saturation. In fact resource
utilization stays quite low.
  *
  *   Our statistics show that the average time for a scan, measured starting from the first
scanner.next() call to the last one which returns a null, is around 2-3 seconds.
  *   Since we use scanner caching, the major portion of this time (around 2 seconds) is spent
on the first call to next(), while the remaining calls take a negligible amount of time.
  *   Similarly, we see that a multi-get on average takes around 2 seconds.
  *   A single get on average takes around 1 second.
We are not sure what the bottleneck is or where it lies. We thought we should look deeper
into what is going on at the regionservers. We monitored the IPC calls during one of the experiments.
Here is a sample of one regionserver log:

2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115483; Served:
HRegionInterface#get queueTime=0 processingTime=1 contents=1 Get, 75 bytes
2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115487; Served:
HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes
2012-10-18 17:00:09,969 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115489; Served:
HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes
2012-10-18 17:00:09,982 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #111421; Served:
HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes
2012-10-18 17:00:09,982 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115497; Served:
HRegionInterface#multi queueTime=0 processingTime=9 contents=200 Gets
2012-10-18 17:00:09,984 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115499; Served:
HRegionInterface#openScanner queueTime=0 processingTime=0 contents=1 Scan, 63 bytes
2012-10-18 17:00:09,990 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #115503; Served:
HRegionInterface#get queueTime=0 processingTime=0 contents=1 Get, 75 bytes
2012-10-18 17:00:09,992 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103230; Served:
HRegionInterface#next queueTime=0 processingTime=0 contents=1 Long, 1 Integer
2012-10-18 17:00:09,994 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103234; Served:
HRegionInterface#close queueTime=0 processingTime=0 contents=1 Long
2012-10-18 17:00:09,994 DEBUG org.apache.hadoop.ipc.HBaseServer.trace: Call #103232; Served:
HRegionInterface#next queueTime=0 processingTime=0 contents=1 Long, 1 Integer

I have attached a larger chunk of the logs we collected for this experiment in case that helps.

>From the logs, we saw that the next() operation at the regionserver takes 1 millisecond
or less; and a multi-get takes 10 ms on average.
Yet the corresponding times we see at the client are orders of magnitude higher.
Ping times between the machines are at most 1ms and we are not saturating the network.

We would really appreciate some insights from you guys on this.
Where do you suggest we focus our efforts in order to hunt down this bottleneck/contention?

Thanks!
Yousuf


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message