hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: HBase read performance
Date Thu, 02 Oct 2014 23:52:18 GMT
You might have the data in the OS buffer cache, without short circuit reading the region server
has to request the block from the data node process, which then reads it from the block cache.
That is a few context switches per RPC that do not show up in CPU metrics. In that you also
would not see disk IO.

If - as you say - you see a lot of evicted blocks the data *has* to come from the OS. If you
do not see disk IO is *has* to come from the OS cache. I.e. there's more RAM on your boxes,
and you should increase the heap block cache.

You can measure the context switches with vmstat. Other than that I have no suggestion until
I reproduce the problem.
Also check the data locality index of the region server it should be close to 100%.

-- Lars

 From: Khaled Elmeleegy <kdiaa@hotmail.com>
To: "user@hbase.apache.org" <user@hbase.apache.org> 
Sent: Thursday, October 2, 2014 3:24 PM
Subject: RE: HBase read performance

Lars thanks a lot about all the tips. I'll make sure I cover all of them and get back to you.
I am not sure they are the bottleneck though as they all are about optimizing physical resource
usage. As I said, I don't see any contended physical resources now. I'll also try to reproduce
this problem in a simpler environment and pass to you the test program to play with.

Couple of high level points to make. You are right that my use case is kind of a worst case
for HBase reads. But, if things go the way you described them, there should be tons of disk
IO and that should be clearly the bottleneck. This is not the case though. That's for the
simple reason that this is done in a test environment (I am still prototyping), and not a
lot of data is yet written to HBase. However for the real use case, there should writers constantly
writing data to HBase and readers occasionally doing this scatter/gather. At steady state,
things should only get worse and all the issues you mentioned should get far more pronounced.
At this point, one can try to mitigate it using more memory or so. I am not there yet as I
think I am hitting some software bottleneck, which I don't know how to work around.


> Date: Thu, 2 Oct 2014 14:20:47 -0700
> From: larsh@apache.org
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
> OK... We might need to investigate this.
> Any chance that you can provide a minimal test program and instruction about how to set
it up.
> We can do some profiling then.
> One thing to note is that with scanning HBase cannot use bloom filters to rule out HFiles
ahead of time, it needs to look into all of them.
> So you kind of hit on the absolute worst case:
> - random reads that do not fit into the block cache
> - cannot use bloom filters
> Few more question/comments:
> - Do you have short circuit reading enabled? If not, you should.
> - Is your table major compacted? That will reduce the number of files to look at.
> - Did you disable Nagle's everywhere (enabled tcpnodelay)? It disabled by default in
HBase, but necessarily in your install of HDFS.
> - Which version of HDFS are you using as backing filesystem?
> - If your disk is idle, it means the data fits into the OS buffer cache. In turn that
means that you increase the heap for the region servers. You can also use block encoding (FAST_DIFF)
to try to make sure the entire working set fits into the cache.
> - Also try to reduce the block size - although if your overall working set does not fit
in the heap it won't help much.
> This is a good section of the book to read through generally (even though you might know
most of this already): http://hbase.apache.org/book.html#perf.configurations
> -- Lars
> ----- Original Message -----
> From: Khaled Elmeleegy <kdiaa@hotmail.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>
> Cc:
> Sent: Thursday, October 2, 2014 11:27 AM
> Subject: RE: HBase read performance
> I do see a very brief spike in CPU (user/system), but it's no where near 0% idle. It
goes from 99% idle down to something like 40% idle for a second or so. The thing to note,
this is all on a test cluster, so no real load. Things are generally idle until i issue 2-3
of these multi-scan-requests to render a web page. Then, you see the spike in the cpu and
some activity in the network and disk, but nowhere near saturation.
> If there are specific tests you'd like me to do to debug this, I'd be more than happy
to do it.
> Khaled
> ----------------------------------------
>> Date: Thu, 2 Oct 2014 11:15:59 -0700
>> From: larsh@apache.org
>> Subject: Re: HBase read performance
>> To: user@hbase.apache.org
>> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting a lot for
IO. No high User/System CPU either?
>> If you see a lot of evicted block then each RPC has a high chance of requiring to
bring an entire 64k block in. You'll see bad performance with this.
>> We might need to trace this specific scenario.
>> -- Lars
>> ________________________________
>> From: Khaled Elmeleegy <kdiaa@hotmail.com>
>> To: "user@hbase.apache.org" <user@hbase.apache.org>
>> Sent: Thursday, October 2, 2014 10:46 AM
>> Subject: RE: HBase read performance
>> I've set the heap size to 6GB and I do have gc logging. No long pauses there -- occasional
0.1s or 0.2s.
>> Other than the discrepancy between what's reported on the client and what's reported
at the RS, there is also the issue of not getting proper concurrency. So, even if a reverse
get takes 100ms or so (this has to be mostly blocking on various things as no physical resource
is contended), then the other gets/scans should be able to proceed in parallel, so a thousand
concurrent gets/scans should finish in few hundreds of ms not many seconds. That's why I thought
I'd increase the handlers count to try to get more concurrency, but it didn't help. So, there
must be something else.
>> Khaled
>> ----------------------------------------
>>> From: ndimiduk@gmail.com
>>> Date: Thu, 2 Oct 2014 10:36:39 -0700
>>> Subject: Re: HBase read performance
>>> To: user@hbase.apache.org
>>> Do check again on the heap size of the region servers. The default
>>> unconfigured size is 1G; too small for much of anything. Check your RS logs
>>> -- look for lines produced by the JVMPauseMonitor thread. They usually
>>> correlate with long GC pauses or other process-freeze events.
>>> Get is implemented as a Scan of a single row, so a reverse scan of a single
>>> row should be functionally equivalent.
>>> In practice, I have seen discrepancy between the latencies reported by the
>>> RS and the latencies experienced by the client. I've not investigated this
>>> area thoroughly.
>>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kdiaa@hotmail.com> wrote:
>>>> Thanks Lars for your quick reply.
>>>> Yes performance is similar with less handlers (I tried with 100 first).
>>>> The payload is not big ~1KB or so. The working set doesn't seem to fit in
>>>> memory as there are many cache misses. However, disk is far from being a
>>>> bottleneck. I checked using iostat. I also verified that neither the
>>>> network nor the CPU of the region server or the client are a bottleneck.
>>>> This leads me to believe that likely this is a software bottleneck,
>>>> possibly due to a misconfiguration on my side. I just don't know how to
>>>> debug it. A clear disconnect I see is the individual request latency as
>>>> reported by metrics on the region server (IPC processCallTime vs scanNext)
>>>> vs what's measured on the client. Does this sound right? Any ideas on how
>>>> to better debug it?
>>>> About this trick with the timestamps to be able to do a forward scan,
>>>> thanks for pointing it out. Actually, I am aware of it. The problem I have
>>>> is, sometimes I want to get the key after a particular timestamp and
>>>> sometimes I want to get the key before, so just relying on the key order
>>>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan can
>>>> do the trick though.
>>>> Khaled
>>>> ----------------------------------------
>>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
>>>>> From: larsh@apache.org
>>>>> Subject: Re: HBase read performance
>>>>> To: user@hbase.apache.org
>>>>> Hi Khaled,
>>>>> is it the same with fewer threads? 1500 handler threads seems to be a
>>>> lot. Typically a good number of threads depends on the hardware (number of
>>>> cores, number of spindles, etc). I cannot think of any type of scenario
>>>> where more than 100 would give any improvement.
>>>>> How large is the payload per KV retrieved that way? If large (as in a
>>>> few 100k) you definitely want to lower the number of the handler threads.
>>>>> How much heap do you give the region server? Does the working set fit
>>>> into the cache? (i.e. in the metrics, do you see the eviction count going
>>>> up, if so it does not fit into the cache).
>>>>> If the working set does not fit into the cache (eviction count goes up)
>>>> then HBase will need to bring a new block in from disk on each Get
>>>> (assuming the Gets are more or less random as far as the server is
>>>> concerned).
>>>>> In case you'll benefit from reducing the HFile block size (from 64k to
>>>> 8k or even 4k).
>>>>> Lastly I don't think we tested the performance of using reverse scan
>>>> this way, there is probably room to optimize this.
>>>>> Can you restructure your keys to allow forwards scanning? For example
>>>> you could store the time as MAX_LONG-time. Or you could invert all the bits
>>>> of the time portion of the key, so that it sort the other way. Then you
>>>> could do a forward scan.
>>>>> Let us know how it goes.
>>>>> -- Lars
>>>>> ----- Original Message -----
>>>>> From: Khaled Elmeleegy <kdiaa@hotmail.com>
>>>>> To: "user@hbase.apache.org" <user@hbase.apache.org>
>>>>> Cc:
>>>>> Sent: Thursday, October 2, 2014 12:12 AM
>>>>> Subject: HBase read performance
>>>>> Hi,
>>>>> I am trying to do a scatter/gather on hbase (, where I have
>>>> client reading ~1000 keys from an HBase table. These keys happen to fall
>>>> the same region server. For my reads I use reverse scan to read each key
>>>> I want the key prior to a specific time stamp (time stamps are stored in
>>>> reverse order). I don't believe gets can accomplish that, right? so I use
>>>> scan, with caching set to 1.
>>>>> I use 2000 reader threads in the client and on HBase, I've set
>>>> hbase.regionserver.handler.count to 1500. With this setup, my scatter
>>>> gather is very slow and can take up to 10s in total. Timing an individual
>>>> getScanner(..) call on the client side, it can easily take few hundreds of
>>>> ms. I also got the following metrics from the region server in question:
>>>>> "queueCallTime_mean" : 2.190855525775637,
>>>>> "queueCallTime_median" : 0.0,
>>>>> "queueCallTime_75th_percentile" : 0.0,
>>>>> "queueCallTime_95th_percentile" : 1.0,
>>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
>>>>> "processCallTime_min" : 0,
>>>>> "processCallTime_max" : 12755,
>>>>> "processCallTime_mean" : 105.64873440912682,
>>>>> "processCallTime_median" : 0.0,
>>>>> "processCallTime_75th_percentile" : 2.0,
>>>>> "processCallTime_95th_percentile" : 7917.95,
>>>>> "processCallTime_99th_percentile" : 8876.89,
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
>>>> : 89,
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
>>>> : 11300,
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
>>>> : 654.4949739797315,
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
>>>> : 101.0,
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
>>>> : 101.0,
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
>>>> : 101.0,
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
>>>> : 113.0,
>>>>> Where "delta" is the name of the table I am querying.
>>>>> In addition to all this, i monitored the hardware resources (CPU, disk,
>>>> and network) of both the client and the region server and nothing seems
>>>> anywhere near saturation. So I am puzzled by what's going on and where this
>>>> time is going.
>>>>> Few things to note based on the above measurements: both medians of IPC
>>>> processCallTime and queueCallTime are basically zero (ms I presume,
>>>> right?). However, scanNext_median is 101 (ms too, right?). I am not sure
>>>> how this adds up. Also, even though the 101 figure seems outrageously high
>>>> and I don't know why, still all these scans should be happening in
>>>> parallel, so the overall call should finish fast, given that no hardware
>>>> resource is contended, right? but this is not what's happening, so I have
>>>> to be missing something(s).
>>>>> So, any help is appreciated there.
>>>>> Thanks,
>>>>> Khaled
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message