hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Keller <brya...@gmail.com>
Subject Re: Poor HBase map-reduce scan performance
Date Thu, 23 May 2013 15:45:44 GMT
I am considering scanning a snapshot instead of the table. I believe this is what the ExportSnapshot
class does. If I could use the scanning code from ExportSnapshot then I will be able to scan
the HDFS files directly and bypass the regionservers. This could potentially give me a huge
boost in performance for full table scans. However, it doesn't really address the poor scan
performance against a table.

On May 22, 2013, at 3:57 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Sandy:
> Looking at patch v6 of HBASE-8420, I think it is different from your
> approach below for the case of cache.size() == 0.
> 
> Maybe log a JIRA for further discussion ?
> 
> On Wed, May 22, 2013 at 3:33 PM, Sandy Pratt <prattrs@adobe.com> wrote:
> 
>> It seems to be in the ballpark of what I was getting at, but I haven't
>> fully digested the code yet, so I can't say for sure.
>> 
>> Here's what I'm getting at.  Looking at
>> o.a.h.h.client.ClientScanner.next() in the 94.2 source I have loaded, I
>> see there are three branches with respect to the cache:
>> 
>> public Result next() throws IOException {
>> 
>> 
>>  // If the scanner is closed and there's nothing left in the cache, next
>> is a no-op.
>>  if (cache.size() == 0 && this.closed) {
>>    return null;
>>  }
>> 
>>  if (cache.size() == 0) {
>> // Request more results from RS
>>  ...
>>  }
>> 
>>  if (cache.size() > 0) {
>>    return cache.poll();
>>  }
>> 
>>  ...
>>  return null;
>> 
>> }
>> 
>> 
>> I think that middle branch wants to change as follows (pseudo-code):
>> 
>> if the cache size is below a certain threshold then
>>  initiate asynchronous action to refill it
>>  if there is no result to return until the cache refill completes then
>>    block
>>  done
>> done
>> 
>> Or something along those lines.  I haven't grokked the patch well enough
>> yet to tell if that's what it does.  What I think is happening in the
>> 0.94.2 code I've got is that it requests nothing until the cache is empty,
>> then blocks until it's non-empty.  We want to eagerly and asynchronously
>> refill the cache so that we ideally never have to block.
>> 
>> 
>> Sandy
>> 
>> 
>> On 5/22/13 1:39 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:
>> 
>>> Sandy:
>>> Do you think the following JIRA would help with what you expect in this
>>> regard ?
>>> 
>>> HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb
>>> 
>>> Cheers
>>> 
>>> On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <prattrs@adobe.com> wrote:
>>> 
>>>> I found this thread on search-hadoop.com just now because I've been
>>>> wrestling with the same issue for a while and have as yet been unable to
>>>> solve it.  However, I think I have an idea of the problem.  My theory is
>>>> based on assumptions about what's going on in HBase and HDFS internally,
>>>> so please correct me if I'm wrong.
>>>> 
>>>> Briefly, I think the issue is that sequential reads from HDFS are
>>>> pipelined, whereas sequential reads from HBase are not.  Therefore,
>>>> sequential reads from HDFS tend to keep the IO subsystem saturated,
>>>> while
>>>> sequential reads from HBase allow it to idle for a relatively large
>>>> proportion of time.
>>>> 
>>>> To make this more concrete, suppose that I'm reading N bytes of data
>>>> from
>>>> a file in HDFS.  I issue the calls to open the file and begin to read
>>>> (from an InputStream, for example).  As I'm reading byte 1 of the stream
>>>> at my client, the datanode is reading byte M where 1 < M <= N from
disk.
>>>> Thus, three activities tend to happen concurrently for the most part
>>>> (disregarding the beginning and end of the file): 1) processing at the
>>>> client; 2) streaming over the network from datanode to client; and 3)
>>>> reading data from disk at the datanode.  The proportion of time these
>>>> three activities overlap tends towards 100% as N -> infinity.
>>>> 
>>>> Now suppose I read a batch of R records from HBase (where R = whatever
>>>> scanner caching happens to be).  As I understand it, I issue my call to
>>>> ResultScanner.next(), and this causes the RegionServer to block as if
>>>> on a
>>>> page fault while it loads enough HFile blocks from disk to cover the R
>>>> records I (implicitly) requested.  After the blocks are loaded into the
>>>> block cache on the RS, the RS returns R records to me over the network.
>>>> Then I process the R records locally.  When they are exhausted, this
>>>> cycle
>>>> repeats.  The notable upshot is that while the RS is faulting HFile
>>>> blocks
>>>> into the cache, my client is blocked.  Furthermore, while my client is
>>>> processing records, the RS is idle with respect to work on behalf of my
>>>> client.
>>>> 
>>>> That last point is really the killer, if I'm correct in my assumptions.
>>>> It means that Scanner caching and larger block sizes work only to
>>>> amortize
>>>> the fixed overhead of disk IOs and RPCs -- they do nothing to keep the
>>>> IO
>>>> subsystems saturated during sequential reads.  What *should* happen is
>>>> that the RS should treat the Scanner caching value (R above) as a hint
>>>> that it should always have ready records r + 1 to r + R when I'm reading
>>>> record r, at least up to the region boundary.  The RS should be
>>>> preparing
>>>> eagerly for the next call to ResultScanner.next(), which I suspect it's
>>>> currently not doing.
>>>> 
>>>> Another way to state this would be to say that the client should tell
>>>> the
>>>> RS to prepare the next batch of records soon enough that they can start
>>>> to
>>>> arrive at the client just as the client is finishing the current batch.
>>>> As is, I suspect it doesn't request more from the RS until the local
>>>> batch
>>>> is exhausted.
>>>> 
>>>> As I cautioned before, this is based on assumptions about how the
>>>> internals work, so please correct me if I'm wrong.  Also, I'm way behind
>>>> on the mailing list, so I probably won't see any responses unless CC'd
>>>> directly.
>>>> 
>>>> Sandy
>>>> 
>>>> On 5/10/13 8:46 AM, "Bryan Keller" <bryanck@gmail.com> wrote:
>>>> 
>>>>> FYI, I ran tests with compression on and off.
>>>>> 
>>>>> With a plain HDFS sequence file and compression off, I am getting very
>>>>> good I/O numbers, roughly 75% of theoretical max for reads. With snappy
>>>>> compression on with a sequence file, I/O speed is about 3x slower.
>>>>> However the file size is 3x smaller so it takes about the same time to
>>>>> scan.
>>>>> 
>>>>> With HBase, the results are equivalent (just much slower than a
>>>> sequence
>>>>> file). Scanning a compressed table is about 3x slower I/O than an
>>>>> uncompressed table, but the table is 3x smaller, so the time to scan
is
>>>>> about the same. Scanning an HBase table takes about 3x as long as
>>>>> scanning the sequence file export of the table, either compressed or
>>>>> uncompressed. The sequence file export file size ends up being just
>>>>> barely larger than the table, either compressed or uncompressed
>>>>> 
>>>>> So in sum, compression slows down I/O 3x, but the file is 3x smaller
so
>>>>> the time to scan is about the same. Adding in HBase slows things down
>>>>> another 3x. So I'm seeing 9x faster I/O scanning an uncompressed
>>>> sequence
>>>>> file vs scanning a compressed table.
>>>>> 
>>>>> 
>>>>> On May 8, 2013, at 10:15 AM, Bryan Keller <bryanck@gmail.com> wrote:
>>>>> 
>>>>>> Thanks for the offer Lars! I haven't made much progress speeding
>>>> things
>>>>>> up.
>>>>>> 
>>>>>> I finally put together a test program that populates a table that
is
>>>>>> similar to my production dataset. I have a readme that should describe
>>>>>> things, hopefully enough to make it useable. There is code to
>>>> populate a
>>>>>> test table, code to scan the table, and code to scan sequence files
>>>> from
>>>>>> an export (to compare HBase w/ raw HDFS). I use a gradle build script.
>>>>>> 
>>>>>> You can find the code here:
>>>>>> 
>>>>>> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
>>>>>> 
>>>>>> 
>>>>>> On May 4, 2013, at 6:33 PM, lars hofhansl <larsh@apache.org>
wrote:
>>>>>> 
>>>>>>> The blockbuffers are not reused, but that by itself should not
be a
>>>>>>> problem as they are all the same size (at least I have never
>>>> identified
>>>>>>> that as one in my profiling sessions).
>>>>>>> 
>>>>>>> My offer still stands to do some profiling myself if there is
an
>>>> easy
>>>>>>> way to generate data of similar shape.
>>>>>>> 
>>>>>>> -- Lars
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> From: Bryan Keller <bryanck@gmail.com>
>>>>>>> To: user@hbase.apache.org
>>>>>>> Sent: Friday, May 3, 2013 3:44 AM
>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>> 
>>>>>>> 
>>>>>>> Actually I'm not too confident in my results re block size, they
may
>>>>>>> have been related to major compaction. I'm going to rerun before
>>>>>>> drawing any conclusions.
>>>>>>> 
>>>>>>> On May 3, 2013, at 12:17 AM, Bryan Keller <bryanck@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> I finally made some progress. I tried a very large HBase
block size
>>>>>>>> (16mb), and it significantly improved scan performance. I
went from
>>>>>>>> 45-50 min to 24 min. Not great but much better. Before I
had it set
>>>> to
>>>>>>>> 128k. Scanning an equivalent sequence file takes 10 min.
My random
>>>>>>>> read performance will probably suffer with such a large block
size
>>>>>>>> (theoretically), so I probably can't keep it this big. I
care about
>>>>>>>> random read performance too. I've read having a block size
this big
>>>> is
>>>>>>>> not recommended, is that correct?
>>>>>>>> 
>>>>>>>> I haven't dug too deeply into the code, are the block buffers
>>>> reused
>>>>>>>> or is each new block read a new allocation? Perhaps a buffer
pool
>>>>>>>> could help here if there isn't one already. When doing a
scan, HBase
>>>>>>>> could reuse previously allocated block buffers instead of
>>>> allocating a
>>>>>>>> new one for each block. Then block size shouldn't affect
scan
>>>>>>>> performance much.
>>>>>>>> 
>>>>>>>> I'm not using a block encoder. Also, I'm still sifting through
the
>>>>>>>> profiler results, I'll see if I can make more sense of it
and run
>>>> some
>>>>>>>> more experiments.
>>>>>>>> 
>>>>>>>> On May 2, 2013, at 5:46 PM, lars hofhansl <larsh@apache.org>
>> wrote:
>>>>>>>> 
>>>>>>>>> Interesting. If you can try 0.94.7 (but it'll probably
not have
>>>>>>>>> changed that much from 0.94.4)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Do you have enabled one of the block encoders (FAST_DIFF,
etc)? If
>>>>>>>>> so, try without. They currently need to reallocate a
ByteBuffer for
>>>>>>>>> each single KV.
>>>>>>>>> (Sine you see ScannerV2 rather than EncodedScannerV2
you probably
>>>>>>>>> have not enabled encoding, just checking).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> And do you have a stack trace for the ByteBuffer.allocate().
That
>>>> is
>>>>>>>>> a strange one since it never came up in my profiling
(unless you
>>>>>>>>> enabled block encoding).
>>>>>>>>> (You can get traces from VisualVM by creating a snapshot,
but
>>>> you'd
>>>>>>>>> have to drill in to find the allocate()).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> During normal scanning (again, without encoding) there
should be
>>>> no
>>>>>>>>> allocation happening except for blocks read from disk
(and they
>>>>>>>>> should all be the same size, thus allocation should be
cheap).
>>>>>>>>> 
>>>>>>>>> -- Lars
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ________________________________
>>>>>>>>> From: Bryan Keller <bryanck@gmail.com>
>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>> Sent: Thursday, May 2, 2013 10:54 AM
>>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I ran one of my regionservers through VisualVM. It looks
like the
>>>>>>>>> top hot spots are HFileReaderV2$ScannerV2.getKeyValue()
and
>>>>>>>>> ByteBuffer.allocate(). It appears at first glance that
memory
>>>>>>>>> allocations may be an issue. Decompression was next below
that but
>>>>>>>>> less of an issue it seems.
>>>>>>>>> 
>>>>>>>>> Would changing the block size, either HDFS or HBase,
help here?
>>>>>>>>> 
>>>>>>>>> Also, if anyone has tips on how else to profile, that
would be
>>>>>>>>> appreciated. VisualVM can produce a lot of noise that
is hard to
>>>> sift
>>>>>>>>> through.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <bryanck@gmail.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
>>>>>>>>>> 
>>>>>>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <larsh@apache.org>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hmm... Did you actually use exactly version 0.94.4,
or the
>>>> latest
>>>>>>>>>>> 0.94.7.
>>>>>>>>>>> I would be very curious to see profiling data.
>>>>>>>>>>> 
>>>>>>>>>>> -- Lars
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>> From: Bryan Keller <bryanck@gmail.com>
>>>>>>>>>>> To: "user@hbase.apache.org" <user@hbase.apache.org>
>>>>>>>>>>> Cc:
>>>>>>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>>>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>>>> 
>>>>>>>>>>> I tried running my test with 0.94.4, unfortunately
performance
>>>> was
>>>>>>>>>>> about the same. I'm planning on profiling the
regionserver and
>>>>>>>>>>> trying some other things tonight and tomorrow
and will report
>>>> back.
>>>>>>>>>>> 
>>>>>>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <bryanck@gmail.com>
>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Yes I would like to try this, if you can
point me to the
>>>> pom.xml
>>>>>>>>>>>> patch that would save me some time.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tuesday, April 30, 2013, lars hofhansl
wrote:
>>>>>>>>>>>> If you can, try 0.94.4+; it should significantly
reduce the
>>>>>>>>>>>> amount of bytes copied around in RAM during
scanning, especially
>>>>>>>>>>>> if you have wide rows and/or large key portions.
That in turns
>>>>>>>>>>>> makes scans scale better across cores, since
RAM is shared
>>>>>>>>>>>> resource between cores (much like disk).
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> It's not hard to build the latest HBase against
Cloudera's
>>>>>>>>>>>> version of Hadoop. I can send along a simple
patch to pom.xml to
>>>>>>>>>>>> do that.
>>>>>>>>>>>> 
>>>>>>>>>>>> -- Lars
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> ________________________________
>>>>>>>>>>>> From: Bryan Keller <bryanck@gmail.com>
>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>>>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> The table has hashed keys so rows are evenly
distributed
>>>> amongst
>>>>>>>>>>>> the regionservers, and load on each regionserver
is pretty much
>>>>>>>>>>>> the same. I also have per-table balancing
turned on. I get
>>>> mostly
>>>>>>>>>>>> data local mappers with only a few rack local
(maybe 10 of the
>>>> 250
>>>>>>>>>>>> mappers).
>>>>>>>>>>>> 
>>>>>>>>>>>> Currently the table is a wide table schema,
with lists of data
>>>>>>>>>>>> structures stored as columns with column
prefixes grouping the
>>>>>>>>>>>> data structures (e.g. 1_name, 1_address,
1_city, 2_name,
>>>>>>>>>>>> 2_address, 2_city). I was thinking of moving
those data
>>>> structures
>>>>>>>>>>>> to protobuf which would cut down on the number
of columns. The
>>>>>>>>>>>> downside is I can't filter on one value with
that, but it is a
>>>>>>>>>>>> tradeoff I would make for performance. I
was also considering
>>>>>>>>>>>> restructuring the table into a tall table.
>>>>>>>>>>>> 
>>>>>>>>>>>> Something interesting is that my old regionserver
machines had
>>>>>>>>>>>> five 15k SCSI drives instead of 2 SSDs, and
performance was
>>>> about
>>>>>>>>>>>> the same. Also, my old network was 1gbit,
now it is 10gbit. So
>>>>>>>>>>>> neither network nor disk I/O appear to be
the bottleneck. The
>>>> CPU
>>>>>>>>>>>> is rather high for the regionserver so it
seems like the best
>>>>>>>>>>>> candidate to investigate. I will try profiling
it tomorrow and
>>>>>>>>>>>> will report back. I may revisit compression
on vs off since that
>>>>>>>>>>>> is adding load to the CPU.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'll also come up with a sample program that
generates data
>>>>>>>>>>>> similar to my table.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl
<larsh@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Your average row is 35k so scanner caching
would not make a
>>>> huge
>>>>>>>>>>>>> difference, although I would have expected
some improvements by
>>>>>>>>>>>>> setting it to 10 or 50 since you have
a wide 10ge pipe.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I assume your table is split sufficiently
to touch all
>>>>>>>>>>>>> RegionServer... Do you see the same load/IO
on all region
>>>> servers?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> A bunch of scan improvements went into
HBase since 0.94.2.
>>>>>>>>>>>>> I blogged about some of these changes
here:
>>>>>>>>>>>>> http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In your case - since you have many columns,
each of which
>>>> carry
>>>>>>>>>>>>> the rowkey - you might benefit a lot
from HBASE-7279.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In the end HBase *is* slower than straight
HDFS for full
>>>> scans.
>>>>>>>>>>>>> How could it not be?
>>>>>>>>>>>>> So I would start by looking at HDFS first.
Make sure Nagle's
>>>> is
>>>>>>>>>>>>> disbaled in both HBase and HDFS.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And lastly SSDs are somewhat new territory
for HBase. Maybe
>>>> Andy
>>>>>>>>>>>>> Purtell is listening, I think he did
some tests with HBase on
>>>>>>>>>>>>> SSDs.
>>>>>>>>>>>>> With rotating media you typically see
an improvement with
>>>>>>>>>>>>> compression. With SSDs the added CPU
needed for decompression
>>>>>>>>>>>>> might outweigh the benefits.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> At the risk of starting a larger discussion
here, I would
>>>> posit
>>>>>>>>>>>>> that HBase's LSM based design, which
trades random IO with
>>>>>>>>>>>>> sequential IO, might be a bit more questionable
on SSDs.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If you can, it would be nice to run a
profiler against one of
>>>>>>>>>>>>> the RegionServers (or maybe do it with
the single RS setup) and
>>>>>>>>>>>>> see where it is bottlenecked.
>>>>>>>>>>>>> (And if you send me a sample program
to generate some data -
>>>> not
>>>>>>>>>>>>> 700g, though :) - I'll try to do a bit
of profiling during the
>>>>>>>>>>>>> next days as my day job permits, but
I do not have any machines
>>>>>>>>>>>>> with SSDs).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- Lars
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>> From: Bryan Keller <bryanck@gmail.com>
>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>>>>>>>>>> Subject: Re: Poor HBase map-reduce scan
performance
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes, I have tried various settings for
setCaching() and I have
>>>>>>>>>>>>> setCacheBlocks(false)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yuzhihong@gmail.com>
>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example
:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> scan.setCaching(500);        // 1
is the default in Scan,
>>>> which
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>> be bad for MapReduce jobs
>>>>>>>>>>>>>> scan.setCacheBlocks(false);  // don't
set to true for MR jobs
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I guess you have used the above setting.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 0.94.x releases are compatible. Have
you considered upgrading
>>>>>>>>>>>>>> to, say
>>>>>>>>>>>>>> 0.94.7 which was recently released
?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM,
Bryan Keller <bryanck@gm
>>>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
View raw message