hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <lhofha...@yahoo.com>
Subject Re: Slow full-table scans
Date Fri, 24 Aug 2012 17:27:32 GMT
Sent offline.



----- Original Message -----
From: Gurjeet Singh <gurjeet@gmail.com>
To: user@hbase.apache.org
Cc: 
Sent: Wednesday, August 22, 2012 7:01 PM
Subject: Re: Slow full-table scans

Lars,

Can you send me the modified ingestion code ? I am trying to track
down the problem as well and will keep you posted.

Thanks for your help!
Gurjeet

On Wed, Aug 22, 2012 at 6:38 PM, Lars H <lhofhansl@yahoo.com> wrote:
> Your puts are much faster because in the old case you're doing a Put per column, rather
than per row.
> That's the first thing I changed in you sample code (but since this was about scan performance
I did not mention that).
>
> I'm still interested in tracking this down if it is an actual performance problem.
>
> -- Lars
>
> Gurjeet Singh <gurjeet@gmail.com> schrieb:
>
>>Okay, I just ran extensive tests with my minimal test case and you are
>>correct, the old and the new version do the scans in about the same
>>amount of time (although puts are MUCH faster in the packed scheme).
>>
>>I guess my test case is too minimal. I will try to make a better
>>testcase since in my production code, there is still a 500x
>>difference.
>>
>>Gurjeet
>>
>>On Tue, Aug 21, 2012 at 10:00 PM, J Mohamed Zahoor <jmozah@gmail.com> wrote:
>>> Try a quick TestDFSIO to see if things are okay.
>>>
>>> ./zahoor
>>>
>>> On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:
>>>
>>>> It's possible that there is a bad or slower disk on Gurjeet's machine. I
>>>> think details of iostat and cpu would clear things up.
>>>>
>>>> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <lhofhansl@yahoo.com>
>>>> wrote:
>>>>
>>>> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment
size
>>>> > 100
>>>> >
>>>> >
>>>> >
>>>> > ________________________________
>>>> >  From: Gurjeet Singh <gurjeet@gmail.com>
>>>> > To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
>>>> > Sent: Tuesday, August 21, 2012 11:31 AM
>>>> >  Subject: Re: Slow full-table scans
>>>> >
>>>> > How does that compare with the newScanTable on your build ?
>>>> >
>>>> > Gurjeet
>>>> >
>>>> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <lhofhansl@yahoo.com>
>>>> > wrote:
>>>> > > Hmm... So I tried in HBase (current trunk).
>>>> > > I created 100 rows with 200.000 columns each (using your oldMakeTable).
>>>> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo
>>>> > distributed mode - with your oldScanTable).
>>>> > >
>>>> > > -- Lars
>>>> > >
>>>> > >
>>>> > >
>>>> > > ----- Original Message -----
>>>> > > From: lars hofhansl <lhofhansl@yahoo.com>
>>>> > > To: "user@hbase.apache.org" <user@hbase.apache.org>
>>>> > > Cc:
>>>> > > Sent: Monday, August 20, 2012 7:50 PM
>>>> > > Subject: Re: Slow full-table scans
>>>> > >
>>>> > > Thanks Gurjeet,
>>>> > >
>>>> > > I'll (hopefully) have a look tomorrow.
>>>> > >
>>>> > > -- Lars
>>>> > >
>>>> > >
>>>> > >
>>>> > > ----- Original Message -----
>>>> > > From: Gurjeet Singh <gurjeet@gmail.com>
>>>> > > To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
>>>> > > Cc:
>>>> > > Sent: Monday, August 20, 2012 7:42 PM
>>>> > > Subject: Re: Slow full-table scans
>>>> > >
>>>> > > Hi Lars,
>>>> > >
>>>> > > Here is a testcase:
>>>> > >
>>>> > > https://gist.github.com/3410948
>>>> > >
>>>> > > Benchmarking code:
>>>> > >
>>>> > > https://gist.github.com/3410952
>>>> > >
>>>> > > Try running it with numRows = 100, numCols = 200000, segmentSize
= 1000
>>>> > >
>>>> > > Gurjeet
>>>> > >
>>>> > >
>>>> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <gurjeet@gmail.com>
>>>> > wrote:
>>>> > >> Sure - I can create a minimal testcase and send it along.
>>>> > >>
>>>> > >> Gurjeet
>>>> > >>
>>>> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <lhofhansl@yahoo.com>
>>>> > wrote:
>>>> > >>> That's interesting.
>>>> > >>> Could you share your old and new schema. I would like to
track down
>>>> > the performance problems you saw.
>>>> > >>> (If you had a demo program that populates your rows with
200.000
>>>> > columns in a way where you saw the performance issues, that'd be even
>>>> > better, but not necessary).
>>>> > >>>
>>>> > >>>
>>>> > >>> -- Lars
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> ________________________________
>>>> > >>>  From: Gurjeet Singh <gurjeet@gmail.com>
>>>> > >>> To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
>>>> > >>> Sent: Thursday, August 16, 2012 11:26 AM
>>>> > >>> Subject: Re: Slow full-table scans
>>>> > >>>
>>>> > >>> Sorry for the delay guys.
>>>> > >>>
>>>> > >>> Here are a few results:
>>>> > >>>
>>>> > >>> 1. Regions in the table = 11
>>>> > >>> 2. The region servers don't appear to be very busy with
the query ~5%
>>>> > >>> CPU (but with parallelization, they are all busy)
>>>> > >>>
>>>> > >>> Finally, I changed the format of my data, such that each
cell in
>>>> HBase
>>>> > >>> contains a chunk of a row instead of the single value it
had. So,
>>>> > >>> stuffing each Hbase cell with 500 columns of a row, gave
me a
>>>> > >>> performance boost of 1000x. It seems that the underlying
issue was IO
>>>> > >>> overhead per byte of actual data stored.
>>>> > >>>
>>>> > >>>
>>>> > >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lhofhansl@yahoo.com>
>>>> > wrote:
>>>> > >>>> Yeah... It looks OK.
>>>> > >>>> Maybe 2G of heap is a bit low when dealing with 200.000
column rows.
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> If you can I'd like to know how busy your regionservers
are during
>>>> > these operations. That would be an indication on whether the
>>>> > parallelization is good or not.
>>>> > >>>>
>>>> > >>>> -- Lars
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> ----- Original Message -----
>>>> > >>>> From: Stack <stack@duboce.net>
>>>> > >>>> To: user@hbase.apache.org
>>>> > >>>> Cc:
>>>> > >>>> Sent: Wednesday, August 15, 2012 3:13 PM
>>>> > >>>> Subject: Re: Slow full-table scans
>>>> > >>>>
>>>> > >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gurjeet@gmail.com>
>>>> > wrote:
>>>> > >>>>> I am beginning to think that this is a configuration
issue on my
>>>> > >>>>> cluster. Do the following configuration files seem
sane ?
>>>> > >>>>>
>>>> > >>>>> hbase-env.sh    https://gist.github.com/3345338
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>> Nothing wrong w/ this (Remove the -ea, you don't want
asserts in
>>>> > >>>> production, and the -XX:+CMSIncrementalMode flag if
>= 2 cores).
>>>> > >>>>
>>>> > >>>>
>>>> > >>>>> hbase-site.xml    https://gist.github.com/3345356
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>> This is all defaults effectively.   I don't see any
of the configs.
>>>> > >>>> recommended by the performance section of the reference
guide and/or
>>>> > >>>> those suggested by the GBIF blog.
>>>> > >>>>
>>>> > >>>> You don't answer LarsH's query about where you see
the 4%
>>>> difference.
>>>> > >>>>
>>>> > >>>> How many regions in your table?  Whats the HBase Master
UI look like
>>>> > >>>> when this scan is running?
>>>> > >>>> St.Ack
>>>> > >>>>
>>>> >
>>>>


Mime
View raw message