hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Esteban Gutierrez <este...@cloudera.com>
Subject Re: Scan vs Parallel scan.
Date Thu, 11 Sep 2014 03:57:24 GMT
Hi Guillermo,

Thanks for the additional information. How large is the difference between
the shell count command and the single threaded scan you use? e.g. in the
order of 1% or 200%? can you tell us which filter are you using for the
scan? Have you fully verified that you are in fact not using the block
cache at all and all your reads bypass the cache and go directly to HDFS?

thanks,
esteban.


--
Cloudera, Inc.


On Wed, Sep 10, 2014 at 1:41 PM, Guillermo Ortiz <konstt2000@gmail.com>
wrote:

> What I want to say that I don't understand why a count takes more time than
> a complete scan without cache. I thought it should take more time to scan
> the table than to execute a count.
> Another point is why is slower an distributed scan than a sequential scan.
> Tomorrow I'll check how many disk we have.
>
> El miércoles, 10 de septiembre de 2014, Esteban Gutierrez <
> esteban@cloudera.com> escribió:
>
> > Hello Guillermo,
> >
> > Sounds like some potential contention going on, how many disks per node
> you
> > have?
> >
> > Can you explain further what do you mean by "and I don't know why it's so
> > fast,, it's really much faster than execute an "count" from hbase shell,"
> > the count command from the shell uses the FirstKeyOnlyFilter and a
> caching
> > of 10 which should be close to the behavior of your testing tool if its
> > using the same filter and the same cache settings.
> >
> > cheers,
> > esteban.
> >
> >
> >
> >
> > --
> > Cloudera, Inc.
> >
> >
> > On Wed, Sep 10, 2014 at 1:40 AM, Guillermo Ortiz <konstt2000@gmail.com
> > <javascript:;>>
> > wrote:
> >
> > > Hi,
> > >
> > > I developed an distributed scan, I create an thread for each region.
> > After
> > > that, I've tried to get some times Scan vs DistributedScan.
> > > I have disabled blockcache in my table. My cluster has 3 region servers
> > > with 2 regions each one, in total there are 100.000 rows and execute a
> > > complete scan.
> > >
> > > My partitions are
> > > -01666 -> request 16665
> > > 016666-033332 -> request 16666
> > > 033332-049998 -> request 16666
> > > 049998-066664 -> request 16666
> > > 066664-083330 -> request 16666
> > > 083330- -> request 16671
> > >
> > >
> > > 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2
> ->
> > > Caching 10
> > >
> > > 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2
> ->
> > > Caching 100
> > >
> > > 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2
> ->
> > > Caching 1000
> > >
> > > 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> > > Caching 1
> > >
> > > 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> > > Caching 100
> > >
> > > 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> > > Caching 1000
> > >
> > > Parallel scan works much worse than simple scan,, and I don't know why
> > it's
> > > so fast,, it's really much faster than execute an "count" from hbase
> > shell,
> > > what it doesn't look pretty notmal. The only time that it works better
> > > parallel is when I execute a normal scan with caching 1.
> > >
> > > Any clue about it?
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message