hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Record limit in scan api?
Date Sat, 21 Nov 2009 00:36:11 GMT
The problem with this setting, is there is no good 'one size fits all'
value.  If we set it to 1, we do a RPC for ever row, clearly not
efficient for small rows.  If we set it to something as seemingly
innocuous as 5 or 10, then map reduces which do a significant amount
of processing on a row can cause the scanner to time out. The client
code will also give up if its been more than 60 seconds since the
scanner was last used, it's possible this code might need to be
adjusted so we can resume scanning.

I personally set it to anywhere between 1000-5000 for high performance
jobs on small rows.

The only factor is "can you process the cached chunk of rows in  <
60s".  Set the value as large as possible to not violate this and
you'll achieve max perf.

-ryan

On Fri, Nov 20, 2009 at 4:20 PM, Dave Latham <latham@davelink.net> wrote:
> Thanks for your thoughts.  It's true you can configure the scan buffer rows
> on an HTable or Scan instance, but I think there's something to be said to
> try to work as well as we can out of the box.
>
> It would be more complication, but not by much.  To track the idea and see
> what it would look like, I made an issue and attached a proposed patch.
>
> Dave
>
> On Fri, Nov 20, 2009 at 1:55 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
>
>> And on the Scan as I wrote in my answer which is really really convenient.
>>
>> Not convinced on using bytes as a value for caching... It would be
>> also more complicated.
>>
>> J-D
>>
>> On Fri, Nov 20, 2009 at 1:45 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>> > You can set it on a per-HTable basis.  HTable.setScannerCaching(int);
>> >
>> >
>> >
>> > On Fri, Nov 20, 2009 at 1:43 PM, Dave Latham <latham@davelink.net>
>> wrote:
>> >> I have some tables with large rows and some tables with very small rows,
>> so
>> >> I keep my default scanner caching at 1 row, but have to remember to set
>> it
>> >> higher when scanner tables with smaller rows.  It would be nice to have
>> a
>> >> default that did something reasonable across tables.
>> >>
>> >> Would it make sense to set scanner caching as a count of bytes rather
>> than a
>> >> count of rows?  That would make it similar to the write buffer for
>> batches
>> >> of puts that get flushed based on size rather than a fixed number of
>> Puts.
>> >> Then there could be some default value which should provide decent
>> >> performance out of the box.
>> >>
>> >> Dave
>> >>
>> >> On Fri, Nov 20, 2009 at 12:35 PM, Gary Helmling <ghelmling@gmail.com>
>> wrote:
>> >>
>> >>> To set this per scan you should be able to do:
>> >>>
>> >>> Scan s = new Scan()
>> >>> s.setCaching(...)
>> >>>
>> >>> (I think this works anyway)
>> >>>
>> >>>
>> >>> The other thing that I've found useful is using a PageFilter on scans:
>> >>>
>> >>>
>> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/filter/PageFilter.html
>> >>>
>> >>> I believe this is applied independently on each region server (?) so
>> you
>> >>> still need to do your own counting in iterating the results, but it
can
>> be
>> >>> used to early out on the server side separately from the scanner
>> caching
>> >>> value.
>> >>>
>> >>> --gh
>> >>>
>> >>> On Fri, Nov 20, 2009 at 3:04 PM, stack <stack@duboce.net> wrote:
>> >>>
>> >>> > There is this in the configuration:
>> >>> >
>> >>> >  <property>
>> >>> >    <name>hbase.client.scanner.caching</name>
>> >>> >    <value>1</value>
>> >>> >    <description>Number of rows that will be fetched when
calling next
>> >>> >    on a scanner if it is not served from memory. Higher caching
>> values
>> >>> >    will enable faster scanners but will eat up more memory and
some
>> >>> >    calls of next may take longer and longer times when the cache
is
>> >>> empty.
>> >>> >    </description>
>> >>> >  </property>
>> >>> >
>> >>> >
>> >>> > Being able to do it per Scan sounds like something we should add.
>> >>> >
>> >>> > St.Ack
>> >>> >
>> >>> >
>> >>> > On Fri, Nov 20, 2009 at 11:43 AM, Adam Silberstein
>> >>> > <silberst@yahoo-inc.com>wrote:
>> >>> >
>> >>> > >   Hi,
>> >>> > > Is there a way to specify a limit on number of returned records
for
>> >>> scan?
>> >>> > >  I
>> >>> > > don¹t see any way to do this when building the scan.  If
there is,
>> that
>> >>> > > would be great.  If not, what about when iterating over the
result?
>>  If
>> >>> I
>> >>> > > exit the loop when I reach my limit, will that approximate
this
>> clause?
>> >>> > I
>> >>> > > guess my real question is about how scan is implemented in
the
>> client.
>> >>> > >  I.e.
>> >>> > > How many records are returned from Hbase at a time as I iterate
>> through
>> >>> > the
>> >>> > > scan result?  If I want 1,000 records and 100 get returned
at a
>> time,
>> >>> > then
>> >>> > > I¹m in good shape.  On the other hand, if I want 10 records
and get
>> 100
>> >>> > at
>> >>> > > a
>> >>> > > time, it¹s a bit wasteful, though the waste is bounded.
>> >>> > >
>> >>> > > Thanks,
>> >>> > > Adam
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >
>>
>

Mime
View raw message