hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Latham <lat...@davelink.net>
Subject Re: Record limit in scan api?
Date Sat, 21 Nov 2009 00:46:11 GMT
Aha!  So the concern is more over scanner timeouts rather than OOM.  I guess
this is where I misunderstood.  I'm still not sure that such a small default
is the best, but it makes more sense.  Perhaps if the client sent the
regionserver some sort of heartbeat occasionally if the scanner was still
being used client side.  I'll have to give it some more thought.

Dave

On Fri, Nov 20, 2009 at 4:40 PM, Dave Latham <latham@davelink.net> wrote:

> Right, that's the problem with the current setting.  If we change the
> setting so that the buffer is measured in bytes, then I think there is a
> decent 'one size fits all' setting, like 1MB.  You'd still want to adjust it
> in some cases, but I think it would be a lot better by default.
>
> Dave
>
>
> On Fri, Nov 20, 2009 at 4:36 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>
>> The problem with this setting, is there is no good 'one size fits all'
>> value.  If we set it to 1, we do a RPC for ever row, clearly not
>> efficient for small rows.  If we set it to something as seemingly
>> innocuous as 5 or 10, then map reduces which do a significant amount
>> of processing on a row can cause the scanner to time out. The client
>> code will also give up if its been more than 60 seconds since the
>> scanner was last used, it's possible this code might need to be
>> adjusted so we can resume scanning.
>>
>> I personally set it to anywhere between 1000-5000 for high performance
>> jobs on small rows.
>>
>> The only factor is "can you process the cached chunk of rows in  <
>> 60s".  Set the value as large as possible to not violate this and
>> you'll achieve max perf.
>>
>> -ryan
>>
>> On Fri, Nov 20, 2009 at 4:20 PM, Dave Latham <latham@davelink.net> wrote:
>> > Thanks for your thoughts.  It's true you can configure the scan buffer
>> rows
>> > on an HTable or Scan instance, but I think there's something to be said
>> to
>> > try to work as well as we can out of the box.
>> >
>> > It would be more complication, but not by much.  To track the idea and
>> see
>> > what it would look like, I made an issue and attached a proposed patch.
>> >
>> > Dave
>> >
>> > On Fri, Nov 20, 2009 at 1:55 PM, Jean-Daniel Cryans <
>> jdcryans@apache.org>wrote:
>> >
>> >> And on the Scan as I wrote in my answer which is really really
>> convenient.
>> >>
>> >> Not convinced on using bytes as a value for caching... It would be
>> >> also more complicated.
>> >>
>> >> J-D
>> >>
>> >> On Fri, Nov 20, 2009 at 1:45 PM, Ryan Rawson <ryanobjc@gmail.com>
>> wrote:
>> >> > You can set it on a per-HTable basis.  HTable.setScannerCaching(int);
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Nov 20, 2009 at 1:43 PM, Dave Latham <latham@davelink.net>
>> >> wrote:
>> >> >> I have some tables with large rows and some tables with very small
>> rows,
>> >> so
>> >> >> I keep my default scanner caching at 1 row, but have to remember
to
>> set
>> >> it
>> >> >> higher when scanner tables with smaller rows.  It would be nice
to
>> have
>> >> a
>> >> >> default that did something reasonable across tables.
>> >> >>
>> >> >> Would it make sense to set scanner caching as a count of bytes
>> rather
>> >> than a
>> >> >> count of rows?  That would make it similar to the write buffer
for
>> >> batches
>> >> >> of puts that get flushed based on size rather than a fixed number
of
>> >> Puts.
>> >> >> Then there could be some default value which should provide decent
>> >> >> performance out of the box.
>> >> >>
>> >> >> Dave
>> >> >>
>> >> >> On Fri, Nov 20, 2009 at 12:35 PM, Gary Helmling <
>> ghelmling@gmail.com>
>> >> wrote:
>> >> >>
>> >> >>> To set this per scan you should be able to do:
>> >> >>>
>> >> >>> Scan s = new Scan()
>> >> >>> s.setCaching(...)
>> >> >>>
>> >> >>> (I think this works anyway)
>> >> >>>
>> >> >>>
>> >> >>> The other thing that I've found useful is using a PageFilter
on
>> scans:
>> >> >>>
>> >> >>>
>> >>
>> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/filter/PageFilter.html
>> >> >>>
>> >> >>> I believe this is applied independently on each region server
(?)
>> so
>> >> you
>> >> >>> still need to do your own counting in iterating the results,
but it
>> can
>> >> be
>> >> >>> used to early out on the server side separately from the scanner
>> >> caching
>> >> >>> value.
>> >> >>>
>> >> >>> --gh
>> >> >>>
>> >> >>> On Fri, Nov 20, 2009 at 3:04 PM, stack <stack@duboce.net>
wrote:
>> >> >>>
>> >> >>> > There is this in the configuration:
>> >> >>> >
>> >> >>> >  <property>
>> >> >>> >    <name>hbase.client.scanner.caching</name>
>> >> >>> >    <value>1</value>
>> >> >>> >    <description>Number of rows that will be fetched
when calling
>> next
>> >> >>> >    on a scanner if it is not served from memory. Higher
caching
>> >> values
>> >> >>> >    will enable faster scanners but will eat up more memory
and
>> some
>> >> >>> >    calls of next may take longer and longer times when
the cache
>> is
>> >> >>> empty.
>> >> >>> >    </description>
>> >> >>> >  </property>
>> >> >>> >
>> >> >>> >
>> >> >>> > Being able to do it per Scan sounds like something we
should add.
>> >> >>> >
>> >> >>> > St.Ack
>> >> >>> >
>> >> >>> >
>> >> >>> > On Fri, Nov 20, 2009 at 11:43 AM, Adam Silberstein
>> >> >>> > <silberst@yahoo-inc.com>wrote:
>> >> >>> >
>> >> >>> > >   Hi,
>> >> >>> > > Is there a way to specify a limit on number of returned
records
>> for
>> >> >>> scan?
>> >> >>> > >  I
>> >> >>> > > don¹t see any way to do this when building the scan.
 If there
>> is,
>> >> that
>> >> >>> > > would be great.  If not, what about when iterating
over the
>> result?
>> >>  If
>> >> >>> I
>> >> >>> > > exit the loop when I reach my limit, will that approximate
this
>> >> clause?
>> >> >>> > I
>> >> >>> > > guess my real question is about how scan is implemented
in the
>> >> client.
>> >> >>> > >  I.e.
>> >> >>> > > How many records are returned from Hbase at a time
as I iterate
>> >> through
>> >> >>> > the
>> >> >>> > > scan result?  If I want 1,000 records and 100 get
returned at a
>> >> time,
>> >> >>> > then
>> >> >>> > > I¹m in good shape.  On the other hand, if I want
10 records and
>> get
>> >> 100
>> >> >>> > at
>> >> >>> > > a
>> >> >>> > > time, it¹s a bit wasteful, though the waste is bounded.
>> >> >>> > >
>> >> >>> > > Thanks,
>> >> >>> > > Adam
>> >> >>> > >
>> >> >>> >
>> >> >>>
>> >> >>
>> >> >
>> >>
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message