hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Lawlor <jonathan.law...@cloudera.com>
Subject Re: remove scanner caching?
Date Thu, 09 Apr 2015 18:28:05 GMT
Nice idea! I agree that the Scan API would be cleaned up by your
suggestions, especially the doc updates. Some comments below:

> Scan.bufferSize (instead of maxResultSize for the target over-the-wire
size - though this is still confusing because it's common to go over this
size)
Ya this setting will always have a little ambiguity associated with it (at
least until such a time where we are able to enforce it at the byte level
i.e. send back partial cells). Scan.bufferSize sounds okay. As a note,
there was some discussion in HBASE-11544 about renaming this field and one
of the recommendations was Scan.rpcChunkSize.

> Scan.limitRows (instead of caching - along with true client side support)
Makes sense. I think that client side support is actually already there (at
least it is in ClientScanner via the countdown variable that is used as the
caching value for new scanner callables).

> Scan.allowPartialResults  (to indicate it's ok to break up rows across
Results...)
With HBASE-11544 in branch-1+ the server will stop adding Cells as soon as
the buffer fills and send back the accumulated Results to the client (last
Result may be a partial of its row). In the case that allow partial results
is false, the ClientScanner handles reassembling the partials into a
complete view of the row before releasing the Result to the application.

With this proposed cleanup, are you recommending that we do away with
Scan.setBatch? Would the default configuration remain as it is now in
branch-1+ (rowLimit = Integer.MAX_VALUE, bufferSize = 2MB,
allowPartialResults = false)?

Jonathan

On Thu, Apr 9, 2015 at 6:25 AM, Dave Latham <latham@davelink.net> wrote:

> We definitely wouldn't want to remove it too soon, for compatibility
> reasons.
>
> Adding a "limitRows" notion sounds reasonable, but I'd argue is actually
> something different than caching was.  If an app is relying on the scanner
> to actually limit the number of rows returned, the current caching limit
> won't work for scans that cross region boundaries.  We would need to keep
> the state client side in addition to server side and decrement as we
> traverse regions.
>
> Looking into the branch-1 API, I can see there is now also
> Scan.allowPartialResults in addition to Scan.batch.  For most cases, I'd
> expect batching is just to avoid memory issues for wide rows, in which case
> allowPartialResults could be a better, simpler interface to tell HBase not
> to overflow a small buffer with wide rows.  Though it looks like at the
> moment that doesn't happen.
>
> As an app developer myself, the interaction between batch, caching,
> maxResultSize, allowPartialResults is confusing (as well as the similarly
> named cacheBlocks).  The names could be better (maxResultSize has actually
> nothing to do with the maximum size of the Result's returned - it's rather
> the internal buffer size, batch is a max Cells per Result [I think.  does
> it reset between rows?], and of course caching is an internal
> maxRowsPerRPC).  The documentation is limited and out of date (e.g.
> https://hbase.apache.org/book.html#perf.hbase.client.caching).  Some
> things
> could at least be more consistent (getAllowPartialResults instead of
> isAllowPartialResults like almost all the other boolean properties).
>
> As I poke around and write this out, I guess I'd argue instead that it's
> time (or past time) to clean up the Scan API and document it more clearly.
> Which is a scary task I know.  But for a newcomer, it's a scary API right
> now.
>
> What about something like:
> Scan.bufferSize (instead of maxResultSize for the target over-the-wire size
> - though this is still confusing because it's common to go over this size)
> Scan.limitRows (instead of caching - along with true client side support)
> Scan.allowPartialResults (to indicate it's ok to break up rows across
> Results. it is transmitted to the server to indicate stop adding Cells to
> the buffer as soon as it fills rather than at the end of the row.  if a
> client needs true pagination for Cells within a row it can be done with a
> Filter.)
> Scan.cacheBlocks (less confusing without other things called "caching")
>
> Dave
>
>
>
>
>
> On Wed, Apr 8, 2015 at 10:00 PM, lars hofhansl <larsh@apache.org> wrote:
>
> > Scanner caching (in 1.1 and 2.0) is now a _limit_. I.e. normally you
> leave
> > it disabled (the default of Long.MAX_VALUE) unless you know ahead of time
> > that you'll only look at the first N rows returned. In that case you'd
> set
> > it to N. I thought we had renamed it from "caching" to "limit" but
> looking
> > at the code, that is not the case.
> >
> > In 0.98 and 1.0.x we need to keep it around defaulting to 100 for
> > backwards compatibility.
> >
> > -- Lars
> >       From: Dave Latham <latham@davelink.net>
> >  To: dev@hbase.apache.org
> >  Sent: Wednesday, April 8, 2015 9:09 PM
> >  Subject: remove scanner caching?
> >
> > After debugging a scans missing data issue while migrating to 0.98
> (thanks
> > Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we
> > have both caching and maxResultSize for scans.  It seems to be more
> client
> > api complexity than it's worth.  Why would someone need to set caching
> when
> > maxResultSize is available?  Indeed, the first patch proposed by some
> > fellow in HBASE-1996 simply replaced caching with maxResultSize.  Can we
> > deprecate and eventually remove caching?  Is there a good case for
> keeping
> > it in the client API surface?
> >
> > Dave
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message