hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Latham <lat...@davelink.net>
Subject Re: remove scanner caching?
Date Thu, 09 Apr 2015 13:25:32 GMT
We definitely wouldn't want to remove it too soon, for compatibility

Adding a "limitRows" notion sounds reasonable, but I'd argue is actually
something different than caching was.  If an app is relying on the scanner
to actually limit the number of rows returned, the current caching limit
won't work for scans that cross region boundaries.  We would need to keep
the state client side in addition to server side and decrement as we
traverse regions.

Looking into the branch-1 API, I can see there is now also
Scan.allowPartialResults in addition to Scan.batch.  For most cases, I'd
expect batching is just to avoid memory issues for wide rows, in which case
allowPartialResults could be a better, simpler interface to tell HBase not
to overflow a small buffer with wide rows.  Though it looks like at the
moment that doesn't happen.

As an app developer myself, the interaction between batch, caching,
maxResultSize, allowPartialResults is confusing (as well as the similarly
named cacheBlocks).  The names could be better (maxResultSize has actually
nothing to do with the maximum size of the Result's returned - it's rather
the internal buffer size, batch is a max Cells per Result [I think.  does
it reset between rows?], and of course caching is an internal
maxRowsPerRPC).  The documentation is limited and out of date (e.g.
https://hbase.apache.org/book.html#perf.hbase.client.caching).  Some things
could at least be more consistent (getAllowPartialResults instead of
isAllowPartialResults like almost all the other boolean properties).

As I poke around and write this out, I guess I'd argue instead that it's
time (or past time) to clean up the Scan API and document it more clearly.
Which is a scary task I know.  But for a newcomer, it's a scary API right

What about something like:
Scan.bufferSize (instead of maxResultSize for the target over-the-wire size
- though this is still confusing because it's common to go over this size)
Scan.limitRows (instead of caching - along with true client side support)
Scan.allowPartialResults (to indicate it's ok to break up rows across
Results. it is transmitted to the server to indicate stop adding Cells to
the buffer as soon as it fills rather than at the end of the row.  if a
client needs true pagination for Cells within a row it can be done with a
Scan.cacheBlocks (less confusing without other things called "caching")


On Wed, Apr 8, 2015 at 10:00 PM, lars hofhansl <larsh@apache.org> wrote:

> Scanner caching (in 1.1 and 2.0) is now a _limit_. I.e. normally you leave
> it disabled (the default of Long.MAX_VALUE) unless you know ahead of time
> that you'll only look at the first N rows returned. In that case you'd set
> it to N. I thought we had renamed it from "caching" to "limit" but looking
> at the code, that is not the case.
> In 0.98 and 1.0.x we need to keep it around defaulting to 100 for
> backwards compatibility.
> -- Lars
>       From: Dave Latham <latham@davelink.net>
>  To: dev@hbase.apache.org
>  Sent: Wednesday, April 8, 2015 9:09 PM
>  Subject: remove scanner caching?
> After debugging a scans missing data issue while migrating to 0.98 (thanks
> Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we
> have both caching and maxResultSize for scans.  It seems to be more client
> api complexity than it's worth.  Why would someone need to set caching when
> maxResultSize is available?  Indeed, the first patch proposed by some
> fellow in HBASE-1996 simply replaced caching with maxResultSize.  Can we
> deprecate and eventually remove caching?  Is there a good case for keeping
> it in the client API surface?
> Dave

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message