kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Kudu - Session.Configuration.FlushMode
Date Mon, 23 Oct 2017 22:27:32 GMT
Hi Shawn,

Answers inline below

On Tue, Oct 17, 2017 at 12:59 PM, Shawn Terry <shawn.terry@mining.komatsu>

> We ran into a problem today that looks like it might be related to this:
> https://issues.apache.org/jira/browse/KUDU-1891
> We had a client app crash with this same kind of error: “not enough
> mutation buffer space remaining for operation”.  Currently the client app
> was queuing up a number of writes and doing manual flushing at the end of
> the set of transactions.

This means that the configured mutation buffer size for the KuduSession
object was not large enough to handle all of the operations that you wrote
before flushing. The default is 7MB, but it could be configured safely to
be a bit larger at the expense of memory.

> We’re using the kudu-python api and would like to better understand the
> behavior of the different flushing modes… (assuming
> SessionConfiguration.FlushMode is the thing we should be looking at).

Since the Python API wraps the C++ API it's best to look at the C++ client
docs here. See
for docs on the various flush modes.

> Are there any global settings to tweak to allow a larger buffer?  What
> would be the pro’s and con’s of this?

At a certain size you will hit errors that the maximum RPC size has been
crossed, and then your writes will fail. Additionally, flushing a larger
buffer at a time implies higher latency for that flush (since it's doing
more work).

> Would explicitly using KuduSession.setFlushMode(AUTO_FLUSH_SYNC) make any
> difference?
AUTO_FLUSH_SYNC means that each operation that you Apply (eg an insert or
update) makes its own separate round trip to the appropriate server before
responding. This will be very slow if your goal is to stream a high volume
of writes into Kudu. It is most appropriate for an online application where
you mght want to do only a few inserts in response to some web request, etc.

AUTO_FLUSH_BACKGROUND is typically the best choice for a streaming ingest
or bulk load scenario since it aims to manage buffer sizes for you
automatically for best performance. We'll continue to invest on making
AUTO_FLUSH_BACKGROUND work as well as possible for these scenarios.

Todd Lipcon
Software Engineer, Cloudera

View raw message