kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clifford Resnick <cresn...@mediamath.com>
Subject Re: "broadcast" tablet replication for kudu?
Date Fri, 16 Mar 2018 18:35:38 GMT
Thanks for that, glad I was wrong there! Aside from replication considerations, is it also
recommended the number of tablet servers be odd?

I will check forums as you suggested, but from what I read after searching is that Impala
relies on user configured caching strategies using HDFS cache.  The workload for these tables
is very light write, maybe a dozen or so records per hour across 6 or 7 tables. The size of
the tables ranges from thousands to low millions of rows so so sub-partitioning would not
be required. So perhaps this is not a typical use-case but I think it could work quite well
with kudu.

From: Dan Burkert <danburkert@apache.org<mailto:danburkert@apache.org>>
Reply-To: "user@kudu.apache.org<mailto:user@kudu.apache.org>" <user@kudu.apache.org<mailto:user@kudu.apache.org>>
Date: Friday, March 16, 2018 at 2:09 PM
To: "user@kudu.apache.org<mailto:user@kudu.apache.org>" <user@kudu.apache.org<mailto:user@kudu.apache.org>>
Subject: Re: "broadcast" tablet replication for kudu?

The replication count is the number of tablet servers which Kudu will host copies on.  So
if you set the replication level to 5, Kudu will put the data on 5 separate tablet servers.
 There's no built-in broadcast table feature; upping the replication factor is the closest
thing.  A couple of things to keep in mind:

- Always use an odd replication count.  This is important due to how the Raft algorithm works.
 Recent versions of Kudu won't even let you specify an even number without flipping some flags.
- We don't test much much beyond 5 replicas.  It should work, but you may run in to issues
since it's a relatively rare configuration.  With a heavy write workload and many replicas
you are even more likely to encounter issues.

It's also worth checking in an Impala forum whether it has features that make joins against
small broadcast tables better?  Perhaps Impala can cache small tables locally when doing joins.

- Dan

On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick <cresnick@mediamath.com<mailto:cresnick@mediamath.com>>
wrote:
The problem is, AFIK, that replication count is not necessarily the distribution count, so
you can't guarantee all tablet servers will have a copy.

On Mar 16, 2018 1:41 PM, Boris Tyukin <boris@boristyukin.com<mailto:boris@boristyukin.com>>
wrote:
I'm new to Kudu but we are also going to use Impala mostly with Kudu. We have a few tables
that are small but used a lot. My plan is replicate them more than 3 times. When you create
a kudu table, you can specify number of replicated copies (3 by default) and I guess you can
put there a number, corresponding to your node count in cluster. The downside, you cannot
change that number unless you recreate a table.

On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick <cresny@gmail.com<mailto:cresny@gmail.com>>
wrote:
We will soon be moving our analytics from AWS Redshift to Impala/Kudu. One Redshift feature
that we will miss is its ALL Distribution, where a copy of a table is maintained on each server.
We define a number of metadata tables this way since they are used in nearly every query.
We are considering using parquet in HDFS cache for these, and Kudu would be a much better
fit for the update semantics but we are worried about the additional contention.  I'm wondering
if having a Broadcast, or ALL, tablet replication might be an easy feature to add to Kudu?

-Cliff


Mime
View raw message