kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: "broadcast" tablet replication for kudu?
Date Fri, 16 Mar 2018 20:10:09 GMT
On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick <cresnick@mediamath.com>

> I thought I had read that the Kudu client can configure a scan for
> CLOSEST_REPLICA and assumed this was a way to take advantage of data
> collocation.

Yea, when a client uses CLOSEST_REPLICA it will read a local one if
available. However, that doesn't influence the higher level operation of
the Impala (or Spark) planner. The planner isn't aware of the replication
policy, so it will use one of the existing supported JOIN strategies. Given
statistics, it will choose to broadcast the small table, which means that
it will create a plan that looks like:

                                   |                         |
                        +---------->build      JOIN          |
                        |          |                         |
                        |          |              probe      |
                 +--------------+  +-------------------------+
                 |              |                  |
                 | Exchange     |                  |
            +----+ (broadcast   |                  |
            |    |              |                  |
            |    +--------------+                  |
            |                                      |
      +---------+                                  |
      |         |                        +-----------------------+
      |  SCAN   |                        |                       |
      |  KUDU   |                        |   SCAN (other side)   |
      |         |                        |                       |
      +---------+                        +-----------------------+

(hopefully the ASCII art comes through)

In other words, the "scan kudu" operator scans the table once, and then
replicates the results of that scan into the JOIN operator. The "scan kudu"
operator of course will read its local copy, but it will still go through
the exchange process.

For the use case you're talking about, where the join is just looking up a
single row by PK in a dimension table, ideally we'd be using an altogether
different join strategy such as nested-loop join, with the inner "loop"
actually being a Kudu PK lookup, but that strategy isn't implemented by


>  If this exists then how far out of context is my understanding of it?
> Reading about HDFS cache replication, I do know that Impala will choose a
> random replica there to more evenly distribute load. But especially
> compared to Kudu upsert, managing mutable data using Parquet is painful.
> So, perhaps to sum thing up, if nearly 100% of my metadata scan are single
> Primary Key lookups followed by a tiny broadcast then am I really just
> splitting hairs performance-wise between Kudu and HDFS-cached parquet?
> From:  Todd Lipcon <todd@cloudera.com>
> Reply-To: "user@kudu.apache.org" <user@kudu.apache.org>
> Date: Friday, March 16, 2018 at 2:51 PM
> To: "user@kudu.apache.org" <user@kudu.apache.org>
> Subject: Re: "broadcast" tablet replication for kudu?
> It's worth noting that, even if your table is replicated, Impala's planner
> is unaware of this fact and it will give the same plan regardless. That is
> to say, rather than every node scanning its local copy, instead a single
> node will perform the whole scan (assuming it's a small table) and
> broadcast it from there within the scope of a single query. So, I don't
> think you'll see any performance improvements on Impala queries by
> attempting something like an extremely high replication count.
> I could see bumping the replication count to 5 for these tables since the
> extra storage cost is low and it will ensure higher availability of the
> important central tables, but I'd be surprised if there is any measurable
> perf impact.
> -Todd
> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick <cresnick@mediamath.com
> > wrote:
>> Thanks for that, glad I was wrong there! Aside from replication
>> considerations, is it also recommended the number of tablet servers be odd?
>> I will check forums as you suggested, but from what I read after
>> searching is that Impala relies on user configured caching strategies using
>> HDFS cache.  The workload for these tables is very light write, maybe a
>> dozen or so records per hour across 6 or 7 tables. The size of the tables
>> ranges from thousands to low millions of rows so so sub-partitioning would
>> not be required. So perhaps this is not a typical use-case but I think it
>> could work quite well with kudu.
>> From: Dan Burkert <danburkert@apache.org>
>> Reply-To: "user@kudu.apache.org" <user@kudu.apache.org>
>> Date: Friday, March 16, 2018 at 2:09 PM
>> To: "user@kudu.apache.org" <user@kudu.apache.org>
>> Subject: Re: "broadcast" tablet replication for kudu?
>> The replication count is the number of tablet servers which Kudu will
>> host copies on.  So if you set the replication level to 5, Kudu will put
>> the data on 5 separate tablet servers.  There's no built-in broadcast table
>> feature; upping the replication factor is the closest thing.  A couple of
>> things to keep in mind:
>> - Always use an odd replication count.  This is important due to how the
>> Raft algorithm works.  Recent versions of Kudu won't even let you specify
>> an even number without flipping some flags.
>> - We don't test much much beyond 5 replicas.  It *should* work, but you
>> may run in to issues since it's a relatively rare configuration.  With a
>> heavy write workload and many replicas you are even more likely to
>> encounter issues.
>> It's also worth checking in an Impala forum whether it has features that
>> make joins against small broadcast tables better?  Perhaps Impala can cache
>> small tables locally when doing joins.
>> - Dan
>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick <
>> cresnick@mediamath.com> wrote:
>>> The problem is, AFIK, that replication count is not necessarily the
>>> distribution count, so you can't guarantee all tablet servers will have a
>>> copy.
>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <boris@boristyukin.com> wrote:
>>> I'm new to Kudu but we are also going to use Impala mostly with Kudu. We
>>> have a few tables that are small but used a lot. My plan is replicate them
>>> more than 3 times. When you create a kudu table, you can specify number of
>>> replicated copies (3 by default) and I guess you can put there a number,
>>> corresponding to your node count in cluster. The downside, you cannot
>>> change that number unless you recreate a table.
>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick <cresny@gmail.com>
>>> wrote:
>>>> We will soon be moving our analytics from AWS Redshift to Impala/Kudu.
>>>> One Redshift feature that we will miss is its ALL Distribution, where a
>>>> copy of a table is maintained on each server. We define a number of
>>>> metadata tables this way since they are used in nearly every query. We are
>>>> considering using parquet in HDFS cache for these, and Kudu would be a much
>>>> better fit for the update semantics but we are worried about the additional
>>>> contention.  I'm wondering if having a Broadcast, or ALL, tablet
>>>> replication might be an easy feature to add to Kudu?
>>>> -Cliff
> --
> Todd Lipcon
> Software Engineer, Cloudera

Todd Lipcon
Software Engineer, Cloudera

View raw message