kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cliff Resnick <cre...@gmail.com>
Subject Re: "broadcast" tablet replication for kudu?
Date Thu, 12 Apr 2018 03:52:50 GMT
>From the screenshot it does not look like there was a broadcast of the
dimension table(s), so it could be the case here that the multiple smaller
sends helps. Our dim tables are generally in the single-digit millions and
Impala chooses to broadcast them. Since the fact result cardinality is
always much smaller, we've found that forcing a [shuffle] dimension join is
actually faster since it only sends dims once rather than all to all nodes.
The degenerative performance of broadcast is especially obvious when the
query returns zero results. I don't have much experience here, but it does
seem that Kudu's efficient predicate scans can sometimes "break" Impala's
query plan.

-Cliff

On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal <mauricio@impact.com>
wrote:

> @Todd not to belabor the point, but when I suggested breaking up small dim
> tables into multiple parquet files (and in this thread's context perhaps
> partition kudu table, even if small, into multiple tablets), it was to
> speed up joins/exchanges, not to parallelize the scan.
>
> For example recently we ran into this slow query where the 14M record
> dimension fit into a single file & block, so it got scanned on a single
> node though still pretty quickly (300ms), however it caused the join to
> take 25+ seconds and bogged down the entire query.  See highlighted
> fragment and its parent.
>
> So we broke it into several small files the way I described in my previous
> post, and now join and query are fast (6s).
>
> -m
>
>
>
>
>
> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> I suppose in the case that the dimension table scan makes a non-trivial
>> portion of your workload time, then yea, parallelizing the scan as you
>> suggest would be beneficial. That said, in typical analytic queries,
>> scanning the dimension tables is very quick compared to scanning the
>> much-larger fact tables, so the extra parallelism on the dim table scan
>> isn't worth too much.
>>
>> -Todd
>>
>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal <
>> mauricio@impactradius.com> wrote:
>>
>>> @Todd I know working with parquet in the past I've seen small dimensions
>>> that fit in 1 single file/block limit parallelism of
>>> join/exchange/aggregation nodes, and I've forced those dims to spread
>>> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or similar
>>> when doing INSERT OVERWRITE to load them, which then allows these
>>> operations to parallelize across that many nodes.
>>>
>>> Wouldn't it be useful here for Cliff's small dims to be partitioned into
>>> a couple tablets to similarly improve parallelism?
>>>
>>> -m
>>>
>>> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>
>>>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <cresny@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Todd,
>>>>>
>>>>> Thanks for that explanation, as well as all the great work you're
>>>>> doing  -- it's much appreciated! I just have one last follow-up question.
>>>>> Reading about BROADCAST operations (Kudu, Spark, Flink, etc. ) it seems
the
>>>>> smaller table is always copied in its entirety BEFORE the predicate is
>>>>> evaluated.
>>>>>
>>>>
>>>> That's not quite true. If you have a predicate on a joined column, or
>>>> on one of the columns in the joined table, it will be pushed down to the
>>>> "scan" operator, which happens before the "exchange". In addition, there
is
>>>> a feature called "runtime filters" that can push dynamically-generated
>>>> filters from one side of the exchange to the other.
>>>>
>>>>
>>>>> But since the Kudu client provides a serialized scanner as part of the
>>>>> ScanToken API, why wouldn't Impala use that instead if it knows that
the
>>>>> table is Kudu and the query has any type of predicate? Perhaps if I
>>>>> hash-partition the table I could maybe force this (because that complicates
>>>>> a BROADCAST)? I guess this is really a question for Impala but perhaps
>>>>> there is a more basic reason.
>>>>>
>>>>
>>>> Impala could definitely be smarter, just a matter of programming
>>>> Kudu-specific join strategies into the optimizer. Today, the optimizer
>>>> isn't aware of the unique properties of Kudu scans vs other storage
>>>> mechanisms.
>>>>
>>>> -Todd
>>>>
>>>>
>>>>>
>>>>> -Cliff
>>>>>
>>>>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <todd@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick <
>>>>>> cresnick@mediamath.com> wrote:
>>>>>>
>>>>>>> I thought I had read that the Kudu client can configure a scan
for
>>>>>>> CLOSEST_REPLICA and assumed this was a way to take advantage
of data
>>>>>>> collocation.
>>>>>>>
>>>>>>
>>>>>> Yea, when a client uses CLOSEST_REPLICA it will read a local one
if
>>>>>> available. However, that doesn't influence the higher level operation
of
>>>>>> the Impala (or Spark) planner. The planner isn't aware of the replication
>>>>>> policy, so it will use one of the existing supported JOIN strategies.
Given
>>>>>> statistics, it will choose to broadcast the small table, which means
that
>>>>>> it will create a plan that looks like:
>>>>>>
>>>>>>
>>>>>>                                    +-------------------------+
>>>>>>                                    |                         |
>>>>>>                         +---------->build      JOIN          |
>>>>>>                         |          |                         |
>>>>>>                         |          |              probe      |
>>>>>>                  +--------------+  +-------------------------+
>>>>>>                  |              |                  |
>>>>>>                  | Exchange     |                  |
>>>>>>             +----+ (broadcast   |                  |
>>>>>>             |    |              |                  |
>>>>>>             |    +--------------+                  |
>>>>>>             |                                      |
>>>>>>       +---------+                                  |
>>>>>>       |         |                        +-----------------------+
>>>>>>       |  SCAN   |                        |                      
|
>>>>>>       |  KUDU   |                        |   SCAN (other side)  
|
>>>>>>       |         |                        |                      
|
>>>>>>       +---------+                        +-----------------------+
>>>>>>
>>>>>> (hopefully the ASCII art comes through)
>>>>>>
>>>>>> In other words, the "scan kudu" operator scans the table once, and
>>>>>> then replicates the results of that scan into the JOIN operator.
The "scan
>>>>>> kudu" operator of course will read its local copy, but it will still
go
>>>>>> through the exchange process.
>>>>>>
>>>>>> For the use case you're talking about, where the join is just looking
>>>>>> up a single row by PK in a dimension table, ideally we'd be using
an
>>>>>> altogether different join strategy such as nested-loop join, with
the inner
>>>>>> "loop" actually being a Kudu PK lookup, but that strategy isn't implemented
>>>>>> by Impala.
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>>
>>>>>>
>>>>>>>  If this exists then how far out of context is my understanding
of
>>>>>>> it? Reading about HDFS cache replication, I do know that Impala
will choose
>>>>>>> a random replica there to more evenly distribute load. But especially
>>>>>>> compared to Kudu upsert, managing mutable data using Parquet
is painful.
>>>>>>> So, perhaps to sum thing up, if nearly 100% of my metadata scan
are single
>>>>>>> Primary Key lookups followed by a tiny broadcast then am I really
just
>>>>>>> splitting hairs performance-wise between Kudu and HDFS-cached
parquet?
>>>>>>>
>>>>>>> From:  Todd Lipcon <todd@cloudera.com>
>>>>>>> Reply-To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>> Date: Friday, March 16, 2018 at 2:51 PM
>>>>>>>
>>>>>>> To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>>>>>
>>>>>>> It's worth noting that, even if your table is replicated, Impala's
>>>>>>> planner is unaware of this fact and it will give the same plan
regardless.
>>>>>>> That is to say, rather than every node scanning its local copy,
instead a
>>>>>>> single node will perform the whole scan (assuming it's a small
table) and
>>>>>>> broadcast it from there within the scope of a single query. So,
I don't
>>>>>>> think you'll see any performance improvements on Impala queries
by
>>>>>>> attempting something like an extremely high replication count.
>>>>>>>
>>>>>>> I could see bumping the replication count to 5 for these tables
>>>>>>> since the extra storage cost is low and it will ensure higher
availability
>>>>>>> of the important central tables, but I'd be surprised if there
is any
>>>>>>> measurable perf impact.
>>>>>>>
>>>>>>> -Todd
>>>>>>>
>>>>>>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick <
>>>>>>> cresnick@mediamath.com> wrote:
>>>>>>>
>>>>>>>> Thanks for that, glad I was wrong there! Aside from replication
>>>>>>>> considerations, is it also recommended the number of tablet
servers be odd?
>>>>>>>>
>>>>>>>> I will check forums as you suggested, but from what I read
after
>>>>>>>> searching is that Impala relies on user configured caching
strategies using
>>>>>>>> HDFS cache.  The workload for these tables is very light
write, maybe a
>>>>>>>> dozen or so records per hour across 6 or 7 tables. The size
of the tables
>>>>>>>> ranges from thousands to low millions of rows so so sub-partitioning
would
>>>>>>>> not be required. So perhaps this is not a typical use-case
but I think it
>>>>>>>> could work quite well with kudu.
>>>>>>>>
>>>>>>>> From: Dan Burkert <danburkert@apache.org>
>>>>>>>> Reply-To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>>> Date: Friday, March 16, 2018 at 2:09 PM
>>>>>>>> To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>>>>>>
>>>>>>>> The replication count is the number of tablet servers which
Kudu
>>>>>>>> will host copies on.  So if you set the replication level
to 5, Kudu will
>>>>>>>> put the data on 5 separate tablet servers.  There's no built-in
broadcast
>>>>>>>> table feature; upping the replication factor is the closest
thing.  A
>>>>>>>> couple of things to keep in mind:
>>>>>>>>
>>>>>>>> - Always use an odd replication count.  This is important
due to
>>>>>>>> how the Raft algorithm works.  Recent versions of Kudu won't
even let you
>>>>>>>> specify an even number without flipping some flags.
>>>>>>>> - We don't test much much beyond 5 replicas.  It *should*
work,
>>>>>>>> but you may run in to issues since it's a relatively rare
configuration.
>>>>>>>> With a heavy write workload and many replicas you are even
more likely to
>>>>>>>> encounter issues.
>>>>>>>>
>>>>>>>> It's also worth checking in an Impala forum whether it has
features
>>>>>>>> that make joins against small broadcast tables better?  Perhaps
Impala can
>>>>>>>> cache small tables locally when doing joins.
>>>>>>>>
>>>>>>>> - Dan
>>>>>>>>
>>>>>>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick <
>>>>>>>> cresnick@mediamath.com> wrote:
>>>>>>>>
>>>>>>>>> The problem is, AFIK, that replication count is not necessarily
>>>>>>>>> the distribution count, so you can't guarantee all tablet
servers will have
>>>>>>>>> a copy.
>>>>>>>>>
>>>>>>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <boris@boristyukin.com>
>>>>>>>>> wrote:
>>>>>>>>> I'm new to Kudu but we are also going to use Impala mostly
with
>>>>>>>>> Kudu. We have a few tables that are small but used a
lot. My plan is
>>>>>>>>> replicate them more than 3 times. When you create a kudu
table, you can
>>>>>>>>> specify number of replicated copies (3 by default) and
I guess you can put
>>>>>>>>> there a number, corresponding to your node count in cluster.
The downside,
>>>>>>>>> you cannot change that number unless you recreate a table.
>>>>>>>>>
>>>>>>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick <cresny@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> We will soon be moving our analytics from AWS Redshift
to
>>>>>>>>>> Impala/Kudu. One Redshift feature that we will miss
is its ALL
>>>>>>>>>> Distribution, where a copy of a table is maintained
on each server. We
>>>>>>>>>> define a number of metadata tables this way since
they are used in nearly
>>>>>>>>>> every query. We are considering using parquet in
HDFS cache for these, and
>>>>>>>>>> Kudu would be a much better fit for the update semantics
but we are worried
>>>>>>>>>> about the additional contention.  I'm wondering if
having a Broadcast, or
>>>>>>>>>> ALL, tablet replication might be an easy feature
to add to Kudu?
>>>>>>>>>>
>>>>>>>>>> -Cliff
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Todd Lipcon
>>>>>>> Software Engineer, Cloudera
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>>
>>> --
>>> *MAURICIO ARISTIZABAL*
>>> Architect - Business Intelligence + Data Science
>>> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101
>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>>>
>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter
>>> <https://twitter.com/impactradius> | Facebook
>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> |
>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc->
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Mauricio Aristizabal
> Architect - Data Pipeline
> *M * 323 309 4260
> *E  *mauricio@impact.com  |  *W * https://impact.com
> <https://www.linkedin.com/company/608678/>
> <https://www.facebook.com/ImpactMarTech/>
> <https://twitter.com/impactmartech>
>

Mime
View raw message