kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mauricio Aristizabal <mauri...@impact.com>
Subject Re: "broadcast" tablet replication for kudu?
Date Thu, 12 Apr 2018 04:29:33 GMT
Sorry I left that out Cliff, FWIW it does seem to have been broadcast..



Not sure though how a shuffle would be much different from a broadcast if
entire table is 1 file/block in 1 node.

On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick <cresny@gmail.com> wrote:

> From the screenshot it does not look like there was a broadcast of the
> dimension table(s), so it could be the case here that the multiple smaller
> sends helps. Our dim tables are generally in the single-digit millions and
> Impala chooses to broadcast them. Since the fact result cardinality is
> always much smaller, we've found that forcing a [shuffle] dimension join is
> actually faster since it only sends dims once rather than all to all nodes.
> The degenerative performance of broadcast is especially obvious when the
> query returns zero results. I don't have much experience here, but it does
> seem that Kudu's efficient predicate scans can sometimes "break" Impala's
> query plan.
>
> -Cliff
>
> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal <mauricio@impact.com
> > wrote:
>
>> @Todd not to belabor the point, but when I suggested breaking up small
>> dim tables into multiple parquet files (and in this thread's context
>> perhaps partition kudu table, even if small, into multiple tablets), it was
>> to speed up joins/exchanges, not to parallelize the scan.
>>
>> For example recently we ran into this slow query where the 14M record
>> dimension fit into a single file & block, so it got scanned on a single
>> node though still pretty quickly (300ms), however it caused the join to
>> take 25+ seconds and bogged down the entire query.  See highlighted
>> fragment and its parent.
>>
>> So we broke it into several small files the way I described in my
>> previous post, and now join and query are fast (6s).
>>
>> -m
>>
>>
>>
>>
>>
>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>>> I suppose in the case that the dimension table scan makes a non-trivial
>>> portion of your workload time, then yea, parallelizing the scan as you
>>> suggest would be beneficial. That said, in typical analytic queries,
>>> scanning the dimension tables is very quick compared to scanning the
>>> much-larger fact tables, so the extra parallelism on the dim table scan
>>> isn't worth too much.
>>>
>>> -Todd
>>>
>>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal <
>>> mauricio@impactradius.com> wrote:
>>>
>>>> @Todd I know working with parquet in the past I've seen small
>>>> dimensions that fit in 1 single file/block limit parallelism of
>>>> join/exchange/aggregation nodes, and I've forced those dims to spread
>>>> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or similar
>>>> when doing INSERT OVERWRITE to load them, which then allows these
>>>> operations to parallelize across that many nodes.
>>>>
>>>> Wouldn't it be useful here for Cliff's small dims to be partitioned
>>>> into a couple tablets to similarly improve parallelism?
>>>>
>>>> -m
>>>>
>>>> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>>
>>>>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <cresny@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Todd,
>>>>>>
>>>>>> Thanks for that explanation, as well as all the great work you're
>>>>>> doing  -- it's much appreciated! I just have one last follow-up question.
>>>>>> Reading about BROADCAST operations (Kudu, Spark, Flink, etc. ) it
seems the
>>>>>> smaller table is always copied in its entirety BEFORE the predicate
is
>>>>>> evaluated.
>>>>>>
>>>>>
>>>>> That's not quite true. If you have a predicate on a joined column, or
>>>>> on one of the columns in the joined table, it will be pushed down to
the
>>>>> "scan" operator, which happens before the "exchange". In addition, there
is
>>>>> a feature called "runtime filters" that can push dynamically-generated
>>>>> filters from one side of the exchange to the other.
>>>>>
>>>>>
>>>>>> But since the Kudu client provides a serialized scanner as part of
>>>>>> the ScanToken API, why wouldn't Impala use that instead if it knows
that
>>>>>> the table is Kudu and the query has any type of predicate? Perhaps
if I
>>>>>> hash-partition the table I could maybe force this (because that complicates
>>>>>> a BROADCAST)? I guess this is really a question for Impala but perhaps
>>>>>> there is a more basic reason.
>>>>>>
>>>>>
>>>>> Impala could definitely be smarter, just a matter of programming
>>>>> Kudu-specific join strategies into the optimizer. Today, the optimizer
>>>>> isn't aware of the unique properties of Kudu scans vs other storage
>>>>> mechanisms.
>>>>>
>>>>> -Todd
>>>>>
>>>>>
>>>>>>
>>>>>> -Cliff
>>>>>>
>>>>>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <todd@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick <
>>>>>>> cresnick@mediamath.com> wrote:
>>>>>>>
>>>>>>>> I thought I had read that the Kudu client can configure a
scan for
>>>>>>>> CLOSEST_REPLICA and assumed this was a way to take advantage
of data
>>>>>>>> collocation.
>>>>>>>>
>>>>>>>
>>>>>>> Yea, when a client uses CLOSEST_REPLICA it will read a local
one if
>>>>>>> available. However, that doesn't influence the higher level operation
of
>>>>>>> the Impala (or Spark) planner. The planner isn't aware of the
replication
>>>>>>> policy, so it will use one of the existing supported JOIN strategies.
Given
>>>>>>> statistics, it will choose to broadcast the small table, which
means that
>>>>>>> it will create a plan that looks like:
>>>>>>>
>>>>>>>
>>>>>>>                                    +-------------------------+
>>>>>>>                                    |                        
|
>>>>>>>                         +---------->build      JOIN      
   |
>>>>>>>                         |          |                        
|
>>>>>>>                         |          |              probe     
|
>>>>>>>                  +--------------+  +-------------------------+
>>>>>>>                  |              |                  |
>>>>>>>                  | Exchange     |                  |
>>>>>>>             +----+ (broadcast   |                  |
>>>>>>>             |    |              |                  |
>>>>>>>             |    +--------------+                  |
>>>>>>>             |                                      |
>>>>>>>       +---------+                                  |
>>>>>>>       |         |                        +-----------------------+
>>>>>>>       |  SCAN   |                        |                  
    |
>>>>>>>       |  KUDU   |                        |   SCAN (other side)
  |
>>>>>>>       |         |                        |                  
    |
>>>>>>>       +---------+                        +-----------------------+
>>>>>>>
>>>>>>> (hopefully the ASCII art comes through)
>>>>>>>
>>>>>>> In other words, the "scan kudu" operator scans the table once,
and
>>>>>>> then replicates the results of that scan into the JOIN operator.
The "scan
>>>>>>> kudu" operator of course will read its local copy, but it will
still go
>>>>>>> through the exchange process.
>>>>>>>
>>>>>>> For the use case you're talking about, where the join is just
>>>>>>> looking up a single row by PK in a dimension table, ideally we'd
be using
>>>>>>> an altogether different join strategy such as nested-loop join,
with the
>>>>>>> inner "loop" actually being a Kudu PK lookup, but that strategy
isn't
>>>>>>> implemented by Impala.
>>>>>>>
>>>>>>> -Todd
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>  If this exists then how far out of context is my understanding
of
>>>>>>>> it? Reading about HDFS cache replication, I do know that
Impala will choose
>>>>>>>> a random replica there to more evenly distribute load. But
especially
>>>>>>>> compared to Kudu upsert, managing mutable data using Parquet
is painful.
>>>>>>>> So, perhaps to sum thing up, if nearly 100% of my metadata
scan are single
>>>>>>>> Primary Key lookups followed by a tiny broadcast then am
I really just
>>>>>>>> splitting hairs performance-wise between Kudu and HDFS-cached
parquet?
>>>>>>>>
>>>>>>>> From:  Todd Lipcon <todd@cloudera.com>
>>>>>>>> Reply-To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>>> Date: Friday, March 16, 2018 at 2:51 PM
>>>>>>>>
>>>>>>>> To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>>>>>>
>>>>>>>> It's worth noting that, even if your table is replicated,
Impala's
>>>>>>>> planner is unaware of this fact and it will give the same
plan regardless.
>>>>>>>> That is to say, rather than every node scanning its local
copy, instead a
>>>>>>>> single node will perform the whole scan (assuming it's a
small table) and
>>>>>>>> broadcast it from there within the scope of a single query.
So, I don't
>>>>>>>> think you'll see any performance improvements on Impala queries
by
>>>>>>>> attempting something like an extremely high replication count.
>>>>>>>>
>>>>>>>> I could see bumping the replication count to 5 for these
tables
>>>>>>>> since the extra storage cost is low and it will ensure higher
availability
>>>>>>>> of the important central tables, but I'd be surprised if
there is any
>>>>>>>> measurable perf impact.
>>>>>>>>
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick <
>>>>>>>> cresnick@mediamath.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks for that, glad I was wrong there! Aside from replication
>>>>>>>>> considerations, is it also recommended the number of
tablet servers be odd?
>>>>>>>>>
>>>>>>>>> I will check forums as you suggested, but from what I
read after
>>>>>>>>> searching is that Impala relies on user configured caching
strategies using
>>>>>>>>> HDFS cache.  The workload for these tables is very light
write, maybe a
>>>>>>>>> dozen or so records per hour across 6 or 7 tables. The
size of the tables
>>>>>>>>> ranges from thousands to low millions of rows so so sub-partitioning
would
>>>>>>>>> not be required. So perhaps this is not a typical use-case
but I think it
>>>>>>>>> could work quite well with kudu.
>>>>>>>>>
>>>>>>>>> From: Dan Burkert <danburkert@apache.org>
>>>>>>>>> Reply-To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>>>> Date: Friday, March 16, 2018 at 2:09 PM
>>>>>>>>> To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>>>>>>>
>>>>>>>>> The replication count is the number of tablet servers
which Kudu
>>>>>>>>> will host copies on.  So if you set the replication level
to 5, Kudu will
>>>>>>>>> put the data on 5 separate tablet servers.  There's no
built-in broadcast
>>>>>>>>> table feature; upping the replication factor is the closest
thing.  A
>>>>>>>>> couple of things to keep in mind:
>>>>>>>>>
>>>>>>>>> - Always use an odd replication count.  This is important
due to
>>>>>>>>> how the Raft algorithm works.  Recent versions of Kudu
won't even let you
>>>>>>>>> specify an even number without flipping some flags.
>>>>>>>>> - We don't test much much beyond 5 replicas.  It *should*
work,
>>>>>>>>> but you may run in to issues since it's a relatively
rare configuration.
>>>>>>>>> With a heavy write workload and many replicas you are
even more likely to
>>>>>>>>> encounter issues.
>>>>>>>>>
>>>>>>>>> It's also worth checking in an Impala forum whether it
has
>>>>>>>>> features that make joins against small broadcast tables
better?  Perhaps
>>>>>>>>> Impala can cache small tables locally when doing joins.
>>>>>>>>>
>>>>>>>>> - Dan
>>>>>>>>>
>>>>>>>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick <
>>>>>>>>> cresnick@mediamath.com> wrote:
>>>>>>>>>
>>>>>>>>>> The problem is, AFIK, that replication count is not
necessarily
>>>>>>>>>> the distribution count, so you can't guarantee all
tablet servers will have
>>>>>>>>>> a copy.
>>>>>>>>>>
>>>>>>>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <boris@boristyukin.com>
>>>>>>>>>> wrote:
>>>>>>>>>> I'm new to Kudu but we are also going to use Impala
mostly with
>>>>>>>>>> Kudu. We have a few tables that are small but used
a lot. My plan is
>>>>>>>>>> replicate them more than 3 times. When you create
a kudu table, you can
>>>>>>>>>> specify number of replicated copies (3 by default)
and I guess you can put
>>>>>>>>>> there a number, corresponding to your node count
in cluster. The downside,
>>>>>>>>>> you cannot change that number unless you recreate
a table.
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick <cresny@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> We will soon be moving our analytics from AWS
Redshift to
>>>>>>>>>>> Impala/Kudu. One Redshift feature that we will
miss is its ALL
>>>>>>>>>>> Distribution, where a copy of a table is maintained
on each server. We
>>>>>>>>>>> define a number of metadata tables this way since
they are used in nearly
>>>>>>>>>>> every query. We are considering using parquet
in HDFS cache for these, and
>>>>>>>>>>> Kudu would be a much better fit for the update
semantics but we are worried
>>>>>>>>>>> about the additional contention.  I'm wondering
if having a Broadcast, or
>>>>>>>>>>> ALL, tablet replication might be an easy feature
to add to Kudu?
>>>>>>>>>>>
>>>>>>>>>>> -Cliff
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Todd Lipcon
>>>>>>>> Software Engineer, Cloudera
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Todd Lipcon
>>>>>>> Software Engineer, Cloudera
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *MAURICIO ARISTIZABAL*
>>>> Architect - Business Intelligence + Data Science
>>>> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
>>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101
>>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>>>>
>>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter
>>>> <https://twitter.com/impactradius> | Facebook
>>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> |
>>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc->
>>>>
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Mauricio Aristizabal
>> Architect - Data Pipeline
>> *M * 323 309 4260
>> *E  *mauricio@impact.com  |  *W * https://impact.com
>> <https://www.linkedin.com/company/608678/>
>> <https://www.facebook.com/ImpactMarTech/>
>> <https://twitter.com/impactmartech>
>>
>
>


-- 
Mauricio Aristizabal
Architect - Data Pipeline
*M * 323 309 4260
*E  *mauricio@impact.com  |  *W * https://impact.com
<https://www.linkedin.com/company/608678/>
<https://www.facebook.com/ImpactMarTech/>
<https://twitter.com/impactmartech>

Mime
View raw message