kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Tyukin <bo...@boristyukin.com>
Subject Re: "broadcast" tablet replication for kudu?
Date Mon, 23 Jul 2018 13:41:33 GMT
sorry to revive the old thread but I am curious if there is a good way to
speed up requests to frequently used tables in Kudu.

On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin <boris@boristyukin.com> wrote:

> bummer..After reading your guys conversation, I wish there was an easier
> way...we will have the same issue as we have a few dozens of tables which
> are used very frequently in joins and I was hoping there was an easy way to
> replicate them on most of the nodes to avoid broadcasts every time
>
> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick <cresnick@mediamath.com>
> wrote:
>
>> The table in our case is 12x hashed and ranged by month, so the
>> broadcasts were often to all (12) nodes.
>>
>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal <mauricio@impact.com>
>> wrote:
>> Sorry I left that out Cliff, FWIW it does seem to have been broadcast..
>>
>>
>>
>> Not sure though how a shuffle would be much different from a broadcast if
>> entire table is 1 file/block in 1 node.
>>
>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick <cresny@gmail.com> wrote:
>>
>>> From the screenshot it does not look like there was a broadcast of the
>>> dimension table(s), so it could be the case here that the multiple smaller
>>> sends helps. Our dim tables are generally in the single-digit millions and
>>> Impala chooses to broadcast them. Since the fact result cardinality is
>>> always much smaller, we've found that forcing a [shuffle] dimension join is
>>> actually faster since it only sends dims once rather than all to all nodes.
>>> The degenerative performance of broadcast is especially obvious when the
>>> query returns zero results. I don't have much experience here, but it does
>>> seem that Kudu's efficient predicate scans can sometimes "break" Impala's
>>> query plan.
>>>
>>> -Cliff
>>>
>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal <
>>> mauricio@impact.com> wrote:
>>>
>>>> @Todd not to belabor the point, but when I suggested breaking up small
>>>> dim tables into multiple parquet files (and in this thread's context
>>>> perhaps partition kudu table, even if small, into multiple tablets), it was
>>>> to speed up joins/exchanges, not to parallelize the scan.
>>>>
>>>> For example recently we ran into this slow query where the 14M record
>>>> dimension fit into a single file & block, so it got scanned on a single
>>>> node though still pretty quickly (300ms), however it caused the join to
>>>> take 25+ seconds and bogged down the entire query.  See highlighted
>>>> fragment and its parent.
>>>>
>>>> So we broke it into several small files the way I described in my
>>>> previous post, and now join and query are fast (6s).
>>>>
>>>> -m
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>>
>>>>> I suppose in the case that the dimension table scan makes a
>>>>> non-trivial portion of your workload time, then yea, parallelizing the
scan
>>>>> as you suggest would be beneficial. That said, in typical analytic queries,
>>>>> scanning the dimension tables is very quick compared to scanning the
>>>>> much-larger fact tables, so the extra parallelism on the dim table scan
>>>>> isn't worth too much.
>>>>>
>>>>> -Todd
>>>>>
>>>>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal <
>>>>> mauricio@impactradius.com> wrote:
>>>>>
>>>>>> @Todd I know working with parquet in the past I've seen small
>>>>>> dimensions that fit in 1 single file/block limit parallelism of
>>>>>> join/exchange/aggregation nodes, and I've forced those dims to spread
>>>>>> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or
similar
>>>>>> when doing INSERT OVERWRITE to load them, which then allows these
>>>>>> operations to parallelize across that many nodes.
>>>>>>
>>>>>> Wouldn't it be useful here for Cliff's small dims to be partitioned
>>>>>> into a couple tablets to similarly improve parallelism?
>>>>>>
>>>>>> -m
>>>>>>
>>>>>> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <todd@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <cresny@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Todd,
>>>>>>>>
>>>>>>>> Thanks for that explanation, as well as all the great work
you're
>>>>>>>> doing  -- it's much appreciated! I just have one last follow-up
question.
>>>>>>>> Reading about BROADCAST operations (Kudu, Spark, Flink, etc.
) it seems the
>>>>>>>> smaller table is always copied in its entirety BEFORE the
predicate is
>>>>>>>> evaluated.
>>>>>>>>
>>>>>>>
>>>>>>> That's not quite true. If you have a predicate on a joined column,
>>>>>>> or on one of the columns in the joined table, it will be pushed
down to the
>>>>>>> "scan" operator, which happens before the "exchange". In addition,
there is
>>>>>>> a feature called "runtime filters" that can push dynamically-generated
>>>>>>> filters from one side of the exchange to the other.
>>>>>>>
>>>>>>>
>>>>>>>> But since the Kudu client provides a serialized scanner as
part of
>>>>>>>> the ScanToken API, why wouldn't Impala use that instead if
it knows that
>>>>>>>> the table is Kudu and the query has any type of predicate?
Perhaps if I
>>>>>>>> hash-partition the table I could maybe force this (because
that complicates
>>>>>>>> a BROADCAST)? I guess this is really a question for Impala
but perhaps
>>>>>>>> there is a more basic reason.
>>>>>>>>
>>>>>>>
>>>>>>> Impala could definitely be smarter, just a matter of programming
>>>>>>> Kudu-specific join strategies into the optimizer. Today, the
optimizer
>>>>>>> isn't aware of the unique properties of Kudu scans vs other storage
>>>>>>> mechanisms.
>>>>>>>
>>>>>>> -Todd
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> -Cliff
>>>>>>>>
>>>>>>>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <todd@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick <
>>>>>>>>> cresnick@mediamath.com> wrote:
>>>>>>>>>
>>>>>>>>>> I thought I had read that the Kudu client can configure
a scan
>>>>>>>>>> for CLOSEST_REPLICA and assumed this was a way to
take advantage of data
>>>>>>>>>> collocation.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yea, when a client uses CLOSEST_REPLICA it will read
a local one
>>>>>>>>> if available. However, that doesn't influence the higher
level operation of
>>>>>>>>> the Impala (or Spark) planner. The planner isn't aware
of the replication
>>>>>>>>> policy, so it will use one of the existing supported
JOIN strategies. Given
>>>>>>>>> statistics, it will choose to broadcast the small table,
which means that
>>>>>>>>> it will create a plan that looks like:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                    +-------------------------+
>>>>>>>>>                                    |                
        |
>>>>>>>>>                         +---------->build      JOIN
         |
>>>>>>>>>                         |          |                
        |
>>>>>>>>>                         |          |              probe
     |
>>>>>>>>>                  +--------------+  +-------------------------+
>>>>>>>>>                  |              |                  |
>>>>>>>>>                  | Exchange     |                  |
>>>>>>>>>             +----+ (broadcast   |                  |
>>>>>>>>>             |    |              |                  |
>>>>>>>>>             |    +--------------+                  |
>>>>>>>>>             |                                      |
>>>>>>>>>       +---------+                                  |
>>>>>>>>>       |         |                        +-----------------------+
>>>>>>>>>       |  SCAN   |                        |          
            |
>>>>>>>>>       |  KUDU   |                        |   SCAN (other
side)   |
>>>>>>>>>       |         |                        |          
            |
>>>>>>>>>       +---------+                        +-----------------------+
>>>>>>>>>
>>>>>>>>> (hopefully the ASCII art comes through)
>>>>>>>>>
>>>>>>>>> In other words, the "scan kudu" operator scans the table
once, and
>>>>>>>>> then replicates the results of that scan into the JOIN
operator. The "scan
>>>>>>>>> kudu" operator of course will read its local copy, but
it will still go
>>>>>>>>> through the exchange process.
>>>>>>>>>
>>>>>>>>> For the use case you're talking about, where the join
is just
>>>>>>>>> looking up a single row by PK in a dimension table, ideally
we'd be using
>>>>>>>>> an altogether different join strategy such as nested-loop
join, with the
>>>>>>>>> inner "loop" actually being a Kudu PK lookup, but that
strategy isn't
>>>>>>>>> implemented by Impala.
>>>>>>>>>
>>>>>>>>> -Todd
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  If this exists then how far out of context is my
understanding
>>>>>>>>>> of it? Reading about HDFS cache replication, I do
know that Impala will
>>>>>>>>>> choose a random replica there to more evenly distribute
load. But
>>>>>>>>>> especially compared to Kudu upsert, managing mutable
data using Parquet is
>>>>>>>>>> painful. So, perhaps to sum thing up, if nearly 100%
of my metadata scan
>>>>>>>>>> are single Primary Key lookups followed by a tiny
broadcast then am I
>>>>>>>>>> really just splitting hairs performance-wise between
Kudu and HDFS-cached
>>>>>>>>>> parquet?
>>>>>>>>>>
>>>>>>>>>> From:  Todd Lipcon <todd@cloudera.com>
>>>>>>>>>> Reply-To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>>>>> Date: Friday, March 16, 2018 at 2:51 PM
>>>>>>>>>>
>>>>>>>>>> To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>>>>>>>>
>>>>>>>>>> It's worth noting that, even if your table is replicated,
>>>>>>>>>> Impala's planner is unaware of this fact and it will
give the same plan
>>>>>>>>>> regardless. That is to say, rather than every node
scanning its local copy,
>>>>>>>>>> instead a single node will perform the whole scan
(assuming it's a small
>>>>>>>>>> table) and broadcast it from there within the scope
of a single query. So,
>>>>>>>>>> I don't think you'll see any performance improvements
on Impala queries by
>>>>>>>>>> attempting something like an extremely high replication
count.
>>>>>>>>>>
>>>>>>>>>> I could see bumping the replication count to 5 for
these tables
>>>>>>>>>> since the extra storage cost is low and it will ensure
higher availability
>>>>>>>>>> of the important central tables, but I'd be surprised
if there is any
>>>>>>>>>> measurable perf impact.
>>>>>>>>>>
>>>>>>>>>> -Todd
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick
<
>>>>>>>>>> cresnick@mediamath.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for that, glad I was wrong there! Aside
from replication
>>>>>>>>>>> considerations, is it also recommended the number
of tablet servers be odd?
>>>>>>>>>>>
>>>>>>>>>>> I will check forums as you suggested, but from
what I read after
>>>>>>>>>>> searching is that Impala relies on user configured
caching strategies using
>>>>>>>>>>> HDFS cache.  The workload for these tables is
very light write, maybe a
>>>>>>>>>>> dozen or so records per hour across 6 or 7 tables.
The size of the tables
>>>>>>>>>>> ranges from thousands to low millions of rows
so so sub-partitioning would
>>>>>>>>>>> not be required. So perhaps this is not a typical
use-case but I think it
>>>>>>>>>>> could work quite well with kudu.
>>>>>>>>>>>
>>>>>>>>>>> From: Dan Burkert <danburkert@apache.org>
>>>>>>>>>>> Reply-To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:09 PM
>>>>>>>>>>> To: "user@kudu.apache.org" <user@kudu.apache.org>
>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for
kudu?
>>>>>>>>>>>
>>>>>>>>>>> The replication count is the number of tablet
servers which Kudu
>>>>>>>>>>> will host copies on.  So if you set the replication
level to 5, Kudu will
>>>>>>>>>>> put the data on 5 separate tablet servers.  There's
no built-in broadcast
>>>>>>>>>>> table feature; upping the replication factor
is the closest thing.  A
>>>>>>>>>>> couple of things to keep in mind:
>>>>>>>>>>>
>>>>>>>>>>> - Always use an odd replication count.  This
is important due to
>>>>>>>>>>> how the Raft algorithm works.  Recent versions
of Kudu won't even let you
>>>>>>>>>>> specify an even number without flipping some
flags.
>>>>>>>>>>> - We don't test much much beyond 5 replicas.
 It *should* work,
>>>>>>>>>>> but you may run in to issues since it's a relatively
rare configuration.
>>>>>>>>>>> With a heavy write workload and many replicas
you are even more likely to
>>>>>>>>>>> encounter issues.
>>>>>>>>>>>
>>>>>>>>>>> It's also worth checking in an Impala forum whether
it has
>>>>>>>>>>> features that make joins against small broadcast
tables better?  Perhaps
>>>>>>>>>>> Impala can cache small tables locally when doing
joins.
>>>>>>>>>>>
>>>>>>>>>>> - Dan
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick
<
>>>>>>>>>>> cresnick@mediamath.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The problem is, AFIK, that replication count
is not necessarily
>>>>>>>>>>>> the distribution count, so you can't guarantee
all tablet servers will have
>>>>>>>>>>>> a copy.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <boris@boristyukin.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> I'm new to Kudu but we are also going to
use Impala mostly with
>>>>>>>>>>>> Kudu. We have a few tables that are small
but used a lot. My plan is
>>>>>>>>>>>> replicate them more than 3 times. When you
create a kudu table, you can
>>>>>>>>>>>> specify number of replicated copies (3 by
default) and I guess you can put
>>>>>>>>>>>> there a number, corresponding to your node
count in cluster. The downside,
>>>>>>>>>>>> you cannot change that number unless you
recreate a table.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick
<
>>>>>>>>>>>> cresny@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> We will soon be moving our analytics
from AWS Redshift to
>>>>>>>>>>>>> Impala/Kudu. One Redshift feature that
we will miss is its ALL
>>>>>>>>>>>>> Distribution, where a copy of a table
is maintained on each server. We
>>>>>>>>>>>>> define a number of metadata tables this
way since they are used in nearly
>>>>>>>>>>>>> every query. We are considering using
parquet in HDFS cache for these, and
>>>>>>>>>>>>> Kudu would be a much better fit for the
update semantics but we are worried
>>>>>>>>>>>>> about the additional contention.  I'm
wondering if having a Broadcast, or
>>>>>>>>>>>>> ALL, tablet replication might be an easy
feature to add to Kudu?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Cliff
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Todd Lipcon
>>>>>>>>>> Software Engineer, Cloudera
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Todd Lipcon
>>>>>>>>> Software Engineer, Cloudera
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Todd Lipcon
>>>>>>> Software Engineer, Cloudera
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *MAURICIO ARISTIZABAL*
>>>>>> Architect - Business Intelligence + Data Science
>>>>>> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
>>>>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101
>>>>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>>>>>>
>>>>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter
>>>>>> <https://twitter.com/impactradius> | Facebook
>>>>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183>
|
>>>>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc->
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mauricio Aristizabal
>>>> Architect - Data Pipeline
>>>> *M * 323 309 4260
>>>> *E  *mauricio@impact.com  |  *W * https://impact.com
>>>> <https://www.linkedin.com/company/608678/>
>>>> <https://www.facebook.com/ImpactMarTech/>
>>>> <https://twitter.com/impactmartech>
>>>>
>>>
>>>
>>
>>
>> --
>> Mauricio Aristizabal
>> Architect - Data Pipeline
>> *M * 323 309 4260
>> *E  *mauricio@impact.com  |  *W * https://impact.com
>> <https://www.linkedin.com/company/608678/>
>> <https://www.facebook.com/ImpactMarTech/>
>> <https://twitter.com/impactmartech>
>>
>
>

Mime
View raw message