kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Spark locality issue
Date Mon, 26 Jun 2017 16:18:08 GMT
On Mon, Jun 26, 2017 at 8:53 AM, Jean-Daniel Cryans <jdcryans@apache.org>

> Hi Pavel,
> I think the whole Kudu/Spark story needs more attention, for example Spark
> SQL query plans don't have access to any Kudu stats so you can end up with
> some really bad join decisions.
> It feels like KUDU-1454 should be really easy to solve at this point. What
> we need is to get the RDD to use CLOSEST_REPLICA and to set a propagated
> timestamp like Todd says in the jira. This is all stuff that's done in
> Impala's integration for Kudu. If you wanted to see if that solves your
> problem you could add the following code on this line http://github.mtv.
> cloudera.com/CDH/kudu/blob/cdh5-trunk/java/kudu-client/
> src/main/java/org/apache/kudu/client/KuduScanToken.java#L226

Of course I meant a link more like this

> builder.replicaSelection(ReplicaSelection.CLOSEST_REPLICA);
> The propagated timestamp part is also needed but only for consistency
> purposes, it won't affect the locality.
> J-D
> On Mon, Jun 26, 2017 at 12:59 AM, Pavel Martynov <mr.xkurt@gmail.com>
> wrote:
>> Hi, guys!
>> I working on replacing proprietary analytic platform Microsoft PDW (aka
>> Microsoft APS) in my company with open source alternative. Currently, I
>> experimenting with Mesos/Spark/Kudu stack and it looks promising.
>> Recently I discovered very strange behavior. Situation: I have table on
>> 5-servers cluster with 50 tablets and run simple Spark rdd.count() against
>> it. If table has no replication - all is fine, every server run count
>> aggregation on local data. But, if that table have replication > 1, I see
>> (with iftop util) that Spark scans remote tablets and Spark UI still shows
>> me tasks with locality NODE_LOCAL, what is not true.
>> I found issue https://issues.apache.org/jira/browse/KUDU-1454 "Spark and
>> MR jobs running without scan locality" which looks like my problem.
>> IMHO Kudu-Spark can't be considered as production-ready with such an
>> issue. Are there fundamental problems with fixing of that issue?
>> --
>> with best regards, Pavel Martynov

View raw message