kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavel Martynov <mr.xk...@gmail.com>
Subject Re: Spark locality issue
Date Wed, 28 Jun 2017 04:46:08 GMT
Thanks, this line helps me with locality.
I agree with you about that Kudu/Spark need more attention, because.. you
know there in 2017, Spark looks like default "weapon of choice" for
analytic data aggregations :)

2017-06-26 19:18 GMT+03:00 Jean-Daniel Cryans <jdcryans@apache.org>:

> On Mon, Jun 26, 2017 at 8:53 AM, Jean-Daniel Cryans <jdcryans@apache.org>
> wrote:
>> Hi Pavel,
>> I think the whole Kudu/Spark story needs more attention, for example
>> Spark SQL query plans don't have access to any Kudu stats so you can end up
>> with some really bad join decisions.
>> It feels like KUDU-1454 should be really easy to solve at this point.
>> What we need is to get the RDD to use CLOSEST_REPLICA and to set a
>> propagated timestamp like Todd says in the jira. This is all stuff that's
>> done in Impala's integration for Kudu. If you wanted to see if that solves
>> your problem you could add the following code on this line
>> http://github.mtv.cloudera.com/CDH/kudu/blob/cdh5-
>> trunk/java/kudu-client/src/main/java/org/apache/kudu/clie
>> nt/KuduScanToken.java#L226
> Of course I meant a link more like this https://github.com/
> apache/kudu/blob/master/java/kudu-client/src/main/java/org/
> apache/kudu/client/KuduScanToken.java#L226
>> builder.replicaSelection(ReplicaSelection.CLOSEST_REPLICA);
>> The propagated timestamp part is also needed but only for consistency
>> purposes, it won't affect the locality.
>> J-D
>> On Mon, Jun 26, 2017 at 12:59 AM, Pavel Martynov <mr.xkurt@gmail.com>
>> wrote:
>>> Hi, guys!
>>> I working on replacing proprietary analytic platform Microsoft PDW (aka
>>> Microsoft APS) in my company with open source alternative. Currently, I
>>> experimenting with Mesos/Spark/Kudu stack and it looks promising.
>>> Recently I discovered very strange behavior. Situation: I have table on
>>> 5-servers cluster with 50 tablets and run simple Spark rdd.count() against
>>> it. If table has no replication - all is fine, every server run count
>>> aggregation on local data. But, if that table have replication > 1, I see
>>> (with iftop util) that Spark scans remote tablets and Spark UI still shows
>>> me tasks with locality NODE_LOCAL, what is not true.
>>> I found issue https://issues.apache.org/jira/browse/KUDU-1454 "Spark
>>> and MR jobs running without scan locality" which looks like my problem.
>>> IMHO Kudu-Spark can't be considered as production-ready with such an
>>> issue. Are there fundamental problems with fixing of that issue?
>>> --
>>> with best regards, Pavel Martynov

with best regards, Pavel Martynov

View raw message