kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Spark locality issue
Date Mon, 26 Jun 2017 15:53:44 GMT
Hi Pavel,

I think the whole Kudu/Spark story needs more attention, for example Spark
SQL query plans don't have access to any Kudu stats so you can end up with
some really bad join decisions.

It feels like KUDU-1454 should be really easy to solve at this point. What
we need is to get the RDD to use CLOSEST_REPLICA and to set a propagated
timestamp like Todd says in the jira. This is all stuff that's done in
Impala's integration for Kudu. If you wanted to see if that solves your
problem you could add the following code on this line


The propagated timestamp part is also needed but only for consistency
purposes, it won't affect the locality.


On Mon, Jun 26, 2017 at 12:59 AM, Pavel Martynov <mr.xkurt@gmail.com> wrote:

> Hi, guys!
> I working on replacing proprietary analytic platform Microsoft PDW (aka
> Microsoft APS) in my company with open source alternative. Currently, I
> experimenting with Mesos/Spark/Kudu stack and it looks promising.
> Recently I discovered very strange behavior. Situation: I have table on
> 5-servers cluster with 50 tablets and run simple Spark rdd.count() against
> it. If table has no replication - all is fine, every server run count
> aggregation on local data. But, if that table have replication > 1, I see
> (with iftop util) that Spark scans remote tablets and Spark UI still shows
> me tasks with locality NODE_LOCAL, what is not true.
> I found issue https://issues.apache.org/jira/browse/KUDU-1454 "Spark and
> MR jobs running without scan locality" which looks like my problem.
> IMHO Kudu-Spark can't be considered as production-ready with such an
> issue. Are there fundamental problems with fixing of that issue?
> --
> with best regards, Pavel Martynov

View raw message