spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Node afinity for Kafka-Direct Stream
Date Wed, 14 Oct 2015 14:31:08 GMT
Assumptions about locality in spark are not very reliable, regardless of
what consumer you use.  Even if you have locality preferences, and locality
wait turned up really high, you still have to account for losing executors.

On Wed, Oct 14, 2015 at 8:23 AM, Gerard Maas <gerard.maas@gmail.com> wrote:

> Thanks Saisai, Mishra,
>
> Indeed, that hint will only work on a case where the Spark executor is
> co-located with the Kafka broker.
> I think the answer to my question as stated  is that there's no warranty
> of where the task will execute as it will depend on the scheduler and
> cluster resources available  (Mesos in our case).
> Therefore, any assumptions made about data locality using the
> consumer-based approach need to be reconsidered when migrating to the
> direct stream.
>
> ((In our case, we were using local caches to decide when a given secondary
> index for a record should be produced and written.))
>
> -kr, Gerard.
>
>
>
>
> On Wed, Oct 14, 2015 at 2:58 PM, Saisai Shao <sai.sai.shao@gmail.com>
> wrote:
>
>> This preferred locality is a hint to spark to schedule Kafka tasks on the
>> preferred nodes, if Kafka and Spark are two separate cluster, obviously
>> this locality hint takes no effect, and spark will schedule tasks following
>> node-local -> rack-local -> any pattern, like any other spark tasks.
>>
>> On Wed, Oct 14, 2015 at 8:10 PM, Rishitesh Mishra <rmishra@snappydata.io>
>> wrote:
>>
>>> Hi Gerard,
>>> I am also trying to understand the same issue. Whatever code I have seen
>>> it looks like once Kafka RDD is constructed the execution of that RDD is
>>> upto the task scheduler and it can schedule the partitions based on the
>>> load on nodes. There is preferred node specified in Kafks RDD. But ASFIK it
>>> maps to the Kafka partitions host . So if Kafka and Spark are co hosted
>>> probably this will work. If not, I am not sure how to get data locality for
>>> a partition.
>>> Others,
>>> correct me if there is a way.
>>>
>>> On Wed, Oct 14, 2015 at 3:08 PM, Gerard Maas <gerard.maas@gmail.com>
>>> wrote:
>>>
>>>> In the receiver-based kafka streaming model, given that each receiver
>>>> starts as a long-running task, one can rely in a certain degree of data
>>>> locality based on the kafka partitioning:  Data published on a given
>>>> topic/partition will land on the same spark streaming receiving node until
>>>> the receiver dies and needs to be restarted somewhere else.
>>>>
>>>> As I understand, the direct-kafka streaming model just computes offsets
>>>> and relays the work to a KafkaRDD. How is the execution locality compared
>>>> to the receiver-based approach?
>>>>
>>>> thanks, Gerard.
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Regards,
>>> Rishitesh Mishra,
>>> SnappyData . (http://www.snappydata.io/)
>>>
>>> https://in.linkedin.com/in/rishiteshmishra
>>>
>>
>>
>

Mime
View raw message