spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Maas <gerard.m...@gmail.com>
Subject Re: Node afinity for Kafka-Direct Stream
Date Wed, 14 Oct 2015 20:19:27 GMT
Thanks! Indeed not a given.

I'm not sure we have the time to wait for nodes within a streaming
interval. I'll explore some alternatives. If I stumble on something
reasonable I'll report back.

-kr, Gerard.


On Wed, Oct 14, 2015 at 9:57 PM, Cody Koeninger <cody@koeninger.org> wrote:

> What I'm saying is that it's not a given with spark, even in
> receiver-based mode, because as soon as you lose an executor you'll have a
> rebalance.
>
> Spark's model in general isn't a good fit for pinning work to specific
> nodes.
>
> If you really want to try and fake this, you can override
> getPreferredLocations and set spark.locality.wait to a high value.
>
>
>
> On Wed, Oct 14, 2015 at 2:45 PM, Gerard Maas <gerard.maas@gmail.com>
> wrote:
>
>> Hi Cody,
>>
>> I think that I misused the term 'data locality'. I think I should better
>> call it "node affinity"  instead, as this is what I would like to have:
>> For as long as an executor is available, I would like to have the same
>> kafka partition processed by the same node in order to take advantage of
>> local in-memory structures.
>>
>> In the receiver-based mode this was a given. Any ideas how to achieve
>> that with the direct stream approach?
>>
>> -greetz, Gerard.
>>
>>
>> On Wed, Oct 14, 2015 at 4:31 PM, Cody Koeninger <cody@koeninger.org>
>> wrote:
>>
>>> Assumptions about locality in spark are not very reliable, regardless of
>>> what consumer you use.  Even if you have locality preferences, and locality
>>> wait turned up really high, you still have to account for losing executors.
>>>
>>> On Wed, Oct 14, 2015 at 8:23 AM, Gerard Maas <gerard.maas@gmail.com>
>>> wrote:
>>>
>>>> Thanks Saisai, Mishra,
>>>>
>>>> Indeed, that hint will only work on a case where the Spark executor is
>>>> co-located with the Kafka broker.
>>>> I think the answer to my question as stated  is that there's no
>>>> warranty of where the task will execute as it will depend on the scheduler
>>>> and cluster resources available  (Mesos in our case).
>>>> Therefore, any assumptions made about data locality using the
>>>> consumer-based approach need to be reconsidered when migrating to the
>>>> direct stream.
>>>>
>>>> ((In our case, we were using local caches to decide when a given
>>>> secondary index for a record should be produced and written.))
>>>>
>>>> -kr, Gerard.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Oct 14, 2015 at 2:58 PM, Saisai Shao <sai.sai.shao@gmail.com>
>>>> wrote:
>>>>
>>>>> This preferred locality is a hint to spark to schedule Kafka tasks on
>>>>> the preferred nodes, if Kafka and Spark are two separate cluster, obviously
>>>>> this locality hint takes no effect, and spark will schedule tasks following
>>>>> node-local -> rack-local -> any pattern, like any other spark tasks.
>>>>>
>>>>> On Wed, Oct 14, 2015 at 8:10 PM, Rishitesh Mishra <
>>>>> rmishra@snappydata.io> wrote:
>>>>>
>>>>>> Hi Gerard,
>>>>>> I am also trying to understand the same issue. Whatever code I have
>>>>>> seen it looks like once Kafka RDD is constructed the execution of
that RDD
>>>>>> is upto the task scheduler and it can schedule the partitions based
on the
>>>>>> load on nodes. There is preferred node specified in Kafks RDD. But
ASFIK it
>>>>>> maps to the Kafka partitions host . So if Kafka and Spark are co
hosted
>>>>>> probably this will work. If not, I am not sure how to get data locality
for
>>>>>> a partition.
>>>>>> Others,
>>>>>> correct me if there is a way.
>>>>>>
>>>>>> On Wed, Oct 14, 2015 at 3:08 PM, Gerard Maas <gerard.maas@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> In the receiver-based kafka streaming model, given that each
>>>>>>> receiver starts as a long-running task, one can rely in a certain
degree of
>>>>>>> data locality based on the kafka partitioning:  Data published
on a given
>>>>>>> topic/partition will land on the same spark streaming receiving
node until
>>>>>>> the receiver dies and needs to be restarted somewhere else.
>>>>>>>
>>>>>>> As I understand, the direct-kafka streaming model just computes
>>>>>>> offsets and relays the work to a KafkaRDD. How is the execution
locality
>>>>>>> compared to the receiver-based approach?
>>>>>>>
>>>>>>> thanks, Gerard.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Regards,
>>>>>> Rishitesh Mishra,
>>>>>> SnappyData . (http://www.snappydata.io/)
>>>>>>
>>>>>> https://in.linkedin.com/in/rishiteshmishra
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message