storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsha <st...@harsha.io>
Subject Re: Storm Kafka Processing
Date Tue, 03 Feb 2015 14:47:14 GMT

Vineet, In kafka producer.send(KeyedMessage<Id, Message>) are you
passing in a ID. If this is constant or null your data won't be
distributed to all partitions. In case of constant Id all of your
messages goes to same partition and incase of null it chooses
round-robin to distribute among partitions. Its better to use a random
UUID to distribute among all of your partitions. -Harsha


On Tue, Feb 3, 2015, at 12:44 AM, Vineet Mishra wrote:
> Do you mean to say that the event published to Kafka is not partition
> distributed?
>
> Well while creating the topic I ensured to create # of partitions as
> 10 and replication factor as 3.
>
> Is it something effected as how I am writing to Kafka?
>
> Thanks!
>
> On Tue, Feb 3, 2015 at 1:50 PM, Andrew Neilson
> <arsneilson@gmail.com> wrote:
>> The behaviour you are describing sounds like your topology is
>> processing a small backlog of events built up in each partition and
>> then catching up to realtime where events are only being published to
>> one of the 10 partitions at a time. I will echo Harsha in suggesting
>> that you verify you are actually publishing to all partitions
>> (important: this is *not* the default behaviour).
>>
>> On Tue, Feb 3, 2015 at 12:05 AM, Vineet Mishra
>> <clearmidoubt@gmail.com> wrote:
>>> Hi Harsha,
>>>
>>> Based on the proposed metric, I ensured the specified changes by
>>> changing the Kafka-Storm Version bundle.
>>>
>>> Although I could see the difference from the last bundle used to the
>>> current change but was not satisfied by the way Spouts were
>>> processing. The observation which I had was,
>>>
>>> The Spout were running with Executor counts as 10, while initiating
>>> the job around half of the executors(5) started processing in
>>> parallel to ingest the data.
>>>
>>> As soon as the counts reached around a million or so the state of
>>> parallelism dropped and eventually it started processing in
>>> serially(One Executor at a time).
>>>
>>> Executors (All time) IdUptimeHostPortEmittedTransferredComplete
>>> latency (ms)AckedFailed [2-2]13m 54shost367030.000 [3-3]13m
>>> 52shost267023183003183004.789318160 [4-4]13m
>>> 52shost367024342004342007.064434380 [5-5]13m 53shost2670120200.000
>>> [6-6]13m 55shost367010.000 [7-7]13m 51shost2670025000250004.12224500
>>> [8-8]13m 51shost367002483602483609.514245780 [9-9]13m
>>> 52shost267030.000 [10-10]13m 54shost367032352202352209.250233200
>>> [11-11]13m 52shost2670220442020442010.382205800
>>>
>>> I am having around .2 Billion Events ingested to Kafka which needs
>>> to be processed through Storm in Real time but I am not sure what is
>>> making this unexpected intermittent behavior of the storm and how
>>> can I prevent this in near future.
>>>
>>> Expecting Expert Suggestions.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Mon, Feb 2, 2015 at 11:53 PM, Vineet Mishra
>>> <clearmidoubt@gmail.com> wrote:
>>>> Well I am already running Kafka with 10 Partitions and Replication
>>>> factor as 3 which is the default size of my cluster.
>>>>
>>>> bin/kafka-topics.sh --create --zookeeper
>>>> host1:2181,host2:2181,host3:2181 --replication-factor 3
>>>> --partitions 10 --topic test
>>>>
>>>> and I am also running Kafka Storm topology with Executors count
>>>> as 10
>>>>
>>>> TopologyBuilder builder=new TopologyBuilder();
>>>> builder.setSpout("KafkaSpout", new KafkaSpout(kafkaConfig), 10);
>>>>
>>>> I am having a notion that since the time I have started running
>>>> Kafka from last* changed RF and # of Partitions I am landing up
>>>> with latency.
>>>>
>>>> * bin/kafka-topics.sh --create --zookeeper
>>>>   host1:2181,host2:2181,host3:2181 --replication-factor 1
>>>>   --partitions 1 --topic test
>>>>
>>>> Well I will try with above provided Storm Kafka bundle. Hope that
>>>> could help out!
>>>>
>>>> Thanks!
>>>>
>>>> On Mon, Feb 2, 2015 at 10:30 PM, Harsha <storm@harsha.io> wrote:
>>>>> __
>>>>> Vineet, Can you try using the one in storm
>>>>> https://github.com/apache/storm/tree/master/external/storm-kafka .
>>>>> This is published into maven repo. So you can use the following
>>>>> <dependency> <groupId>org.apache.storm</groupId>
>>>>> <artifactId>storm-kafka</artifactId> <version>0.9.3</version>
>>>>> </dependency>
>>>>>
>>>>> If you are using topic with partitions size 10 make sure you
>>>>> configured your kafka spout with parallelism set to 10. Also make
>>>>> sure on the producer side you are pushing data onto all of the 10
>>>>> partitions so that your kafka spout is fetching data from all of
>>>>> the 10 partitions.
>>>>>
>>>>> -Harsha
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 2, 2015, at 08:55 AM, Vineet Mishra wrote:
>>>>>> Hi Harsha,
>>>>>>
>>>>>> I am using storm.kafka.KafkaSpout.KafkaSpout implementation from
>>>>>>
>>>>>> https://github.com/wurstmeister/storm-kafka-0.8-plus
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Mon, Feb 2, 2015 at 8:14 PM, Harsha <storm@harsha.io> wrote:
>>>>>>> __
>>>>>>> Vineet, Which kafka spout are you using?
>>>>>>>
>>>>>>> -Harsha
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 2, 2015, at 05:25 AM, Vineet Mishra wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am running Kafka Storm Engine to process real time data
>>>>>>>> generated on a 3 node distributed cluster.
>>>>>>>>
>>>>>>>> Currently I have set 10 Executors for Storm Spout, which
I
>>>>>>>> don't think is running in parallel. Moreover earlier I was
>>>>>>>> running the Kafka Topology with Replication Factor and
>>>>>>>> Partitions as 1(which seems to have run comparatively faster),
>>>>>>>> now I gave the Replication Factor as 3 and Partitions as
10 and
>>>>>>>> I could see the performance degradation.
>>>>>>>>
>>>>>>>> Is there any way I can max utilize the available resource
and
>>>>>>>> get the max throughput of event processing.
>>>>>>>>
>>>>>>>> Looking for the expert suggestions at URGENT.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Mime
View raw message