storm-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Curtis Allen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (STORM-399) Kafka Spout defaulting to latest offset when current offset is older then 100k
Date Wed, 09 Jul 2014 21:33:07 GMT
Curtis Allen created STORM-399:
----------------------------------

             Summary: Kafka Spout defaulting to latest offset when current offset is older
then 100k
                 Key: STORM-399
                 URL: https://issues.apache.org/jira/browse/STORM-399
             Project: Apache Storm (Incubating)
          Issue Type: Bug
    Affects Versions: 0.9.2-incubating
            Reporter: Curtis Allen
            Priority: Minor


Using storm and storm-kafka 0.9.2-incubating

In the storm kafka spout the default for maxOffsetBehind is 100000
see https://github.com/apache/incubator-storm/blob/v0.9.2-incubating/external/storm-kafka/src/jvm/storm/kafka/KafkaConfig.java#L38

This default is too low and causes the kafka spout to start from the latest offset instead
of the last committed offset without warning.
see https://github.com/apache/incubator-storm/blob/v0.9.2-incubating/external/storm-kafka/src/jvm/storm/kafka/PartitionManager.java#L95

Producing the following log output from the storm worker processes

2014-07-09 18:02:15 s.k.PartitionManager [INFO] Read last commit
offset from zookeeper: 15266940; old topology_id:
ef3f1f89-f64c-4947-b6eb-0c7fb9adb9ea - new topology_id:
5747dba6-c947-4c4f-af4a-4f50a84817bf
2014-07-09 18:02:15 s.k.PartitionManager [INFO] Last commit offset
from zookeeper: 15266940
2014-07-09 18:02:15 s.k.PartitionManager [INFO] Commit offset 22092614
is more than 100000 behind, resetting to startOffsetTime=-2
2014-07-09 18:02:15 s.k.PartitionManager [INFO] Starting Kafka
prd-use1c-pr-08-kafka-kamq-0004:4 from offset 22092614

To fix this problem I ended up setting spout config in my topology like so

spoutConf.maxOffsetBehind = Long.MAX_VALUE;

Why would the kafka spout skip to the latest offset if the current offset
is more then 100000 behind by default?

This seems like a bad default value, the spout literally skipped over
months of data without any warning.

Are the core contributors open to accepting a pull request that would set
the default to Long.MAX_VALUE?




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message