samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagadish Venkatraman <jagadish1...@gmail.com>
Subject Re: Samza Job Slow to Restart
Date Wed, 20 Sep 2017 19:21:41 GMT
Hi Xiaochuan,

>> What does that loop do exactly?

Most of what the run-loop does is documented in
https://samza.apache.org/learn/documentation/0.9/container/event-loop.html

>> We are running into a problem where it seems to take a very long time to
restart a Samza job.

Some follow-up questions,

How long does it take?
Have you measured which parts of the start up sequence take the most time?
- is it checkpoint restoration, or restore of local state?
If reading from the checkpoint topic takes the most time, then I'd
recommend reading from the beginning from that topic, and benchmarking how
long it takes? It'll also help to verify if the checkpoint topic is
actually log-compacted.
Do containers eventually start? Or does the start-up hang? If so, a thread
dump will be useful.
Can you please link and attach the entire log file for us to take a look?

>> 3. Any ideas on how to fix this?

We can perhaps, try to narrow down where the time is spent in startup from
the logs? Depending on that, I can suggest a fix :-)

Thanks,
Jagadish

On Wed, Sep 20, 2017 at 11:21 AM, XiaoChuan Yu <xiaochuan.yu@kik.com> wrote:

> Hi,
>
> We are running into a problem where it seems to take a very long time to
> restart a Samza job.
> We are using Samza 0.9.1 at the moment.
>
> From the logs for a particular container it looks like it has something to
> do with reading checkpoints from Kafka:
>
> 2017-09-20 03:21:02.060 INFO  o.a.s.c.kafka.KafkaCheckpointManager [main]
> -
> Got offset 0 for topic __samza_checkpoint_ver_1_for_test-job_1 and
> partition 0. Attempting to fetch messages for checkpoint log.
> 2017-09-20 03:21:02.072 INFO  o.a.s.c.kafka.KafkaCheckpointManager [main]
> -
> Get latest offset 42890599 for topic
> __samza_checkpoint_ver_1_for_test-job_1 and partition 0.
>
> Looking at this line in KafkaCheckpointManager
> <https://github.com/apache/samza/blob/0.9.1/samza-kafka/
> src/main/scala/org/apache/samza/checkpoint/kafka/
> KafkaCheckpointManager.scala#L275>,
> it seems to indicate that the loop iterates from 0 to 42890599 and make
> requests for each.
>
> Questions:
> 1. What does that loop do exactly?
> 2. Is this an expected behaviour? Is "Got offset 0 for topic ..." normal?
> 3. Any ideas on how to fix this?
>
> Thanks,
> Xiaochuan Yu
>



-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message