samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Pan <nickpa...@gmail.com>
Subject Re: Samza Job Slow to Restart
Date Thu, 19 Oct 2017 22:10:06 GMT
Awesome that you have figured it out! Just a general notice: any logcompact
topic used in Samza may see this slow-down if the Kafka log cleaner thread
dies, which include checkpoint, coordinator stream, and changelog topics.

Best!

-Yi

On Thu, Oct 19, 2017 at 12:14 PM, XiaoChuan Yu <xiaochuan.yu@kik.com> wrote:

> Hi,
>
> We were finally able to find out why the job takes so long to start.
> There was higher than normal network IO during job startup and so we
> checked size of the checkpoint topic on disk and it was ~21GB.
> We then restarted the Kafka node who was the leader for the checkpoint
> topic, the topic disk size went down to ~1.8GB and the job started up
> fairly quickly.
> Its probably due to a bug in Kafka where log cleaner died and we never
> noticed: https://issues.apache.org/jira/browse/KAFKA-3894.
> We have since been working on upgrading Kafka to avoid this bug.
> Hope this helps if anyone else ever runs into it.
>
> Thanks,
> Xiaochuan Yu
>
> On Sat, Sep 23, 2017 at 6:17 PM XiaoChuan Yu <xiaochuan.yu@kik.com> wrote:
>
> > >> How long does it take?
> > It took around 10 minute from "Got offset 0 for topic <checkpoint topic>
> > ..." to init() being called on the Task.
> >
> > >> Have you measured which parts of the start up sequence take the most
> > time?
> > >> - is it checkpoint restoration, or restore of local state?
> > Should be checkpoint restoration. There is no local state for this job.
> >
> > >> If reading from the checkpoint topic takes the most time, then I'd
> > >> recommend reading from the beginning from that topic, and benchmarking
> > how
> > >> long it takes? It'll also help to verify if the checkpoint topic is
> > >> actually log-compacted.
> > I'm not sure how to verify how much the topic is compacted by Kafka.
> > The cleanup policy is to compact though.
> >
> > >> Do containers eventually start? Or does the start-up hang?
> > >> If so, a thread dump will be useful.
> > It does eventually start up.
> >
> > >> Can you please link and attach the entire log file for us to take a
> > look?
> > Unfortunately there is too much stuff for me to redact from the log right
> > now.
> > However, I can tell you that the job has two input topics both with the
> > following settings:
> > systems.kafka.streams.my-special-topic.samza.reset.offset=true
> > systems.kafka.streams.my-special-topic.samza.offset.default=upcoming
> > It was thought that this would speedup startup of the job to no avail.
> >
> > On Wed, Sep 20, 2017 at 3:21 PM Jagadish Venkatraman <
> > jagadish1989@gmail.com> wrote:
> >
> >> Hi Xiaochuan,
> >>
> >> >> What does that loop do exactly?
> >>
> >> Most of what the run-loop does is documented in
> >> https://samza.apache.org/learn/documentation/0.9/
> container/event-loop.html
> >>
> >> >> We are running into a problem where it seems to take a very long time
> >> to
> >> restart a Samza job.
> >>
> >> Some follow-up questions,
> >>
> >> How long does it take?
> >> Have you measured which parts of the start up sequence take the most
> time?
> >> - is it checkpoint restoration, or restore of local state?
> >> If reading from the checkpoint topic takes the most time, then I'd
> >> recommend reading from the beginning from that topic, and benchmarking
> how
> >> long it takes? It'll also help to verify if the checkpoint topic is
> >> actually log-compacted.
> >> Do containers eventually start? Or does the start-up hang? If so, a
> thread
> >> dump will be useful.
> >> Can you please link and attach the entire log file for us to take a
> look?
> >>
> >> >> 3. Any ideas on how to fix this?
> >>
> >> We can perhaps, try to narrow down where the time is spent in startup
> from
> >> the logs? Depending on that, I can suggest a fix :-)
> >>
> >> Thanks,
> >> Jagadish
> >>
> >> On Wed, Sep 20, 2017 at 11:21 AM, XiaoChuan Yu <xiaochuan.yu@kik.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > We are running into a problem where it seems to take a very long time
> to
> >> > restart a Samza job.
> >> > We are using Samza 0.9.1 at the moment.
> >> >
> >> > From the logs for a particular container it looks like it has
> something
> >> to
> >> > do with reading checkpoints from Kafka:
> >> >
> >> > 2017-09-20 03:21:02.060 INFO  o.a.s.c.kafka.KafkaCheckpointManager
> >> [main]
> >> > -
> >> > Got offset 0 for topic __samza_checkpoint_ver_1_for_test-job_1 and
> >> > partition 0. Attempting to fetch messages for checkpoint log.
> >> > 2017-09-20 03:21:02.072 INFO  o.a.s.c.kafka.KafkaCheckpointManager
> >> [main]
> >> > -
> >> > Get latest offset 42890599 for topic
> >> > __samza_checkpoint_ver_1_for_test-job_1 and partition 0.
> >> >
> >> > Looking at this line in KafkaCheckpointManager
> >> > <https://github.com/apache/samza/blob/0.9.1/samza-kafka/
> >> > src/main/scala/org/apache/samza/checkpoint/kafka/
> >> > KafkaCheckpointManager.scala#L275>,
> >> > it seems to indicate that the loop iterates from 0 to 42890599 and
> make
> >> > requests for each.
> >> >
> >> > Questions:
> >> > 1. What does that loop do exactly?
> >> > 2. Is this an expected behaviour? Is "Got offset 0 for topic ..."
> >> normal?
> >> > 3. Any ideas on how to fix this?
> >> >
> >> > Thanks,
> >> > Xiaochuan Yu
> >> >
> >>
> >>
> >>
> >> --
> >> Jagadish V,
> >> Graduate Student,
> >> Department of Computer Science,
> >> Stanford University
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message