flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "vprabhu@gmail.com" <vpra...@gmail.com>
Subject Re: Failed job restart - flink on yarn
Date Sat, 02 Jul 2016 05:09:49 GMT
Hi Jamie,

Thanks for the reply.

Yeah i looked at save points, i want to start my job only from the last
checkpoint, this means I have to keep track of when the checkpoint was
taken and the trigger a save point. I am not sure this is the way to go. My
state backend is HDFS and I can see that the checkpoint path has the data
that has been buffered in the window.

I want to start the job in a way such that it will read the checkpointed
data before the failure and continue processing.

I realise that the checkpoints are used whenever there is a container
failure, and a new container is obtained. In my case the job failed because
a container failed for the maximum AllowedN umber of failures


On Fri, Jul 1, 2016 at 3:54 PM, Jamie Grier [via Apache Flink User Mailing
List archive.] <ml-node+s2336050n7767h97@n4.nabble.com> wrote:

> Hi Prabhu,
> Have you taken a look at Flink's savepoints feature?  This allows you to
> make snapshots of your job's state on demand and then at any time restart
> your job from that point:
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/savepoints.html
> Also know that you can use Flink disk-backed state backend as well if
> you're job state is larger than fits in memory.  See
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/state_backends.html#the-rocksdbstatebackend
> -Jamie
> On Fri, Jul 1, 2016 at 1:34 PM, [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=7767&i=0> <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=7767&i=1>> wrote:
>> Hi,
>> I have a flink streaming job that reads from kafka, performs a aggregation
>> in a window, it ran fine for a while however when the number of events in
>> a
>> window crossed a certain limit , the yarn containers failed with Out Of
>> Memory. The job was running with 10G containers.
>> We have about 64G memory on the machine and now I want to restart the job
>> with a 20G container (we ran some tests and 20G should be good enough to
>> accomodate all the elements from the window).
>> Is there a way to restart the job from the last checkpoint ?
>> When I resubmit the job, it starts from the last committed offsets however
>> the events that were held in the window at the time of checkpointing seem
>> to
>> get lost. Is there a way to recover the events buffered within the window
>> and were checkpointed before the failure ?
>> Thanks,
>> Prabhu
>> --
>> View this message in context:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764.html
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
> --
> Jamie Grier
> data Artisans, Director of Applications Engineering
> @jamiegrier <https://twitter.com/jamiegrier>
> [hidden email] <http:///user/SendEmail.jtp?type=node&node=7767&i=2>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764p7767.html
> To unsubscribe from Failed job restart - flink on yarn, click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7764&code=dnByYWJodUBnbWFpbC5jb218Nzc2NHw2MzI5NTI5MDE=>
> .
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764p7771.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
View raw message