flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Checkpointing with RocksDB as statebackend
Date Fri, 24 Feb 2017 16:43:56 GMT
Flink's state backends currently do a good number of "make sure this
exists" operations on the file systems. Through Hadoop's S3 filesystem,
that translates to S3 bucket list operations, where there is a limit in how
many operation may happen per time interval. After that, S3 blocks.

It seems that operations that are totally cheap on HDFS are hellishly
expensive (and limited) on S3. It may be that you are affected by that.

We are gradually trying to improve the behavior there and be more S3 aware.

Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there.

Best,
Stephan


On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <vinay18.patil@gmail.com>
wrote:

> Hi Stephan,
>
> So do you mean that S3 is causing the stall , as I have mentioned in my
> previous mail, I could not see any progress for 16minutes as checkpoints
> were getting failed continuously.
>
> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing List
> archive.]" <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>
>> Hi Vinay!
>>
>> True, the operator state (like Kafka) is currently not asynchronously
>> checkpointed.
>>
>> While it is rather small state, we have seen before that on S3 it can
>> cause trouble, because S3 frequently stalls uploads of even data amounts as
>> low as kilobytes due to its throttling policies.
>>
>> That would be a super important fix to add!
>>
>> Best,
>> Stephan
>>
>>
>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>
>>> Hi,
>>>
>>> I have attached a snapshot for reference:
>>> As you can see all the 3 checkpointins failed , for checkpoint ID 2 and
>>> 3 it
>>> is stuck at the Kafka source after 50%
>>> (The data sent till now by Kafka source 1 is 65GB and sent by source 2 is
>>> 15GB )
>>>
>>> Within 10minutes 15M records were processed, and for the next 16minutes
>>> the
>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>> checkpoints getting failed consistently.
>>>
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-flink-user-maili
>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>> RocksDB-as-statebackend-tp11752p11882.html
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive at Nabble.com.
>>>
>>
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-
>> tp11752p11885.html
>> To start a new topic under Apache Flink User Mailing List archive., email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=11887&i=1>
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
> ------------------------------
> View this message in context: Re: Checkpointing with RocksDB as
> statebackend
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at
> Nabble.com.
>

Mime
View raw message