Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Date: Thu, 1 Sep 2016 11:22:24 -0700 (PDT)
From: vinay patil <vinay18.patil@gmail.com>
To: user@flink.apache.org
Message-ID: <CAMpYU5SZH0fFzSY3jWUEkwJN1u09NCQkM0FNxg6ycPt3ijeh=g@mail.gmail.com>
In-Reply-To: <CAAdrtT0LgJfQJVQwgym9ACa9uTdmuWORNC_ZdXkWAMNFHUOz_w@mail.gmail.com>
References: <CAMpYU5SSMEgq13eeJAia99gmBqjirNEq7RKc7FpiJ=QBH_E7Jw@mail.gmail.com> <CAAdrtT1QLGZgYumo92ggKZvnWWNimVo2Chc6SZcbmbr+b_EVYw@mail.gmail.com> <CAMpYU5SuTQ9SpRDtYqM3U-j8j3P2kQuKmi6fri1j7zyJkLg3tA@mail.gmail.com> <CAAdrtT0vnm1n85EfEHCDD0LUuCcYtFFEuEpvPAkXtUHGE6swJg@mail.gmail.com> <CAMpYU5RKrFPd+nFrT-g3dva6=d-f8Mmj-Lfex0aFDR_gRzg7Nw@mail.gmail.com> <CAAdrtT0LgJfQJVQwgym9ACa9uTdmuWORNC_ZdXkWAMNFHUOz_w@mail.gmail.com>
Subject: Re: Streaming - memory management
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_110734_1786378467.1472754144898"
archived-at: Thu, 01 Sep 2016 18:25:00 -0000

------=_Part_110734_1786378467.1472754144898
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

I don't to join the third stream.

And Yes, This is what I was thinking of.also :
s1.union(s2).keyBy().window().apply(// outerjoin).keyBy.flatMap(// backup
join)


I am already done integrating with Cassandra but I feel RocksDB will be a
better option, I will have to take care of the clearing part as you have
suggested, will check that in documentation.

I have the DTO with almost 50 fields , converting it to JSON and storing it
as a state should not be a problem , or there is no harm in storing the DTO
?

I think the documentation should specify the point that the state will be
maintained for user-defined operators to avoid confusion.

Regards,
Vinay Patil

On Thu, Sep 1, 2016 at 1:12 PM, Fabian Hueske-2 [via Apache Flink User
Mailing List archive.] <ml-node+s2336050n8843h85@n4.nabble.com> wrote:

> I thought you would like to join the non-matched elements with another
> (third) stream.
>
> --> s1.union(s2).keyBy().window().apply(// outerjoin).keyBy.connect(s3.keyBy).coFlatMap(//
> backup join)
>
> If you want to match the non-matched stream with itself a FlatMapFunction
> is the right choice.
>
> --> s1.union(s2).keyBy().window().apply(// outerjoin).keyBy.flatMap(//
> backup join)
>
> The backup join puts all non-match elements in the state and waits for
> another non-matched element with the same key to do the join.
>
> Best, Fabian
>
>
>
> 2016-09-01 19:55 GMT+02:00 vinay patil <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=8843&i=0>>:
>
>> Yes, that's what I am looking for.
>>
>> But why to use CoFlatMapFunction , I have already got the
>> matchingAndNonMatching Stream , by doing the union of two streams and
>> having the logic in apply method for performing outer-join.
>>
>> I am thinking of applying the same key on matchingAndNonMatching and
>> flatmap to take care of rest logic.
>>
>> Or are you suggestion to use Co-FlatMapFunction after the outer-join
>> operation  (I mean after doing the window and
>> getting matchingAndNonMatching stream )?
>>
>> Regards,
>> Vinay Patil
>>
>> On Thu, Sep 1, 2016 at 11:38 AM, Fabian Hueske-2 [via Apache Flink User
>> Mailing List archive.] <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=8842&i=0>> wrote:
>>
>>> Thanks for the explanation. I think I understood your usecase.
>>>
>>> Yes, I'd go for the RocksDB approach in a CoFlatMapFunction on a keyed
>>> stream (keyed by join key).
>>> One input would be the unmatched outer join records, the other input
>>> would serve the events you want to match them with.
>>> Retrieving elements from RocksDB will be local and should be fast.
>>>
>>> You should be confident though, that all unmatched record will be picked
>>> up at some point (RocksDB persists to disk, so you won't run out of memory
>>> but snapshots size will increase).
>>> The future state expiry feature will avoid such situations.
>>>
>>> Best, Fabian
>>>
>>> 2016-09-01 18:29 GMT+02:00 vinay patil <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=8837&i=0>>:
>>>
>>>> Hi Fabian,
>>>>
>>>> I had already used Co-Group function earlier but were getting some
>>>> issues while dealing with watermarks (for one use case I was not getting
>>>> the correct result), so I have used the union operator for performing the
>>>> outer-join (WindowFunction on a keyedStream), this approach is working
>>>> correctly and giving me correct results.
>>>>
>>>> As I have discussed the scenario, I want to maintain the non-matching
>>>> records in some store, so that's why I was thinking of using RocksDB as a
>>>> store here, where I will maintain the user-defined state  after the
>>>> outer-join window operator, and I can query it using Flink to check if the
>>>> value for a particular key is present or not , if present I can match them
>>>> and send it downstream.
>>>>
>>>> The final goal is to have zero non-matching records, so this is the
>>>> backup plan to handle edge-case scenarios.
>>>>
>>>> I have already integrated code to write to Cassandra using Flink
>>>> Connector, but I think this will be a better option rather than hitting the
>>>> query to external store since RocksDb will store the data to local TM disk,
>>>> the retrieval will be faster here than Cassandra , right ?
>>>>
>>>> What do you think ?
>>>>
>>>>
>>>> Regards,
>>>> Vinay Patil
>>>>
>>>> On Thu, Sep 1, 2016 at 10:19 AM, Fabian Hueske-2 [via Apache Flink User
>>>> Mailing List archive.] <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=8836&i=0>> wrote:
>>>>
>>>>> Hi Vinay,
>>>>>
>>>>> can you give a bit more detail about how you plan to implement the
>>>>> outer join? Using a WIndowFunction or a CoFlatMapFunction on a KeyedStream?
>>>>>
>>>>> An alternative could be to use a CoGroup operator which collects from
>>>>> two inputs all elements that share a common key (the join key) and are in
>>>>> the same window. The interface of the function provides two iterators over
>>>>> the elements of both inputs and can be used to implement outer join
>>>>> functionality. The benefit of working with a CoGroupFunction is that you do
>>>>> not have to take care of state handling at all.
>>>>>
>>>>> In case you go for a custom implementation you will need to work with
>>>>> operator state.
>>>>> However, you do not need to directly interact with RocksDB. Flink is
>>>>> taking care of that for you.
>>>>>
>>>>> Best, Fabian
>>>>>
>>>>> 2016-09-01 16:13 GMT+02:00 vinay patil <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=8832&i=0>>:
>>>>>
>>>>>> Hi Fabian/Stephan,
>>>>>>
>>>>>> Waiting for your suggestion
>>>>>>
>>>>>> Regards,
>>>>>> Vinay Patil
>>>>>>
>>>>>> On Wed, Aug 31, 2016 at 1:46 PM, Vinay Patil <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=8829&i=0>> wrote:
>>>>>>
>>>>>>> Hi Fabian/Stephan,
>>>>>>>
>>>>>>> This makes things clear.
>>>>>>>
>>>>>>> This is the use case I have :
>>>>>>> I am performing a outer join operation on the two streams (in
>>>>>>> window) after which I get matchingAndNonMatchingStream, now I want to make
>>>>>>> sure that the matching rate is high (matching cannot happen if one of the
>>>>>>> source is not emitting elements for certain time) , so to tackle this
>>>>>>> situation I was thinking of using RocksDB as a state Backend, where I will
>>>>>>> insert the unmatched records in it (key - will be same as used for window
>>>>>>> and value will be DTO ), so before inserting into it I will check if it is
>>>>>>> already present in RocksDB, if yes I will take the data from it and send it
>>>>>>> downstream (and ensure I perform the clean operation for that key).
>>>>>>> (Also the data to store should be encrypted, encryption part can be
>>>>>>> handled )
>>>>>>>
>>>>>>> so instead of using Cassandra , Can I do this using RocksDB as state
>>>>>>> backend since the state is not gone after checkpointing ?
>>>>>>>
>>>>>>> P.S I have kept the watermark behind by 1500 secs just to be safe on
>>>>>>> handling late elements but to tackle edge case scenarios like the one
>>>>>>> mentioned above we are having a backup plan of using Cassandra as external
>>>>>>> store since we are dealing with financial critical data.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Vinay Patil
>>>>>>>
>>>>>>> On Wed, Aug 31, 2016 at 11:34 AM, Fabian Hueske <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=8829&i=1>> wrote:
>>>>>>>
>>>>>>>> Hi Vinaj,
>>>>>>>>
>>>>>>>> if you use user-defined state, you have to manually clear it.
>>>>>>>> Otherwise, it will stay in the state backend (heap or RocksDB)
>>>>>>>> until the
>>>>>>>> job goes down (planned or due to an OOM error).
>>>>>>>>
>>>>>>>> This is esp. important to keep in mind, when using keyed state.
>>>>>>>> If you have an unbounded, evolving key space you will likely run
>>>>>>>> out-of-memory.
>>>>>>>> The job will constantly add state for each new key but won't be
>>>>>>>> able to
>>>>>>>> clean up the state for "expired" keys.
>>>>>>>>
>>>>>>>> You could implement a clean-up mechanism this if you implement a
>>>>>>>> custom
>>>>>>>> stream operator.
>>>>>>>> However this is a very low level interface and requires solid
>>>>>>>> understanding
>>>>>>>> of the internals like timestamps, watermarks and the checkpointing
>>>>>>>> mechanism.
>>>>>>>>
>>>>>>>> The community is currently working on a state expiry feature (state
>>>>>>>> will be
>>>>>>>> discarded if not requested or updated for x minutes).
>>>>>>>>
>>>>>>>> Regarding the second question: Does state remain local after
>>>>>>>> checkpointing?
>>>>>>>> Yes, the local state is only copied to the remote FS (HDFS, S3,
>>>>>>>> ...) but
>>>>>>>> remains in the operator. So the state is not gone after a
>>>>>>>> checkpoint is
>>>>>>>> completed.
>>>>>>>>
>>>>>>>> Hope this helps,
>>>>>>>> Fabian
>>>>>>>>
>>>>>>>> 2016-08-31 18:17 GMT+02:00 Vinay Patil <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=8829&i=2>>:
>>>>>>>>
>>>>>>>> > Hi Stephan,
>>>>>>>> >
>>>>>>>> > Just wanted to jump into this discussion regarding state.
>>>>>>>> >
>>>>>>>> > So do you mean that if we maintain user-defined state (for
>>>>>>>> non-window
>>>>>>>> > operators), then if we do  not clear it explicitly will the data
>>>>>>>> for that
>>>>>>>> > key remains in RocksDB.
>>>>>>>> >
>>>>>>>> > What happens in case of checkpoint ? I read in the documentation
>>>>>>>> that after
>>>>>>>> > the checkpoint happens the rocksDB data is pushed to the desired
>>>>>>>> location
>>>>>>>> > (hdfs or s3 or other fs), so for user-defined state does the data
>>>>>>>> still
>>>>>>>> > remain in RocksDB after checkpoint ?
>>>>>>>> >
>>>>>>>> > Correct me if I have misunderstood this concept
>>>>>>>> >
>>>>>>>> > For one of our use we were going for this, but since I read the
>>>>>>>> above part
>>>>>>>> > in documentation so we are going for Cassandra now (to store
>>>>>>>> records and
>>>>>>>> > query them for a special case)
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Regards,
>>>>>>>> > Vinay Patil
>>>>>>>> >
>>>>>>>> > On Wed, Aug 31, 2016 at 4:51 AM, Stephan Ewen <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=8829&i=3>> wrote:
>>>>>>>> >
>>>>>>>> > > In streaming, memory is mainly needed for state (key/value
>>>>>>>> state). The
>>>>>>>> > > exact representation depends on the chosen StateBackend.
>>>>>>>> > >
>>>>>>>> > > State is explicitly released: For windows, state is cleaned up
>>>>>>>> > > automatically (firing / expiry), for user-defined state, keys
>>>>>>>> have to be
>>>>>>>> > > explicitly cleared (clear() method) or in the future will have
>>>>>>>> the option
>>>>>>>> > > to expire.
>>>>>>>> > >
>>>>>>>> > > The heavy work horse for streaming state is currently RocksDB,
>>>>>>>> which
>>>>>>>> > > internally uses native (off-heap) memory to keep the data.
>>>>>>>> > >
>>>>>>>> > > Does that help?
>>>>>>>> > >
>>>>>>>> > > Stephan
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > > On Tue, Aug 30, 2016 at 11:52 PM, Roshan Naik <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=8829&i=4>>
>>>>>>>> > > wrote:
>>>>>>>> > >
>>>>>>>> > > > As per the docs, in Batch mode, dynamic memory allocation is
>>>>>>>> avoided by
>>>>>>>> > > > storing messages being processed in ByteBuffers via Unsafe
>>>>>>>> methods.
>>>>>>>> > > >
>>>>>>>> > > > Couldn't find any docs  describing mem mgmt in Streamingn
>>>>>>>> mode. So...
>>>>>>>> > > >
>>>>>>>> > > > - Am wondering if this is also the case with Streaming ?
>>>>>>>> > > >
>>>>>>>> > > > - If so, how does Flink detect that an object is no longer
>>>>>>>> being used
>>>>>>>> > and
>>>>>>>> > > > can be reclaimed for reuse once again ?
>>>>>>>> > > >
>>>>>>>> > > > -roshan
>>>>>>>> > > >
>>>>>>>> > >
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> View this message in context: Re: Streaming - memory management
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Streaming-memory-management-tp8829.html>
>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>> archive
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>> at Nabble.com.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>> ble.com/Re-Streaming-memory-management-tp8829p8832.html
>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>> email [hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=8836&i=1>
>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>> here.
>>>>> NAML
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> View this message in context: Re: Streaming - memory management
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Streaming-memory-management-tp8829p8836.html>
>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>> archive
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>> at Nabble.com.
>>>>
>>>
>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Re-Streaming-memory-management-tp8829p8837.html
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=8842&i=1>
>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>> View this message in context: Re: Streaming - memory management
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Streaming-memory-management-tp8829p8842.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>> at Nabble.com.
>>
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Re-Streaming-memory-management-tp8829p8843.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Streaming-memory-management-tp8829p8845.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
------=_Part_110734_1786378467.1472754144898
Content-Type: text/html; charset=UTF8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><span style=3D"font-size:12.8px">I don&#39;t to join =
the third stream.</span></div><div><span style=3D"font-size:12.8px"><br></s=
pan></div><div><span style=3D"font-size:12.8px">And Yes,=C2=A0</span><span =
style=3D"font-size:12.8px">This is what I was thinking of.also :=C2=A0<br><=
/span><span style=3D"font-size:12.8px">s1.union(s2).keyBy().window().</span=
><wbr style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">apply(// =
outerjoin).keyBy.flatMap(// backup join)</span></div><div><br></div><div><s=
pan style=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-si=
ze:12.8px">I am already done integrating with Cassandra but I feel RocksDB =
will be a better option, I will have to take care of the clearing part as y=
ou have suggested, will check that in documentation.</span></div><div><span=
 style=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-size:=
12.8px">I have the DTO with almost 50 fields , converting it to JSON and st=
oring it as a state should not be a problem , or there is no harm in storin=
g the DTO ?</span></div><div><span style=3D"font-size:12.8px"><br></span></=
div><div><span style=3D"font-size:12.8px">I think the documentation should =
specify the point that the state will be maintained for user-defined operat=
ors to avoid confusion.</span></div></div><div class=3D"gmail_extra"><br cl=
ear=3D"all"><div><div class=3D"gmail_signature" data-smartmail=3D"gmail_sig=
nature"><div dir=3D"ltr"><div><div dir=3D"ltr"><font color=3D"#000000">Rega=
rds,</font><div><font color=3D"#000000">Vinay Patil</font></div></div></div=
></div></div></div>
<br><div class=3D"gmail_quote">On Thu, Sep 1, 2016 at 1:12 PM, Fabian Huesk=
e-2 [via Apache Flink User Mailing List archive.] <span dir=3D"ltr">&lt;<a =
href=3D"/user/SendEmail.jtp?type=3Dnode&node=3D8845&i=3D0" target=3D"_top" =
rel=3D"nofollow" link=3D"external">[hidden email]</a>&gt;</span> wrote:<br>=
<blockquote style=3D'border-left:2px solid #CCCCCC;padding:0 1em' class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex">

=09<div dir=3D"ltr"><div>I thought you would like to join the non-matched e=
lements with another (third) stream. <br><br></div><div>--&gt; s1.union(s2)=
.keyBy().window().<wbr>apply(// outerjoin).keyBy.connect(s3.<wbr>keyBy).coF=
latMap(// backup join)<br><br></div><div>If you want to match the non-match=
ed stream with itself a FlatMapFunction is the right choice.<br><br>--&gt; =
s1.union(s2).keyBy().window().<wbr>apply(// outerjoin).keyBy.flatMap(// bac=
kup join)<br><br></div><div>The backup join puts all non-match elements in =
the state and waits for another non-matched element with the same key to do=
 the join.<br></div><div><br></div>Best, Fabian<br><div><br><br></div></div=
><div class=3D"gmail_extra"><br><div class=3D"gmail_quote"><span class=3D""=
>2016-09-01 19:55 GMT+02:00 vinay patil <span dir=3D"ltr">&lt;<a href=3D"ht=
tp:///user/SendEmail.jtp?type=3Dnode&amp;node=3D8843&amp;i=3D0" rel=3D"nofo=
llow" link=3D"external" target=3D"_blank">[hidden email]</a>&gt;</span>:<br=
></span><blockquote style=3D'border-left:2px solid #CCCCCC;padding:0 1em' s=
tyle=3D"border-left:2px solid #cccccc;padding:0 1em" class=3D"gmail_quote">=
<span class=3D""><div dir=3D"ltr">Yes, that&#39;s what I am looking for.<di=
v><br></div><div>But why to use CoFlatMapFunction , I have already got the =
matchingAndNonMatching Stream , by doing the union of two streams and havin=
g the logic in apply method for performing outer-join.</div><div><br></div>=
<div>I am thinking of applying the same key on matchingAndNonMatching and f=
latmap to take care of rest logic.</div><div><br></div><div>Or are you sugg=
estion to use Co-FlatMapFunction after the outer-join operation =C2=A0(I me=
an after doing the window and getting=C2=A0matchingAndNonMatching stream )?=
</div></div></span><div class=3D"gmail_extra"><br clear=3D"all"><div><div d=
ata-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><f=
ont color=3D"#000000">Regards,</font><div><font color=3D"#000000">Vinay Pat=
il</font></div></div></div></div></div></div><div><div class=3D"h5">
<br><div class=3D"gmail_quote"><span>On Thu, Sep 1, 2016 at 11:38 AM, Fabia=
n Hueske-2 [via Apache Flink User Mailing List archive.] <span dir=3D"ltr">=
&lt;<a href=3D"http:///user/SendEmail.jtp?type=3Dnode&amp;node=3D8842&amp;i=
=3D0" rel=3D"nofollow" link=3D"external" target=3D"_blank">[hidden email]</=
a>&gt;</span> wrote:<br></span><blockquote style=3D'border-left:2px solid #=
CCCCCC;padding:0 1em' style=3D"border-left:2px solid #cccccc;padding:0 1em"=
 class=3D"gmail_quote"><span>

=09<div dir=3D"ltr"><div><div><div><div>Thanks for the explanation. I think=
 I understood your usecase.<br><br></div>Yes, I&#39;d go for the RocksDB ap=
proach in a CoFlatMapFunction on a keyed stream (keyed by join key). <br>On=
e input would be the unmatched outer join records, the other input would se=
rve the events you want to match them with.<br></div>Retrieving elements fr=
om RocksDB will be local and should be fast.<br></div><div><br></div>You sh=
ould be confident though, that all unmatched record will be picked up at so=
me point (RocksDB persists to disk, so you won&#39;t run out of memory but =
snapshots size will increase).<br></div><div>The future state expiry featur=
e will avoid such situations.<br></div><div><br></div>Best, Fabian<br></div=
></span><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote"=
><span>2016-09-01 18:29 GMT+02:00 vinay patil <span dir=3D"ltr">&lt;<a href=
=3D"http:///user/SendEmail.jtp?type=3Dnode&amp;node=3D8837&amp;i=3D0" rel=
=3D"nofollow" link=3D"external" target=3D"_blank">[hidden email]</a>&gt;</s=
pan>:<br></span><blockquote style=3D'border-left:2px solid #CCCCCC;padding:=
0 1em' style=3D"border-left:2px solid #cccccc;padding:0 1em" class=3D"gmail=
_quote"><span><div dir=3D"ltr">Hi Fabian,<div><br></div><div>I had already =
used Co-Group function earlier but were getting some issues while dealing w=
ith watermarks (for one use case I was not getting the correct result), so =
I have used the union operator for performing the outer-join (WindowFunctio=
n on a keyedStream), this approach is working correctly and giving me corre=
ct results.</div><div><br></div><div>As I have discussed the scenario, I wa=
nt to maintain the non-matching records in some store, so that&#39;s why I =
was thinking of using RocksDB as a store here, where I will maintain the us=
er-defined state =C2=A0after the outer-join window operator, and I can quer=
y it using Flink to check if the value for a particular key is present or n=
ot , if present I can match them and send it downstream.</div><div><br></di=
v><div>The final goal is to have zero non-matching records, so this is the =
backup plan to handle edge-case scenarios.</div><div><br></div><div>I have =
already integrated code to write to Cassandra using Flink Connector, but I =
think this will be a better option rather than hitting the query to externa=
l store since RocksDb will store the data to local TM disk, the retrieval w=
ill be faster here than Cassandra , right ?</div><div><br></div><div><div>W=
hat do you think ?</div></div><div><br></div></div></span><div class=3D"gma=
il_extra"><br clear=3D"all"><div><div data-smartmail=3D"gmail_signature"><d=
iv dir=3D"ltr"><div><div dir=3D"ltr"><font color=3D"#000000">Regards,</font=
><div><font color=3D"#000000">Vinay Patil</font></div></div></div></div></d=
iv></div>
<br><div class=3D"gmail_quote"><div><div><span>On Thu, Sep 1, 2016 at 10:19=
 AM, Fabian Hueske-2 [via Apache Flink User Mailing List archive.] <span di=
r=3D"ltr">&lt;<a href=3D"http:///user/SendEmail.jtp?type=3Dnode&amp;node=3D=
8836&amp;i=3D0" rel=3D"nofollow" link=3D"external" target=3D"_blank">[hidde=
n email]</a>&gt;</span> wrote:<br></span></div></div><blockquote style=3D'b=
order-left:2px solid #CCCCCC;padding:0 1em' style=3D"border-left:2px solid =
#cccccc;padding:0 1em" class=3D"gmail_quote"><div><div><span>

=09<div dir=3D"ltr"><div><div><div><div>Hi Vinay,<br><br></div>can you give=
 a bit more detail about how you plan to implement the outer join? Using a =
WIndowFunction or a CoFlatMapFunction on a KeyedStream?<br><br></div>An alt=
ernative could be to use a CoGroup operator which collects from two inputs =
all elements that share a common key (the join key) and are in the same win=
dow. The interface of the function provides two iterators over the elements=
 of both inputs and can be used to implement outer join functionality. The =
benefit of working with a CoGroupFunction is that you do not have to take c=
are of state handling at all. <br><br></div>In case you go for a custom imp=
lementation you will need to work with operator state. <br>However, you do =
not need to directly interact with RocksDB. Flink is taking care of that fo=
r you.<br><br></div>Best, Fabian<br></div></span><div><div><div class=3D"gm=
ail_extra"><br><div class=3D"gmail_quote"><span>2016-09-01 16:13 GMT+02:00 =
vinay patil <span dir=3D"ltr">&lt;<a href=3D"http:///user/SendEmail.jtp?typ=
e=3Dnode&amp;node=3D8832&amp;i=3D0" rel=3D"nofollow" link=3D"external" targ=
et=3D"_blank">[hidden email]</a>&gt;</span>:<br></span><blockquote style=3D=
'border-left:2px solid #CCCCCC;padding:0 1em' style=3D"border-left:2px soli=
d #cccccc;padding:0 1em" class=3D"gmail_quote"><span><span><div dir=3D"ltr"=
>Hi Fabian/Stephan,<div><br></div><div>Waiting for your suggestion</div></d=
iv></span></span><div class=3D"gmail_extra"><br clear=3D"all"><div><div dat=
a-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><fon=
t color=3D"#000000">Regards,</font><div><font color=3D"#000000">Vinay Patil=
</font></div></div></div></div></div></div>
<br><div class=3D"gmail_quote"><span><span>On Wed, Aug 31, 2016 at 1:46 PM,=
 Vinay Patil <span dir=3D"ltr">&lt;<a href=3D"http:///user/SendEmail.jtp?ty=
pe=3Dnode&amp;node=3D8829&amp;i=3D0" rel=3D"nofollow" link=3D"external" tar=
get=3D"_blank">[hidden email]</a>&gt;</span> wrote:<br></span></span><block=
quote style=3D'border-left:2px solid #CCCCCC;padding:0 1em' style=3D"border=
-left:2px solid #cccccc;padding:0 1em" class=3D"gmail_quote"><span><span><d=
iv dir=3D"ltr">Hi Fabian/Stephan,<div><br></div><div>This makes things clea=
r.</div><div><br></div><div>This is the use case I have :=C2=A0</div><div>I=
 am performing a outer join operation on the two streams (in window) after =
which I get matchingAndNonMatchingStream, now I want to make sure that the =
matching rate is high (matching cannot happen if one of the source is not e=
mitting elements for certain time) , so to tackle this situation I was thin=
king of using RocksDB as a state Backend, where I will insert the unmatched=
 records in it (key - will be same as used for window and value will be DTO=
 ), so before inserting into it I will check if it is already present in Ro=
cksDB, if yes I will take the data from it and send it downstream (and ensu=
re I perform the clean operation for that key).</div><div>(Also the data to=
 store should be encrypted, encryption part can be handled )</div><div><br>=
</div><div>so instead of using Cassandra , Can I do this using RocksDB as s=
tate backend since the state is not gone after checkpointing ?</div><div><b=
r></div><div>P.S I have kept the watermark behind by 1500 secs just to be s=
afe on handling late elements but to tackle edge case scenarios like the on=
e mentioned above we are having a backup plan of using Cassandra as externa=
l store since we are dealing with financial critical data.</div></div></spa=
n></span><div class=3D"gmail_extra"><br clear=3D"all"><div><div data-smartm=
ail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><font color=
=3D"#000000">Regards,</font><div><font color=3D"#000000">Vinay Patil</font>=
</div></div></div></div></div></div><div><div>
<br><div class=3D"gmail_quote"><span><span>On Wed, Aug 31, 2016 at 11:34 AM=
, Fabian Hueske <span dir=3D"ltr">&lt;<a href=3D"http:///user/SendEmail.jtp=
?type=3Dnode&amp;node=3D8829&amp;i=3D1" rel=3D"nofollow" link=3D"external" =
target=3D"_blank">[hidden email]</a>&gt;</span> wrote:<br></span></span><bl=
ockquote style=3D'border-left:2px solid #CCCCCC;padding:0 1em' style=3D"bor=
der-left:2px solid #cccccc;padding:0 1em" class=3D"gmail_quote"><span><span=
>Hi Vinaj,<br>
<br>
if you use user-defined state, you have to manually clear it.<br>
Otherwise, it will stay in the state backend (heap or RocksDB) until the<br=
>
job goes down (planned or due to an OOM error).<br>
<br>
This is esp. important to keep in mind, when using keyed state.<br>
If you have an unbounded, evolving key space you will likely run<br>
out-of-memory.<br>
The job will constantly add state for each new key but won&#39;t be able to=
<br>
clean up the state for &quot;expired&quot; keys.<br>
<br>
You could implement a clean-up mechanism this if you implement a custom<br>
stream operator.<br>
However this is a very low level interface and requires solid understanding=
<br>
of the internals like timestamps, watermarks and the checkpointing<br>
mechanism.<br>
<br>
The community is currently working on a state expiry feature (state will be=
<br>
discarded if not requested or updated for x minutes).<br>
<br>
Regarding the second question: Does state remain local after checkpointing?=
<br>
Yes, the local state is only copied to the remote FS (HDFS, S3, ...) but<br=
>
remains in the operator. So the state is not gone after a checkpoint is<br>
completed.<br>
<br>
Hope this helps,<br>
Fabian<br>
</span></span><div><div><span><span><br>
2016-08-31 18:17 GMT+02:00 Vinay Patil &lt;<a href=3D"http:///user/SendEmai=
l.jtp?type=3Dnode&amp;node=3D8829&amp;i=3D2" rel=3D"nofollow" link=3D"exter=
nal" target=3D"_blank">[hidden email]</a>&gt;:<br>
<br>
&gt; Hi Stephan,<br>
&gt;<br>
&gt; Just wanted to jump into this discussion regarding state.<br>
&gt;<br>
&gt; So do you mean that if we maintain user-defined state (for non-window<=
br>
&gt; operators), then if we do=C2=A0 not clear it explicitly will the data =
for that<br>
&gt; key remains in RocksDB.<br>
&gt;<br>
&gt; What happens in case of checkpoint ? I read in the documentation that =
after<br>
&gt; the checkpoint happens the rocksDB data is pushed to the desired locat=
ion<br>
&gt; (hdfs or s3 or other fs), so for user-defined state does the data stil=
l<br>
&gt; remain in RocksDB after checkpoint ?<br>
&gt;<br>
&gt; Correct me if I have misunderstood this concept<br>
&gt;<br>
&gt; For one of our use we were going for this, but since I read the above =
part<br>
&gt; in documentation so we are going for Cassandra now (to store records a=
nd<br>
&gt; query them for a special case)<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Regards,<br>
&gt; Vinay Patil<br>
&gt;<br></span></span><span><span>
&gt; On Wed, Aug 31, 2016 at 4:51 AM, Stephan Ewen &lt;<a href=3D"http:///u=
ser/SendEmail.jtp?type=3Dnode&amp;node=3D8829&amp;i=3D3" rel=3D"nofollow" l=
ink=3D"external" target=3D"_blank">[hidden email]</a>&gt; wrote:<br>
&gt;<br>
&gt; &gt; In streaming, memory is mainly needed for state (key/value state)=
. The<br>
&gt; &gt; exact representation depends on the chosen StateBackend.<br>
&gt; &gt;<br>
&gt; &gt; State is explicitly released: For windows, state is cleaned up<br=
>
&gt; &gt; automatically (firing / expiry), for user-defined state, keys hav=
e to be<br>
&gt; &gt; explicitly cleared (clear() method) or in the future will have th=
e option<br>
&gt; &gt; to expire.<br>
&gt; &gt;<br>
&gt; &gt; The heavy work horse for streaming state is currently RocksDB, wh=
ich<br>
&gt; &gt; internally uses native (off-heap) memory to keep the data.<br>
&gt; &gt;<br>
&gt; &gt; Does that help?<br>
&gt; &gt;<br>
&gt; &gt; Stephan<br>
&gt; &gt;<br>
&gt; &gt;<br></span></span>
&gt; &gt; On Tue, Aug 30, 2016 at 11:52 PM, Roshan Naik &lt;<a href=3D"http=
:///user/SendEmail.jtp?type=3Dnode&amp;node=3D8829&amp;i=3D4" rel=3D"nofoll=
ow" link=3D"external" target=3D"_blank">[hidden email]</a>&gt;<span><span><=
br>
&gt; &gt; wrote:<br>
&gt; &gt;<br>
&gt; &gt; &gt; As per the docs, in Batch mode, dynamic memory allocation is=
 avoided by<br>
&gt; &gt; &gt; storing messages being processed in ByteBuffers via Unsafe m=
ethods.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Couldn&#39;t find any docs=C2=A0 describing mem mgmt in Stre=
amingn mode. So...<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; - Am wondering if this is also the case with Streaming ?<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; - If so, how does Flink detect that an object is no longer b=
eing used<br>
&gt; and<br>
&gt; &gt; &gt; can be reclaimed for reuse once again ?<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; -roshan<br>
&gt; &gt; &gt;<br>
&gt; &gt;<br>
&gt;<br>
</span></span></div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div><div><div>


=09
=09
=09
<br><hr align=3D"left" width=3D"300">
View this message in context: <a href=3D"http://apache-flink-user-mailing-l=
ist-archive.2336050.n4.nabble.com/Re-Streaming-memory-management-tp8829.htm=
l" rel=3D"nofollow" link=3D"external" target=3D"_blank">Re: Streaming - mem=
ory management</a><br>
Sent from the <a href=3D"http://apache-flink-user-mailing-list-archive.2336=
050.n4.nabble.com/" rel=3D"nofollow" link=3D"external" target=3D"_blank">Ap=
ache Flink User Mailing List archive. mailing list archive</a> at Nabble.co=
m.<br></div></div></blockquote></div><br></div>


=09
=09
=09
=09<br>
=09<br>
=09</div></div><hr color=3D"#cccccc" size=3D"1" noshade>
=09<div style=3D"color:#444;font:12px tahoma,geneva,helvetica,arial,sans-se=
rif"><span>
=09=09<div style=3D"font-weight:bold">If you reply to this email, your mess=
age will be added to the discussion below:</div>
=09=09</span><a href=3D"http://apache-flink-user-mailing-list-archive.23360=
50.n4.nabble.com/Re-Streaming-memory-management-tp8829p8832.html" rel=3D"no=
follow" link=3D"external" target=3D"_blank">http://apache-flink-user-maili<=
wbr>ng-list-archive.2336050.n4.nab<wbr>ble.com/Re-Streaming-memory-ma<wbr>n=
agement-tp8829p8832.html</a>
=09</div>
=09</div></div><div style=3D"color:#666;font:11px tahoma,geneva,helvetica,a=
rial,sans-serif;margin-top:.4em;line-height:1.5em">
=09=09To start a new topic under Apache Flink User Mailing List archive., e=
mail <a href=3D"http:///user/SendEmail.jtp?type=3Dnode&amp;node=3D8836&amp;=
i=3D1" rel=3D"nofollow" link=3D"external" target=3D"_blank">[hidden email]<=
/a> <br><span>
=09=09To unsubscribe from Apache Flink User Mailing List archive., <a rel=
=3D"nofollow" link=3D"external" target=3D"_top">click here</a>.<br>
=09=09<a href=3D"http://apache-flink-user-mailing-list-archive.2336050.n4.n=
abble.com/template/NamlServlet.jtp?macro=3Dmacro_viewer&amp;id=3Dinstant_ht=
ml%21nabble%3Aemail.naml&amp;base=3Dnabble.naml.namespaces.BasicNamespace-n=
abble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespa=
ce-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNa=
mespace&amp;breadcrumbs=3Dnotify_subscribers%21nabble%3Aemail.naml-instant_=
emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml" rel=
=3D"nofollow" style=3D"font:9px serif" link=3D"external" target=3D"_blank">=
NAML</a>
=09</span></div></blockquote></div><br></div><div><div>


=09
=09
=09
<br><span><hr align=3D"left" width=3D"300">
View this message in context: <a href=3D"http://apache-flink-user-mailing-l=
ist-archive.2336050.n4.nabble.com/Re-Streaming-memory-management-tp8829p883=
6.html" rel=3D"nofollow" link=3D"external" target=3D"_blank">Re: Streaming =
- memory management</a><br>
Sent from the <a href=3D"http://apache-flink-user-mailing-list-archive.2336=
050.n4.nabble.com/" rel=3D"nofollow" link=3D"external" target=3D"_blank">Ap=
ache Flink User Mailing List archive. mailing list archive</a> at Nabble.co=
m.<br></span></div></div></blockquote></div><br></div>


=09
=09
=09
=09<br>
=09<br>
=09<hr color=3D"#cccccc" size=3D"1" noshade>
=09</div></div><div style=3D"color:#444;font:12px tahoma,geneva,helvetica,a=
rial,sans-serif"><div><div><span>
=09=09<div style=3D"font-weight:bold">If you reply to this email, your mess=
age will be added to the discussion below:</div>
=09=09</span></div></div><a href=3D"http://apache-flink-user-mailing-list-a=
rchive.2336050.n4.nabble.com/Re-Streaming-memory-management-tp8829p8837.htm=
l" rel=3D"nofollow" link=3D"external" target=3D"_blank">http://apache-flink=
-user-maili<wbr>ng-list-archive.2336050.n4.nab<wbr>ble.com/Re-Streaming-mem=
ory-<wbr>management-tp8829p8837.html</a>
=09</div><span><div><div>
=09<div style=3D"color:#666;font:11px tahoma,geneva,helvetica,arial,sans-se=
rif;margin-top:.4em;line-height:1.5em">
=09=09To start a new topic under Apache Flink User Mailing List archive., e=
mail <a href=3D"http:///user/SendEmail.jtp?type=3Dnode&amp;node=3D8842&amp;=
i=3D1" rel=3D"nofollow" link=3D"external" target=3D"_blank">[hidden email]<=
/a> <br>
=09=09To unsubscribe from Apache Flink User Mailing List archive., <a rel=
=3D"nofollow" link=3D"external" target=3D"_top">click here</a>.<br>
=09=09<a href=3D"http://apache-flink-user-mailing-list-archive.2336050.n4.n=
abble.com/template/NamlServlet.jtp?macro=3Dmacro_viewer&amp;id=3Dinstant_ht=
ml%21nabble%3Aemail.naml&amp;base=3Dnabble.naml.namespaces.BasicNamespace-n=
abble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespa=
ce-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNa=
mespace&amp;breadcrumbs=3Dnotify_subscribers%21nabble%3Aemail.naml-instant_=
emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml" rel=
=3D"nofollow" style=3D"font:9px serif" link=3D"external" target=3D"_blank">=
NAML</a>
=09</div></div></div></span></blockquote></div><br></div></div></div><div><=
div class=3D"h5"><div><div>


=09
=09
=09
<br><hr align=3D"left" width=3D"300">
View this message in context: <a href=3D"http://apache-flink-user-mailing-l=
ist-archive.2336050.n4.nabble.com/Re-Streaming-memory-management-tp8829p884=
2.html" rel=3D"nofollow" link=3D"external" target=3D"_blank">Re: Streaming =
- memory management</a><br>
Sent from the <a href=3D"http://apache-flink-user-mailing-list-archive.2336=
050.n4.nabble.com/" rel=3D"nofollow" link=3D"external" target=3D"_blank">Ap=
ache Flink User Mailing List archive. mailing list archive</a> at Nabble.co=
m.<br></div></div></div></div></blockquote></div><br></div>


=09
=09
=09
=09<br>
=09<br>
=09<hr noshade size=3D"1" color=3D"#cccccc">
=09<div style=3D"color:#444;font:12px tahoma,geneva,helvetica,arial,sans-se=
rif"><div><div class=3D"h5">
=09=09<div style=3D"font-weight:bold">If you reply to this email, your mess=
age will be added to the discussion below:</div>
=09=09</div></div><a href=3D"http://apache-flink-user-mailing-list-archive.=
2336050.n4.nabble.com/Re-Streaming-memory-management-tp8829p8843.html" targ=
et=3D"_blank" rel=3D"nofollow" link=3D"external">http://apache-flink-user-<=
wbr>mailing-list-archive.2336050.<wbr>n4.nabble.com/Re-Streaming-<wbr>memor=
y-management-tp8829p8843.<wbr>html</a>
=09</div><div class=3D"HOEnZb"><div class=3D"h5">
=09<div style=3D"color:#666;font:11px tahoma,geneva,helvetica,arial,sans-se=
rif;margin-top:.4em;line-height:1.5em">
=09=09To start a new topic under Apache Flink User Mailing List archive., e=
mail <a href=3D"/user/SendEmail.jtp?type=3Dnode&node=3D8845&i=3D1" target=
=3D"_top" rel=3D"nofollow" link=3D"external">[hidden email]</a> <br>
=09=09To unsubscribe from Apache Flink User Mailing List archive., <a href=
=3D"" target=3D"_blank" rel=3D"nofollow" link=3D"external">click here</a>.<=
br>
=09=09<a href=3D"http://apache-flink-user-mailing-list-archive.2336050.n4.n=
abble.com/template/NamlServlet.jtp?macro=3Dmacro_viewer&amp;id=3Dinstant_ht=
ml%21nabble%3Aemail.naml&amp;base=3Dnabble.naml.namespaces.BasicNamespace-n=
abble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespa=
ce-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNa=
mespace&amp;breadcrumbs=3Dnotify_subscribers%21nabble%3Aemail.naml-instant_=
emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml" rel=
=3D"nofollow" style=3D"font:9px serif" target=3D"_blank" link=3D"external">=
NAML</a>
=09</div></div></div></blockquote></div><br></div>


=09
=09
=09
<br/><hr align=3D"left" width=3D"300" />
View this message in context: <a href=3D"http://apache-flink-user-mailing-l=
ist-archive.2336050.n4.nabble.com/Re-Streaming-memory-management-tp8829p884=
5.html">Re: Streaming - memory management</a><br/>
Sent from the <a href=3D"http://apache-flink-user-mailing-list-archive.2336=
050.n4.nabble.com/">Apache Flink User Mailing List archive. mailing list ar=
chive</a> at Nabble.com.<br/>
------=_Part_110734_1786378467.1472754144898--