spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From backtrack5 <sorna...@live.com>
Subject Re: spark stream based deduplication
Date Mon, 26 Sep 2016 05:18:46 GMT
Thank you @markcitizen . What I want to achieve is , say for an example

My historic rdd has
(Hash1, recordid1)
(Hash2,recordid2)

And in the new steam I have the following,
(Hash3, recordid3)
(Hash1,recordid5)

In this above scenario,
1) for recordid5,I should get recordid5 is duplicate of recordid1.
2) the new values (hash3,recordid3) should added in the historic rdd.

And I have one another question to ask,
If the problem crashes at any point, is it possible to recover that historic
rdd ?
Can i use state full stream. ?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-stream-based-deduplication-tp27770p27792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message