flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Iterations vs. combo source/sink
Date Fri, 30 Sep 2016 23:49:50 GMT
Hi Fabian,

Thanks for responding. Comments and questions inline below.


— Ken

> On Sep 29, 2016, at 6:10am, Fabian Hueske <fhueske@gmail.com> wrote:
> Hi Ken,
> you can certainly have partitioned sources and sinks. You can control the parallelism
by calling .setParallelism() method.

So I assume I’d implement the ParallelSourceFunction interface.

> If you need a partitioned sink, you can call .keyBy() to hash partition.
> I did not completely understand the requirements of your program. Can you maybe provide
pseudo code for how the program should look like.

Just for grins, I’m looking at re-implementing the Bixo web crawler (built on top of Cascading/Hadoop
MR) as a continuous crawler on top of Flink.

The main issue is the “crawl DB” that has to maintain the state of every URL ever seen,
and also provide a fast way to generate the “best” URLs to be fetched. The logic of figuring
out the best URL is complex, depending on factors like the anticipated value of the page,
refetch rates for pages that have already been seen, number of unique URLs per domain vs.
the domain “rank”, etc.

And it has to scale to something like 30B+ URLs with a small (e.g. 10 moderately big servers)
cluster, so it needs to be very efficient in terms of memory/CPU usage.

An additional goal is to not require additional external infrastructure. That simplifies the
operational overhead of running a continuous crawl.

So this “crawl DB” has to act as both a source (of the best URLs to fetch) and as a sink
(for updates to fetched URLs, and as new URLs are discovered/injected). The state is a mix
of in-memory and spilled to disk data.

Given what you mention below about iterative data flows not being fault tolerant, it seems
like a combo source/sink (if possible) would be best.

Any guidance as to how to implement such a thing? I don’t know enough yet about Flink to
determine if I can essentially have one task that’s acting as both the source & sink.

> Some general comments:
> - Flink's fault tolerance mechanism does not work with iterative data flows yet. This
is work in progress see: FLINK-3257 [1]

OK, good to know.

> - Flink's fault tolerance mechanism does only work if you expose all! internal operator
state. So you would need to put your Java DB in Flink state to have a recoverable job.


> - Is the DB essential in your application? Could you use Flink's key-partitioned state
interface instead? That would help to make your job fault-tolerant.

Yes, as per above.

> [1] https://issues.apache.org/jira/browse/FLINK-3257 <https://issues.apache.org/jira/browse/FLINK-3257>
> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html#using-the-keyvalue-state-interface
> 2016-09-29 1:15 GMT+02:00 Ken Krugler <kkrugler_lists@transpac.com <mailto:kkrugler_lists@transpac.com>>:
> Hi all,
> I’ve got a very specialized DB (runs in the JVM) that I need to use to both keep track
of state and generate new records to be processed by my Flink streaming workflow. Some of
the workflow results are updates to be applied to the DB.
> And the DB needs to be partitioned.
> My initial approach is to wrap it in a regular operator, and have subsequent streams
be inputs for updating state. So now I’ve got an IterativeDataStream, which should work.
> But I imagine I could also wrap this DB in a source and a sink, yes? Though I’m not
sure how I could partition it as a source, in that case.
> If it is feasible to have a partitioned source/sink, are there general pros/cons to either
> Thanks,
> — Ken

Ken Krugler
+1 530-210-6378
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

View raw message