flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arvid Heise <ar...@ververica.com>
Subject Re: Reading from sockets using dataset api
Date Thu, 23 Apr 2020 14:51:03 GMT
Hi Kaan,

afaik there is no (easy) way to switch from streaming back to batch API
while retaining all data in memory (correct me if I misunderstood).

However, from your description, I also have some severe understanding
problems. Why can't you dump the data to some file? Do you really have more
main memory than disk space? Or do you have no shared memory between your
generating cluster and the flink cluster?

It almost sounds as if the issue at heart is rather to find a good
serialization format on how to store the edges. The 70 billion edges could
be stored in an array of id pairs, which amount to ~560 GB uncompressed
data if stored in Avro (or any other binary serialization format) when ids
are longs. That's not much by today's standards and could also be easily
offloaded to S3.

Alternatively, if graph generation is rather cheap, you could also try to
incorporate it directly into the analysis job.

On Wed, Apr 22, 2020 at 2:58 AM Kaan Sancak <kaansnck@gmail.com> wrote:

> Hi,
>
> I have been running some experiments on  large graph data, smallest graph
> I have been using is around ~70 billion edges. I have a graph generator,
> which generates the graph in parallel and feeds to the running system.
> However, it takes a lot of time to read the edges, because even though the
> graph generation process is parallel, in Flink I can only listen from
> master node (correct me if I am wrong). Another option is dumping the
> generated data to a file and reading with readFromCsv, however this is not
> feasible in terms of storage management.
>
> What I want to do is, invoking my graph generator, using ipc/tcp
> protocols  and reading the generated data from the sockets. Since the graph
> data is also generated parallel in each node, I want to make use of ipc,
> and read the data in parallel at each node. I made some online digging  but
> couldn’t find something similar using dataset api. I would be glad if you
> have some similar use cases or examples.
>
> Is it possible to use streaming environment to create the data in parallel
> and switch to dataset api?
>
> Thanks in advance!
>
> Best
> Kaan



-- 

Arvid Heise | Senior Java Developer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng

Mime
View raw message