flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nico Kruber <n...@data-artisans.com>
Subject Re: Distribute crawling of a URL list using Flink
Date Mon, 14 Aug 2017 17:06:23 GMT
Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect 
the resulting string for further processing in your pipeline.



Nico


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
> Hi,
> 
> While this task is quite trivial to do with Flink Dataset API, using
> readTextFile to read the input and
> 
> a flatMap function to perform the downloading, it might not be a good idea.
> 
> The download process is I/O bound, and will block the synchronous
> flatMap function,
> 
> so the throughput will not be very good.
> 
> 
> Until Flink supports asynchronous functions, I suggest you looks elsewhere.
> 
> An example with master-workers architecture using Akka can be found here
> 
> https://github.com/typesafehub/activator-akka-distributed-workers
> 
> 
> Regards,
> 
> Kien
> 
> On 8/14/2017 10:09 AM, Eranga Heshan wrote:
> > Hi all,
> > 
> > I am fairly new to Flink. I have this project where I have a list of
> > URLs (In one node) which need to be crawled distributedly. Then for
> > each URL, I need the serialized crawled result to be written to a
> > single text file.
> > 
> > I want to know if there are similar projects which I can look into or
> > an idea on how to implement this.
> > 
> > Thanks & Regards,
> > 
> > 
> > 
> > 
> > Eranga Heshan
> > /Undergraduate/
> > Computer Science & Engineering
> > University of Moratuwa
> > Mobile: 	+94 71 138 2686 <tel:%2B94%2071%20552%202087>
> > Email: 	eranga.h.n@gmail.com <mailto:eranga.h.n@gmail.com>
> > <https://www.facebook.com/erangaheshan>
> > <https://twitter.com/erangaheshan>
> > <https://www.linkedin.com/in/erangaheshan>


Mime
View raw message