flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aljoscha Krettek <aljos...@apache.org>
Subject Re: Distribute crawling of a URL list using Flink
Date Fri, 25 Aug 2017 12:23:10 GMT
Hi,

It is not available for the Batch API, you would have to use the DataStream API.

Best,
Aljoscha

> On 15. Aug 2017, at 01:16, Kien Truong <duckientruong@gmail.com> wrote:
> 
> Hi, 
> 
> Admittedly, I have not suggested this because I thought it was not available for batch
API. 
> 
> Regards, 
> Kien 
> On Aug 15, 2017, at 00:06, Nico Kruber <nico@data-artisans.com <mailto:nico@data-artisans.com>>
wrote:
> Hi Eranga and Kien,
> Flink supports asynchronous IO since version 1.2, see [1] for details.
> 
> You basically pack your URL download into the asynchronous part and collect 
> the resulting string for further processing in your pipeline.
> 
> 
> 
> Nico
> 
> 
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream <https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream>/
> asyncio.html
> 
> On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
> 
>  
> 
>    Hi,
>   
> 
>   
>  While this task is quite trivial to do with Flink Dataset API, using
>   
>  readTextFile to read the input and
>   
> 
>   
>  a flatMap function to perform the downloading, it might not be a good idea.
>   
> 
>   
>  The download process is I/O bound, and will block the synchronous
>   
>  flatMap function,
>   
> 
>   
>  so the throughput will not be very good.
>   
> 
>   
> 
>   
>  Until Flink supports asynchronous functions, I suggest you looks elsewhere.
>   
> 
>   
>  An example with master-workers architecture using Akka can be found here
>   
> 
>   
> 
>   https://github.com/typesafehub/activator-akka-distributed-workers <https://github.com/typesafehub/activator-akka-distributed-workers>
>   
> 
>   
> 
>   
>  Regards,
>   
> 
>   
>  Kien
>   
> 
>   
>  On 8/14/2017 10:09 AM, Eranga Heshan wrote:
>   
> 
>   
> 
>     Hi all,
>    
> 
>    
>  I am fairly new to Flink. I have this project where I have a list of
>    
>  URLs (In one node) which need to be crawled distributedly. Then for
>    
>  each URL, I need the serialized crawled result to be written to a
>    
>  single text file.
>    
> 
>    
>  I want to know if there are similar projects which I can look into or
>    
>  an idea on how to implement this.
>    
> 
>    
>  Thanks & Regards,
>    
> 
>    
> 
>    
> 
>    
> 
>    
>  Eranga Heshan
>    
>  /Undergraduate/
>    
>  Computer Science & Engineering
>    
>  University of Moratuwa
>    
>  Mobile: +94 71 138 2686 <tel:%2B94%2071%20552%202087>
>    
>  Email: eranga.h.n@gmail.com <mailto:eranga.h.n@gmail.com>
>    
>  <
>    https://www.facebook.com/erangaheshan <https://www.facebook.com/erangaheshan>>
>    
>  <
>    https://twitter.com/erangaheshan <https://twitter.com/erangaheshan>>
>    
>  <
>    https://www.linkedin.com/in/erangaheshan <https://www.linkedin.com/in/erangaheshan>>
>    
> 
>   
> 
>  
> 


Mime
View raw message