Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 91395200CF3 for ; Tue, 15 Aug 2017 01:16:50 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8F9C6165F1F; Mon, 14 Aug 2017 23:16:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AF18F165F0D for ; Tue, 15 Aug 2017 01:16:49 +0200 (CEST) Received: (qmail 97120 invoked by uid 500); 14 Aug 2017 23:16:42 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 97103 invoked by uid 99); 14 Aug 2017 23:16:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Aug 2017 23:16:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 53585C0047 for ; Mon, 14 Aug 2017 23:16:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.379 X-Spam-Level: *** X-Spam-Status: No, score=3.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id XvRrs_TwXIBr for ; Mon, 14 Aug 2017 23:16:38 +0000 (UTC) Received: from mail-pg0-f67.google.com (mail-pg0-f67.google.com [74.125.83.67]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 59CC45FB96 for ; Mon, 14 Aug 2017 23:16:38 +0000 (UTC) Received: by mail-pg0-f67.google.com with SMTP id 123so12813218pga.5 for ; Mon, 14 Aug 2017 16:16:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=in-reply-to:references:thread-topic:user-agent:mime-version :content-transfer-encoding:subject:from:date:to:cc:message-id; bh=36RhhLsN2HEj+bggyffp3uK2kvHYSkbROcacSSreBVI=; b=sq6ylJ+h0RarzOM8cEkCm26d8vjRvvt9AcPssyFm2bEqx6p7pgGSNLjAvq89cKU3FD OLKuP0G63Sr3tSOF6Zc3VyHcm3y4yg4ohxZPACcP+BR6PwtBsedpB3HrSNwcfSF86xa4 TbWRcJexZ8qsvWUpQRSJkstOGUigM3uU4mNR3mfqrATcvcSfDC5IG8qP1MHXp8XdxLPk 18CsXSwSBW2Ylju7kghgZBKsgWq83tIm/OWNR+hJrxDQqlSqtaWVPO9zU4CVBg7GFxSO yAR+YUvJ5NWwBmwvbYHbLQDwLo0dpd5nyMZtBwtoPTec8DX72eovafzmDK2w0gfGfEhq ceHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:in-reply-to:references:thread-topic:user-agent :mime-version:content-transfer-encoding:subject:from:date:to:cc :message-id; bh=36RhhLsN2HEj+bggyffp3uK2kvHYSkbROcacSSreBVI=; b=t60aNwsgFvoozinfa8+OLPlNSsqBpWRDtVWxWmCMkNq0NoGfNvLACRgeJ5+r5AokoM 2Sob2zMua8Uw5P6+4Gd0ELPqCk5AFG/1VMre9iUr1LPvfDK+xhdUIWGFgTufNXKcJnor y99m1qRMb3lsIr+3IBc9tTi7+yp0WACoLZOzVWdOMPIDJfCdmYo78zrio+eChLEk6jsW kzQAB7cLQOXsZfN149g18eyEDVXgE2Pmks16np7uVrCAvINJiOnn2b28vKOlx9twgNHA xK82YoGZyD7MYjwOwd71AmLGKwqA/DKhVfvu62Rek8MKDkYK0QCIFMZhKHvssUunuPXH idsA== X-Gm-Message-State: AHYfb5j671KC8/edGtU7LTw86dwCNieq7VwjLHSrptF603A8ZwJQU5y8 HvXbIEDR+34LKg== X-Received: by 10.84.139.36 with SMTP id 33mr29100839plq.20.1502752597470; Mon, 14 Aug 2017 16:16:37 -0700 (PDT) Received: from android-afb80d375c020319.lan ([14.177.196.103]) by smtp.gmail.com with ESMTPSA id t5sm15007816pfd.96.2017.08.14.16.16.35 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Aug 2017 16:16:36 -0700 (PDT) In-Reply-To: <2859816.4nhSuef7RF@nico-work> References: <8b2462a7-c427-2fcd-1ea2-1bc6f94fda56@gmail.com> <2859816.4nhSuef7RF@nico-work> X-Referenced-Uid: 10477 Thread-Topic: Re: Distribute crawling of a URL list using Flink User-Agent: Android MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----2WE4R5NDHSJ5SPWY7X6D9FOTCK2TD9" Content-Transfer-Encoding: 7bit X-Local-Message-Id: <37c57421-3ec8-4686-92d8-1fa2fc6baf3c@gmail.com> Subject: Re: Distribute crawling of a URL list using Flink From: Kien Truong Date: Tue, 15 Aug 2017 06:16:28 +0700 To: Nico Kruber CC: user@flink.apache.org,Eranga Heshan Message-ID: <37c57421-3ec8-4686-92d8-1fa2fc6baf3c@gmail.com> archived-at: Mon, 14 Aug 2017 23:16:50 -0000 ------2WE4R5NDHSJ5SPWY7X6D9FOTCK2TD9 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Hi, Admittedly, I have not suggested this because I thought it was not av= ailable for batch API=2E Regards, Kien On Aug 15, 2017, 00:06, at 00:= 06, Nico Kruber wrote: >Hi Eranga and Kien, >Fli= nk supports asynchronous IO since version 1=2E2, see [1] for details=2E > >= You basically pack your URL download into the asynchronous part and >collec= t >the resulting string for further processing in your pipeline=2E > > > >= Nico > > >[1] >https://ci=2Eapache=2Eorg/projects/flink/flink-docs-release-= 1=2E3/dev/stream/ >asyncio=2Ehtml > >On Monday, 14 August 2017 17:50:47 CES= T Kien Truong wrote: >> Hi, >> >> While this task is quite trivial to do w= ith Flink Dataset API, using >> readTextFile to read the input and >> >> a= flatMap function to perform the downloading, it might not be a good >idea= =2E >> >> The download process is I/O bound, and will block the synchronou= s >> flatMap function, >> >> so the throughput will not be very good=2E >>= >> >> Until Flink supports asynchronous functions, I suggest you looks >= elsewhere=2E >> >> An example with master-workers architecture using Akka = can be found >here >> >> https://github=2Ecom/typesafehub/activator-akka-d= istributed-workers >> >> >> Regards, >> >> Kien >> >> On 8/14/2017 10:0= 9 AM, Eranga Heshan wrote: >> > Hi all, >> > >> > I am fairly new to Flink= =2E I have this project where I have a list >of >> > URLs (In one node) whi= ch need to be crawled distributedly=2E Then for >> > each URL, I need the s= erialized crawled result to be written to a >> > single text file=2E >> > = >> > I want to know if there are similar projects which I can look into >or= >> > an idea on how to implement this=2E >> > >> > Thanks & Regards, >> >= >> > >> > >> > >> > Eranga Heshan >> > /Undergraduate/ >> > Computer S= cience & Engineering >> > University of Moratuwa >> > Mobile: +94 71 138 2= 686 >> > Email: eranga=2Eh=2En@gmail=2Ecom <= mailto:eranga=2Eh=2En@gmail=2Ecom> >> > >> > >> > ------2WE4R5NDHSJ5SPWY7X6D9FOTCK2TD9 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Hi,

Admittedly, I have not suggested this because I thought it was not ava= ilable for batch API=2E

Regards,
Kien
On Aug 15, = 2017, at 00:06, Nico Kruber <nico@data-artisans=2Ecom> wrote:
Hi Eranga a=
nd Kien,
Flink supports asynchronous IO since version 1=2E2, see [1] for= details=2E

You basically pack your URL download into the asynchrono= us part and collect
the resulting string for further processing in your= pipeline=2E



Nico


[1] https://ci=2E= apache=2Eorg/projects/flink/flink-docs-release-1=2E3/dev/stream/
asy= ncio=2Ehtml

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrot= e:
Hi,
=
While this task is quite trivial to do with Flink Dataset API, using =
readTextFile to read the input and

a flatMap function = to perform the downloading, it might not be a good idea=2E

T= he download process is I/O bound, and will block the synchronous
fla= tMap function,

so the throughput will not be very good=2E =


Until Flink supports asynchronous functions, I suggest= you looks elsewhere=2E

An example with master-workers archi= tecture using Akka can be found here

https://github= =2Ecom/typesafehub/activator-akka-distributed-workers

=
Regards,

Kien

On 8/14/2017 10:09 AM, Er= anga Heshan wrote:
https://www=2Efacebook=2Ecom/erangaheshan>
< https://twitter=2Ecom/erangahes= han>
< https://www=2Elinkedin=2Ecom/in/erangaheshan>

------2WE4R5NDHSJ5SPWY7X6D9FOTCK2TD9--