Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ADE7918410 for ; Fri, 17 Jul 2015 14:19:22 +0000 (UTC) Received: (qmail 41038 invoked by uid 500); 17 Jul 2015 14:19:22 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 40965 invoked by uid 500); 17 Jul 2015 14:19:22 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 40954 invoked by uid 99); 17 Jul 2015 14:19:22 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jul 2015 14:19:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id CDC4BD5653 for ; Fri, 17 Jul 2015 14:19:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id d27mELedO-oM for ; Fri, 17 Jul 2015 14:19:13 +0000 (UTC) Received: from mail-la0-f44.google.com (mail-la0-f44.google.com [209.85.215.44]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 9E1B843CD3 for ; Fri, 17 Jul 2015 14:19:12 +0000 (UTC) Received: by lagw2 with SMTP id w2so61935048lag.3 for ; Fri, 17 Jul 2015 07:19:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=/Ieba7CSoHmPDqtPQvmxpIl4m2mmySBzXyoWba3Tpr4=; b=rZJG4RG5WqhEY9L+86GgsO43F5As1Vmz4b3+DWzZ56bOBuP7vLZ3L49IfEARgidKBE lh7jPxH7WIH61R5koB4zXj+YUxiNcjDWqXd3AT1p8vIgtD3u5KJUCVEli6xHCf/glaL+ fOtJZflN0UsDvrsEGBrBmNas4xRvNrPzO1QJCfcmjvu/9EnlpFeECSEXh5vZd2pgvSS/ kNRsoae6rtmyRvCkvOSV6TdyqgQsfgcHVqsEuWtv6XGZAes8CD+d5XgAIc9/q80rBM1S Nntxetm1sOWmz3QcOzbVO+/oyyjRBLp2q+FWaiY6A3PIiUaJmbpzaI5jfPPIjEnkdx3h Zkpg== MIME-Version: 1.0 X-Received: by 10.112.85.3 with SMTP id d3mr14801327lbz.33.1437142751508; Fri, 17 Jul 2015 07:19:11 -0700 (PDT) Received: by 10.152.225.171 with HTTP; Fri, 17 Jul 2015 07:19:11 -0700 (PDT) In-Reply-To: References: Date: Fri, 17 Jul 2015 16:19:11 +0200 Message-ID: Subject: Re: Map and Reduce cycle From: Fabian Hueske To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a1134946c31fe48051b12df0d --001a1134946c31fe48051b12df0d Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Bill, Flink uses pipelined data shipping by default. If you have a program like: source -> map -> reduce -> sink, the mapper will immediately start to shuffle data over the network to the reducer. The reducer collects that data and starts to sort it in batches. When the mapper is done and the reducer is done with the sorting, the sorted batches are merged and a sorted stream is fed into the reduce function. So, the shuffling and sorting happens while the mapper is running, but the reduce function can only be applied after the last mapper has finished. I am not aware of benchmarks that compare Flink's and Spark's in-memory operations, but this blog post compares the performance of sorting binary data in managed memory (like Flink) to naive sorting on the heap (Spark might do it differently!). --> http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html Fabian 2015-07-17 16:07 GMT+02:00 Bill Sparks : > Does flink require all the map tasks to finish before the reducers can > proceed like Spark, or can the reducer operations start before all the > mappers have finished like the older Hadoop mapreduce. > > Also my understanding is that flink manages it's own heap, do you/we > have a sense of the performance impact of this as compared to say =E2=80= =A6. Spark > where it's all in the JVM. > > Regards, > Bill. > > -- > Jonathan (Bill) Sparks > Software Architecture > Cray Inc. > --001a1134946c31fe48051b12df0d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Bill,

Flink uses pipel= ined data shipping by default. If you have a program like: source -> map= -> reduce -> sink, the mapper will immediately start to shuffle data= over the network to the reducer. The reducer collects that data and starts= to sort it in batches. When the mapper is done and the reducer is done wit= h the sorting, the sorted batches are merged and a sorted stream is fed int= o the reduce function.
So, the shuffling and sorting happens= while the mapper is running, but the reduce function can only be applied a= fter the last mapper has finished.

I am not aware of benchmark= s that compare Flink's and Spark's in-memory operations, but this b= log post compares the performance of sorting binary data in managed memory = (like Flink) to naive sorting on the heap (Spark might do it differently!).=

--> http://flink.apache.org/news/2015/05/11/Juggling-= with-Bits-and-Bytes.html

Fabian

201= 5-07-17 16:07 GMT+02:00 Bill Sparks <jsparks@cray.com>:
Does flink require all the map tasks to finish before the reducers can= proceed like Spark, or can the reducer operations start before all the map= pers have finished like the older Hadoop mapreduce.=C2=A0

Also my understanding is that flink manages it's own heap, do you/= we have a sense of the performance impact of this as compared to say =E2=80= =A6. Spark where it's all in the JVM.

Regards,
=C2=A0 =C2=A0Bill.

--=C2=A0
Jonathan (Bill) Sparks
Software Architecture
Cray Inc.

--001a1134946c31fe48051b12df0d--