Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F3412102FC for ; Tue, 17 Mar 2015 21:23:43 +0000 (UTC) Received: (qmail 37711 invoked by uid 500); 17 Mar 2015 21:23:41 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 37641 invoked by uid 500); 17 Mar 2015 21:23:41 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 37547 invoked by uid 99); 17 Mar 2015 21:23:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Mar 2015 21:23:41 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chen.song.82@gmail.com designates 209.85.215.45 as permitted sender) Received: from [209.85.215.45] (HELO mail-la0-f45.google.com) (209.85.215.45) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Mar 2015 21:23:15 +0000 Received: by ladw1 with SMTP id w1so20252351lad.0 for ; Tue, 17 Mar 2015 14:23:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=z3G+k47jlVqb7LTjgu5ilfksmK5xY3e6gN4xaaWQ0uo=; b=TCiSmB86bpzfThEM0JWyDzf8TnVJaMhD6/PrgMjN5orI2lVmXwVWDatGJGjonjJRuk u0e2lewl9PyMOVxcDOnpkj9iS2yI23fbetYdiz2/miZwTz5Z3K+aVEWNyZgDBBTHYFZQ CwCxXCDWcBfWQyYgc8DUaTYGg+0HPthAwd1j+v1Kr5opfMPvDrRUsLRDhCLlEIljt1NS PSQFnS+NwhYblF3hOCp9QMp4NIs+tRo9a+9xmFdBRzkvirFUbuvyNA6N/P8cFlAR2JLi oQc+snjQEYC2HnTmSQhItMK7SWX6DSAwDY7POmVRkIG4SSglRpmwrg+ITnDedlYfHD69 lQ9Q== MIME-Version: 1.0 X-Received: by 10.112.62.135 with SMTP id y7mr61266392lbr.50.1426627394025; Tue, 17 Mar 2015 14:23:14 -0700 (PDT) Received: by 10.25.91.207 with HTTP; Tue, 17 Mar 2015 14:23:13 -0700 (PDT) Date: Tue, 17 Mar 2015 17:23:13 -0400 Message-ID: Subject: shuffle write size From: Chen Song To: "user@spark.apache.org" Content-Type: multipart/alternative; boundary=001a11c3f7ce0c4308051182936d X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3f7ce0c4308051182936d Content-Type: text/plain; charset=UTF-8 I have a map reduce job that reads from three logs and joins them on some key column. The underlying data is protobuf messages in sequence files. Between mappers and reducers, the underlying raw byte arrays for protobuf messages are shuffled . Roughly, for 1G input from HDFS, there is 2G data output from map phase. I am testing spark jobs (v1.3.0) on the same input. I found that shuffle write is 3 - 4 times input size. I tried passing protobuf Message object and ArrayByte but neither gives good shuffle write output. Is there any good practice on shuffling * protobuf messages * raw byte array Chen --001a11c3f7ce0c4308051182936d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I have a map reduce job that reads from three logs and joi= ns them on some key column. The underlying data is protobuf messages in seq= uence files.=C2=A0Between=C2=A0mappers and reducers,=C2=A0the underlying ra= w byte arrays for protobuf messages are=C2=A0shuffled .=C2=A0Roughly, for 1= G input from HDFS, there is 2G data output from map phase.

I am testing spark jobs (v1.3.0) on the same input. I found that shuffle= write is 3 - 4 times input size. I tried passing protobuf Message object a= nd=C2=A0ArrayByte but neither gives good shuffle write output.
Is there any good practice on shuffling

* protobuf messages
* raw byte array

= Chen

--001a11c3f7ce0c4308051182936d--