Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A95EE18799 for ; Fri, 1 Apr 2016 20:29:43 +0000 (UTC) Received: (qmail 71178 invoked by uid 500); 1 Apr 2016 20:29:41 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 71068 invoked by uid 500); 1 Apr 2016 20:29:41 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 71054 invoked by uid 99); 1 Apr 2016 20:29:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Apr 2016 20:29:40 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 9249DC2D00 for ; Fri, 1 Apr 2016 20:29:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.592 X-Spam-Level: ** X-Spam-Status: No, score=2.592 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URI_HEX=1.313] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=databricks-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id wgn1Ph2bqLaV for ; Fri, 1 Apr 2016 20:29:38 +0000 (UTC) Received: from mail-lf0-f47.google.com (mail-lf0-f47.google.com [209.85.215.47]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id D05835F247 for ; Fri, 1 Apr 2016 20:29:37 +0000 (UTC) Received: by mail-lf0-f47.google.com with SMTP id c62so89257460lfc.1 for ; Fri, 01 Apr 2016 13:29:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=databricks-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=ukde8rCvpchwPb8rFy6bMfP4aWv1Jd97VO4UswKb6eM=; b=EF0wKrsqXTKXj4ojCNqOLi6h5NeUCmizQb6zkwhZi8BBlC37b3itaDS6Z6JyWVCDle vZ+fhRegJ69W8ab96FyNAepmLZYa/G8z7eq02VcHnnVDpN+zgLZ2CNaCUhwh9ZBsQs32 VJGHuN7opf8qmAp/Hd79VowxkXaMlVKEwi9/PjUx35UUcr7GaFyBV4DqdWJ9vgDjA9IK ZDlN0nfmpmZG+kPE4fXNzHfqJZHQvfVweq5WPo7Kh8TXfVY+lKRIxM6QZjr0HFzBgDnI qTLQj5iiF65BDq1V/vTVutMLQpdtHGmz+KX/5xCp6QZkrP4MW6Dr3ORLsb/mGqSYD9hy b3zA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=ukde8rCvpchwPb8rFy6bMfP4aWv1Jd97VO4UswKb6eM=; b=JrMMXItoYl1DIcrAfwmdFUbcsj2IMF+WYEk/y21rRdCOxE/QOpwPKLu+45+qaw3Y8k pQYz57NuAeNfyiPB+NIcUrihXuBBM9E4SA927HT4kq74IYzVmAK/A5c1hGiU6jKtfV5O tj90xd/sUWXRBGuPIjDBV/bg/WkKQJbnvd81nfqisc3Kv8C9kNq0yJrHIKAhXmpswHa+ aC8T8r9fgncsjN1CGI1p/a80e9zAJFIH0qXa5yWB+XJyvIdCyfc1nU0a7Sg4oRrG9qN+ smx7EvxfGW3QxO+hOLyH8WDPBihCj9qexsO6ujd9B1WPtVloPruoKmdm/kmTQX8joC/7 CoQA== X-Gm-Message-State: AD7BkJI8l/MsH2LRBKnzdY3Q4aOxqQi4We9mcggPGxJ4ljEQ5Xh4WzoOdjL617pO4bTz5DNUhnNnmuBJSmuEEQ== X-Received: by 10.25.150.207 with SMTP id y198mr2754194lfd.68.1459542571488; Fri, 01 Apr 2016 13:29:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.138.215 with HTTP; Fri, 1 Apr 2016 13:29:11 -0700 (PDT) In-Reply-To: <1459441588178-16944.post@n3.nabble.com> References: <1459441588178-16944.post@n3.nabble.com> From: Michael Armbrust Date: Fri, 1 Apr 2016 13:29:11 -0700 Message-ID: Subject: Re: What influences the space complexity of Spark operations? To: Steve Johnston Cc: "dev@spark.apache.org" Content-Type: multipart/alternative; boundary=001a11401cb2822dc7052f723c69 --001a11401cb2822dc7052f723c69 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Blocking operators like Sort, Join or Aggregate will put all of the data for a whole partition into a hash table or array. However, if you are running Spark 1.5+ we should be spilling to disk. In Spark 1.6 if you are seeing OOMs for SQL operations you should report it as a bug. On Thu, Mar 31, 2016 at 9:26 AM, Steve Johnston wrote: > *What we=E2=80=99ve observed* > > Increasing the number of partitions (and thus decreasing the partition > size) seems to reliably help avoid OOM errors. To demonstrate this we use= d > a single executor and loaded a small table into a DataFrame, persisted it > with MEMORY_AND_DISK, repartitioned it and joined it to itself. Varying t= he > number of partitions identifies a threshold between completing the join a= nd > incurring an OOM error. > > > lineitem =3D sc.textFile('lineitem.tbl').map(converter) > lineitem =3D sqlContext.createDataFrame(lineitem, schema) > lineitem.persist(StorageLevel.MEMORY_AND_DISK) > repartitioned =3D lineitem.repartition(partition_count) > joined =3D repartitioned.join(repartitioned) > joined.show() > > > *Questions* > > Generally, what influences the space complexity of Spark operations? Is i= t > the case that a single partition of each operand=E2=80=99s data set + a s= ingle > partition of the resulting data set all need to fit in memory at the same > time? We can see where the transformations (for say joins) are implemente= d > in the source code (for the example above BroadcastNestedLoopJoin), but > they seem to be based on virtualized iterators; where in the code is the > partition data for the inputs and outputs actually materialized? > ------------------------------ > View this message in context: What influences the space complexity of > Spark operations? > > Sent from the Apache Spark Developers List mailing list archive > at > Nabble.com. > > --001a11401cb2822dc7052f723c69 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Blocking operators like Sort, Join or Aggregate will put a= ll of the data for a whole partition into a hash table or array.=C2=A0 Howe= ver, if you are running Spark 1.5+ we should be spilling to disk.=C2=A0 In = Spark 1.6 if you are seeing OOMs for SQL operations you should report it as= a bug.

On T= hu, Mar 31, 2016 at 9:26 AM, Steve Johnston <sjohnston@algebra= ixdata.com> wrote:
What = we=E2=80=99ve observed

Increasing the number of partitions (and thus decreasing the partition size= ) seems to reliably help avoid OOM errors. To demonstrate this we used a si= ngle executor and loaded a small table into a DataFrame, persisted it with = MEMORY_AND_DISK, repartitioned it and joined it to itself. Varying the numb= er of partitions identifies a threshold between completing the join and inc= urring an OOM error.=20


lineitem =3D sc.textFile('lineitem.tbl').map(converter)
lineitem =3D sqlContext.createDataFrame(lineitem, schema)
lineitem.persist(StorageLevel.MEMORY_AND_DISK)
repartitioned =3D lineitem.repartition(partition_count)
joined =3D repartitioned.join(repartitioned)
joined.show()
 
Questions

Generally, what influences the space complexity of Spar= k operations? Is it the case that a single partition of each operand=E2=80= =99s data set + a single partition of the resulting data set all need to fi= t in memory at the same time? We can see where the transformations (for say= joins) are implemented in the source code (for the example above Broadcast= NestedLoopJoin), but they seem to be based on virtualized iterators; where = in the code is the partition data for the inputs and outputs actually mater= ialized? =09 =09 =09


View this message in context: What influences the space complexity = of Spark operations?
Sent from the Apache Spark Developers List mailing list archi= ve at Nabble.com.


--001a11401cb2822dc7052f723c69--