Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3AEEB200B78 for ; Fri, 2 Sep 2016 22:21:59 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 396F4160AAE; Fri, 2 Sep 2016 20:21:59 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 804A4160A8C for ; Fri, 2 Sep 2016 22:21:58 +0200 (CEST) Received: (qmail 18872 invoked by uid 500); 2 Sep 2016 20:21:56 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 18860 invoked by uid 99); 2 Sep 2016 20:21:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Sep 2016 20:21:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 82BA3C31B5 for ; Fri, 2 Sep 2016 20:21:55 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.021 X-Spam-Level: X-Spam-Status: No, score=-0.021 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=databricks-com.20150623.gappssmtp.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id AAIZDGv8f9Zz for ; Fri, 2 Sep 2016 20:21:53 +0000 (UTC) Received: from mail-yw0-f179.google.com (mail-yw0-f179.google.com [209.85.161.179]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 8DFBD5F233 for ; Fri, 2 Sep 2016 20:21:52 +0000 (UTC) Received: by mail-yw0-f179.google.com with SMTP id j12so76573878ywb.2 for ; Fri, 02 Sep 2016 13:21:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=databricks-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=F6fCsZcv5jd5fRZ1gB3J9qUrngVubjEofljK+YWGJmg=; b=xYK1WTCZIJLwuqaNec1mDot8MSX1M7/QVg8KUxGh3zY7pDYzi5VQcLvvBYlC0fml/p u533pc5rgUoT+gQMInFiRCqXoAHipVdHz4tqIrQwst649zGmYzUWTaEqbNk2ALFT8ZMc w389v2RwR5Ehhpp4/943IEBipbxgDo4iD5y1EP6MCpni8sUv1auIa3ztf7jSG8PQ8wmv di/AZsHAs4IvLiJ3JnRVQ7c9zYgtaDkK3bv08LFy++bepSYRHeFal1Z6p0wS1yF00F5J Nwf6QbhC0T5ZmSipcc+W74i6UMAcwHts/ds5TxDAyLsH2VUMloyulPfUUc1dF4D8Us24 V4JQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=F6fCsZcv5jd5fRZ1gB3J9qUrngVubjEofljK+YWGJmg=; b=PTNQnYFnxQ1oeFKFCtaAC1xefachRkKBsFFFHkcCWFsq29DkJwx0KjsecCLqG5TGXo 1Mxc+aAeQqUsIYnTPzIW92C4A3jiDCpMoeZEhI3KPgzKpEJoRtEV2fHZM3KOsxrWgrQe 3uOJSUV3fNIPNN3qIu0TYZBK1AxRu4bfXUKYWwThh0+BaHRJVGW6rEi9QPMAo4RmMXV7 gKUHqEdoA/m1x/8qS5pXhne+dvl7CQnW8JWt6kzBUM5IBIjyYMHoEgYMR94BdWCHqQqY y3U7BTxthYYUHXfKrg6iBzglWxX3Jf+WXlaA4fUOvR4WkSMqnunxje6uIKIwH2lpEKtX N76w== X-Gm-Message-State: AE9vXwMPdYSbWn+UE0BPdwZOy6M9XP8e4etveKBNnRcgT2snnoHIO6jIStIQF/O7LIfpBFzZhuo6F/C30XNBb1yr X-Received: by 10.129.105.6 with SMTP id e6mr9099248ywc.288.1472847711620; Fri, 02 Sep 2016 13:21:51 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.223.143 with HTTP; Fri, 2 Sep 2016 13:21:51 -0700 (PDT) In-Reply-To: References: From: Davies Liu Date: Fri, 2 Sep 2016 13:21:51 -0700 Message-ID: Subject: Re: Is cache() still necessary for Spark DataFrames? To: apu Cc: user Content-Type: text/plain; charset=UTF-8 archived-at: Fri, 02 Sep 2016 20:21:59 -0000 Caching a RDD/DataFrame always has some cost, in this case, I'd suggest that do not cache the DataFrame, the first() is usually fast enough (only compute the partitions as needed). On Fri, Sep 2, 2016 at 1:05 PM, apu wrote: > When I first learnt Spark, I was told that cache() is desirable anytime one > performs more than one Action on an RDD or DataFrame. For example, consider > the PySpark toy example below; it shows two approaches to doing the same > thing. > > # Approach 1 (bad?) > df2 = someTransformation(df1) > a = df2.count() > b = df2.first() # This step could take long, because df2 has to be created > all over again > > # Approach 2 (good?) > df2 = someTransformation(df1) > df2.cache() > a = df2.count() > b = df2.first() # Because df2 is already cached, this action is quick > df2.unpersist() > > The second approach shown above is somewhat clunky, because it requires one > to cache any dataframe that will be Acted on more than once, followed by the > need to call unpersist() later to free up memory. > > So my question is: is the second approach still necessary/desirable when > operating on DataFrames in newer versions of Spark (>=1.6)? > > Thanks!! > > Apu --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscribe@spark.apache.org