Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DE39017C20 for ; Thu, 17 Sep 2015 21:18:14 +0000 (UTC) Received: (qmail 57245 invoked by uid 500); 17 Sep 2015 21:18:11 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 57154 invoked by uid 500); 17 Sep 2015 21:18:11 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 57144 invoked by uid 99); 17 Sep 2015 21:18:11 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Sep 2015 21:18:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B068CC084E for ; Thu, 17 Sep 2015 21:18:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[AC_DIV_BONANZA=0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id R375P9AuGci5 for ; Thu, 17 Sep 2015 21:18:01 +0000 (UTC) Received: from mail-ig0-f180.google.com (mail-ig0-f180.google.com [209.85.213.180]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 5217C44151 for ; Thu, 17 Sep 2015 21:18:01 +0000 (UTC) Received: by igbkq10 with SMTP id kq10so4864179igb.0 for ; Thu, 17 Sep 2015 14:18:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=UILwdvOq7Yf4KFHHbgRXdJGaJ+5jtoMCGZ+/JHLbwmw=; b=WXO9G8kptuko1A98nlnktIAyD28fOlkZB8qI+UnRd5SDGBE9smt8fW2WIMAAZ1ZMmw jvDMihMm5IuTvhhF1fuG3LA05E7Lrbm6Aet6uFFFKp1aKw+71NuY0NHgJ9jLqFqpYu25 eqooE7/W1qQO+sNrrn/HGJdVKC8UOvAvZOBu//PhiHIN3eKlk5p1kmsZqLXl9Jlh+hTb 3wBfxPfuOs1un5CrEt/VuLFWuNneoG2JdBGHQcKMCJDS7gVOe6Dun5J2C35EU4gbKniO qS9GZAFq6pD8U248Sbusa8GwIIyAqzlzQEHlJMOUVjdZMPzfQfkt51bNWaAjQHXwaolk +SUg== X-Received: by 10.50.102.4 with SMTP id fk4mr10615841igb.46.1442524680958; Thu, 17 Sep 2015 14:18:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.36.120.22 with HTTP; Thu, 17 Sep 2015 14:17:41 -0700 (PDT) From: Gavin Yue Date: Thu, 17 Sep 2015 14:17:41 -0700 Message-ID: Subject: Cache after filter Vs Writing back to HDFS To: user Content-Type: multipart/alternative; boundary=047d7b11198d3039be051ff7f3ae --047d7b11198d3039be051ff7f3ae Content-Type: text/plain; charset=UTF-8 For a large dataset, I want to filter out something and then do the computing intensive work. What I am doing now: Data.filter(somerules).cache() Data.count() Data.map(timeintensivecompute) But this sometimes takes unusually long time due to cache missing and recalculation. So I changed to this way. Data.filter.saveasTextFile() sc.testFile(),map(timeintesivecompute) Second one is even faster. How could I tune the job to reach maximum performance? Thank you. --047d7b11198d3039be051ff7f3ae Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
For= a large dataset, I want to filter out something and then do the computing = intensive work.

What I am doing now:

Data.filte= r(somerules).cache()
Data.count()

Data.map(timeintens= ivecompute)

But this sometimes takes unusually long time due t= o cache missing and recalculation.

So I changed to this way. =

Data.filter.saveasTextFile()

sc.testFile(),map(t= imeintesivecompute)

Second one is even faster.=C2=A0=C2=A0
How could I tune the job to reach maximum performance?

<= /div>Thank you.

--047d7b11198d3039be051ff7f3ae--