Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (nike.apache.org: domain of pwendell@gmail.com designates
 209.85.218.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKDPS92xiL9jOpTiDye-+crfX2Fh5a2aS3FyN3J5Fg-Ar6c87g@mail.gmail.com>
References: 
 <CAKDPS92xiL9jOpTiDye-+crfX2Fh5a2aS3FyN3J5Fg-Ar6c87g@mail.gmail.com>
Date: Wed, 24 Dec 2014 22:42:16 -0800
Message-ID: 
 <CABPQxssjT22-pLQrE6h78TGOdvEMuZfF2nK_q=ugp7MT2DcTuw@mail.gmail.com>
Subject: Re: Problems with large dataset using collect() and broadcast()
From: Patrick Wendell <pwendell@gmail.com>
To: Will Yang <era.yeung@gmail.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Content-Type: text/plain; charset=ISO-8859-1

Hi Will,

When you call collect() the item you are collecting needs to fit in
memory on the driver. Is it possible your driver program does not have
enough memory?

- Patrick

On Wed, Dec 24, 2014 at 9:34 PM, Will Yang <era.yeung@gmail.com> wrote:
> Hi all,
> In my occasion, I have a huge HashMap[(Int, Long), (Double, Double,
> Double)], say several GB to tens of GB, after each iteration, I need to
> collect() this HashMap and perform some calculation, and then broadcast()
> it to every node. Now I have 20GB for each executor and after it
> performances collect(), it gets stuck at "Added rdd_xx_xx", no further
> respond showed on the Application UI.
>
> I've tried to lower the spark.shuffle.memoryFraction and
> spark.storage.memoryFraction, but it seems that it can only deal with as
> much as 2GB HashMap. What should I optimize for such conditions.
>
> (ps: sorry for my bad English & Grammar)
>
>
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org