Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9B94917C5A for ; Sun, 2 Nov 2014 18:47:08 +0000 (UTC) Received: (qmail 79896 invoked by uid 500); 2 Nov 2014 18:47:07 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 79816 invoked by uid 500); 2 Nov 2014 18:47:07 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 79804 invoked by uid 99); 2 Nov 2014 18:47:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Nov 2014 18:47:06 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mengxr@gmail.com designates 209.85.213.177 as permitted sender) Received: from [209.85.213.177] (HELO mail-ig0-f177.google.com) (209.85.213.177) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Nov 2014 18:46:42 +0000 Received: by mail-ig0-f177.google.com with SMTP id hl2so3538499igb.16 for ; Sun, 02 Nov 2014 10:44:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=5NorVTc1KyjnzmaqoaeHIxuXymm8BWzmPMhv1buFplY=; b=CPtQAPkFUb3mgVg/7ZXxxlb6c2qr0J6iD54WdtlpIk014/I2ckvCFqhw5RNZVRiaWn WlQPJrluEClsAfh6HP68pYX+u5OQx4qPC+zn6+KyOQPK3GkHQjYUdywPLUdXkhk3871w NmayvDGz2v9QexjbyEAUDzTLJxGfcFXqGWGcYZgW8722r7sDGr/Zc/DN519b6RPygg3Z w3K/tj4oriyvgK+7qSWIPiMavNg0a60hfdNDc1eiZqxxODCUb6TZJ/EXXYJmmi//nTXW 4l8ypfRR11tZP+9CHYhANYTJBviYYoRi2JNnXfrkOSMQRR/1A0UwDprXSecndvr78gkr bP8A== MIME-Version: 1.0 X-Received: by 10.107.134.203 with SMTP id q72mr4071728ioi.51.1414953865740; Sun, 02 Nov 2014 10:44:25 -0800 (PST) Received: by 10.107.155.149 with HTTP; Sun, 2 Nov 2014 10:44:25 -0800 (PST) In-Reply-To: References: Date: Sun, 2 Nov 2014 10:44:25 -0800 Message-ID: Subject: Re: OOM when making bins in BinaryClassificationMetrics ? From: Xiangrui Meng To: Sean Owen Cc: dev Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org Yes, if there are many distinct values, we need binning to compute the AUC curve. Usually, the scores are not evenly distribution, we cannot simply truncate the digits. Estimating the quantiles for binning is necessary, similar to RangePartitioner: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104 . Limiting the number of bins is definitely useful. Do you have time to work on it? -Xiangrui On Sun, Nov 2, 2014 at 9:34 AM, Sean Owen wrote: > This might be a question for Xiangrui. Recently I was using > BinaryClassificationMetrics to build an AUC curve for a classifier > over a reasonably large number of points (~12M). The scores were all > probabilities, so tended to be almost entirely unique. > > The computation does some operations by key, and this ran out of > memory. It's something you can solve with more than the default amount > of memory, but in this case, it seemed unuseful to create an AUC curve > with such fine-grained resolution. > > I ended up just binning the scores so there were ~1000 unique values > and then it was fine. > > Does that sound generally useful as some kind of parameter? or am I > missing a trick here. > > Sean > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org > For additional commands, e-mail: dev-help@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org For additional commands, e-mail: dev-help@spark.apache.org