Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1D36610114 for ; Mon, 24 Feb 2014 19:25:31 +0000 (UTC) Received: (qmail 71797 invoked by uid 500); 24 Feb 2014 19:25:30 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 71742 invoked by uid 500); 24 Feb 2014 19:25:30 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 71734 invoked by uid 99); 24 Feb 2014 19:25:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Feb 2014 19:25:30 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jinalshah2007@gmail.com designates 74.125.82.169 as permitted sender) Received: from [74.125.82.169] (HELO mail-we0-f169.google.com) (74.125.82.169) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Feb 2014 19:25:24 +0000 Received: by mail-we0-f169.google.com with SMTP id t61so5156480wes.0 for ; Mon, 24 Feb 2014 11:25:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=fEc8sq38pXBKdM/z1TAhxuudzxt5Ghju+y81B9gg5Dw=; b=RfdXkP41PgGu23moSqaw3WeCo0JYXSqSI0vsRY3t24pakFjm5GRwwsRVSeEdgWC2w9 ycaaEN5W6aMV9/XrFkdO91x5byLjKDDCf7aDYYCqbyivdLMqAdQ0tpp/pBfRMz9V5SXq P7zmOsrxMISpgEWlRnTIgdFP8cmwfCfcrGYCRr7rJtzNClU930YgjKVw+ZYx+0Ia+0LY aMO55qFeBrcZF4rLjezcBzyQV0bG9jN/Re52HD8cuMVxqSCJ1p7J3uZpICmpkn0ezt0E j1Oql2eBOYgR/2DcetmjY94M5a1ubaewfK0ucigLMxO2qVkuzUc94TiqXj7KrRBF/dhn DuJw== MIME-Version: 1.0 X-Received: by 10.180.164.229 with SMTP id yt5mr15885826wib.49.1393269903904; Mon, 24 Feb 2014 11:25:03 -0800 (PST) Received: by 10.194.239.106 with HTTP; Mon, 24 Feb 2014 11:25:03 -0800 (PST) In-Reply-To: References: Date: Mon, 24 Feb 2014 13:25:03 -0600 Message-ID: Subject: Re: Object size From: Jinal Shah To: dev@crunch.apache.org Content-Type: multipart/alternative; boundary=00248c11e82db2dc9804f32bedd4 X-Virus-Checked: Checked by ClamAV on apache.org --00248c11e82db2dc9804f32bedd4 Content-Type: text/plain; charset=ISO-8859-1 Thanks Josh, I have a few following questions so let's say with the default scaleFactor how much approximation should we assume like +/- 1%? How does scaleFactor affect the size of the object? Can this be a part of Crunch as an enhancement to the current Join strategy? Thanks Jinal On Mon, Feb 24, 2014 at 1:01 PM, Josh Wills wrote: > Ah, cool. the long getSize() method will return Crunch's estimate of the > size of the object in bytes, but it's good to keep in mind that it's a very > rough approximation based on the size of the file on disk and any info we > have about the behavior of any DoFns that are applied to the PTable when it > is processed, which is communicated via the scaleFactor() function on each > DoFn. > > > On Mon, Feb 24, 2014 at 10:57 AM, Jinal Shah >wrote: > > > By size I meant the memory size sorry for the confusion. Like how much > > memory will a PTable object require. Basically what I'm trying to do is > if > > the object is not that large and if it could fit in memory I wanted to > > apply map-side join to optimize the join and depending on that I also > > wanted to determine which one is smaller to use the Left join. > > > > > > On Mon, Feb 24, 2014 at 12:45 PM, Josh Wills > wrote: > > > > > There is the length() method, which will return a PObject with > the > > > number of elements in the PCollection. It requires running an MR job > > > though. > > > > > > J > > > > > > > > > On Mon, Feb 24, 2014 at 10:03 AM, Jinal Shah > > >wrote: > > > > > > > Hi, > > > > > > > > Is there a way possible in crunch to find the size of a particular > > > > PCollection or PTable in whole. > > > > > > > > Thanks > > > > Jinal > > > > > > > > > > > > > > > > -- > > > Director of Data Science > > > Cloudera > > > Twitter: @josh_wills > > > > > > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > --00248c11e82db2dc9804f32bedd4--