Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0F68F1029A for ; Mon, 24 Feb 2014 19:59:54 +0000 (UTC) Received: (qmail 59334 invoked by uid 500); 24 Feb 2014 19:59:53 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 59271 invoked by uid 500); 24 Feb 2014 19:59:53 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 59263 invoked by uid 99); 24 Feb 2014 19:59:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Feb 2014 19:59:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com designates 209.85.216.171 as permitted sender) Received: from [209.85.216.171] (HELO mail-qc0-f171.google.com) (209.85.216.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Feb 2014 19:59:48 +0000 Received: by mail-qc0-f171.google.com with SMTP id x13so1908823qcv.16 for ; Mon, 24 Feb 2014 11:59:28 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=iZYekIeIBUxgTPY5SPis03yg84yVjXE0hssohsI8a+A=; b=XKS+Ho9wsrmPYUhuzG9K/5yZp0Hm1sbHqt7E3gYOV0qgob/EQHzZDgm09VEKfQMvxv mnbwUFN8BjhnxhiiCItjHyqD6jvBeLAC9W3UkY6S7Op3oeOLebmhV4Ob0SY+/t4+E/Sy VQmwou2n/DJuXk5muEdomgFdTJOXV+jF8lA8py8OgaQQ8xUz3UM+LTaiYEN2bWPA7p7n 2KgfaJ4myNcGh2N6X1u9HWl9Qyi5rjrunMgy6MH7ncGxFCAS8KjfS2e9Snz8bsQ5eBcB P8ofWgoMy1WHJxG37yOYnv3ecM12pz3Y5OASlqVdG6gk07psgJgjmG5m0MAwJEi+UPuV +wZg== X-Gm-Message-State: ALoCoQlhWqbaUbBn6oVEsr8pLDRVUrEBc5YN6Ty8YczBgJ4SpH71GU0Om/GzIQy/V1mIBUAy4Jw4 X-Received: by 10.140.41.134 with SMTP id z6mr31492057qgz.112.1393271968012; Mon, 24 Feb 2014 11:59:28 -0800 (PST) MIME-Version: 1.0 Received: by 10.224.172.202 with HTTP; Mon, 24 Feb 2014 11:59:07 -0800 (PST) In-Reply-To: References: From: Josh Wills Date: Mon, 24 Feb 2014 11:59:07 -0800 Message-ID: Subject: Re: Object size To: dev Content-Type: multipart/alternative; boundary=001a11c13d96baaea604f32c68db X-Virus-Checked: Checked by ClamAV on apache.org --001a11c13d96baaea604f32c68db Content-Type: text/plain; charset=ISO-8859-1 On Mon, Feb 24, 2014 at 11:25 AM, Jinal Shah wrote: > Thanks Josh, I have a few following questions > so let's say with the default scaleFactor how much approximation should we > assume like +/- 1%? > In the worst case, it can be arbitrarily wrong (although I suppose we're bounded on the low end by zero.) The primary sources of error are a) the fact that serialized size on disk is less than (and sometimes significantly less than) Java's object overhead and b) scaleFactor may or may not accurately reflect the operations performed by the DoFn. If I was a conservative man, and in this I am, I would assume that the in-memory storage size of the data will be 2x whatever scaleFactor reports it as, at least for purposes of deciding between an in-memory vs. a reduce-side join. > How does scaleFactor affect the size of the object? > It doesn't affect it, it only reports what the developer thinks the DoFn will do to any input it receives. Sometimes this is relatively easy to determine, like if we have a FilterFn that is going to filter out half of its inputs. For an arbitrary DoFn, it's harder to do precisely. > Can this be a part of Crunch as an enhancement to the current Join > strategy? > We have generally stayed away from any sort of intelligent join strategy selection, although it's come up a couple of times during discussions on the mailing list. One of our principles is to avoid magic wherever possible and always give developers precise control over the operations performed during a pipeline, so I would want to be careful about how we proceeded w/this sort of thing. > > Thanks > Jinal > > > On Mon, Feb 24, 2014 at 1:01 PM, Josh Wills wrote: > > > Ah, cool. the long getSize() method will return Crunch's estimate of the > > size of the object in bytes, but it's good to keep in mind that it's a > very > > rough approximation based on the size of the file on disk and any info we > > have about the behavior of any DoFns that are applied to the PTable when > it > > is processed, which is communicated via the scaleFactor() function on > each > > DoFn. > > > > > > On Mon, Feb 24, 2014 at 10:57 AM, Jinal Shah > >wrote: > > > > > By size I meant the memory size sorry for the confusion. Like how much > > > memory will a PTable object require. Basically what I'm trying to do is > > if > > > the object is not that large and if it could fit in memory I wanted to > > > apply map-side join to optimize the join and depending on that I also > > > wanted to determine which one is smaller to use the Left join. > > > > > > > > > On Mon, Feb 24, 2014 at 12:45 PM, Josh Wills > > wrote: > > > > > > > There is the length() method, which will return a PObject with > > the > > > > number of elements in the PCollection. It requires running an MR job > > > > though. > > > > > > > > J > > > > > > > > > > > > On Mon, Feb 24, 2014 at 10:03 AM, Jinal Shah < > jinalshah2007@gmail.com > > > > >wrote: > > > > > > > > > Hi, > > > > > > > > > > Is there a way possible in crunch to find the size of a particular > > > > > PCollection or PTable in whole. > > > > > > > > > > Thanks > > > > > Jinal > > > > > > > > > > > > > > > > > > > > > -- > > > > Director of Data Science > > > > Cloudera > > > > Twitter: @josh_wills > > > > > > > > > > > > > > > -- > > Director of Data Science > > Cloudera > > Twitter: @josh_wills > > > -- Director of Data Science Cloudera Twitter: @josh_wills --001a11c13d96baaea604f32c68db--