crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Object size
Date Mon, 24 Feb 2014 19:59:07 GMT
On Mon, Feb 24, 2014 at 11:25 AM, Jinal Shah <jinalshah2007@gmail.com>wrote:

> Thanks Josh, I have a few following questions
> so let's say with the default scaleFactor how much approximation should we
> assume like +/- 1%?
>

In the worst case, it can be arbitrarily wrong (although I suppose we're
bounded on the low end by zero.) The primary sources of error are a) the
fact that serialized size on disk is less than (and sometimes significantly
less than) Java's object overhead and b) scaleFactor may or may not
accurately reflect the operations performed by the DoFn. If I was a
conservative man, and in this I am, I would assume that the in-memory
storage size of the data will be 2x whatever scaleFactor reports it as, at
least for purposes of deciding between an in-memory vs. a reduce-side join.


> How does scaleFactor affect the size of the object?
>

It doesn't affect it, it only reports what the developer thinks the DoFn
will do to any input it receives. Sometimes this is relatively easy to
determine, like if we have a FilterFn that is going to filter out half of
its inputs. For an arbitrary DoFn, it's harder to do precisely.


> Can this be a part of Crunch as an enhancement to the current Join
> strategy?
>

We have generally stayed away from any sort of intelligent join strategy
selection, although it's come up a couple of times during discussions on
the mailing list. One of our principles is to avoid magic wherever possible
and always give developers precise control over the operations performed
during a pipeline, so I would want to be careful about how we proceeded
w/this sort of thing.


>
> Thanks
> Jinal
>
>
> On Mon, Feb 24, 2014 at 1:01 PM, Josh Wills <jwills@cloudera.com> wrote:
>
> > Ah, cool. the long getSize() method will return Crunch's estimate of the
> > size of the object in bytes, but it's good to keep in mind that it's a
> very
> > rough approximation based on the size of the file on disk and any info we
> > have about the behavior of any DoFns that are applied to the PTable when
> it
> > is processed, which is communicated via the scaleFactor() function on
> each
> > DoFn.
> >
> >
> > On Mon, Feb 24, 2014 at 10:57 AM, Jinal Shah <jinalshah2007@gmail.com
> > >wrote:
> >
> > > By size I meant the memory size sorry for the confusion. Like how much
> > > memory will a PTable object require. Basically what I'm trying to do is
> > if
> > > the object is not that large and if it could fit in memory I wanted to
> > > apply map-side join to optimize the join and depending on that I also
> > > wanted to determine which one is smaller to use the Left join.
> > >
> > >
> > > On Mon, Feb 24, 2014 at 12:45 PM, Josh Wills <jwills@cloudera.com>
> > wrote:
> > >
> > > > There is the length() method, which will return a PObject<Long>
with
> > the
> > > > number of elements in the PCollection. It requires running an MR job
> > > > though.
> > > >
> > > > J
> > > >
> > > >
> > > > On Mon, Feb 24, 2014 at 10:03 AM, Jinal Shah <
> jinalshah2007@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Is there a way possible in crunch to find the size of a particular
> > > > > PCollection or PTable in whole.
> > > > >
> > > > > Thanks
> > > > > Jinal
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Director of Data Science
> > > > Cloudera <http://www.cloudera.com>
> > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > >
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message