crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinal Shah <jinalshah2...@gmail.com>
Subject Re: Object size
Date Mon, 24 Feb 2014 19:25:03 GMT
Thanks Josh, I have a few following questions
so let's say with the default scaleFactor how much approximation should we
assume like +/- 1%?
How does scaleFactor affect the size of the object?
Can this be a part of Crunch as an enhancement to the current Join strategy?

Thanks
Jinal


On Mon, Feb 24, 2014 at 1:01 PM, Josh Wills <jwills@cloudera.com> wrote:

> Ah, cool. the long getSize() method will return Crunch's estimate of the
> size of the object in bytes, but it's good to keep in mind that it's a very
> rough approximation based on the size of the file on disk and any info we
> have about the behavior of any DoFns that are applied to the PTable when it
> is processed, which is communicated via the scaleFactor() function on each
> DoFn.
>
>
> On Mon, Feb 24, 2014 at 10:57 AM, Jinal Shah <jinalshah2007@gmail.com
> >wrote:
>
> > By size I meant the memory size sorry for the confusion. Like how much
> > memory will a PTable object require. Basically what I'm trying to do is
> if
> > the object is not that large and if it could fit in memory I wanted to
> > apply map-side join to optimize the join and depending on that I also
> > wanted to determine which one is smaller to use the Left join.
> >
> >
> > On Mon, Feb 24, 2014 at 12:45 PM, Josh Wills <jwills@cloudera.com>
> wrote:
> >
> > > There is the length() method, which will return a PObject<Long> with
> the
> > > number of elements in the PCollection. It requires running an MR job
> > > though.
> > >
> > > J
> > >
> > >
> > > On Mon, Feb 24, 2014 at 10:03 AM, Jinal Shah <jinalshah2007@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > Is there a way possible in crunch to find the size of a particular
> > > > PCollection or PTable in whole.
> > > >
> > > > Thanks
> > > > Jinal
> > > >
> > >
> > >
> > >
> > > --
> > > Director of Data Science
> > > Cloudera <http://www.cloudera.com>
> > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message