incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Object reuse in Reducer and impact in Crunch
Date Thu, 28 Jun 2012 21:28:38 GMT
Gabriel,

Generally agree with your line of thought-- where is the attached test case?

J

On Thu, Jun 28, 2012 at 2:11 PM, Gabriel Reid <gabriel.reid@gmail.com> wrote:
> Hi guys,
>
> As you may have seen, the topic of the PTable#collectValues method came up today in the
user mailing list. I hadn't been aware of this method before, and when I took a closer look
I saw that it just creates a Collection of values based on the incoming Iterable, without
doing any kind of a deep copy of the contents of the Iterable. As far as I can see, something
similar (i.e. holding on to values from an Iterable from a reducer) is also done in the Join
methods.
>
> As Christian also pointed out (and added to the documentation for DoFn), this can be
an issue, as values made available as an Iterable in a reducer are re-used within Hadoop.
>
> This object re-use isn't a problem in Crunch wherever a non-identity mapping is used
between the serialization type and the PCollection type within the PType (for example, with
primitives and String). However, using Writable types or non-mapped Avro types won't work
(as shown in the attached test case).
>
> I think it's definitely a problem that PTable#collectValues (and probably some other
methods) doesn't work for Writables, or in a broader sense, that the semantics can change
for the Iterable that is passed in when processing a grouped table.
>
> One really easy (but also inefficient) way we could solve this would be to not use an
IdentityFn as the default mapping function in Writables and AvroType, and instead use a MapFn
that does a deep copy of the object (i.e. by serializing and deserializing itself in memory).
This is of course a pretty big overhead for a something that isn't necessary in a lot of cases.
>
> Another option I was considering was to do something like making the input and output
PTypes of a DoFn available to the DoFn, and adding a createDetachedValue method (or something
similar) to PType, which would then serialize and deserialize objects in order to make a clone
if necessary. With this approach, the clone method would have to be called within the collectValues
method (or any other method that is holding on to values outside of the iterator).
>
> I prefer the second approach, as it avoids the the waste of extra cloning/serialization
while still making it possible to get detached values out of an Iterable.
>
> Does anyone else have any thoughts on this?
>
> - Gabriel
>



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Mime
View raw message