incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kiyan Ahmadizadeh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-58) Implement PObject in Crunch/Scrunch
Date Wed, 12 Sep 2012 02:35:07 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453655#comment-13453655
] 

Kiyan Ahmadizadeh commented on CRUNCH-58:
-----------------------------------------

I included these methods mostly because the backing PCollection exposes them. All three seemed
like useful things to expose to the client (although this is debatable and could be convinced
to remove some or all of them).  

I didn't want to expose a getter for the backing PCollection for a couple of reasons:

1. I wanted the PObject interface to be agnostic regarding what actually backed the implementation.
Including a method in the interface that returned the backing PCollection would make this
impossible.  The importance of this in the context of Crunch is debatable, since a PCollection
is the mechanism through which all distributed computation has to happen.  PObjects act as
a lazy Future and later we might want to use that concept more generally.  

2. It felt like it would hurt the PObject abstraction by exposing implementation details.
 It provides a means for the client to initiate further distributed computation on the data
backing the PObject, which encourages bad practice with PObjects.  PObjects should be used
for values small enough to fit into memory so they can be worked with locally or shipped around
with do functions to act as side data for jobs.  I think hiding the underlying PCollection
enforces this.  

Thoughts?  
                
> Implement PObject in Crunch/Scrunch
> -----------------------------------
>
>                 Key: CRUNCH-58
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-58
>             Project: Crunch
>          Issue Type: New Feature
>    Affects Versions: 0.3.0
>            Reporter: Kiyan Ahmadizadeh
>            Assignee: Kiyan Ahmadizadeh
>         Attachments: CRUNCH-58.patch
>
>
> FlumeJava has the concept of a PObject<T>, a container for a singleton of type
T.  It is meant represent the result of a distributed computation that yields a singleton
value (for example max, min, and length methods on PCollection<T>).  Generally speaking,
the result of any computation that combines/reduces a PCollection into a singleton value could
be represented by a PObject.  
> Like PCollection, a PObject defers distributed computation until its value is actually
used.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message