crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-449) Add sequentialDo function for injecting arbitrary non-parallel code
Date Mon, 28 Jul 2014 15:53:40 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076326#comment-14076326
] 

Gabriel Reid commented on CRUNCH-449:
-------------------------------------

Sorry I took so long to take a look at this. Looks interesting -- at first I found it a bit
difficult to figure out what exactly it would be used for (and what the advantage is between
this and just calling Pipeline.run at some points), but it looks like this opens up a whole
lot of other opportunities to indirectly influence the job plan without actually having to
worry about how it's exactly done.

I noticed that SeqDoFn.dependsOn(String, PCollection) is called implicitly from PCollectionImpl.sequentialDo
, but SeqDoFn.dependsOn(String, Target) always needs to be called explicitly. I guess this
makes sense, but maybe it would be handy to change PCollection.sequentialDo to accept a String
argument that would be used as the label of the incoming PCollection dependency. I'm thinking
that would make it easier to retrieve that PCollection later by name from within the SeqDoFn.

Can the "Output" generic parameter of SeqDoFn be bounded by PCollection (i.e. <Output extends
PCollection<?>>), just because that might make documentation things easier? Or is
it possible to have a SeqDoFn that is bound to something other than a PCollection?

I noticed that the PCollection class has a commented-out version of the sequentialDo method
that needs to be removed.

I know you're probably on top of this, but I'll just point it out anyway: more docs in SeqDoFn,
particularly on the abstract methods, would be really good. It's not immediately obvious exactly
how it is intended to be used.

Also, more tests demonstrating some more use cases (target isn't created, dependent on multiple
targets, dependent on multiple PCollections, dependent on a combination of targets and PCollections)
would also be really handy, if only in terms of documenting some use cases for this new functionality.

> Add sequentialDo function for injecting arbitrary non-parallel code
> -------------------------------------------------------------------
>
>                 Key: CRUNCH-449
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-449
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-449.patch, CRUNCH-449b.patch
>
>
> I've been noodling on this one for awhile: how to add the ability to execute some code
if and only if one or more targets are created, and have that executed code (optionally) return
one or more new PCollections as a result. I was thinking that this functionality could be
wired in to libraries to do things like bulk loading HBase tables or running Sqoop jobs as
part of Crunch pipelines automatically.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message