crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-449) Add sequentialDo function for injecting arbitrary non-parallel code
Date Mon, 28 Jul 2014 19:28:39 GMT


Micah Whitacre commented on CRUNCH-449:

* Probably want to provide access to SeqDoFn to have access to a Configuration object for
the pipeline/target in the execute method.  In the case you give where someone wants to bulk
load to HBase an HFile Target Configuration for accessing the FileSystem would be useful.
* +1 to Javadoc.  Specifically the relationship between when getOutput/execute are called
and any guaranteed execution order or not.  Also around thread safety/concurrent execution
guarantees as well as blocking operations.
* Is calling it a DoFn really appropriate?  Currently in Crunch a DoFn operates on each element
of a PCollection.  This instead essentially fork/joins pipeline stages.  I don't have a better
name unfortunately.
* Should SeqDoFn expose access to the collection of labels for targets and PCollection vs
just asking for them by name.

> Add sequentialDo function for injecting arbitrary non-parallel code
> -------------------------------------------------------------------
>                 Key: CRUNCH-449
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-449.patch, CRUNCH-449b.patch
> I've been noodling on this one for awhile: how to add the ability to execute some code
if and only if one or more targets are created, and have that executed code (optionally) return
one or more new PCollections as a result. I was thinking that this functionality could be
wired in to libraries to do things like bulk loading HBase tables or running Sqoop jobs as
part of Crunch pipelines automatically.

This message was sent by Atlassian JIRA

View raw message