crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-489) Add methods to create PCollections from Java Iterable to Pipeline interface
Date Sat, 24 Jan 2015 00:54:35 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290290#comment-14290290
] 

Gabriel Reid commented on CRUNCH-489:
-------------------------------------

Yep, I think having a {{read()}} method that takes a name would be pretty handy (as well as
being really useful in this context for allowing named created collections when running with
MRPipeline).

I really like the approach here, there have been quite a few times where I've wished something
like this existed (or would have wished for it if I'd had the idea).

If I'm reading it correctly, the parallelism parameter will only work with Text-based Writables
(looking at AvroType and WritableType). I'm thinking that it might be possible to get around
that by just writing multiple files in the createSourceTarget method of those classes, and
then parallelism would work regardless of the underlying type (as long as CombineFileInputFormat
doesn't get in the way). 

Apart from that, a few really small nits I noticed with the current patch:
* Maybe CreatedCollection should have a different name (or at least a bit of javadoc), as
it is currently not that easy to know what it does based on that name. MemoryBasedCollection?
InputIterableCollection? I don't know. Similar comment also possibly applies to MapInputFn
and MapPairInputFn, or those classes could even be static inner classes of CreatedCollection
I guess.
* There are a few wildcard imports, which are not compliant with the non-existent coding conventions
* NLineInputFn is no longer directly testing the NLineInputSource, which is a bit confusing
(although it's definitely doing a valid test)
* CreatedCollection currently does some unecessary null checking and default value setting
on CreatedCollection.getName()



> Add methods to create PCollections from Java Iterable to Pipeline interface
> ---------------------------------------------------------------------------
>
>                 Key: CRUNCH-489
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-489
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Josh Wills
>         Attachments: CRUNCH-489.patch, CRUNCH-489b.patch, CRUNCH-489c.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message