crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: MultipleOutputs in Crunch
Date Wed, 07 Aug 2013 18:18:31 GMT
If your data is going though a reducer, there's support for something like
this built in to Crunch, although it's not (yet) very developer-friendly.

If you have a custom Partitioner that maps each key to a pre-determined
partition id, you can implement a custom FileNamingScheme[1] and have then
map the output partition keys to a set filename that represents the content
under that key. I believe most (or all) Target implementations can be
instantiated with a FileNamingScheme object.

- Gabriel

[1]
http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/io/FileNamingScheme.html


On Wed, Aug 7, 2013 at 3:04 PM, Micah Whitacre <mkwhit@gmail.com> wrote:

> I believe you could accomplish this but creating PCollections for each of
> the key/values you want to persist and then writing[1] the PCollections out
> to whichever directories makes the most sense.
>
> [1] -
>
> http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/Pipeline.html#write(org.apache.crunch.PCollection
> ,
> org.apache.crunch.Target)
>
>
> On Wed, Aug 7, 2013 at 3:31 AM, Mridul Das <d.mridul@gmail.com> wrote:
>
> > Hi,
> >    MultipleOutputs enable us to generate custom file names base on
> > keys/values.
> >    How do we achieve this in Crunch?
> >
> > Regards,
> > Mridul
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message