crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <dor...@videologygroup.com>
Subject RE: Output file prefix
Date Fri, 13 Nov 2015 18:05:10 GMT
Thanks.  I’ll take a look at that!

From: Josh Wills [mailto:josh.wills@gmail.com]
Sent: Friday, November 13, 2015 1:04 PM
To: user@crunch.apache.org
Subject: Re: Output file prefix

Yeah, I think that might work. You would create a FileNamingScheme that would allow you to
specify different prefixes for the FileTargets of your different PCollections. I don't see
any example code for how to use it for that purpose, just this one test Gabriel wrote:

https://github.com/apache/crunch/blob/master/crunch-core/src/test/java/org/apache/crunch/io/SequentialFileNamingSchemeTest.java

On Fri, Nov 13, 2015 at 10:00 AM, Josh Wills <josh.wills@gmail.com<mailto:josh.wills@gmail.com>>
wrote:
Although...hrm. I wonder if FileNamingScheme would work for this purpose? Did you look at
that?

On Fri, Nov 13, 2015 at 9:58 AM, Josh Wills <josh.wills@gmail.com<mailto:josh.wills@gmail.com>>
wrote:
I see; they all need to end up in the same bucket in S3 w/different names. Then yes, the options
you describe sound about right.

On Fri, Nov 13, 2015 at 9:49 AM, David Ortiz <dortiz@videologygroup.com<mailto:dortiz@videologygroup.com>>
wrote:
Hey,

     The reason I was looking for this is because whether I write them to different directories,
or the same directories, I have to distcp them all to the same s3 bucket for downstream processing
to function properly, so I need to make sure that the file names don’t overlap.  So to get
this to work, it sounds like my options would be the following:

•        Have the client move the files to a common directory with names I want using FileSystem
calls

•        Write a shell script that Oozie calls to do the same thing as the previous option,
but with dfs calls.

•        Write an additional crunch job, which will load the output from the previous four
jobs and union the results.

Does that sounds about right?

Thanks,
     Dave

From: Josh Wills [mailto:josh.wills@gmail.com<mailto:josh.wills@gmail.com>]
Sent: Friday, November 13, 2015 12:41 PM
To: user@crunch.apache.org<mailto:user@crunch.apache.org>
Subject: Re: Output file prefix

Hey David,

There isn't a way to muck w/the file output prefix on a per-collection basis. Would something
like a PathPerKeyTarget work for this situation, where you would have four keys for the different
output directories and could sort of union together the PTable<String, Whatever> instances
that you needed to create on a particular run?

J

On Fri, Nov 13, 2015 at 7:36 AM, David Ortiz <dpo5003@gmail.com<mailto:dpo5003@gmail.com>>
wrote:
Hey everyone,

     I thought I remembered seeing something in the docs about being able to set a prefix
for output files from a collection, but I am having trouble finding it now.  Does that exist?

    I am trying to break up a large job that had four parallel threads of execution on different
data sets, that all fed one output set into four separate jobs to make it easier to rerun
only one of the input sets in the event something goes wrong, and this would make it a lot
easier to deal with getting the output all into one directory.

Thanks,
     Dave

This email is intended only for the use of the individual(s) to whom it is addressed. If you
have received this communication in error, please immediately notify the sender and delete
the original email.



This email is intended only for the use of the individual(s) to whom it is addressed. If you
have received this communication in error, please immediately notify the sender and delete
the original email.
Mime
View raw message