crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Output file prefix
Date Fri, 13 Nov 2015 17:58:59 GMT
I see; they all need to end up in the same bucket in S3 w/different names.
Then yes, the options you describe sound about right.

On Fri, Nov 13, 2015 at 9:49 AM, David Ortiz <dortiz@videologygroup.com>
wrote:

> Hey,
>
>
>
>      The reason I was looking for this is because whether I write them to
> different directories, or the same directories, I have to distcp them all
> to the same s3 bucket for downstream processing to function properly, so I
> need to make sure that the file names don’t overlap.  So to get this to
> work, it sounds like my options would be the following:
>
> ·        Have the client move the files to a common directory with names
> I want using FileSystem calls
>
> ·        Write a shell script that Oozie calls to do the same thing as
> the previous option, but with dfs calls.
>
> ·        Write an additional crunch job, which will load the output from
> the previous four jobs and union the results.
>
>
>
> Does that sounds about right?
>
>
>
> Thanks,
>
>      Dave
>
>
>
> *From:* Josh Wills [mailto:josh.wills@gmail.com]
> *Sent:* Friday, November 13, 2015 12:41 PM
> *To:* user@crunch.apache.org
> *Subject:* Re: Output file prefix
>
>
>
> Hey David,
>
>
>
> There isn't a way to muck w/the file output prefix on a per-collection
> basis. Would something like a PathPerKeyTarget work for this situation,
> where you would have four keys for the different output directories and
> could sort of union together the PTable<String, Whatever> instances that
> you needed to create on a particular run?
>
>
>
> J
>
>
>
> On Fri, Nov 13, 2015 at 7:36 AM, David Ortiz <dpo5003@gmail.com> wrote:
>
> Hey everyone,
>
>
>
>      I thought I remembered seeing something in the docs about being able
> to set a prefix for output files from a collection, but I am having trouble
> finding it now.  Does that exist?
>
>
>
>     I am trying to break up a large job that had four parallel threads of
> execution on different data sets, that all fed one output set into four
> separate jobs to make it easier to rerun only one of the input sets in the
> event something goes wrong, and this would make it a lot easier to deal
> with getting the output all into one directory.
>
>
>
> Thanks,
>
>      Dave
>
>
> *This email is intended only for the use of the individual(s) to whom it
> is addressed. If you have received this communication in error, please
> immediately notify the sender and delete the original email.*
>

Mime
View raw message