crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Output file prefix
Date Fri, 13 Nov 2015 18:00:11 GMT
Although...hrm. I wonder if FileNamingScheme would work for this purpose?
Did you look at that?

On Fri, Nov 13, 2015 at 9:58 AM, Josh Wills <josh.wills@gmail.com> wrote:

> I see; they all need to end up in the same bucket in S3 w/different names.
> Then yes, the options you describe sound about right.
>
> On Fri, Nov 13, 2015 at 9:49 AM, David Ortiz <dortiz@videologygroup.com>
> wrote:
>
>> Hey,
>>
>>
>>
>>      The reason I was looking for this is because whether I write them to
>> different directories, or the same directories, I have to distcp them all
>> to the same s3 bucket for downstream processing to function properly, so I
>> need to make sure that the file names don’t overlap.  So to get this to
>> work, it sounds like my options would be the following:
>>
>> ·        Have the client move the files to a common directory with names
>> I want using FileSystem calls
>>
>> ·        Write a shell script that Oozie calls to do the same thing as
>> the previous option, but with dfs calls.
>>
>> ·        Write an additional crunch job, which will load the output from
>> the previous four jobs and union the results.
>>
>>
>>
>> Does that sounds about right?
>>
>>
>>
>> Thanks,
>>
>>      Dave
>>
>>
>>
>> *From:* Josh Wills [mailto:josh.wills@gmail.com]
>> *Sent:* Friday, November 13, 2015 12:41 PM
>> *To:* user@crunch.apache.org
>> *Subject:* Re: Output file prefix
>>
>>
>>
>> Hey David,
>>
>>
>>
>> There isn't a way to muck w/the file output prefix on a per-collection
>> basis. Would something like a PathPerKeyTarget work for this situation,
>> where you would have four keys for the different output directories and
>> could sort of union together the PTable<String, Whatever> instances that
>> you needed to create on a particular run?
>>
>>
>>
>> J
>>
>>
>>
>> On Fri, Nov 13, 2015 at 7:36 AM, David Ortiz <dpo5003@gmail.com> wrote:
>>
>> Hey everyone,
>>
>>
>>
>>      I thought I remembered seeing something in the docs about being able
>> to set a prefix for output files from a collection, but I am having trouble
>> finding it now.  Does that exist?
>>
>>
>>
>>     I am trying to break up a large job that had four parallel threads of
>> execution on different data sets, that all fed one output set into four
>> separate jobs to make it easier to rerun only one of the input sets in the
>> event something goes wrong, and this would make it a lot easier to deal
>> with getting the output all into one directory.
>>
>>
>>
>> Thanks,
>>
>>      Dave
>>
>>
>> *This email is intended only for the use of the individual(s) to whom it
>> is addressed. If you have received this communication in error, please
>> immediately notify the sender and delete the original email.*
>>
>
>

Mime
View raw message