crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Output file prefix
Date Fri, 13 Nov 2015 18:03:48 GMT
Yeah, I think that might work. You would create a FileNamingScheme that
would allow you to specify different prefixes for the FileTargets of your
different PCollections. I don't see any example code for how to use it for
that purpose, just this one test Gabriel wrote:

https://github.com/apache/crunch/blob/master/crunch-core/src/test/java/org/apache/crunch/io/SequentialFileNamingSchemeTest.java

On Fri, Nov 13, 2015 at 10:00 AM, Josh Wills <josh.wills@gmail.com> wrote:

> Although...hrm. I wonder if FileNamingScheme would work for this purpose?
> Did you look at that?
>
> On Fri, Nov 13, 2015 at 9:58 AM, Josh Wills <josh.wills@gmail.com> wrote:
>
>> I see; they all need to end up in the same bucket in S3 w/different
>> names. Then yes, the options you describe sound about right.
>>
>> On Fri, Nov 13, 2015 at 9:49 AM, David Ortiz <dortiz@videologygroup.com>
>> wrote:
>>
>>> Hey,
>>>
>>>
>>>
>>>      The reason I was looking for this is because whether I write them
>>> to different directories, or the same directories, I have to distcp them
>>> all to the same s3 bucket for downstream processing to function properly,
>>> so I need to make sure that the file names don’t overlap.  So to get this
>>> to work, it sounds like my options would be the following:
>>>
>>> ·        Have the client move the files to a common directory with
>>> names I want using FileSystem calls
>>>
>>> ·        Write a shell script that Oozie calls to do the same thing as
>>> the previous option, but with dfs calls.
>>>
>>> ·        Write an additional crunch job, which will load the output
>>> from the previous four jobs and union the results.
>>>
>>>
>>>
>>> Does that sounds about right?
>>>
>>>
>>>
>>> Thanks,
>>>
>>>      Dave
>>>
>>>
>>>
>>> *From:* Josh Wills [mailto:josh.wills@gmail.com]
>>> *Sent:* Friday, November 13, 2015 12:41 PM
>>> *To:* user@crunch.apache.org
>>> *Subject:* Re: Output file prefix
>>>
>>>
>>>
>>> Hey David,
>>>
>>>
>>>
>>> There isn't a way to muck w/the file output prefix on a per-collection
>>> basis. Would something like a PathPerKeyTarget work for this situation,
>>> where you would have four keys for the different output directories and
>>> could sort of union together the PTable<String, Whatever> instances that
>>> you needed to create on a particular run?
>>>
>>>
>>>
>>> J
>>>
>>>
>>>
>>> On Fri, Nov 13, 2015 at 7:36 AM, David Ortiz <dpo5003@gmail.com> wrote:
>>>
>>> Hey everyone,
>>>
>>>
>>>
>>>      I thought I remembered seeing something in the docs about being
>>> able to set a prefix for output files from a collection, but I am having
>>> trouble finding it now.  Does that exist?
>>>
>>>
>>>
>>>     I am trying to break up a large job that had four parallel threads
>>> of execution on different data sets, that all fed one output set into four
>>> separate jobs to make it easier to rerun only one of the input sets in the
>>> event something goes wrong, and this would make it a lot easier to deal
>>> with getting the output all into one directory.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>      Dave
>>>
>>>
>>> *This email is intended only for the use of the individual(s) to whom it
>>> is addressed. If you have received this communication in error, please
>>> immediately notify the sender and delete the original email.*
>>>
>>
>>
>

Mime
View raw message