crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <dpo5...@gmail.com>
Subject Re: Output file prefix
Date Fri, 13 Nov 2015 20:59:46 GMT
Josh,

     That did the trick.  I made a new implementation of FileNamingScheme
which takes a prefix in the constructor and otherwise uses the same logic
as the SequentialFileNamingScheme to create the output files, so now I get
part-<prefix>-<number> in the output.  Thanks!

Dave

On Fri, Nov 13, 2015 at 1:05 PM David Ortiz <dortiz@videologygroup.com>
wrote:

> Thanks.  I’ll take a look at that!
>
>
>
> *From:* Josh Wills [mailto:josh.wills@gmail.com]
> *Sent:* Friday, November 13, 2015 1:04 PM
>
>
> *To:* user@crunch.apache.org
> *Subject:* Re: Output file prefix
>
>
>
> Yeah, I think that might work. You would create a FileNamingScheme that
> would allow you to specify different prefixes for the FileTargets of your
> different PCollections. I don't see any example code for how to use it for
> that purpose, just this one test Gabriel wrote:
>
>
>
>
> https://github.com/apache/crunch/blob/master/crunch-core/src/test/java/org/apache/crunch/io/SequentialFileNamingSchemeTest.java
>
>
>
> On Fri, Nov 13, 2015 at 10:00 AM, Josh Wills <josh.wills@gmail.com> wrote:
>
> Although...hrm. I wonder if FileNamingScheme would work for this purpose?
> Did you look at that?
>
>
>
> On Fri, Nov 13, 2015 at 9:58 AM, Josh Wills <josh.wills@gmail.com> wrote:
>
> I see; they all need to end up in the same bucket in S3 w/different names.
> Then yes, the options you describe sound about right.
>
>
>
> On Fri, Nov 13, 2015 at 9:49 AM, David Ortiz <dortiz@videologygroup.com>
> wrote:
>
> Hey,
>
>
>
>      The reason I was looking for this is because whether I write them to
> different directories, or the same directories, I have to distcp them all
> to the same s3 bucket for downstream processing to function properly, so I
> need to make sure that the file names don’t overlap.  So to get this to
> work, it sounds like my options would be the following:
>
> ·        Have the client move the files to a common directory with names
> I want using FileSystem calls
>
> ·        Write a shell script that Oozie calls to do the same thing as
> the previous option, but with dfs calls.
>
> ·        Write an additional crunch job, which will load the output from
> the previous four jobs and union the results.
>
>
>
> Does that sounds about right?
>
>
>
> Thanks,
>
>      Dave
>
>
>
> *From:* Josh Wills [mailto:josh.wills@gmail.com]
> *Sent:* Friday, November 13, 2015 12:41 PM
> *To:* user@crunch.apache.org
> *Subject:* Re: Output file prefix
>
>
>
> Hey David,
>
>
>
> There isn't a way to muck w/the file output prefix on a per-collection
> basis. Would something like a PathPerKeyTarget work for this situation,
> where you would have four keys for the different output directories and
> could sort of union together the PTable<String, Whatever> instances that
> you needed to create on a particular run?
>
>
>
> J
>
>
>
> On Fri, Nov 13, 2015 at 7:36 AM, David Ortiz <dpo5003@gmail.com> wrote:
>
> Hey everyone,
>
>
>
>      I thought I remembered seeing something in the docs about being able
> to set a prefix for output files from a collection, but I am having trouble
> finding it now.  Does that exist?
>
>
>
>     I am trying to break up a large job that had four parallel threads of
> execution on different data sets, that all fed one output set into four
> separate jobs to make it easier to rerun only one of the input sets in the
> event something goes wrong, and this would make it a lot easier to deal
> with getting the output all into one directory.
>
>
>
> Thanks,
>
>      Dave
>
>
>
> *This email is intended only for the use of the individual(s) to whom it
> is addressed. If you have received this communication in error, please
> immediately notify the sender and delete the original email.*
>
>
>
>
>
>
> *This email is intended only for the use of the individual(s) to whom it
> is addressed. If you have received this communication in error, please
> immediately notify the sender and delete the original email.*
>

Mime
View raw message