crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: CrunchJobHooks.handleMultiPaths(..) file pattern expectations
Date Fri, 26 Apr 2013 20:19:03 GMT
Can you gist up a patch and/or post it to a JIRA so we can take a look?


On Fri, Apr 26, 2013 at 12:30 PM, Micah Whitacre <mkwhitacre@gmail.com>wrote:

> So as mentioned I'm currently trying out adding Avro Trevni support to
> Crunch.  I think I've gotten everything working with the exception that my
> output is not being copied to the correct directory upon completion.
>
> I'm extending the FileTargetImpl and have the following in my
> implementation:
>
>     @Override
>     public void configureForMapReduce(Job job, PType<?> ptype, Path
> outputPath, String name) {
>          .....
>         configureForMapReduce(job, AvroKey.class, NullWritable.class,
> AvroTrevniKeyOutputFormat.class,
>                 outputPath, name);
>
>         //AvroTrevniKeyOutputFormat uses this set value to write content
> directly to this path.  Therefore
>         // resetting the value with the named value.
>         if(name != null){
>             FileOutputFormat.setOutputPath(job, new Path(outputPath,
> name));
>         }
>
> This produces the following in the crunch tmp directory:
>
> $ pwd
>
> /var/folders/0f/l_2w0gxd0p15k9410b18j8q40000gp/T/junit6467712912178902519/tmp-crunch.tmp.dir/crunch-1902403831/p1/output/out0
> $ ls
> _SUCCESS part-m-00000
> $ cd part-m-00000/
> $ ls -l
> total 8
> -rwxrwxrwx  1 mw010351  staff  493 Apr 26 13:52 part-0.trv
> -rw-r--r--  1 mw010351  staff    0 Apr 26 13:52 part-m-00000
>
> the part-0.trv is the file of the most interest and ideally I'd be able to
> avoid the extra part-m-00000 directory (but I can work on that
> configuration because it is inside of Trevni I think).
>
> Unfortunately the directories from the crunch tmdir isn't getting copied
> to the expected output directory because the CrunchJobHooks for completion
> expects folders to be of the form "out#-*" and  the directory that is
> getting created does not have the "-" or take the form like others
> ("out0-m-00000").  Am I missing some configuration in my target that would
> cause the directory to be created like that?  Or should the pattern for
> finding directories to copy be lessened to not have the final "-"?
>
> Thoughts?
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message