crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Whitacre <mkwhita...@gmail.com>
Subject Re: CrunchJobHooks.handleMultiPaths(..) file pattern expectations
Date Fri, 26 Apr 2013 20:44:31 GMT
Logged CRUNCH-199.
https://issues.apache.org/jira/browse/CRUNCH-199


On Fri, Apr 26, 2013 at 3:21 PM, Micah Whitacre <mkwhitacre@gmail.com>wrote:

> >> Can you gist up a patch and/or post it to a JIRA so we can take a look?
>
> I'll work on cleaning up my code a bit and attach it to a JIRA.
>
>
> On Fri, Apr 26, 2013 at 3:19 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Can you gist up a patch and/or post it to a JIRA so we can take a look?
>>
>>
>> On Fri, Apr 26, 2013 at 12:30 PM, Micah Whitacre <mkwhitacre@gmail.com>wrote:
>>
>>> So as mentioned I'm currently trying out adding Avro Trevni support to
>>> Crunch.  I think I've gotten everything working with the exception that my
>>> output is not being copied to the correct directory upon completion.
>>>
>>> I'm extending the FileTargetImpl and have the following in my
>>> implementation:
>>>
>>>     @Override
>>>     public void configureForMapReduce(Job job, PType<?> ptype, Path
>>> outputPath, String name) {
>>>          .....
>>>         configureForMapReduce(job, AvroKey.class, NullWritable.class,
>>> AvroTrevniKeyOutputFormat.class,
>>>                 outputPath, name);
>>>
>>>         //AvroTrevniKeyOutputFormat uses this set value to write content
>>> directly to this path.  Therefore
>>>         // resetting the value with the named value.
>>>         if(name != null){
>>>             FileOutputFormat.setOutputPath(job, new Path(outputPath,
>>> name));
>>>         }
>>>
>>> This produces the following in the crunch tmp directory:
>>>
>>> $ pwd
>>>
>>> /var/folders/0f/l_2w0gxd0p15k9410b18j8q40000gp/T/junit6467712912178902519/tmp-crunch.tmp.dir/crunch-1902403831/p1/output/out0
>>> $ ls
>>> _SUCCESS part-m-00000
>>> $ cd part-m-00000/
>>> $ ls -l
>>> total 8
>>> -rwxrwxrwx  1 mw010351  staff  493 Apr 26 13:52 part-0.trv
>>> -rw-r--r--  1 mw010351  staff    0 Apr 26 13:52 part-m-00000
>>>
>>> the part-0.trv is the file of the most interest and ideally I'd be able
>>> to avoid the extra part-m-00000 directory (but I can work on that
>>> configuration because it is inside of Trevni I think).
>>>
>>> Unfortunately the directories from the crunch tmdir isn't getting copied
>>> to the expected output directory because the CrunchJobHooks for completion
>>> expects folders to be of the form "out#-*" and  the directory that is
>>> getting created does not have the "-" or take the form like others
>>> ("out0-m-00000").  Am I missing some configuration in my target that would
>>> cause the directory to be created like that?  Or should the pattern for
>>> finding directories to copy be lessened to not have the final "-"?
>>>
>>> Thoughts?
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Mime
View raw message