incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: inconsistent grouping of map jobs
Date Wed, 20 Feb 2013 18:06:41 GMT
Hey Mike,

I can't replicate this problem using the MultipleOutputIT (which I think we
added as a test for this problem.) Which version of Crunch and Hadoop are
you using? The 0.5.0-incubating release should be up on the maven repos if
you want to try that out.

J


On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Mike,
>
> The code looks right to me. Let me whip up a test and see if I can
> replicate it easily-- is there anything funky beyond what's in your snippet
> that I should be aware of?
>
> J
>
>
> On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta <mike.barretta@gmail.com>wrote:
>
>> I have a number of "tables" in HDFS, represented as folders containing
>> SequenceFiles of serialized objects.  I'm trying to write a tool that will
>> reassemble these objects and output each of the tables into its own CSV
>> file.
>>
>> The wrinkle is that some of the "tables" hold objects with a list of
>> related child objects.  Those related should get chopped into their own
>> table.
>>
>> Here is essentially what my loop looks like (in Groovy):
>>
>> //loop through each top-level table
>> paths.each { path ->
>>     def source = From.sequenceFile(new Path(path),
>>
>> Writables.writables(ColumnKey.class),
>>
>> Writables.writables(ColumnDataArrayWritable.class)
>>     )
>>
>>     //read it in
>>     def data = crunchPipeline.read(source)
>>
>>     //write it out
>>     crunchPipeline.write(
>>         data.parallelDo(new MyDoFn(path), Writables.strings()),
>>         To.textFile("$path/csv")
>>     )
>>
>>     //handle children using same PTable as parent
>>     if (path == TABLE_MESSAGE_DATA) {
>>         messageChildPaths.each {  childPath ->
>>             crunchPipeline.write(
>>                 data.parallelDo(new MyDoFn(childPath),
>> Writables.strings()),
>>                 To.textFile("$childPath/csv")
>>             )
>>         }
>>     }
>> }
>>
>> The parent and child jobs generally get grouped into a single map job,
>> but most of the time, only some of the children tables get included, which
>> is to say, sometimes a child table does not get output.  There doesn't seem
>> to be a pattern - sometimes all of them get included, sometimes 1 or 2.
>>
>> Am I missing something? Is there a way to specify which jobs should be
>> combined?
>>
>> Thanks,
>> Mike
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message