crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Barretta <mike.barre...@gmail.com>
Subject Re: inconsistent grouping of map jobs
Date Wed, 20 Feb 2013 19:30:48 GMT
Was using a very early 0.5.0-incubating build, with hadoop 0.20.2, but just
did a fresh git pull and now with 0.6.0-incubating, things look better
(MessageData and RelationshipData are my parents with children):

13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [MessageData]
13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
path: /Synthesys/MessageData
13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
path: /Synthesys/Contexts
13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
path: /Synthesys/ContextualElements
13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
[RelationshipData]
13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
path: /Synthesys/RelationshipData
13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
path: /Synthesys/RelationshipStructures
13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ElementData]
13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
path: /Synthesys/ElementData
13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ConceptData]
13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new
path: /Synthesys/ConceptData

I'll try a few more times and let you know if anything funky happens.

Thanks, as always, for your prompt responses,
Mike


On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Mike,
>
> I can't replicate this problem using the MultipleOutputIT (which I think
> we added as a test for this problem.) Which version of Crunch and Hadoop
> are you using? The 0.5.0-incubating release should be up on the maven repos
> if you want to try that out.
>
> J
>
>
> On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Hey Mike,
>>
>> The code looks right to me. Let me whip up a test and see if I can
>> replicate it easily-- is there anything funky beyond what's in your snippet
>> that I should be aware of?
>>
>> J
>>
>>
>> On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta <mike.barretta@gmail.com>wrote:
>>
>>> I have a number of "tables" in HDFS, represented as folders containing
>>> SequenceFiles of serialized objects.  I'm trying to write a tool that will
>>> reassemble these objects and output each of the tables into its own CSV
>>> file.
>>>
>>> The wrinkle is that some of the "tables" hold objects with a list of
>>> related child objects.  Those related should get chopped into their own
>>> table.
>>>
>>> Here is essentially what my loop looks like (in Groovy):
>>>
>>> //loop through each top-level table
>>> paths.each { path ->
>>>     def source = From.sequenceFile(new Path(path),
>>>
>>> Writables.writables(ColumnKey.class),
>>>
>>> Writables.writables(ColumnDataArrayWritable.class)
>>>     )
>>>
>>>     //read it in
>>>     def data = crunchPipeline.read(source)
>>>
>>>     //write it out
>>>     crunchPipeline.write(
>>>         data.parallelDo(new MyDoFn(path), Writables.strings()),
>>>         To.textFile("$path/csv")
>>>     )
>>>
>>>     //handle children using same PTable as parent
>>>     if (path == TABLE_MESSAGE_DATA) {
>>>         messageChildPaths.each {  childPath ->
>>>             crunchPipeline.write(
>>>                 data.parallelDo(new MyDoFn(childPath),
>>> Writables.strings()),
>>>                 To.textFile("$childPath/csv")
>>>             )
>>>         }
>>>     }
>>> }
>>>
>>> The parent and child jobs generally get grouped into a single map job,
>>> but most of the time, only some of the children tables get included, which
>>> is to say, sometimes a child table does not get output.  There doesn't seem
>>> to be a pattern - sometimes all of them get included, sometimes 1 or 2.
>>>
>>> Am I missing something? Is there a way to specify which jobs should be
>>> combined?
>>>
>>> Thanks,
>>> Mike
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message