crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Barretta <mike.barre...@gmail.com>
Subject Re: inconsistent grouping of map jobs
Date Wed, 20 Feb 2013 20:34:59 GMT
just once - each of the parallelDo's happens within the run() of my Tool,
and I kick it off with the pipeline.done() vs pipeline.run() - any
difference there?


On Wed, Feb 20, 2013 at 3:25 PM, Josh Wills <jwills@cloudera.com> wrote:

> Ah, okay. I just got on a train, so I'll have to do a bit of local
> debugging.
>
> Curious: are you explicitly calling run() between each of these jobs, or
> just once after they've all been defined?
>
>
> On Wednesday, February 20, 2013, Mike Barretta <mike.barretta@gmail.com>
> wrote:
> > okay, well, things turned for the worse quickly :)
> > Following the same output above, the following jobs were created:
> > 13/02/20 19:25:26 INFO exec.CrunchJob: Running job
> "com.digitalreasoning.petal.extract.SynthesysKBExtractor:
> SeqFile(/Synthesys/MessageData)+[[S1+Text(/Synthesys/export/Contexts)]/[S0+Text(/Synthesys/export/MessageData)]/[S2+Text(/Synthesys/export/ContextualElements)]]"
> > 13/02/20 19:25:26 INFO exec.CrunchJob: Job status available at: <snip>
> > 13/02/20 19:25:28 INFO input.FileInputFormat: Total input paths to
> process : 40
> > 13/02/20 19:25:29 INFO exec.CrunchJob: Running job
> "com.digitalreasoning.petal.extract.SynthesysKBExtractor:
> SeqFile(/Synthesys/ElementData)+S5+Text(/Synthesys/export/ElementData)"
> > 13/02/20 19:25:29 INFO exec.CrunchJob: Job status available at: <snip>
> > 13/02/20 19:25:32 INFO input.FileInputFormat: Total input paths to
> process : 40
> > 13/02/20 19:25:32 INFO exec.CrunchJob: Running job
> "com.digitalreasoning.petal.extract.SynthesysKBExtractor:
> SeqFile(/Synthesys/RelationshipData)+S3+Text(/Synthesys/export/RelationshipData)"
> > notice that the first (MessageData) shows all three output paths while
> the last (RelationshipData) only shows one.  This is despite the previous
> log messages showing:
> > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
> [RelationshipData]
> > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
> new path: /Synthesys/export/RelationshipData
> > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
> new path: /Synthesys/export/RelationshipStructures
> > *forgive the mismatched paths between this email and my previous - am
> shorting for brevity, and trying to convey the difference between input and
> export paths
> >
> > On Wed, Feb 20, 2013 at 2:30 PM, Mike Barretta <mike.barretta@gmail.com>
> wrote:
> >
> > Was using a very early 0.5.0-incubating build, with hadoop 0.20.2, but
> just did a fresh git pull and now with 0.6.0-incubating, things look better
> (MessageData and RelationshipData are my parents with children):
> > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
> [MessageData]
> > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
> new path: /Synthesys/MessageData
> > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
> new path: /Synthesys/Contexts
> > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
> new path: /Synthesys/ContextualElements
> > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
> [RelationshipData]
> > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
> new path: /Synthesys/RelationshipData
> > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
> new path: /Synthesys/RelationshipStructures
> > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
> [ElementData]
> > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
> new path: /Synthesys/ElementData
> > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
> [ConceptData]
> > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
> new path: /Synthesys/ConceptData
> > I'll try a few more times and let you know if anything funky happens.
> > Thanks, as always, for your prompt responses,
> > Mike
> >
> > On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills <jwills@cloudera.com> wrote:
> >
> > Hey Mike,
> > I can't replicate this problem using the MultipleOutputIT (which I think
> we added as a test for this problem.) Which version of Crunch and Hadoop
> are you using? The 0.5.0-incubating release should be up on the maven repos
> if you want to try that out.
> > J
> >
> > On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <jwills@cloudera.com> wrote:
> >
> > Hey Mike,
> > The code looks right to me. Let me whip up a test and see if I can
> replicate it easily-- is there anything funky beyond what's in your snippet
> that I should be aware of?
> > J
> >
> > On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta <mike.barretta@gmail.com>
> wrote:
> >
> > I have a number of "tables" in HDFS, represented as folders containing
> SequenceFiles of serialized objects.  I'm trying to write a tool that will
> reassemble these objects and output each of the tables into its own CSV
> file.
> > The wrinkle is that some of the "tables" hold objects with a list of
> related child objects.  Those related should get chopped into their own
> table.
> > Here is essentially what my loop looks like (in Groovy):
> > //loop through each top-level table
> > paths.each { path ->
> >     def source = From.sequenceFile(new Path(path),
> >
> Writables.writables(ColumnKey.class),
> >
> Writables.writables(ColumnDataArrayWritable.class)
> >     )
> >     //read it in
> >     def data = crunchPipeline.read(source)
> >     //write it out
> >     crunchPipeline.write(
> >         data.parallelDo(new MyDoFn(path), Writables.strings()),
> >         To.textFile("$path/csv")
> >     )
> >     //handle children using same PTable as pare
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>

Mime
View raw message