crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: inconsistent grouping of map jobs
Date Sat, 23 Feb 2013 01:39:28 GMT
K. I just published 0.5.0-cdh4.1.3 on Cloudera's local repo for folks
trying against CDH4 (the default maven repos are built against Hadoop 1.x):
https://repository.cloudera.com/artifactory/libs-release-local/


On Fri, Feb 22, 2013 at 2:47 PM, Mike Barretta <mike.barretta@gmail.com>wrote:

> Ah, well yes, building against the 1.0.3 specified in the 0.6 pom, but
> running against a cdh3 deployment with hadoop 0.20.2.  Frowned upon, I see.
>  I will try it against cdh4.
>
> Banging my head against this one, got nothing. Question: you said Hadoop
> 0.20.2-- how does that work? I don't think Crunch builds agains 0.20.2.
>
> J
>
>
> On Wed, Feb 20, 2013 at 12:36 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> It shouldn't, but that'll help me recreate it. Thanks!
>>
>>
>> On Wednesday, February 20, 2013, Mike Barretta <mike.barretta@gmail.com>
>> wrote:
>> > just once - each of the parallelDo's happens within the run() of my
>> Tool, and I kick it off with the pipeline.done() vs pipeline.run() - any
>> difference there?
>> >
>> > On Wed, Feb 20, 2013 at 3:25 PM, Josh Wills <jwills@cloudera.com>
>> wrote:
>> >
>> > Ah, okay. I just got on a train, so I'll have to do a bit of local
>> debugging.
>> >
>> > Curious: are you explicitly calling run() between each of these jobs,
>> or just once after they've all been defined?
>> >
>> > On Wednesday, February 20, 2013, Mike Barretta <mike.barretta@gmail.com>
>> wrote:
>> >> okay, well, things turned for the worse quickly :)
>> >> Following the same output above, the following jobs were created:
>> >> 13/02/20 19:25:26 INFO exec.CrunchJob: Running job
>> "com.digitalreasoning.petal.extract.SynthesysKBExtractor:
>> SeqFile(/Synthesys/MessageData)+[[S1+Text(/Synthesys/export/Contexts)]/[S0+Text(/Synthesys/export/MessageData)]/[S2+Text(/Synthesys/export/ContextualElements)]]"
>> >> 13/02/20 19:25:26 INFO exec.CrunchJob: Job status available at: <snip>
>> >> 13/02/20 19:25:28 INFO input.FileInputFormat: Total input paths to
>> process : 40
>> >> 13/02/20 19:25:29 INFO exec.CrunchJob: Running job
>> "com.digitalreasoning.petal.extract.SynthesysKBExtractor:
>> SeqFile(/Synthesys/ElementData)+S5+Text(/Synthesys/export/ElementData)"
>> >> 13/02/20 19:25:29 INFO exec.CrunchJob: Job status available at: <snip>
>> >> 13/02/20 19:25:32 INFO input.FileInputFormat: Total input paths to
>> process : 40
>> >> 13/02/20 19:25:32 INFO exec.CrunchJob: Running job
>> "com.digitalreasoning.petal.extract.SynthesysKBExtractor:
>> SeqFile(/Synthesys/RelationshipData)+S3+Text(/Synthesys/export/RelationshipData)"
>> >> notice that the first (MessageData) shows all three output paths while
>> the last (RelationshipData) only shows one.  This is despite the previous
>> log messages showing:
>> >> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
>> [RelationshipData]
>> >> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
>> new path: /Synthesys/export/RelationshipData
>> >> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
>> new path: /Synthesys/export/RelationshipStructures
>> >> *forgive the mismatched paths between this email and my previous - am
>> shorting for brevity, and trying to convey the difference between input and
>> export paths
>> >>
>> >> On Wed, Feb 20, 2013 at 2:30 PM, Mike Barretta <
>> mike.barretta@gmail.com> wrote:
>> >>
>> >> Was using a very early 0.5.0-incubating build, with hadoop 0.20.2, but
>> just did a fresh git pull and now with 0.6.0-incubating, things look better
>> (MessageData and RelationshipData are my parents with children):
>> >> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
>> [MessageData]
>> >> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
>> new path: /Synthesys/MessageData
>> >> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
>> new path: /Synthesys/Contexts
>> >> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
>> new path: /Synthesys/ContextualElements
>> >> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
>> [RelationshipData]
>> >> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
>> new path: /Synthesys/RelationshipData
>> >> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
>> new path: /Synthesys/RelationshipStructures
>> >> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
>> [ElementData]
>> >> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
>> new path: /Synthesys/ElementData
>> >> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
>> [ConceptData]
>> >> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
>> new path: /Synthesys/ConceptData
>> >> I'll try a few more times and let you know if anything funky happens.
>> >> Thanks, as always, for your prompt responses,
>> >> Mike
>> >>
>> >> On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills <jwills@cloudera.com>
>> wrote:
>> >>
>> >> Hey Mike,
>> >> I can't replicate this problem using the MultipleOutputIT (which I
>> think we added as a test for this problem.) Which version of Crunch and
>> Hadoop are you using? The 0.5.0-incubating release should be up on the
>> maven repos if you want to try that out.
>> >> J
>> >>
>> >> On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <jwills
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message