Return-Path: X-Original-To: apmail-incubator-crunch-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 743BBEAD7 for ; Wed, 20 Feb 2013 20:25:54 +0000 (UTC) Received: (qmail 85912 invoked by uid 500); 20 Feb 2013 20:25:54 -0000 Delivered-To: apmail-incubator-crunch-user-archive@incubator.apache.org Received: (qmail 85886 invoked by uid 500); 20 Feb 2013 20:25:54 -0000 Mailing-List: contact crunch-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-user@incubator.apache.org Delivered-To: mailing list crunch-user@incubator.apache.org Received: (qmail 85876 invoked by uid 99); 20 Feb 2013 20:25:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Feb 2013 20:25:54 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com designates 209.85.128.177 as permitted sender) Received: from [209.85.128.177] (HELO mail-ve0-f177.google.com) (209.85.128.177) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Feb 2013 20:25:49 +0000 Received: by mail-ve0-f177.google.com with SMTP id m1so7218418ves.8 for ; Wed, 20 Feb 2013 12:25:28 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:x-gm-message-state; bh=Xl3vhOsM46AF4Qd5G3bMgKflq77O1rFFmeShqL/XaK0=; b=DRVsWQ+yDGVIvh+TIUlBZtcsdj+7Ha8QVQuQbD3OvXF91H5XNNoljPpfTcXOmx2e8w JeRFG1xKweKuEd8wKPA7TyeTMkrfDo/s8IwNmZNIcJMQaXbgGHjJAVrPusEZxZUW8ImH Np3No7hjxNb9l/7q4vp0MaO+2kX9JiBn8LJqQZKVOTONaHPAWxrBm4ANNYsEQdIzsQS8 8rtqCDwiNcffSDdLWaRHfY5nBIbQWvlal88hojiwrZ2FUlFxYRFQa3T3I5qSoCTHueDB vHSCrg70lALYCSiYtNw2l/CrPHmWR6Q2Gtp0RGa9MBkjVCYknysUW1JGsdc2eDj/W558 fRzA== MIME-Version: 1.0 X-Received: by 10.52.20.115 with SMTP id m19mr24116704vde.129.1361391928229; Wed, 20 Feb 2013 12:25:28 -0800 (PST) Received: by 10.58.205.209 with HTTP; Wed, 20 Feb 2013 12:25:27 -0800 (PST) In-Reply-To: References: Date: Wed, 20 Feb 2013 12:25:27 -0800 Message-ID: Subject: Re: inconsistent grouping of map jobs From: Josh Wills To: "crunch-user@incubator.apache.org" Content-Type: multipart/alternative; boundary=20cf3079c0be4840d304d62dc233 X-Gm-Message-State: ALoCoQnegMMTeJtGQlx8t7W8ve73Q5qJ2D4/JkOLq2xTClKonGajrSIfJCZGKQBkFeM4EC4KsbCK X-Virus-Checked: Checked by ClamAV on apache.org --20cf3079c0be4840d304d62dc233 Content-Type: text/plain; charset=ISO-8859-1 Ah, okay. I just got on a train, so I'll have to do a bit of local debugging. Curious: are you explicitly calling run() between each of these jobs, or just once after they've all been defined? On Wednesday, February 20, 2013, Mike Barretta wrote: > okay, well, things turned for the worse quickly :) > Following the same output above, the following jobs were created: > 13/02/20 19:25:26 INFO exec.CrunchJob: Running job "com.digitalreasoning.petal.extract.SynthesysKBExtractor: SeqFile(/Synthesys/MessageData)+[[S1+Text(/Synthesys/export/Contexts)]/[S0+Text(/Synthesys/export/MessageData)]/[S2+Text(/Synthesys/export/ContextualElements)]]" > 13/02/20 19:25:26 INFO exec.CrunchJob: Job status available at: > 13/02/20 19:25:28 INFO input.FileInputFormat: Total input paths to process : 40 > 13/02/20 19:25:29 INFO exec.CrunchJob: Running job "com.digitalreasoning.petal.extract.SynthesysKBExtractor: SeqFile(/Synthesys/ElementData)+S5+Text(/Synthesys/export/ElementData)" > 13/02/20 19:25:29 INFO exec.CrunchJob: Job status available at: > 13/02/20 19:25:32 INFO input.FileInputFormat: Total input paths to process : 40 > 13/02/20 19:25:32 INFO exec.CrunchJob: Running job "com.digitalreasoning.petal.extract.SynthesysKBExtractor: SeqFile(/Synthesys/RelationshipData)+S3+Text(/Synthesys/export/RelationshipData)" > notice that the first (MessageData) shows all three output paths while the last (RelationshipData) only shows one. This is despite the previous log messages showing: > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [RelationshipData] > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/export/RelationshipData > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/export/RelationshipStructures > *forgive the mismatched paths between this email and my previous - am shorting for brevity, and trying to convey the difference between input and export paths > > On Wed, Feb 20, 2013 at 2:30 PM, Mike Barretta wrote: > > Was using a very early 0.5.0-incubating build, with hadoop 0.20.2, but just did a fresh git pull and now with 0.6.0-incubating, things look better (MessageData and RelationshipData are my parents with children): > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [MessageData] > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/MessageData > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/Contexts > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/ContextualElements > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [RelationshipData] > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/RelationshipData > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/RelationshipStructures > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ElementData] > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/ElementData > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ConceptData] > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/ConceptData > I'll try a few more times and let you know if anything funky happens. > Thanks, as always, for your prompt responses, > Mike > > On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills wrote: > > Hey Mike, > I can't replicate this problem using the MultipleOutputIT (which I think we added as a test for this problem.) Which version of Crunch and Hadoop are you using? The 0.5.0-incubating release should be up on the maven repos if you want to try that out. > J > > On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills wrote: > > Hey Mike, > The code looks right to me. Let me whip up a test and see if I can replicate it easily-- is there anything funky beyond what's in your snippet that I should be aware of? > J > > On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta wrote: > > I have a number of "tables" in HDFS, represented as folders containing SequenceFiles of serialized objects. I'm trying to write a tool that will reassemble these objects and output each of the tables into its own CSV file. > The wrinkle is that some of the "tables" hold objects with a list of related child objects. Those related should get chopped into their own table. > Here is essentially what my loop looks like (in Groovy): > //loop through each top-level table > paths.each { path -> > def source = From.sequenceFile(new Path(path), > Writables.writables(ColumnKey.class), > Writables.writables(ColumnDataArrayWritable.class) > ) > //read it in > def data = crunchPipeline.read(source) > //write it out > crunchPipeline.write( > data.parallelDo(new MyDoFn(path), Writables.strings()), > To.textFile("$path/csv") > ) > //handle children using same PTable as pare -- Director of Data Science Cloudera Twitter: @josh_wills --20cf3079c0be4840d304d62dc233 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Ah, okay. I just got on a train, so I'll have to do a bit of local debu= gging.

Curious: are you explicitly calling run() between each of the= se jobs, or just once after they've all been defined?

On Wednesd= ay, February 20, 2013, Mike Barretta <mike.barretta@gmail.com> wrote:
> okay, well, things turned for the worse quickly :)
> Following t= he same output above, the following jobs were created:
> 13/02/20 19:= 25:26 INFO exec.CrunchJob: Running job "com.digitalreasoning.petal.ext= ract.SynthesysKBExtractor: SeqFile(/Synthesys/MessageData)+[[S1+Text(/Synth= esys/export/Contexts)]/[S0+Text(/Synthesys/export/MessageData)]/[S2+Text(/S= ynthesys/export/ContextualElements)]]"
> 13/02/20 19:25:26 INFO exec.CrunchJob: Job status available at: <sn= ip>
> 13/02/20 19:25:28 INFO input.FileInputFormat: Total input pa= ths to process : 40
> 13/02/20 19:25:29 INFO exec.CrunchJob: Running = job "com.digitalreasoning.petal.extract.SynthesysKBExtractor: SeqFile(= /Synthesys/ElementData)+S5+Text(/Synthesys/export/ElementData)"
> 13/02/20 19:25:29 INFO exec.CrunchJob: Job status available at: <sn= ip>
> 13/02/20 19:25:32 INFO input.FileInputFormat: Total input pa= ths to process : 40
> 13/02/20 19:25:32 INFO exec.CrunchJob: Running = job "com.digitalreasoning.petal.extract.SynthesysKBExtractor: SeqFile(= /Synthesys/RelationshipData)+S3+Text(/Synthesys/export/RelationshipData)&qu= ot;
> notice that the first (MessageData) shows all three output paths while= the last (RelationshipData) only shows one. =A0This is despite the previou= s log messages showing:
> 13/02/20 19:25:04 INFO extract.SynthesysKBE= xtractor: reading [RelationshipData]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to= new path: /Synthesys/export/RelationshipData
> 13/02/20 19:25:04 INF= O impl.FileTargetImpl: Will write output files to new path: /Synthesys/expo= rt/RelationshipStructures
> *forgive the mismatched paths between this email and my previous - am = shorting for brevity, and trying to convey the difference between input and= export paths
>
> On Wed, Feb 20, 2013 at 2:30 PM, Mike Barrett= a <mike.barretta@gmail.com> wrote:
>
> Was using a very early 0.5.0-incubating build, with hadoop 0.2= 0.2, but just did a fresh git pull and now with 0.6.0-incubating, things lo= ok better (MessageData and RelationshipData are my parents with children):= =A0
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [MessageD= ata]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output = files to new path: /Synthesys/MessageData
> 13/02/20 19:25:04 INFO im= pl.FileTargetImpl: Will write output files to new path: /Synthesys/Contexts=
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to= new path: /Synthesys/ContextualElements
> 13/02/20 19:25:04 INFO ext= ract.SynthesysKBExtractor: reading [RelationshipData]
> 13/02/20 19:2= 5:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthe= sys/RelationshipData
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to= new path: /Synthesys/RelationshipStructures
> 13/02/20 19:25:04 INFO= extract.SynthesysKBExtractor: reading [ElementData]
> 13/02/20 19:25= :04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthes= ys/ElementData
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ConceptD= ata]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output = files to new path: /Synthesys/ConceptData
> I'll try a few more t= imes and let you know if anything funky happens.
> Thanks, as always, for your prompt responses,
> Mike
>
= > On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills <
jwills@cloudera.com> wrote:
>
> Hey Mik= e,
> I can't replicate this problem using the MultipleOutputIT (which I= think we added as a test for this problem.) Which version of Crunch and Ha= doop are you using? The 0.5.0-incubating release should be up on the maven = repos if you want to try that out.
> J
>
> On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <jwills@cloudera.com> wrote:
&g= t;
> Hey Mike,
> The code looks right to me. Let me whip up a t= est and see if I can replicate it easily-- is there anything funky beyond w= hat's in your snippet that I should be aware of?
> J
>
> On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta <<= a href=3D"mailto:mike.barretta@gmail.com">mike.barretta@gmail.com> w= rote:
>
> I have a number of "tables" in HDFS, repres= ented as folders containing SequenceFiles of serialized objects. =A0I'm= trying to write a tool that will reassemble these objects and output each = of the tables into its own CSV file.
> The wrinkle is that some of the "tables" hold objects with a= list of related child objects. =A0Those related should get chopped into th= eir own table.=A0
> Here is essentially what my loop looks like (in G= roovy):
> //loop through each top-level table
> paths.each { path ->> =A0 =A0 def source =3D From.sequenceFile(new Path(path),
> =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 Writables.writables(ColumnKey.class),
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Writables.writables(ColumnDataArrayWrit= able.class)
> =A0 =A0 )
> =A0 =A0 //read it in
> =A0 =A0 = def data =3D crunchPipeline.read(source)
> =A0 =A0 //write it out
> =A0 =A0 crunchPipeline.write(
> =A0 =A0 =A0 =A0 data.parallelDo(= new MyDoFn(path), Writables.strings()),
> =A0 =A0 =A0 =A0 To.textFile= ("$path/csv")
> =A0 =A0 )
> =A0 =A0 //handle children= using same PTable as pare

--
Director of Data Science
Twitter: @josh_wills

--20cf3079c0be4840d304d62dc233--