Mailing-List: contact crunch-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: crunch-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com
 designates 209.85.128.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CA+1QwBiU3mmkHbXGh__TPa5z8EAEWQ1it+QZQEmMNSpurNYoVQ@mail.gmail.com>
References: 
 <CA+1QwBh5O7=CRou_q+3uCgpjVHObiL2OpFhraODjBO5EgKAiCw@mail.gmail.com>
	<CAH29n6Nk42FemNZaAhjPgZ6csgHGBLUUa3j2sECi+6hoO_RwTA@mail.gmail.com>
	<CAH29n6PwGWqStVxYoyAwJHVeJ04Xm6-xQjWAMthk=A3CJVHYYw@mail.gmail.com>
	<CA+1QwBhLe1WECuGeyN86-TqBHMP_MM8fTiJudZJDqYBTMRPyKQ@mail.gmail.com>
	<CA+1QwBiU3mmkHbXGh__TPa5z8EAEWQ1it+QZQEmMNSpurNYoVQ@mail.gmail.com>
Date: Wed, 20 Feb 2013 12:25:27 -0800
Message-ID: 
 <CAH29n6PQQqykE111HEo+U_P8Xo7GDouGf8G1UF0-yRYZiDvM=g@mail.gmail.com>
Subject: Re: inconsistent grouping of map jobs
From: Josh Wills <jwills@cloudera.com>
To: "crunch-user@incubator.apache.org" <crunch-user@incubator.apache.org>
Content-Type: multipart/alternative; boundary=20cf3079c0be4840d304d62dc233

--20cf3079c0be4840d304d62dc233
Content-Type: text/plain; charset=ISO-8859-1

Ah, okay. I just got on a train, so I'll have to do a bit of local
debugging.

Curious: are you explicitly calling run() between each of these jobs, or
just once after they've all been defined?

On Wednesday, February 20, 2013, Mike Barretta <mike.barretta@gmail.com>
wrote:
> okay, well, things turned for the worse quickly :)
> Following the same output above, the following jobs were created:
> 13/02/20 19:25:26 INFO exec.CrunchJob: Running job
"com.digitalreasoning.petal.extract.SynthesysKBExtractor:
SeqFile(/Synthesys/MessageData)+[[S1+Text(/Synthesys/export/Contexts)]/[S0+Text(/Synthesys/export/MessageData)]/[S2+Text(/Synthesys/export/ContextualElements)]]"
> 13/02/20 19:25:26 INFO exec.CrunchJob: Job status available at: <snip>
> 13/02/20 19:25:28 INFO input.FileInputFormat: Total input paths to
process : 40
> 13/02/20 19:25:29 INFO exec.CrunchJob: Running job
"com.digitalreasoning.petal.extract.SynthesysKBExtractor:
SeqFile(/Synthesys/ElementData)+S5+Text(/Synthesys/export/ElementData)"
> 13/02/20 19:25:29 INFO exec.CrunchJob: Job status available at: <snip>
> 13/02/20 19:25:32 INFO input.FileInputFormat: Total input paths to
process : 40
> 13/02/20 19:25:32 INFO exec.CrunchJob: Running job
"com.digitalreasoning.petal.extract.SynthesysKBExtractor:
SeqFile(/Synthesys/RelationshipData)+S3+Text(/Synthesys/export/RelationshipData)"
> notice that the first (MessageData) shows all three output paths while
the last (RelationshipData) only shows one.  This is despite the previous
log messages showing:
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
[RelationshipData]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
new path: /Synthesys/export/RelationshipData
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
new path: /Synthesys/export/RelationshipStructures
> *forgive the mismatched paths between this email and my previous - am
shorting for brevity, and trying to convey the difference between input and
export paths
>
> On Wed, Feb 20, 2013 at 2:30 PM, Mike Barretta <mike.barretta@gmail.com>
wrote:
>
> Was using a very early 0.5.0-incubating build, with hadoop 0.20.2, but
just did a fresh git pull and now with 0.6.0-incubating, things look better
(MessageData and RelationshipData are my parents with children):
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [MessageData]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
new path: /Synthesys/MessageData
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
new path: /Synthesys/Contexts
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
new path: /Synthesys/ContextualElements
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading
[RelationshipData]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
new path: /Synthesys/RelationshipData
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
new path: /Synthesys/RelationshipStructures
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ElementData]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
new path: /Synthesys/ElementData
> 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ConceptData]
> 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to
new path: /Synthesys/ConceptData
> I'll try a few more times and let you know if anything funky happens.
> Thanks, as always, for your prompt responses,
> Mike
>
> On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills <jwills@cloudera.com> wrote:
>
> Hey Mike,
> I can't replicate this problem using the MultipleOutputIT (which I think
we added as a test for this problem.) Which version of Crunch and Hadoop
are you using? The 0.5.0-incubating release should be up on the maven repos
if you want to try that out.
> J
>
> On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <jwills@cloudera.com> wrote:
>
> Hey Mike,
> The code looks right to me. Let me whip up a test and see if I can
replicate it easily-- is there anything funky beyond what's in your snippet
that I should be aware of?
> J
>
> On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta <mike.barretta@gmail.com>
wrote:
>
> I have a number of "tables" in HDFS, represented as folders containing
SequenceFiles of serialized objects.  I'm trying to write a tool that will
reassemble these objects and output each of the tables into its own CSV
file.
> The wrinkle is that some of the "tables" hold objects with a list of
related child objects.  Those related should get chopped into their own
table.
> Here is essentially what my loop looks like (in Groovy):
> //loop through each top-level table
> paths.each { path ->
>     def source = From.sequenceFile(new Path(path),
>
Writables.writables(ColumnKey.class),
>
Writables.writables(ColumnDataArrayWritable.class)
>     )
>     //read it in
>     def data = crunchPipeline.read(source)
>     //write it out
>     crunchPipeline.write(
>         data.parallelDo(new MyDoFn(path), Writables.strings()),
>         To.textFile("$path/csv")
>     )
>     //handle children using same PTable as pare

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

--20cf3079c0be4840d304d62dc233
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Ah, okay. I just got on a train, so I&#39;ll have to do a bit of local debu=
gging.<br><br>Curious: are you explicitly calling run() between each of the=
se jobs, or just once after they&#39;ve all been defined?<br><br>On Wednesd=
ay, February 20, 2013, Mike Barretta &lt;<a href=3D"mailto:mike.barretta@gm=
ail.com">mike.barretta@gmail.com</a>&gt; wrote:<br>
&gt; okay, well, things turned for the worse quickly :)<br>&gt; Following t=
he same output above, the following jobs were created:<br>&gt; 13/02/20 19:=
25:26 INFO exec.CrunchJob: Running job &quot;com.digitalreasoning.petal.ext=
ract.SynthesysKBExtractor: SeqFile(/Synthesys/MessageData)+[[S1+Text(/Synth=
esys/export/Contexts)]/[S0+Text(/Synthesys/export/MessageData)]/[S2+Text(/S=
ynthesys/export/ContextualElements)]]&quot;<br>
&gt; 13/02/20 19:25:26 INFO exec.CrunchJob: Job status available at: &lt;sn=
ip&gt;<br>&gt; 13/02/20 19:25:28 INFO input.FileInputFormat: Total input pa=
ths to process : 40<br>&gt; 13/02/20 19:25:29 INFO exec.CrunchJob: Running =
job &quot;com.digitalreasoning.petal.extract.SynthesysKBExtractor: SeqFile(=
/Synthesys/ElementData)+S5+Text(/Synthesys/export/ElementData)&quot;<br>
&gt; 13/02/20 19:25:29 INFO exec.CrunchJob: Job status available at: &lt;sn=
ip&gt;<br>&gt; 13/02/20 19:25:32 INFO input.FileInputFormat: Total input pa=
ths to process : 40<br>&gt; 13/02/20 19:25:32 INFO exec.CrunchJob: Running =
job &quot;com.digitalreasoning.petal.extract.SynthesysKBExtractor: SeqFile(=
/Synthesys/RelationshipData)+S3+Text(/Synthesys/export/RelationshipData)&qu=
ot;<br>
&gt; notice that the first (MessageData) shows all three output paths while=
 the last (RelationshipData) only shows one. =A0This is despite the previou=
s log messages showing:<br>&gt; 13/02/20 19:25:04 INFO extract.SynthesysKBE=
xtractor: reading [RelationshipData]<br>
&gt; 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to=
 new path: /Synthesys/export/RelationshipData<br>&gt; 13/02/20 19:25:04 INF=
O impl.FileTargetImpl: Will write output files to new path: /Synthesys/expo=
rt/RelationshipStructures<br>
&gt; *forgive the mismatched paths between this email and my previous - am =
shorting for brevity, and trying to convey the difference between input and=
 export paths<br>&gt;<br>&gt; On Wed, Feb 20, 2013 at 2:30 PM, Mike Barrett=
a &lt;<a href=3D"mailto:mike.barretta@gmail.com">mike.barretta@gmail.com</a=
>&gt; wrote:<br>
&gt;<br>&gt; Was using a very early 0.5.0-incubating build, with hadoop 0.2=
0.2, but just did a fresh git pull and now with 0.6.0-incubating, things lo=
ok better (MessageData and RelationshipData are my parents with children):=
=A0<br>
&gt; 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [MessageD=
ata]<br>&gt; 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output =
files to new path: /Synthesys/MessageData<br>&gt; 13/02/20 19:25:04 INFO im=
pl.FileTargetImpl: Will write output files to new path: /Synthesys/Contexts=
<br>
&gt; 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to=
 new path: /Synthesys/ContextualElements<br>&gt; 13/02/20 19:25:04 INFO ext=
ract.SynthesysKBExtractor: reading [RelationshipData]<br>&gt; 13/02/20 19:2=
5:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthe=
sys/RelationshipData<br>
&gt; 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to=
 new path: /Synthesys/RelationshipStructures<br>&gt; 13/02/20 19:25:04 INFO=
 extract.SynthesysKBExtractor: reading [ElementData]<br>&gt; 13/02/20 19:25=
:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthes=
ys/ElementData<br>
&gt; 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ConceptD=
ata]<br>&gt; 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output =
files to new path: /Synthesys/ConceptData<br>&gt; I&#39;ll try a few more t=
imes and let you know if anything funky happens.<br>
&gt; Thanks, as always, for your prompt responses,<br>&gt; Mike<br>&gt;<br>=
&gt; On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills &lt;<a href=3D"mailto:jwil=
ls@cloudera.com">jwills@cloudera.com</a>&gt; wrote:<br>&gt;<br>&gt; Hey Mik=
e,<br>
&gt; I can&#39;t replicate this problem using the MultipleOutputIT (which I=
 think we added as a test for this problem.) Which version of Crunch and Ha=
doop are you using? The 0.5.0-incubating release should be up on the maven =
repos if you want to try that out.<br>
&gt; J<br>&gt;<br>&gt; On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills &lt;<a h=
ref=3D"mailto:jwills@cloudera.com">jwills@cloudera.com</a>&gt; wrote:<br>&g=
t;<br>&gt; Hey Mike,<br>&gt; The code looks right to me. Let me whip up a t=
est and see if I can replicate it easily-- is there anything funky beyond w=
hat&#39;s in your snippet that I should be aware of?<br>
&gt; J<br>&gt;<br>&gt; On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta &lt;<=
a href=3D"mailto:mike.barretta@gmail.com">mike.barretta@gmail.com</a>&gt; w=
rote:<br>&gt;<br>&gt; I have a number of &quot;tables&quot; in HDFS, repres=
ented as folders containing SequenceFiles of serialized objects. =A0I&#39;m=
 trying to write a tool that will reassemble these objects and output each =
of the tables into its own CSV file.<br>
&gt; The wrinkle is that some of the &quot;tables&quot; hold objects with a=
 list of related child objects. =A0Those related should get chopped into th=
eir own table.=A0<br>&gt; Here is essentially what my loop looks like (in G=
roovy):<br>
&gt; //loop through each top-level table<br>&gt; paths.each { path -&gt;<br=
>&gt; =A0 =A0 def source =3D From.sequenceFile(new Path(path),<br>&gt; =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 Writables.writables(ColumnKey.class),<br>
&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Writables.writables(ColumnDataArrayWrit=
able.class)<br>&gt; =A0 =A0 )<br>&gt; =A0 =A0 //read it in<br>&gt; =A0 =A0 =
def data =3D crunchPipeline.read(source)<br>&gt; =A0 =A0 //write it out<br>
&gt; =A0 =A0 crunchPipeline.write(<br>&gt; =A0 =A0 =A0 =A0 data.parallelDo(=
new MyDoFn(path), Writables.strings()),<br>&gt; =A0 =A0 =A0 =A0 To.textFile=
(&quot;$path/csv&quot;)<br>&gt; =A0 =A0 )<br>&gt; =A0 =A0 //handle children=
 using same PTable as pare<br>
<br>-- <br><div>Director of Data Science</div><div><a href=3D"http://www.cl=
oudera.com" target=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"ht=
tp://twitter.com/josh_wills" target=3D"_blank">@josh_wills</a></div><br>

--20cf3079c0be4840d304d62dc233--