Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 80CC4175C0 for ; Wed, 13 May 2015 19:32:00 +0000 (UTC) Received: (qmail 52391 invoked by uid 500); 13 May 2015 19:32:00 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 52351 invoked by uid 500); 13 May 2015 19:32:00 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 52334 invoked by uid 99); 13 May 2015 19:32:00 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2015 19:32:00 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id A48C91A2B4B for ; Wed, 13 May 2015 19:31:59 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.297 X-Spam-Level: ***** X-Spam-Status: No, score=5.297 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, MANY_SPAN_IN_TEXT=2.399, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id gAo4idIKJp7X for ; Wed, 13 May 2015 19:31:57 +0000 (UTC) Received: from mail-qk0-f176.google.com (mail-qk0-f176.google.com [209.85.220.176]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 99B2D43CB6 for ; Wed, 13 May 2015 19:31:57 +0000 (UTC) Received: by qku63 with SMTP id 63so35505205qku.3 for ; Wed, 13 May 2015 12:31:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=BiwYp0GIpoIX72jkUPUIz16EzEPktSJIs+I5aglIFwE=; b=m04qlYZviHJsCewKPAogrBuygRV6evrig/S+On+Rz7RYrITmWNLaPMXSOsPBfdWKmY Wq2Sjpyo5V2+IdfYF+zdIR7qEJtcmFKeUltoqeia5UghZ7gJilAXGUYizESP9kCCLR0O R85wL6lbImyZR+SDpNJCtXKeFHsTgj1uDk8PgqAhQr70vW8Q1AQmn8saYypHZyDmjENi YJnVpD5Ra/EDAC2Hke/GBcLihvBhM3quf9HBju9UrX9FQ6vc0fjGlRDQxLAeel0ZxnpO uN8ZTX7xV6iTu/lZ1BCRnU3gMJKSMm6lsvB7fHRaYfGMzaKEjM+nJF24SiFfApS1X2E+ GAjg== MIME-Version: 1.0 X-Received: by 10.140.96.119 with SMTP id j110mr715928qge.22.1431545472453; Wed, 13 May 2015 12:31:12 -0700 (PDT) Received: by 10.141.19.16 with HTTP; Wed, 13 May 2015 12:31:12 -0700 (PDT) In-Reply-To: References: Date: Wed, 13 May 2015 12:31:12 -0700 Message-ID: Subject: Re: question about join From: Lucy Chen To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=001a113abebe5d94950515fba737 --001a113abebe5d94950515fba737 Content-Type: text/plain; charset=UTF-8 Meanwhile, I just realized that one of my joining jobs look also OK, which also had the Avro type in it: JoinStrategy strategy = new DefaultJoinStrategy(50); PTable> joined = strategy.join(input_A, input_B, JoinType.INNER_JOIN); So the Avro type probably is not the issue. Any advice? Lucy On Wed, May 13, 2015 at 12:22 PM, Lucy Chen wrote: > Hi all, > > I had a join step in my crunch pipeline, and it looks like the > following: > > //get label data > > PType LabelsType = Avros.records(Labels.class); > > PCollection training_labels = input.parallelDo(new > LabelDataParser(), LabelsType); > > PTable labels_data = training_labels. > > parallelDo(new KeyOnLabels("sample_ID"), tableOf(strings(), LabelsType)); > > //get features > > PType FeatsType = Avros.records(Feats.class); > > PCollection training_feats = Feature.FeatLoader(pipeline, > sample_features_inputs); > > PTable feats_data = training_feats.parallelDo(new > KeyOnFeats("sample_ID"), tableOf(strings(), FeatsType)); > > > //join labels and features > > JoinStrategy strategy = new > DefaultJoinStrategy(20); > > PTable> joined_training = strategy. > > join(labels_data, feats_data, JoinType.INNER_JOIN); > > > //class Labels > > public class Labels implements java.io.Serializable, Cloneable{ > > private String class_ID; > > private String sample_ID; > > private int binary_ind; > > public Labels() > > { > > this(null, null, 0); > > } > > public Labels(String class_ID, String sample_ID, int ind) > > { > > this.class_ID = class_ID; > > this.sample_ID = sample_ID; > > this.binary_ind = ind; > > } > > ... > > } > > > //class Feats > > > public class *Feats* implements java.io.Serializable, Cloneable{ > > private String sample_id; > > private String sample_name; > > private Map feat; > > public Feats() > > { > > this(null, null, null); > > } > > public Feats(String id, String name, Map feat) > > { > > this.sample_id = id; > > this.sample_name = name; > > this.feat = feat; > > } > > ... > > > } > > > The outputs of labels_data and feats_data are both fine; but the join > step throws the following exception: > > > Error: java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.crunch.Pair at > org.apache.crunch.lib.join.DefaultJoinStrategy$1.map(DefaultJoinStrategy.java:87) > at org.apache.crunch.MapFn.process(MapFn.java:34) at > org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98) at > org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56) > at org.apache.crunch.MapFn.process(MapFn.java:34) at > org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98) at > org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:109) at > org.apache.crunch.impl.mr.run.CrunchMapper.map(CrunchMapper.java:60) at > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at > org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:415) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) > > > This issue already bothered me for a while; Did any one get a > similar issue here? Is there another option that will solve it? > > > Btw, I already successfully ran the following joining job: > > > JoinStrategy> strategy = new > DefaultJoinStrategy>(100); > > PTable>> joined = > strategy.join(input_A, input_B, JoinType.INNER_JOIN); > > > So I guess the issue may be still related to the Avro types that I > defined. > > > Thanks for your advice. > > > Lucy > > > > > --001a113abebe5d94950515fba737 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Meanwhile, I just realized that one of my= joining jobs look also OK, which also had the Avro type in it:
=

Jo= inStrategy<String, Feats, Feats> strategy=C2=A0

=3D new DefaultJoinStrategy<String, Feats, Feats>(50);

PTable<String, Pair<Feats,Feats>= ;> joined =3D strategy.join(input_A, input_B, JoinType.INNER_JOIN);


So the Avro type probably is not t= he issue.


Any advice?


Lucy


On Wed, May 13, 2015 at 12:22 PM, Lu= cy Chen <lucychen2014fall@gmail.com> wrote:
Hi all,=

=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0I had a join step in my crunch pipeline, and it looks l= ike the following:

//get label data

PType<Labels> LabelsType =3D Avr= os.records(Labels.class);

<= p style=3D"margin:0px;font-size:18px;font-family:Monaco">PCollection<Lab= els> training_labels =3D input.parallelDo(new LabelDataParser(), LabelsType);

PTable<String, Labels> labels_da= ta =3D training_labels.

parallelDo(new KeyOnLabels(&quo= t;sample_ID"), tableOf(strings(), LabelsType));

//get features

PType<Feats> FeatsType= =3D Avros.records(Feats.class= );

PCollection= <Feats> training_feats =3D Feature.FeatLoader(pipeline, sample_featur= es_inputs);

PT= able<String, Feats> feats_data =3D training_feats.parallelDo(new KeyOnFeats("sample_ID"), tableOf(strings(), FeatsType))= ;


//join labels and fea= tures

JoinStra= tegy<String, Labels, Feats> strategy =3D new DefaultJoinStrategy<String, Labels, Feats>(20);<= /p>

PTable<String,= Pair<Labels, Feats>> joined_training =3D strategy.

join(labels_data, feats_data, JoinType.INNER_JOIN);


//class Labels

public class Labels im= plements java.io.Serializable, Cloneable{

private String c= lass_ID;

private String sample_= ID;

private int binary_ind;

public Labels()

{

this(null, null, 0);

}

= =C2=A0 =C2=A0 =C2=A0 =C2=A0 public Labels(String class_ID, String sa= mple_ID, int ind)

{

this.class_ID =3D class_ID;

this.sample_ID =3D sample_ID;

this.binary_ind =3D ind;

}

=C2=A0 =C2=A0 =C2=A0 =C2=A0 ...

}


//class Feats


public <= span style=3D"color:rgb(147,26,104)">class=C2=A0Feats=C2=A0implements java.io.Serializable,= Cloneable{

private String= sample_id;

private String samp= le_name;

= private Map<String,= Float> feat;

public=C2=A0Feats()

{

this(null, null, null);

}

public=C2=A0Feats(String id, String name, Map<String, Float> feat)

=

{

this.sample_id =3D id;

this.sample_= name =3D name;

this.feat =3D feat;

}

=C2=A0 =C2=A0 =C2=A0 =C2=A0...


}


=C2=A0 =C2=A0The outputs of labels_data and feats_data are both= fine; but the join step throws the following exception:


Error: java.lang.ClassCastException: java.lang.String cannot be cast to= org.apache.crunch.Pair at org.apache.crunch.lib.join.DefaultJoinStrategy$1= .map(DefaultJoinStrategy.java:87) at org.apache.crunch.MapFn.process(MapFn.= java:34) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98) at= org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitte= r.java:56) at org.apache.crunch.MapFn.process(MapFn.java:34) at org.apache.= crunch.impl.mr.run.RTNode.process(RTNode.java:98) at org.apache.crunch.impl= .mr.run.RTNode.process(RTNode.java:109) at org.apache.crunch.impl.mr.run.Cr= unchMapper.map(CrunchMapper.java:60) at org.apache.hadoop.mapreduce.Mapper.= run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTa= sk.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at o= rg.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.securit= y.AccessController.doPrivileged(Native Method) at javax.security.auth.Subje= ct.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformatio= n.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChil= d.main(YarnChild.java:157)


=C2=A0 =C2=A0 =C2=A0 =C2=A0This issue already bothered me for a while; Did any one get= a similar issue here? Is there another option that will solve it?

<= br>

=C2=A0 =C2=A0 =C2=A0 Btw, I already successfully ran the followi= ng=C2=A0joining job:


JoinStrategy<String, Float, Tuple3<= ;String, String, Float>> strategy =3D new DefaultJoinStrategy<String, Float, Tuple3<String, S= tring, Float>>(100);

PTable<String, Pair<Float,Tuple3&l= t;String, String, Float>>> joined =3D strategy.join(input_A, input= _B, JoinType.INNER_JOIN);


=C2=A0 =C2=A0= So I guess the issue may= be still related to the Avro types that I defined.


<= p style=3D"margin:0px">=C2=A0 =C2= =A0 =C2=A0 =C2=A0 Thanks for your advice.


Lucy


<= span style=3D"font-family:sans-serif;font-size:14.3999996185303px;backgroun= d-color:rgba(255,255,255,0.0980392)">



--001a113abebe5d94950515fba737--