Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A8DF018F71 for ; Sat, 16 May 2015 00:13:34 +0000 (UTC) Received: (qmail 76315 invoked by uid 500); 16 May 2015 00:13:34 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 76275 invoked by uid 500); 16 May 2015 00:13:34 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 76265 invoked by uid 99); 16 May 2015 00:13:34 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 May 2015 00:13:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 18E4BC52BD for ; Sat, 16 May 2015 00:13:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.695 X-Spam-Level: ** X-Spam-Status: No, score=2.695 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, MANY_SPAN_IN_TEXT=0.001, RCVD_IN_MSPIKE_H2=-0.205, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id o9TAVnWpdg81 for ; Sat, 16 May 2015 00:13:32 +0000 (UTC) Received: from mail-qg0-f42.google.com (mail-qg0-f42.google.com [209.85.192.42]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 842C42123A for ; Sat, 16 May 2015 00:13:32 +0000 (UTC) Received: by qgf2 with SMTP id 2so18967524qgf.0 for ; Fri, 15 May 2015 17:13:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Bhk7NABddxLG/XGoRp0mnDybxWS3gSLufIJJZ855bu0=; b=nE0o/YFTVAX2/9uUmfm6PsyU7ITUw26hPmG/G184yVUvTzVH2QtXnBR7dcWsgjOyLh QgR0fzAmr42lMGkcm+/G5Ks47QicemxDJrISqatDhe0WQToDcJYJsY6n87NBSRaCpCyl UgX7r5iaSaTn/2XZ5TS1xkBTRIP4/r1Fy8QOYJmvxmQk7zKk1TDgUpT5CAB6Bh/a/CXu KolXVTv6XQ9QqsOjNfkFdOHE3Uq6ZXhYvsQZN1VbpZAz+H7yZNcJlDtDJEIRt8rE0d40 OoCWrxsIT2i0ldgoPeQYVBRvM/r3xUaUKQGSvPEE4s4jn7NX5MRgKXfodAD+JEOy7j1p a8Lw== MIME-Version: 1.0 X-Received: by 10.140.134.145 with SMTP id 139mr16630730qhg.6.1431735205771; Fri, 15 May 2015 17:13:25 -0700 (PDT) Received: by 10.140.172.69 with HTTP; Fri, 15 May 2015 17:13:25 -0700 (PDT) In-Reply-To: References: Date: Fri, 15 May 2015 17:13:25 -0700 Message-ID: Subject: Re: question about join From: Lucy Chen To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=001a1135dae85a3f84051627d4ff --001a1135dae85a3f84051627d4ff Content-Type: text/plain; charset=UTF-8 Hi Josh, Thanks for your quick response. You are right. It should be training_labels.write(At.avroFile(output_path+"/training_labels_avro", LabelsType), WriteMode.OVERWRITE) instead. Sorry about the mistake. After another round of diagnosis with toy data, the codes suddenly turned out to work after I tried to store the avro file. Although I don't know how it turned out to work from not working, now it works without any changes. Probably some weird thing happened. Thanks again for your time and patience. Best, Lucy On Fri, May 15, 2015 at 12:05 PM, Lucy Chen wrote: > Hi Josh, > > Now looks like the issue probably happened with the Class Labels. > I tried to store the training_labels as an avro output (txt outputs are > OK), > > training_labels.write(At.avroFile(output_path+"/training_labels_avro"), > LabelsType, WriteMode.OVERWRITE); > > > but get the following err: > > The method write(Target, Target.WriteMode) in the type PCollection > is not applicable for the arguments (SourceTarget, > PType, Target.WriteMode) > > > Will get the similar issue when storing the labels_data > which is the first input of join. It seemed that Crunch did not recognize > the type of training_labels. But why here it treated it a > GenericData.Record? > > > Any suggestions? > > > Thanks! > > > Lucy > > On Wed, May 13, 2015 at 12:22 PM, Lucy Chen > wrote: > >> Hi all, >> >> I had a join step in my crunch pipeline, and it looks like the >> following: >> >> //get label data >> >> PType LabelsType = Avros.records(Labels.class); >> >> PCollection training_labels = input.parallelDo(new >> LabelDataParser(), LabelsType); >> >> PTable labels_data = training_labels. >> >> parallelDo(new KeyOnLabels("sample_ID"), tableOf(strings(), >> LabelsType)); >> >> //get features >> >> PType FeatsType = Avros.records(Feats.class); >> >> PCollection training_feats = Feature.FeatLoader(pipeline, >> sample_features_inputs); >> >> PTable feats_data = training_feats.parallelDo(new >> KeyOnFeats("sample_ID"), tableOf(strings(), FeatsType)); >> >> >> //join labels and features >> >> JoinStrategy strategy = new >> DefaultJoinStrategy(20); >> >> PTable> joined_training = strategy. >> >> join(labels_data, feats_data, JoinType.INNER_JOIN); >> >> >> //class Labels >> >> public class Labels implements java.io.Serializable, Cloneable{ >> >> private String class_ID; >> >> private String sample_ID; >> >> private int binary_ind; >> >> public Labels() >> >> { >> >> this(null, null, 0); >> >> } >> >> public Labels(String class_ID, String sample_ID, int ind) >> >> { >> >> this.class_ID = class_ID; >> >> this.sample_ID = sample_ID; >> >> this.binary_ind = ind; >> >> } >> >> ... >> >> } >> >> >> //class Feats >> >> >> public class *Feats* implements java.io.Serializable, Cloneable{ >> >> private String sample_id; >> >> private String sample_name; >> >> private Map feat; >> >> public Feats() >> >> { >> >> this(null, null, null); >> >> } >> >> public Feats(String id, String name, Map feat) >> >> { >> >> this.sample_id = id; >> >> this.sample_name = name; >> >> this.feat = feat; >> >> } >> >> ... >> >> >> } >> >> >> The outputs of labels_data and feats_data are both fine; but the join >> step throws the following exception: >> >> >> Error: java.lang.ClassCastException: java.lang.String cannot be cast to >> org.apache.crunch.Pair at >> org.apache.crunch.lib.join.DefaultJoinStrategy$1.map(DefaultJoinStrategy.java:87) >> at org.apache.crunch.MapFn.process(MapFn.java:34) at >> org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98) at >> org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56) >> at org.apache.crunch.MapFn.process(MapFn.java:34) at >> org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98) at >> org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:109) at >> org.apache.crunch.impl.mr.run.CrunchMapper.map(CrunchMapper.java:60) at >> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at >> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at >> java.security.AccessController.doPrivileged(Native Method) at >> javax.security.auth.Subject.doAs(Subject.java:415) at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) >> >> >> This issue already bothered me for a while; Did any one get a >> similar issue here? Is there another option that will solve it? >> >> >> Btw, I already successfully ran the following joining job: >> >> >> JoinStrategy> strategy = new >> DefaultJoinStrategy>(100); >> >> PTable>> joined = >> strategy.join(input_A, input_B, JoinType.INNER_JOIN); >> >> >> So I guess the issue may be still related to the Avro types that I >> defined. >> >> >> Thanks for your advice. >> >> >> Lucy >> >> >> >> >> > --001a1135dae85a3f84051627d4ff Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Josh,

=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0Thanks for your quick response. You are right. It should be=C2=A0
training_labels.write(At.avroF= ile(output_path+"/training_labels_avro",=C2=A0LabelsType), WriteMode= .OVERWRITE) instead. Sorry about the mistake. After another round of= =C2=A0diagnosis with toy data, the codes suddenly turned out to work after = I tried to store the avro file. Although I don't know how it turned out= to work from not working, now it works without any changes. Probably some= =C2=A0weird thing=C2=A0happened.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 Thanks again for your time and patien= ce.

Best,
<= div>Lucy

On Fri, May 15, 2015 = at 12:05 PM, Lucy Chen <lucychen2014fall@gmail.com>= wrote:
Hi Josh,

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Now looks like the issue probably happe= ned with the Class Labels. I tried to store the=C2=A0training_labels as an avro output (txt outputs are OK),=C2=A0
=

training_labels.write(At.avroFile(output_path+"/training_labels_avro"), = LabelsType, WriteMode.OVERWRITE)= ;


=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 but get the following =C2=A0err:

The method write(Target, Target.WriteMode) in the type P= Collection<Labels> is not applicable for the arguments (SourceTarget&= lt;GenericData.Record>, PType<Labels>, Target.WriteMode)


=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Will= get the similar issue when storing the labels_data which is the first inpu= t of join. It seemed that Crunch did not recognize the type of training_lab= els. But why here it treated it a GenericData.Record?


=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Any suggestions?=


=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 Thanks!

=

Lucy


On Wed, May 13, = 2015 at 12:22 PM, Lucy Chen <lucychen2014fall@gmail.com> wrote:
Hi all,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0I had= a join step in my crunch pipeline, and it looks like the following:=

//get label data

PType<Labels> LabelsType =3D Avros.records(Labels.class);

PCollection<Labels> training_labels = =3D input.parallelDo(new Label= DataParser(), LabelsType);

PTable<String, Labels> labels_data =3D training_labels.<= /p>

parallelDo(new KeyOnLabels(&quo= t;sample_ID"), tableOf(strings(), LabelsType));

//get features

PType<Feats> FeatsType= =3D Avros.records(Feats.class= );

PCollection= <Feats> training_feats =3D Feature.FeatLoader(pipeline, sample_featur= es_inputs);

PT= able<String, Feats> feats_data =3D training_feats.parallelDo(new KeyOnFeats("sample_ID"), tableOf(strings(), FeatsType))= ;


//join labels and fea= tures

JoinStra= tegy<String, Labels, Feats> strategy =3D new DefaultJoinStrategy<String, Labels, Feats>(20);<= /p>

PTable<String,= Pair<Labels, Feats>> joined_training =3D strategy.

join(labels_data, feats_data, JoinType.INNER_JOIN);


//class Labels

public class Labels im= plements java.io.Serializable, Cloneable{

private String c= lass_ID;

private String sample_= ID;

private int binary_ind;

public Labels()

{

this(null, null, 0);

}

= =C2=A0 =C2=A0 =C2=A0 =C2=A0 public Labels(String class_ID, String sa= mple_ID, int ind)

{

this.class_ID =3D class_ID;

this.sample_ID =3D sample_ID;

this.binary_ind =3D ind;

}

=C2=A0 =C2=A0 =C2=A0 =C2=A0 ...

}


//class Feats


public <= span style=3D"color:rgb(147,26,104)">class=C2=A0Feats=C2=A0implements java.io.Serializable,= Cloneable{

private String= sample_id;

private String samp= le_name;

= private Map<String,= Float> feat;

public=C2=A0Feats()

{

this(null, null, null);

}

public=C2=A0Feats(String id, String name, Map<String, Float> feat)

=

{

this.sample_id =3D id;

this.sample_= name =3D name;

this.feat =3D feat;

}

=C2=A0 =C2=A0 =C2=A0 =C2=A0...


}


=C2=A0 =C2=A0The outputs of labels_data and feats_data are both= fine; but the join step throws the following exception:


Error: java.lang.ClassCastException: java.lang.String cannot be cast to= org.apache.crunch.Pair at org.apache.crunch.lib.join.DefaultJoinStrategy$1= .map(DefaultJoinStrategy.java:87) at org.apache.crunch.MapFn.process(MapFn.= java:34) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98) at= org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitte= r.java:56) at org.apache.crunch.MapFn.process(MapFn.java:34) at org.apache.= crunch.impl.mr.run.RTNode.process(RTNode.java:98) at org.apache.crunch.impl= .mr.run.RTNode.process(RTNode.java:109) at org.apache.crunch.impl.mr.run.Cr= unchMapper.map(CrunchMapper.java:60) at org.apache.hadoop.mapreduce.Mapper.= run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTa= sk.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at o= rg.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.securit= y.AccessController.doPrivileged(Native Method) at javax.security.auth.Subje= ct.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformatio= n.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChil= d.main(YarnChild.java:157)


=C2=A0 =C2=A0 =C2=A0 =C2=A0This issue already bothered me for a while; Did any one get= a similar issue here? Is there another option that will solve it?

<= br>

=C2=A0 =C2=A0 =C2=A0 Btw, I already successfully ran the followi= ng=C2=A0joining job:


JoinStrategy<String, Float, Tuple3<= ;String, String, Float>> strategy =3D new DefaultJoinStrategy<String, Float, Tuple3<String, S= tring, Float>>(100);

PTable<String, Pair<Float,Tuple3&l= t;String, String, Float>>> joined =3D strategy.join(input_A, input= _B, JoinType.INNER_JOIN);


=C2=A0 =C2=A0= So I guess the issue may= be still related to the Avro types that I defined.


<= p style=3D"margin:0px">=C2=A0 =C2= =A0 =C2=A0 =C2=A0 Thanks for your advice.


Lucy


<= p style=3D"margin:0px;font-size:18px;font-family:Monaco">




--001a1135dae85a3f84051627d4ff--