spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luis Guerra <luispelay...@gmail.com>
Subject Re: class after join
Date Thu, 17 Jul 2014 08:47:28 GMT
Thank you for your fast reply.

We are considering this Map[String, String] solution, but there are some
details that we do not control yet. What would happen if we have different
data types for different fields? Also, with this solution, we have to
repeat the field names for every "row" that we have, is this efficient?

Regarding the solution with composition, the key would be repeated in the
new class, whereas it is only necessary once after the join, isn't it?


On Thu, Jul 17, 2014 at 10:25 AM, Sean Owen <sowen@cloudera.com> wrote:

> If what you have is a large number of named strings, why not use a
> Map[String,String] to represent them? If you're approaching a class
> with >22 String fields anyway, it probably makes more sense. You lose
> a bit of compile-time checking, but gain flexibility.
>
> Also, merging two Maps to make a new one is pretty simple, compared to
> making many of these values classes.
>
> (Although, if you otherwise needed a class that represented "all of
> the things in class A and class B", this could be done easily with
> composition, a class with an A and a B inside.)
>
> On Thu, Jul 17, 2014 at 9:15 AM, Luis Guerra <luispelayo84@gmail.com>
> wrote:
> > Hi all,
> >
> > I am a newbie Spark user with many doubts, so sorry if this is a "silly"
> > question.
> >
> > I am dealing with tabular data formatted as text files, so when I first
> load
> > the data, my code is like this:
> >
> > case class data_class(
> >   V1: String,
> >   V2: String,
> >   V3: String,
> >   V4: String,
> >   V5: String,
> >   V6: String,
> >   V7: String)
> >
> > val data= sc.textFile(data_path)
> >   .map(x => {
> >   val fields = (x+" ").split("\t")
> >
> data_class(fields(0).trim(),fields(1).trim(),fields(2).trim(),fields(3).trim(),
> > fields(4).trim(), fields(5).trim(),fields(6).trim())
> >     })
> >
> > I am doing this because I would like to access to each position using the
> > variable name (V1...V7). Is there any other way of doing this?
> >
> > Also related to this question, if I have data with more than 22
> variables, I
> > am restringed to use class instead of case class. However, this kind of
> > solution has many restrictions mainly related to getter methods. Is there
> > any other way of doing this?
> >
> > And finally, one of my main problems comes after operations of different
> > data variables. For instance, if I have two different variables (data1
> and
> > data2), and I want to join them both as:
> >
> > val data3 = data1.keyBy(_.V1).leftOuterJoin(data2.keyBy(_.V1))
> >
> > Then I have to post process data3 in order to obtain a new class that
> > contains those variables from data1 and also those variables from data2.
> As
> > data3 is (key, (data1, data2)), do I have to create a new different class
> > with all these attributes from data1 and data2? This is kind of annoying
> > when there are many attributes.
> >
> > Thanks in advance,
> >
> > Best
>

Mime
View raw message