Mailing-List: contact dev-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
Received-SPF: pass (nike.apache.org: domain of sachingoel0101@gmail.com
 designates 209.85.223.171 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABq57MjTANgjmFpgLifvGkv06U=Etrxdjgn-bq+o80ydwNP0wA@mail.gmail.com>
References: 
 <CABq57MiQ5do7fT4tF3wYEHwkcKOfwazUYT0t9+=CfCxG+wadrA@mail.gmail.com>
	<CAC27z=NffCzeXViqNhq5jsghJ1Z3hCVFjwQ8Q+zSSUEN8y62Vg@mail.gmail.com>
	<CABq57MgDsUmxnCOU-a5289GMk3aJMuU8dwS0HJJH5mj0jPpp0Q@mail.gmail.com>
	<CAC27z=M=Jh6SDGU=2PZi3VyPVxaqA-PwicxbQY8OZ_OMuh=Khg@mail.gmail.com>
	<CABq57MhG-JGk+2LAdxd06nGfkNE0u3OjCu85h8Yeqpgik3eFdg@mail.gmail.com>
	<CAL3J2zSLBd6G8mi3V7ycGeqsU0sb60+Ueu97Ph5=VOrRu4qiXQ@mail.gmail.com>
	<CAC27z=PCodY_hE35QpzY0b65=kTm38N4njGUV8Y9KOA2T7b5Rg@mail.gmail.com>
	<CADCURAA0SsmfQN2uRmgiLQKiN9xcQjbhQruwDhn1SubURsc8Ng@mail.gmail.com>
	<CAC27z=Ngp+36aaRS3Vj_5rFnmYXZiBJeHovZoa68=4fZFmO30g@mail.gmail.com>
	<CAL3J2zQ9uYOE3O9U7OF7BccG+9E=8M0-0TOQ_fu8OyEG27L5+Q@mail.gmail.com>
	<CABq57MjTANgjmFpgLifvGkv06U=Etrxdjgn-bq+o80ydwNP0wA@mail.gmail.com>
Date: Mon, 8 Jun 2015 16:39:00 +0530
Message-ID: 
 <CAL3J2zQO7fZDEnj8NJOF3uBAny6cigJgNSCYLxQvxd7zafwTtQ@mail.gmail.com>
Subject: Re: Problem with ML pipeline
From: Sachin Goel <sachingoel0101@gmail.com>
To: dev@flink.apache.org
Content-Type: multipart/alternative; boundary=089e01229aaa38bf1a0517ffab01

--089e01229aaa38bf1a0517ffab01
Content-Type: text/plain; charset=UTF-8

That would be better of course. My opinion had to do with
not-implementing-exactly-the-same-thing-twice. Perhaps Till could weigh in
here.
We really do need to come up with a general mechanism for this. Testing
labeled vectors has exactly the same problem. I'll look into how Spark and
sci-kit approach this.

Regards
Sachin Goel

On Mon, Jun 8, 2015 at 4:26 PM, Felix Neutatz <neutatz@googlemail.com>
wrote:

> I am in favor of efficiency. Therefore I would be prefer to introduce new
> methods, in order to save memory and network traffic. This would also solve
> the problem of "how to come up with ids?"
>
> Best regards,
> Felix
> Am 08.06.2015 12:52 nachm. schrieb "Sachin Goel" <sachingoel0101@gmail.com
> >:
>
> > I think if the user doesn't provide IDs, we can safely assume that they
> > don't need it. We can just simply assign an ID of one as a temporary
> > measure and return the result, with no IDs [just to make the interface
> > cleaner].
> > If the IDs are provided, in that case, we simply use those IDs.
> > A possible template for this would be:
> >
> > implicit def predictValues[T <: Vector] = {
> >     new PredictOperation[SVM, T, LabeledVector]{
> >       override def predict(
> >           instance: SVM,
> >           predictParameters: ParameterMap,
> >           input: DataSet[T])
> >         : DataSet[LabeledVector] = {
> >             predict(ParameterMap,input.map(x=>(1,x))).map(x=> x._2)
> >         }
> >     }
> > }
> >
> > implicit def predictValues[T <: (ID,Vector)] = {
> >     new PredictOperation[SVM, T, (ID,LabeledVector)]{
> >       override def predict(
> >           instance: SVM,
> >           predictParameters: ParameterMap,
> >           input: DataSet[T])
> >         : DataSet[LabeledVector] = {
> >             predict(ParameterMap,input)
> >         }
> >     }
> > }
> >
> > Regards
> > Sachin Goel
> >
> > On Mon, Jun 8, 2015 at 4:11 PM, Till Rohrmann <till.rohrmann@gmail.com>
> > wrote:
> >
> > > My gut feeling is also that a `Transformer` would be a good place to
> > > implement feature selection. Then you can simply reuse it across
> multiple
> > > algorithms by simply chaining them together.
> > >
> > > However, I don't know yet what's the best way to realize the IDs. One
> way
> > > would be to add an ID field to `Vector` and `LabeledVector`. Another
> way
> > > would be to provide operations for `(ID, Vector)` and `(ID,
> > LabeledVector)`
> > > tuple types which reuse the implementations for `Vector` and
> > > `LabeledVector`. This means that the developer doesn't have to
> implement
> > > special operations for the tuple variants. The latter approach has the
> > > advantage that you only use memory for IDs if you really need them.
> > >
> > > Another question is how to assign the IDs. Does the user have to
> provide
> > > them? Are they randomly chosen. Or do we assign each element an
> > increasing
> > > index based on the total number of elements?
> > >
> > > On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <mikiobraun@googlemail.com
> >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I think there are number of issues here:
> > > >
> > > > - whether or not we generally need ids for our examples. For
> > > > time-series, this is a must, but I think it would also help us with
> > > > many other things (like partitioning the data, or picking a
> consistent
> > > > subset), so I would think adding (numeric) ids in general to
> > > > LabeledVector would be ok.
> > > > - Some machinery to select features. My biggest concern here for
> > > > putting that as a parameter to the learning algorithm is that this
> > > > something independent of the learning algorith, so every algorithm
> > > > would need to duplicate the code for that. I think it's better if the
> > > > learning algorithm can assume that the LabelVector already contains
> > > > all the relevant features, and then there should be other operations
> > > > to project or extract a subset of examples.
> > > >
> > > > -M
> > > >
> > > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <
> > till.rohrmann@gmail.com>
> > > > wrote:
> > > > > You're right Felix. You need to provide the `FitOperation` and
> > > > > `PredictOperation` for the `Predictor` you want to use and the
> > > > > `FitOperation` and `TransformOperation` for all `Transformer`s you
> > want
> > > > to
> > > > > chain in front of the `Predictor`.
> > > > >
> > > > > Specifying which features to take could be a solution. However,
> then
> > > > you're
> > > > > always carrying data along which is not needed. Especially for
> large
> > > > scale
> > > > > data, this might be prohibitive expensive. I guess the more
> efficient
> > > > > solution would be to assign an ID and later join with the removed
> > > feature
> > > > > elements.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <
> sachingoel0101@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > >> A more general approach would be to take as input which indices of
> > the
> > > > >> vector to consider as features. After that, the vector can be
> > returned
> > > > as
> > > > >> such and user can do what they  wish with the non-feature values.
> > This
> > > > >> wouldn't need extending the predict operation, instead this can be
> > > > >> specified in the model itself using a set parameter function. Or
> > > > perhaps a
> > > > >> better approach is to just take this input in the predict
> operation.
> > > > >>
> > > > >> Cheers!
> > > > >> Sachin
> > > > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <neutatz@googlemail.com>
> > > > wrote:
> > > > >>
> > > > >> > Probably we also need it for the other classes of the pipeline
> as
> > > > well,
> > > > >> in
> > > > >> > order to be able to pass the ID through the whole pipeline.
> > > > >> >
> > > > >> > Best regards,
> > > > >> > Felix
> > > > >> >  Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <
> > > > trohrmann@apache.org
> > > > >> >:
> > > > >> >
> > > > >> > > Then you only have to provide an implicit
> PredictOperation[SVM,
> > > (T,
> > > > >> Int),
> > > > >> > > (LabeledVector, Int)] value with T <: Vector in the scope
> where
> > > you
> > > > >> call
> > > > >> > > the predict operation.
> > > > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <
> neutatz@googlemail.com
> > >
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > That would be great. I like the special predict operation
> > better
> > > > >> > because
> > > > >> > > it
> > > > >> > > > is only in some cases necessary to return the id. The
> special
> > > > predict
> > > > >> > > > Operation would save this overhead.
> > > > >> > > >
> > > > >> > > > Best regards,
> > > > >> > > > Felix
> > > > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> > > > >> > > till.rohrmann@gmail.com
> > > > >> > > > >:
> > > > >> > > >
> > > > >> > > > > I see your problem. One way to solve the problem is to
> > > > implement a
> > > > >> > > > special
> > > > >> > > > > PredictOperation which takes a tuple (id, vector) and
> > returns
> > > a
> > > > >> tuple
> > > > >> > > > (id,
> > > > >> > > > > labeledVector). You can take a look at the implementation
> > for
> > > > the
> > > > >> > > vector
> > > > >> > > > > prediction operation.
> > > > >> > > > >
> > > > >> > > > > But we can also discuss about adding an ID field to the
> > Vector
> > > > >> type.
> > > > >> > > > >
> > > > >> > > > > Cheers,
> > > > >> > > > > Till
> > > > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <
> > > neutatz@googlemail.com
> > > > >
> > > > >> > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi,
> > > > >> > > > > >
> > > > >> > > > > > I have the following use case: I want to to regression
> > for a
> > > > >> > > timeseries
> > > > >> > > > > > dataset like:
> > > > >> > > > > >
> > > > >> > > > > > id, x1, x2, ..., xn, y
> > > > >> > > > > >
> > > > >> > > > > > id = point in time
> > > > >> > > > > > x = features
> > > > >> > > > > > y = target value
> > > > >> > > > > >
> > > > >> > > > > > In the Flink frame work I would map this to a
> > LabeledVector
> > > > (y,
> > > > >> > > > > > DenseVector(x)). (I don't want to use the id as a
> feature)
> > > > >> > > > > >
> > > > >> > > > > > When I apply finally the predict() method I get a
> > > > LabeledVector
> > > > >> > > > > > (y_predicted, DenseVector(x)).
> > > > >> > > > > >
> > > > >> > > > > > Now my problem is that I would like to plot the
> predicted
> > > > target
> > > > >> > > value
> > > > >> > > > > > according to its time.
> > > > >> > > > > >
> > > > >> > > > > > What I have to do now is:
> > > > >> > > > > >
> > > > >> > > > > > a = predictedDataSet.map ( LabeledVector =>
> Tuple2(x,y_p))
> > > > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" =>
> > > > Tuple2(x,id))
> > > > >> > > > > >
> > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > > > >> > > > > >
> > > > >> > > > > > This is really a cumbersome process for such an simple
> > > thing.
> > > > Is
> > > > >> > > there
> > > > >> > > > > any
> > > > >> > > > > > approach which makes this more simple. If not, can we
> > extend
> > > > the
> > > > >> ML
> > > > >> > > > API.
> > > > >> > > > > to
> > > > >> > > > > > allow ids?
> > > > >> > > > > >
> > > > >> > > > > > Best regards,
> > > > >> > > > > > Felix
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Mikio Braun - http://blog.mikiobraun.de,
> http://twitter.com/mikiobraun
> > > >
> > >
> >
>

--089e01229aaa38bf1a0517ffab01--