Mailing-List: contact dev-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAGLgxba9=AqM19b=bYDuA6tT_5t4wo-a701t2v1kMmKJJAZq_Q@mail.gmail.com>
References: 
 <CABq57MiQ5do7fT4tF3wYEHwkcKOfwazUYT0t9+=CfCxG+wadrA@mail.gmail.com>
	<CAC27z=NffCzeXViqNhq5jsghJ1Z3hCVFjwQ8Q+zSSUEN8y62Vg@mail.gmail.com>
	<CABq57MgDsUmxnCOU-a5289GMk3aJMuU8dwS0HJJH5mj0jPpp0Q@mail.gmail.com>
	<CAC27z=M=Jh6SDGU=2PZi3VyPVxaqA-PwicxbQY8OZ_OMuh=Khg@mail.gmail.com>
	<CABq57MhG-JGk+2LAdxd06nGfkNE0u3OjCu85h8Yeqpgik3eFdg@mail.gmail.com>
	<CAL3J2zSLBd6G8mi3V7ycGeqsU0sb60+Ueu97Ph5=VOrRu4qiXQ@mail.gmail.com>
	<CAC27z=PCodY_hE35QpzY0b65=kTm38N4njGUV8Y9KOA2T7b5Rg@mail.gmail.com>
	<CADCURAA0SsmfQN2uRmgiLQKiN9xcQjbhQruwDhn1SubURsc8Ng@mail.gmail.com>
	<CAGLgxba9=AqM19b=bYDuA6tT_5t4wo-a701t2v1kMmKJJAZq_Q@mail.gmail.com>
Date: Mon, 8 Jun 2015 16:10:52 +0530
Message-ID: 
 <CAL3J2zTuQ4PB5SV+z2z7e2Vwcwq-L0_U8eYPChMpWCwWcg4=9w@mail.gmail.com>
Subject: Re: Problem with ML pipeline
From: Sachin Goel <sachingoel0101@gmail.com>
To: dev@flink.apache.org
Content-Type: multipart/alternative; boundary=bcaec517cd0ea27a140517ff4685

--bcaec517cd0ea27a140517ff4685
Content-Type: text/plain; charset=UTF-8

Yes. I agree too. It makes no sense for the learning algorithm to have
extra payload. Only relevant data makes sense.
Further, adding ID to the predict operation type definition seems a
legitimate choice. +1 from my side.

Regards
Sachin Goel

On Mon, Jun 8, 2015 at 4:06 PM, Theodore Vasiloudis <
theodoros.vasiloudis@gmail.com> wrote:

> I agree with Mikio; ids would be useful overall, and feature selection
> should not be a part of learning algorithms,
> all features in a LabeledVector should be assumed to be relevant by the
> learners.
>
> On Mon, Jun 8, 2015 at 12:00 PM, Mikio Braun <mikiobraun@googlemail.com>
> wrote:
>
> > Hi all,
> >
> > I think there are number of issues here:
> >
> > - whether or not we generally need ids for our examples. For
> > time-series, this is a must, but I think it would also help us with
> > many other things (like partitioning the data, or picking a consistent
> > subset), so I would think adding (numeric) ids in general to
> > LabeledVector would be ok.
> > - Some machinery to select features. My biggest concern here for
> > putting that as a parameter to the learning algorithm is that this
> > something independent of the learning algorith, so every algorithm
> > would need to duplicate the code for that. I think it's better if the
> > learning algorithm can assume that the LabelVector already contains
> > all the relevant features, and then there should be other operations
> > to project or extract a subset of examples.
> >
> > -M
> >
> > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <till.rohrmann@gmail.com>
> > wrote:
> > > You're right Felix. You need to provide the `FitOperation` and
> > > `PredictOperation` for the `Predictor` you want to use and the
> > > `FitOperation` and `TransformOperation` for all `Transformer`s you want
> > to
> > > chain in front of the `Predictor`.
> > >
> > > Specifying which features to take could be a solution. However, then
> > you're
> > > always carrying data along which is not needed. Especially for large
> > scale
> > > data, this might be prohibitive expensive. I guess the more efficient
> > > solution would be to assign an ID and later join with the removed
> feature
> > > elements.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <sachingoel0101@gmail.com>
> > wrote:
> > >
> > >> A more general approach would be to take as input which indices of the
> > >> vector to consider as features. After that, the vector can be returned
> > as
> > >> such and user can do what they  wish with the non-feature values. This
> > >> wouldn't need extending the predict operation, instead this can be
> > >> specified in the model itself using a set parameter function. Or
> > perhaps a
> > >> better approach is to just take this input in the predict operation.
> > >>
> > >> Cheers!
> > >> Sachin
> > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <neutatz@googlemail.com>
> > wrote:
> > >>
> > >> > Probably we also need it for the other classes of the pipeline as
> > well,
> > >> in
> > >> > order to be able to pass the ID through the whole pipeline.
> > >> >
> > >> > Best regards,
> > >> > Felix
> > >> >  Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <
> > trohrmann@apache.org
> > >> >:
> > >> >
> > >> > > Then you only have to provide an implicit PredictOperation[SVM,
> (T,
> > >> Int),
> > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where
> you
> > >> call
> > >> > > the predict operation.
> > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <neutatz@googlemail.com>
> > >> wrote:
> > >> > >
> > >> > > > That would be great. I like the special predict operation better
> > >> > because
> > >> > > it
> > >> > > > is only in some cases necessary to return the id. The special
> > predict
> > >> > > > Operation would save this overhead.
> > >> > > >
> > >> > > > Best regards,
> > >> > > > Felix
> > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> > >> > > till.rohrmann@gmail.com
> > >> > > > >:
> > >> > > >
> > >> > > > > I see your problem. One way to solve the problem is to
> > implement a
> > >> > > > special
> > >> > > > > PredictOperation which takes a tuple (id, vector) and returns
> a
> > >> tuple
> > >> > > > (id,
> > >> > > > > labeledVector). You can take a look at the implementation for
> > the
> > >> > > vector
> > >> > > > > prediction operation.
> > >> > > > >
> > >> > > > > But we can also discuss about adding an ID field to the Vector
> > >> type.
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > > Till
> > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <
> neutatz@googlemail.com
> > >
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hi,
> > >> > > > > >
> > >> > > > > > I have the following use case: I want to to regression for a
> > >> > > timeseries
> > >> > > > > > dataset like:
> > >> > > > > >
> > >> > > > > > id, x1, x2, ..., xn, y
> > >> > > > > >
> > >> > > > > > id = point in time
> > >> > > > > > x = features
> > >> > > > > > y = target value
> > >> > > > > >
> > >> > > > > > In the Flink frame work I would map this to a LabeledVector
> > (y,
> > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
> > >> > > > > >
> > >> > > > > > When I apply finally the predict() method I get a
> > LabeledVector
> > >> > > > > > (y_predicted, DenseVector(x)).
> > >> > > > > >
> > >> > > > > > Now my problem is that I would like to plot the predicted
> > target
> > >> > > value
> > >> > > > > > according to its time.
> > >> > > > > >
> > >> > > > > > What I have to do now is:
> > >> > > > > >
> > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" =>
> > Tuple2(x,id))
> > >> > > > > >
> > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > >> > > > > >
> > >> > > > > > This is really a cumbersome process for such an simple
> thing.
> > Is
> > >> > > there
> > >> > > > > any
> > >> > > > > > approach which makes this more simple. If not, can we extend
> > the
> > >> ML
> > >> > > > API.
> > >> > > > > to
> > >> > > > > > allow ids?
> > >> > > > > >
> > >> > > > > > Best regards,
> > >> > > > > > Felix
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> >
> >
> > --
> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
> >
>

--bcaec517cd0ea27a140517ff4685--