flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <till.rohrm...@gmail.com>
Subject Re: Problem with ML pipeline
Date Mon, 08 Jun 2015 10:41:26 GMT
My gut feeling is also that a `Transformer` would be a good place to
implement feature selection. Then you can simply reuse it across multiple
algorithms by simply chaining them together.

However, I don't know yet what's the best way to realize the IDs. One way
would be to add an ID field to `Vector` and `LabeledVector`. Another way
would be to provide operations for `(ID, Vector)` and `(ID, LabeledVector)`
tuple types which reuse the implementations for `Vector` and
`LabeledVector`. This means that the developer doesn't have to implement
special operations for the tuple variants. The latter approach has the
advantage that you only use memory for IDs if you really need them.

Another question is how to assign the IDs. Does the user have to provide
them? Are they randomly chosen. Or do we assign each element an increasing
index based on the total number of elements?

On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <mikiobraun@googlemail.com>
wrote:

> Hi all,
>
> I think there are number of issues here:
>
> - whether or not we generally need ids for our examples. For
> time-series, this is a must, but I think it would also help us with
> many other things (like partitioning the data, or picking a consistent
> subset), so I would think adding (numeric) ids in general to
> LabeledVector would be ok.
> - Some machinery to select features. My biggest concern here for
> putting that as a parameter to the learning algorithm is that this
> something independent of the learning algorith, so every algorithm
> would need to duplicate the code for that. I think it's better if the
> learning algorithm can assume that the LabelVector already contains
> all the relevant features, and then there should be other operations
> to project or extract a subset of examples.
>
> -M
>
> On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <till.rohrmann@gmail.com>
> wrote:
> > You're right Felix. You need to provide the `FitOperation` and
> > `PredictOperation` for the `Predictor` you want to use and the
> > `FitOperation` and `TransformOperation` for all `Transformer`s you want
> to
> > chain in front of the `Predictor`.
> >
> > Specifying which features to take could be a solution. However, then
> you're
> > always carrying data along which is not needed. Especially for large
> scale
> > data, this might be prohibitive expensive. I guess the more efficient
> > solution would be to assign an ID and later join with the removed feature
> > elements.
> >
> > Cheers,
> > Till
> >
> > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <sachingoel0101@gmail.com>
> wrote:
> >
> >> A more general approach would be to take as input which indices of the
> >> vector to consider as features. After that, the vector can be returned
> as
> >> such and user can do what they  wish with the non-feature values. This
> >> wouldn't need extending the predict operation, instead this can be
> >> specified in the model itself using a set parameter function. Or
> perhaps a
> >> better approach is to just take this input in the predict operation.
> >>
> >> Cheers!
> >> Sachin
> >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <neutatz@googlemail.com>
> wrote:
> >>
> >> > Probably we also need it for the other classes of the pipeline as
> well,
> >> in
> >> > order to be able to pass the ID through the whole pipeline.
> >> >
> >> > Best regards,
> >> > Felix
> >> >  Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <
> trohrmann@apache.org
> >> >:
> >> >
> >> > > Then you only have to provide an implicit PredictOperation[SVM, (T,
> >> Int),
> >> > > (LabeledVector, Int)] value with T <: Vector in the scope where
you
> >> call
> >> > > the predict operation.
> >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <neutatz@googlemail.com>
> >> wrote:
> >> > >
> >> > > > That would be great. I like the special predict operation better
> >> > because
> >> > > it
> >> > > > is only in some cases necessary to return the id. The special
> predict
> >> > > > Operation would save this overhead.
> >> > > >
> >> > > > Best regards,
> >> > > > Felix
> >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> >> > > till.rohrmann@gmail.com
> >> > > > >:
> >> > > >
> >> > > > > I see your problem. One way to solve the problem is to
> implement a
> >> > > > special
> >> > > > > PredictOperation which takes a tuple (id, vector) and returns
a
> >> tuple
> >> > > > (id,
> >> > > > > labeledVector). You can take a look at the implementation
for
> the
> >> > > vector
> >> > > > > prediction operation.
> >> > > > >
> >> > > > > But we can also discuss about adding an ID field to the
Vector
> >> type.
> >> > > > >
> >> > > > > Cheers,
> >> > > > > Till
> >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <neutatz@googlemail.com
> >
> >> > > wrote:
> >> > > > >
> >> > > > > > Hi,
> >> > > > > >
> >> > > > > > I have the following use case: I want to to regression
for a
> >> > > timeseries
> >> > > > > > dataset like:
> >> > > > > >
> >> > > > > > id, x1, x2, ..., xn, y
> >> > > > > >
> >> > > > > > id = point in time
> >> > > > > > x = features
> >> > > > > > y = target value
> >> > > > > >
> >> > > > > > In the Flink frame work I would map this to a LabeledVector
> (y,
> >> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
> >> > > > > >
> >> > > > > > When I apply finally the predict() method I get a
> LabeledVector
> >> > > > > > (y_predicted, DenseVector(x)).
> >> > > > > >
> >> > > > > > Now my problem is that I would like to plot the predicted
> target
> >> > > value
> >> > > > > > according to its time.
> >> > > > > >
> >> > > > > > What I have to do now is:
> >> > > > > >
> >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" =>
> Tuple2(x,id))
> >> > > > > >
> >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id,
y_p)
> >> > > > > >
> >> > > > > > This is really a cumbersome process for such an simple
thing.
> Is
> >> > > there
> >> > > > > any
> >> > > > > > approach which makes this more simple. If not, can
we extend
> the
> >> ML
> >> > > > API.
> >> > > > > to
> >> > > > > > allow ids?
> >> > > > > >
> >> > > > > > Best regards,
> >> > > > > > Felix
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>
>
>
> --
> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message