Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 573D61814F for ; Mon, 8 Jun 2015 11:09:28 +0000 (UTC) Received: (qmail 53204 invoked by uid 500); 8 Jun 2015 11:09:27 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 53140 invoked by uid 500); 8 Jun 2015 11:09:27 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 53125 invoked by uid 99); 8 Jun 2015 11:09:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jun 2015 11:09:27 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sachingoel0101@gmail.com designates 209.85.223.171 as permitted sender) Received: from [209.85.223.171] (HELO mail-ie0-f171.google.com) (209.85.223.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jun 2015 11:07:12 +0000 Received: by ieclw1 with SMTP id lw1so95481172iec.3 for ; Mon, 08 Jun 2015 04:09:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ehNardaVKbeGUSfwOewXNhPHKfWhlxtO07xCPU7qr1k=; b=fwi1Op1GrPzE1e/Lru4d81pEvS3ozta91rLMugI6Bd10UnISO1NUcCBcySXByEYMMS HSv0HjEzO2HKwtpc1BMxN+6iTCdRSWLOHHDV0XuCqgnfSbTCIJAdsEUUcWOxjwPkr1aA 4mZvwJnZlnQnwpIr5BqzCT1aFuzVU4vcLMt1D4zywn185NekkpmazaprZJm8XY28+ln2 OAz1orOZxwGYLv7kdf63Q6xvzTpDtClkdPFCJUXHRTxlFMYtysorj3fLn13ICsdugUpw JJSy/i0AbadLZA+0WwXZ8VbphhLMdg5WOohBjCIlEDAYswVFtG1buNWbhFyvpb5LVdiE UnAg== MIME-Version: 1.0 X-Received: by 10.50.79.230 with SMTP id m6mr4572960igx.46.1433761740273; Mon, 08 Jun 2015 04:09:00 -0700 (PDT) Received: by 10.36.61.7 with HTTP; Mon, 8 Jun 2015 04:09:00 -0700 (PDT) In-Reply-To: References: Date: Mon, 8 Jun 2015 16:39:00 +0530 Message-ID: Subject: Re: Problem with ML pipeline From: Sachin Goel To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=089e01229aaa38bf1a0517ffab01 X-Virus-Checked: Checked by ClamAV on apache.org --089e01229aaa38bf1a0517ffab01 Content-Type: text/plain; charset=UTF-8 That would be better of course. My opinion had to do with not-implementing-exactly-the-same-thing-twice. Perhaps Till could weigh in here. We really do need to come up with a general mechanism for this. Testing labeled vectors has exactly the same problem. I'll look into how Spark and sci-kit approach this. Regards Sachin Goel On Mon, Jun 8, 2015 at 4:26 PM, Felix Neutatz wrote: > I am in favor of efficiency. Therefore I would be prefer to introduce new > methods, in order to save memory and network traffic. This would also solve > the problem of "how to come up with ids?" > > Best regards, > Felix > Am 08.06.2015 12:52 nachm. schrieb "Sachin Goel" >: > > > I think if the user doesn't provide IDs, we can safely assume that they > > don't need it. We can just simply assign an ID of one as a temporary > > measure and return the result, with no IDs [just to make the interface > > cleaner]. > > If the IDs are provided, in that case, we simply use those IDs. > > A possible template for this would be: > > > > implicit def predictValues[T <: Vector] = { > > new PredictOperation[SVM, T, LabeledVector]{ > > override def predict( > > instance: SVM, > > predictParameters: ParameterMap, > > input: DataSet[T]) > > : DataSet[LabeledVector] = { > > predict(ParameterMap,input.map(x=>(1,x))).map(x=> x._2) > > } > > } > > } > > > > implicit def predictValues[T <: (ID,Vector)] = { > > new PredictOperation[SVM, T, (ID,LabeledVector)]{ > > override def predict( > > instance: SVM, > > predictParameters: ParameterMap, > > input: DataSet[T]) > > : DataSet[LabeledVector] = { > > predict(ParameterMap,input) > > } > > } > > } > > > > Regards > > Sachin Goel > > > > On Mon, Jun 8, 2015 at 4:11 PM, Till Rohrmann > > wrote: > > > > > My gut feeling is also that a `Transformer` would be a good place to > > > implement feature selection. Then you can simply reuse it across > multiple > > > algorithms by simply chaining them together. > > > > > > However, I don't know yet what's the best way to realize the IDs. One > way > > > would be to add an ID field to `Vector` and `LabeledVector`. Another > way > > > would be to provide operations for `(ID, Vector)` and `(ID, > > LabeledVector)` > > > tuple types which reuse the implementations for `Vector` and > > > `LabeledVector`. This means that the developer doesn't have to > implement > > > special operations for the tuple variants. The latter approach has the > > > advantage that you only use memory for IDs if you really need them. > > > > > > Another question is how to assign the IDs. Does the user have to > provide > > > them? Are they randomly chosen. Or do we assign each element an > > increasing > > > index based on the total number of elements? > > > > > > On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun > > > > wrote: > > > > > > > Hi all, > > > > > > > > I think there are number of issues here: > > > > > > > > - whether or not we generally need ids for our examples. For > > > > time-series, this is a must, but I think it would also help us with > > > > many other things (like partitioning the data, or picking a > consistent > > > > subset), so I would think adding (numeric) ids in general to > > > > LabeledVector would be ok. > > > > - Some machinery to select features. My biggest concern here for > > > > putting that as a parameter to the learning algorithm is that this > > > > something independent of the learning algorith, so every algorithm > > > > would need to duplicate the code for that. I think it's better if the > > > > learning algorithm can assume that the LabelVector already contains > > > > all the relevant features, and then there should be other operations > > > > to project or extract a subset of examples. > > > > > > > > -M > > > > > > > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann < > > till.rohrmann@gmail.com> > > > > wrote: > > > > > You're right Felix. You need to provide the `FitOperation` and > > > > > `PredictOperation` for the `Predictor` you want to use and the > > > > > `FitOperation` and `TransformOperation` for all `Transformer`s you > > want > > > > to > > > > > chain in front of the `Predictor`. > > > > > > > > > > Specifying which features to take could be a solution. However, > then > > > > you're > > > > > always carrying data along which is not needed. Especially for > large > > > > scale > > > > > data, this might be prohibitive expensive. I guess the more > efficient > > > > > solution would be to assign an ID and later join with the removed > > > feature > > > > > elements. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel < > sachingoel0101@gmail.com > > > > > > > wrote: > > > > > > > > > >> A more general approach would be to take as input which indices of > > the > > > > >> vector to consider as features. After that, the vector can be > > returned > > > > as > > > > >> such and user can do what they wish with the non-feature values. > > This > > > > >> wouldn't need extending the predict operation, instead this can be > > > > >> specified in the model itself using a set parameter function. Or > > > > perhaps a > > > > >> better approach is to just take this input in the predict > operation. > > > > >> > > > > >> Cheers! > > > > >> Sachin > > > > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" > > > > wrote: > > > > >> > > > > >> > Probably we also need it for the other classes of the pipeline > as > > > > well, > > > > >> in > > > > >> > order to be able to pass the ID through the whole pipeline. > > > > >> > > > > > >> > Best regards, > > > > >> > Felix > > > > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" < > > > > trohrmann@apache.org > > > > >> >: > > > > >> > > > > > >> > > Then you only have to provide an implicit > PredictOperation[SVM, > > > (T, > > > > >> Int), > > > > >> > > (LabeledVector, Int)] value with T <: Vector in the scope > where > > > you > > > > >> call > > > > >> > > the predict operation. > > > > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" < > neutatz@googlemail.com > > > > > > > >> wrote: > > > > >> > > > > > > >> > > > That would be great. I like the special predict operation > > better > > > > >> > because > > > > >> > > it > > > > >> > > > is only in some cases necessary to return the id. The > special > > > > predict > > > > >> > > > Operation would save this overhead. > > > > >> > > > > > > > >> > > > Best regards, > > > > >> > > > Felix > > > > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > > > > >> > > till.rohrmann@gmail.com > > > > >> > > > >: > > > > >> > > > > > > > >> > > > > I see your problem. One way to solve the problem is to > > > > implement a > > > > >> > > > special > > > > >> > > > > PredictOperation which takes a tuple (id, vector) and > > returns > > > a > > > > >> tuple > > > > >> > > > (id, > > > > >> > > > > labeledVector). You can take a look at the implementation > > for > > > > the > > > > >> > > vector > > > > >> > > > > prediction operation. > > > > >> > > > > > > > > >> > > > > But we can also discuss about adding an ID field to the > > Vector > > > > >> type. > > > > >> > > > > > > > > >> > > > > Cheers, > > > > >> > > > > Till > > > > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" < > > > neutatz@googlemail.com > > > > > > > > > >> > > wrote: > > > > >> > > > > > > > > >> > > > > > Hi, > > > > >> > > > > > > > > > >> > > > > > I have the following use case: I want to to regression > > for a > > > > >> > > timeseries > > > > >> > > > > > dataset like: > > > > >> > > > > > > > > > >> > > > > > id, x1, x2, ..., xn, y > > > > >> > > > > > > > > > >> > > > > > id = point in time > > > > >> > > > > > x = features > > > > >> > > > > > y = target value > > > > >> > > > > > > > > > >> > > > > > In the Flink frame work I would map this to a > > LabeledVector > > > > (y, > > > > >> > > > > > DenseVector(x)). (I don't want to use the id as a > feature) > > > > >> > > > > > > > > > >> > > > > > When I apply finally the predict() method I get a > > > > LabeledVector > > > > >> > > > > > (y_predicted, DenseVector(x)). > > > > >> > > > > > > > > > >> > > > > > Now my problem is that I would like to plot the > predicted > > > > target > > > > >> > > value > > > > >> > > > > > according to its time. > > > > >> > > > > > > > > > >> > > > > > What I have to do now is: > > > > >> > > > > > > > > > >> > > > > > a = predictedDataSet.map ( LabeledVector => > Tuple2(x,y_p)) > > > > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => > > > > Tuple2(x,id)) > > > > >> > > > > > > > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > > > >> > > > > > > > > > >> > > > > > This is really a cumbersome process for such an simple > > > thing. > > > > Is > > > > >> > > there > > > > >> > > > > any > > > > >> > > > > > approach which makes this more simple. If not, can we > > extend > > > > the > > > > >> ML > > > > >> > > > API. > > > > >> > > > > to > > > > >> > > > > > allow ids? > > > > >> > > > > > > > > > >> > > > > > Best regards, > > > > >> > > > > > Felix > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > -- > > > > Mikio Braun - http://blog.mikiobraun.de, > http://twitter.com/mikiobraun > > > > > > > > > > --089e01229aaa38bf1a0517ffab01--