Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7D4431807D for ; Mon, 8 Jun 2015 10:41:02 +0000 (UTC) Received: (qmail 79718 invoked by uid 500); 8 Jun 2015 10:41:02 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 79662 invoked by uid 500); 8 Jun 2015 10:41:02 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 79647 invoked by uid 99); 8 Jun 2015 10:41:02 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jun 2015 10:41:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 97B821A4869 for ; Mon, 8 Jun 2015 10:41:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.15 X-Spam-Level: **** X-Spam-Status: No, score=4.15 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_REPLY=1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 2npIjYBBHnP4 for ; Mon, 8 Jun 2015 10:40:53 +0000 (UTC) Received: from mail-ig0-f174.google.com (mail-ig0-f174.google.com [209.85.213.174]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 41AEC275E9 for ; Mon, 8 Jun 2015 10:40:53 +0000 (UTC) Received: by igblz2 with SMTP id lz2so53797586igb.1 for ; Mon, 08 Jun 2015 03:40:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=xDN3ZP2SWJ0n9QwSVtt9l1kf1XWWXKBReiVBPtfs8Pc=; b=kteiSwcwB0Amy9OUr/ttNcHk1gmr8KXUNT0ghHbJ7bu2U+dkgIkyv+A3fn+s4VN7hp S/QlzKUYTLIq206q4NZH1ak++U39mx5pFQsxWjlxnukuan/LmSonIujStWXUA3kUxzsP Xx7281AeCV9wOWQA0OK33t3kPF4jMGcYdrPrZzmI4FlkXwXbfh9ARjFW6c5FJu/70S2S Z/e69jAdgQIMZl9NLNr+3eF5k21JXz05FsBeBSxNGhYP22vGbycjLXSuBz13nj7wfK7/ esi02AXWE/ry6dzHEGQ9i7sNx7aQXFEur5Wz+AbKKAKmRDhxvSYAGUeoKxzYHLgGr68Z fLFQ== MIME-Version: 1.0 X-Received: by 10.43.119.70 with SMTP id ft6mr22831378icc.15.1433760052703; Mon, 08 Jun 2015 03:40:52 -0700 (PDT) Received: by 10.36.61.7 with HTTP; Mon, 8 Jun 2015 03:40:52 -0700 (PDT) In-Reply-To: References: Date: Mon, 8 Jun 2015 16:10:52 +0530 Message-ID: Subject: Re: Problem with ML pipeline From: Sachin Goel To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=bcaec517cd0ea27a140517ff4685 --bcaec517cd0ea27a140517ff4685 Content-Type: text/plain; charset=UTF-8 Yes. I agree too. It makes no sense for the learning algorithm to have extra payload. Only relevant data makes sense. Further, adding ID to the predict operation type definition seems a legitimate choice. +1 from my side. Regards Sachin Goel On Mon, Jun 8, 2015 at 4:06 PM, Theodore Vasiloudis < theodoros.vasiloudis@gmail.com> wrote: > I agree with Mikio; ids would be useful overall, and feature selection > should not be a part of learning algorithms, > all features in a LabeledVector should be assumed to be relevant by the > learners. > > On Mon, Jun 8, 2015 at 12:00 PM, Mikio Braun > wrote: > > > Hi all, > > > > I think there are number of issues here: > > > > - whether or not we generally need ids for our examples. For > > time-series, this is a must, but I think it would also help us with > > many other things (like partitioning the data, or picking a consistent > > subset), so I would think adding (numeric) ids in general to > > LabeledVector would be ok. > > - Some machinery to select features. My biggest concern here for > > putting that as a parameter to the learning algorithm is that this > > something independent of the learning algorith, so every algorithm > > would need to duplicate the code for that. I think it's better if the > > learning algorithm can assume that the LabelVector already contains > > all the relevant features, and then there should be other operations > > to project or extract a subset of examples. > > > > -M > > > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann > > wrote: > > > You're right Felix. You need to provide the `FitOperation` and > > > `PredictOperation` for the `Predictor` you want to use and the > > > `FitOperation` and `TransformOperation` for all `Transformer`s you want > > to > > > chain in front of the `Predictor`. > > > > > > Specifying which features to take could be a solution. However, then > > you're > > > always carrying data along which is not needed. Especially for large > > scale > > > data, this might be prohibitive expensive. I guess the more efficient > > > solution would be to assign an ID and later join with the removed > feature > > > elements. > > > > > > Cheers, > > > Till > > > > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel > > wrote: > > > > > >> A more general approach would be to take as input which indices of the > > >> vector to consider as features. After that, the vector can be returned > > as > > >> such and user can do what they wish with the non-feature values. This > > >> wouldn't need extending the predict operation, instead this can be > > >> specified in the model itself using a set parameter function. Or > > perhaps a > > >> better approach is to just take this input in the predict operation. > > >> > > >> Cheers! > > >> Sachin > > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" > > wrote: > > >> > > >> > Probably we also need it for the other classes of the pipeline as > > well, > > >> in > > >> > order to be able to pass the ID through the whole pipeline. > > >> > > > >> > Best regards, > > >> > Felix > > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" < > > trohrmann@apache.org > > >> >: > > >> > > > >> > > Then you only have to provide an implicit PredictOperation[SVM, > (T, > > >> Int), > > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where > you > > >> call > > >> > > the predict operation. > > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" > > >> wrote: > > >> > > > > >> > > > That would be great. I like the special predict operation better > > >> > because > > >> > > it > > >> > > > is only in some cases necessary to return the id. The special > > predict > > >> > > > Operation would save this overhead. > > >> > > > > > >> > > > Best regards, > > >> > > > Felix > > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > > >> > > till.rohrmann@gmail.com > > >> > > > >: > > >> > > > > > >> > > > > I see your problem. One way to solve the problem is to > > implement a > > >> > > > special > > >> > > > > PredictOperation which takes a tuple (id, vector) and returns > a > > >> tuple > > >> > > > (id, > > >> > > > > labeledVector). You can take a look at the implementation for > > the > > >> > > vector > > >> > > > > prediction operation. > > >> > > > > > > >> > > > > But we can also discuss about adding an ID field to the Vector > > >> type. > > >> > > > > > > >> > > > > Cheers, > > >> > > > > Till > > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" < > neutatz@googlemail.com > > > > > >> > > wrote: > > >> > > > > > > >> > > > > > Hi, > > >> > > > > > > > >> > > > > > I have the following use case: I want to to regression for a > > >> > > timeseries > > >> > > > > > dataset like: > > >> > > > > > > > >> > > > > > id, x1, x2, ..., xn, y > > >> > > > > > > > >> > > > > > id = point in time > > >> > > > > > x = features > > >> > > > > > y = target value > > >> > > > > > > > >> > > > > > In the Flink frame work I would map this to a LabeledVector > > (y, > > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature) > > >> > > > > > > > >> > > > > > When I apply finally the predict() method I get a > > LabeledVector > > >> > > > > > (y_predicted, DenseVector(x)). > > >> > > > > > > > >> > > > > > Now my problem is that I would like to plot the predicted > > target > > >> > > value > > >> > > > > > according to its time. > > >> > > > > > > > >> > > > > > What I have to do now is: > > >> > > > > > > > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => > > Tuple2(x,id)) > > >> > > > > > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > >> > > > > > > > >> > > > > > This is really a cumbersome process for such an simple > thing. > > Is > > >> > > there > > >> > > > > any > > >> > > > > > approach which makes this more simple. If not, can we extend > > the > > >> ML > > >> > > > API. > > >> > > > > to > > >> > > > > > allow ids? > > >> > > > > > > > >> > > > > > Best regards, > > >> > > > > > Felix > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > -- > > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun > > > --bcaec517cd0ea27a140517ff4685--