Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A0B46200BAE for ; Fri, 28 Oct 2016 22:30:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9C254160AE4; Fri, 28 Oct 2016 20:30:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B9FD7160ACA for ; Fri, 28 Oct 2016 22:30:04 +0200 (CEST) Received: (qmail 89588 invoked by uid 500); 28 Oct 2016 20:30:04 -0000 Mailing-List: contact dev-help@madlib.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@madlib.incubator.apache.org Delivered-To: mailing list dev@madlib.incubator.apache.org Received: (qmail 89572 invoked by uid 99); 28 Oct 2016 20:30:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Oct 2016 20:30:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 078A2C0B41 for ; Fri, 28 Oct 2016 20:30:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.499 X-Spam-Level: ** X-Spam-Status: No, score=2.499 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_SORBS_SPAM=0.5] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=pivotal-io.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id ytzkveF3eVbs for ; Fri, 28 Oct 2016 20:29:57 +0000 (UTC) Received: from mail-oi0-f47.google.com (mail-oi0-f47.google.com [209.85.218.47]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id C62425F477 for ; Fri, 28 Oct 2016 20:29:56 +0000 (UTC) Received: by mail-oi0-f47.google.com with SMTP id y2so142403436oie.0 for ; Fri, 28 Oct 2016 13:29:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pivotal-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=7jGhH5ewRc4CrHvVFUt/HhDVKewdtrW1n+L9RVHBhks=; b=cmPkK6bsdHEDeB2e1Xb6UIWyp8ezWF4wYWLsc2R7e/ocrvKxS4KWzZwI6RuMykhMDj jI0b2k09RdELkSvpREBXgy+ob69DjZXX6BK3WdeLa+oSlWEtCr16NohvMQJ20VZQz+rz 1ZVxW8EbMGfjmMv7AryMosEVU0LhSipWAXYy8h6bV98ZlQi2PFGXiSy8rKiQytukxpYd F2Zt9va26TYxGvYuoeTGL19LD/Hy7IRZFpYVCabjYcVpVdUqvq5D0p2/hLHFGy0gaFyn 5WFpghwjJmHxP4K90LWh+rkJ38huVdgsaHtyn0q+MPpj1Hfd/11h26ot4cqsKPLESxb6 zpdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=7jGhH5ewRc4CrHvVFUt/HhDVKewdtrW1n+L9RVHBhks=; b=kFsj636b9T5zdl0Tfltp3dygvMIwUI+Hu2s1vX8zqoCN0wwW787CgoN1MRN4+GtgGI 5uC69KHrDwI1AUI1AxNDS4BbJPdjCa9m9heHnCB3GDBuxIrcoKThvcv10n9G80WVoiJq Wy6LVUMSNB/gSJ62V13Ob+mcU550E+KYmkR/KNPl1JoaxT/ZXMiAuNoO+L5wic8x0z7h DuIPDvPYBt+3vb9mWFTWmlHcy2YkevssYXJC60EV5gnwrW8kCRzoasefjFECwGYySxJR UpkhIasbWccOEy/Pc3TSRfXtniDfWDJNNgl1byMo5VAiZi0/sUKxoiZJGzTYjlg60A0J AP1w== X-Gm-Message-State: ABUngvf9Ar8f9WYKzSIGKVEvNoJs1u9u6QOxIFQm0C8ieN0graCz7vffmSpQJB5Dv6HUqn5y9RPT2PH3us044Mti X-Received: by 10.107.155.14 with SMTP id d14mr12682493ioe.64.1477686595947; Fri, 28 Oct 2016 13:29:55 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.12.78 with HTTP; Fri, 28 Oct 2016 13:29:55 -0700 (PDT) In-Reply-To: References: From: Woo Jae Jung Date: Fri, 28 Oct 2016 13:29:55 -0700 Message-ID: Subject: Re: Encoding categorical variables To: dev@madlib.incubator.apache.org Cc: user@madlib.incubator.apache.org Content-Type: multipart/alternative; boundary=001a1141a2eea42d9c053ff2b8e1 archived-at: Fri, 28 Oct 2016 20:30:05 -0000 --001a1141a2eea42d9c053ff2b8e1 Content-Type: text/plain; charset=UTF-8 Also agree that double-quoted column names are not ideal. In addition to the net-new features described in this thread, it'd be nice to see non-double-quoted output as default behavior in the existing create_indicator_variables() function. Thanks, Woo On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung wrote: > I like the one-hot encoded feature. Another variant of this idea would be > an "all other" variable (distinct from the reference class) that contains > occurrences of the less frequent category types. In both of these > scenarios, the threshold for 'less frequent' could be user-supplied. > > Thanks, > Woo > > On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer wrote: > >> An alternative to dropping is to assign the less frequent values to the >> reference i.e. all one-hot encoded features will be 0. >> Also important to note: total runtime will increase with this option since >> we'll have to compute the exact frequency distribution. >> >> Another suggested change is to call this function 'one_hot_encoding' since >> that is the output here (similar to sklearn's OneHotEncoder >> > preprocessing.OneHotEncoder.html>). >> We can keep the current name as a deprecated alias till 2.0 is released. >> >> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan >> wrote: >> >> > Jarrod, >> > >> > Just trying to write up detailed requirements. How would you see this >> one >> > working? >> > >> > "2) Option to dummy code only the top n most frequently occurring >> values in >> > any column" >> > >> > With 1 column I can picture it, you would drop the rows with the less >> > frequently occurring values and end up with a smaller table. But what >> if >> > you are encoding multiple rows? Would you want a per row >> specification >> > of n? i.e., top 3 values for column x, top 10 values for column y? If >> you >> > did this then your result set might include low frequency values for >> column >> > x (not in top 3) because they are in the top 10 for column y - this >> might >> > be confusing. >> > >> > Frank >> > >> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan > > >> > wrote: >> > >> >> great, thanks for the additional information >> >> >> >> Frank >> >> >> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey >> >> wrote: >> >> >> >>> IMO >> >>> >> >>> 1) Option to define resulting column names. Please see pdltools >> >>> implementation - the ability to pass in a function is especially >> useful ( >> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html) >> >>> 2) Option to dummy code only the top n most frequently occurring >> values >> >>> in >> >>> any column >> >>> 3) Option to create numeric column names (E.g. pivotcol_val1, >> >>> pivotcol_val2 >> >>> ...) instead of values in column names + secondary mapping table >> >>> 4) Option to exclude original column from results table >> >>> >> >>> (1) & (2) are much higher priority than (3) & (4). >> >>> >> >>> Agreed that these could also be applied to Pivoting (especially 1). >> >>> >> >>> >> >>> >> >>> Jarrod Vawdrey >> >>> Sr. Data Scientist >> >>> Data Science & Engineering | Pivotal >> >>> (650) 315-8905 >> >>> https://pivotal.io/ >> >>> >> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan < >> fmcquillan@pivotal.io> >> >>> wrote: >> >>> >> >>> > Thanks for those suggestions, Jarrod. They all sound pretty useful >> - >> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the >> >>> order >> >>> > of priority as you see it? >> >>> > >> >>> > Also it seems like some of these could be applied to the Pivot >> >>> function as >> >>> > well, e.g., UDF for column naming. >> >>> > >> >>> > Frank >> >>> > >> >>> > >> >>> > >> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey < >> jvawdrey@pivotal.io> >> >>> > wrote: >> >>> > >> >>> >> Hey Frank, >> >>> >> >> >>> >> How are special character values handled today? It is often not >> ideal >> >>> to >> >>> >> end up with column names that require double quotes to call due to >> >>> >> downstream scripts. >> >>> >> >> >>> >> A couple of features that would be useful >> >>> >> >> >>> >> * Option to define resulting column names. Please see pdltools >> >>> >> implementation - the ability to pass in a function is especially >> >>> useful ( >> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html >> ) >> >>> >> * Option to dummy code only the top n most frequently occurring >> >>> values in >> >>> >> any column >> >>> >> * Option to exclude original column from results table >> >>> >> * Option to create numeric column names (E.g. pivotcol_val1, >> >>> >> pivotcol_val2 ...) instead of values in column names + secondary >> >>> mapping >> >>> >> table >> >>> >> >> >>> >> Thank you >> >>> >> >> >>> >> Jarrod Vawdrey >> >>> >> Sr. Data Scientist >> >>> >> Data Science & Engineering | Pivotal >> >>> >> (650) 315-8905 >> >>> >> https://pivotal.io/ >> >>> >> >> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan < >> >>> fmcquillan@pivotal.io> >> >>> >> wrote: >> >>> >> >> >>> >>> For the module encoding categorical variables >> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d >> >>> >>> ata__prep.html >> >>> >>> does anyone have any suggestions on improvements that we could >> make? >> >>> >>> >> >>> >>> Here is a video on how encoding categorical variables works for >> >>> those not >> >>> >>> familiar with it >> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6 >> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ >> >>> >>> >> >>> >> >> >>> >> >> >>> > >> >>> >> >> >> >> >> > >> > > --001a1141a2eea42d9c053ff2b8e1--