Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7217D200BAE for ; Fri, 28 Oct 2016 20:17:16 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 709C8160AE4; Fri, 28 Oct 2016 18:17:16 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 916B2160ACA for ; Fri, 28 Oct 2016 20:17:15 +0200 (CEST) Received: (qmail 67928 invoked by uid 500); 28 Oct 2016 18:17:14 -0000 Mailing-List: contact dev-help@madlib.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@madlib.incubator.apache.org Delivered-To: mailing list dev@madlib.incubator.apache.org Received: (qmail 67916 invoked by uid 99); 28 Oct 2016 18:17:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Oct 2016 18:17:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1201DC18E1 for ; Fri, 28 Oct 2016 18:17:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.48 X-Spam-Level: ** X-Spam-Status: No, score=2.48 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=pivotal-io.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Gg9h6sq80tre for ; Fri, 28 Oct 2016 18:17:12 +0000 (UTC) Received: from mail-qk0-f182.google.com (mail-qk0-f182.google.com [209.85.220.182]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id CB9625F484 for ; Fri, 28 Oct 2016 18:17:11 +0000 (UTC) Received: by mail-qk0-f182.google.com with SMTP id v138so5977534qka.0 for ; Fri, 28 Oct 2016 11:17:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pivotal-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=11aRzDONtHTUNZ0zt8iyIK//7R2DWjoDApWKw//Cy24=; b=1TT2W0XDxMnr9Tc9rIXAHDTGqyCpUl9vymo9CXX8SJX8QMw4LTYgxNFNF1cwifGtid kjmNDc//tHaBddKtDGsGm70wmSmPDemrfqoMekS8EQJNqq/DssrDow7tG2z7rCQqa0e1 mN7kjYDOAQ9aR0gHJCI/FZTzsynNORPmy3MHNp9eonZZkN40nNTSrWyLsJhVQt7821WU hsZxQ9EH9nUKHJCzbg9WSp55Qk6cOagbF/QQl5zlhUYyOp3+RwdoAucNkM2DOvGSTTI4 16miCbJg1W2NnQo/9OPPbX6FntG76yuCFTGeJwvap2NTKBDiFMKBKlrtMjMEH8angTj8 qmDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=11aRzDONtHTUNZ0zt8iyIK//7R2DWjoDApWKw//Cy24=; b=hXRUFiF9pz5JkSUyfivVmjY7bG7OtstKpyOn5jU/4Ac50bOmQhrLaoOME1R90rtmwJ I4EadDClnoP/x0RdKCwN5XLCpXZMgzCo57ejfz6AogkiJgxonA6/cf/+cbrO6lKREsPB /J04sXUM6S29tUiel9UAczbKIJjGYjjrC4QrfQQCn0dw63XED2ojfaTTDuOHfIrufva5 PfdAUzX0ukGsU6O8Ql6I2zfQ6lytRbSyF+JzKm174gmSIttqerV6+8c6V8UqLSdj9br1 bj9lm7DsPH7vF7FxVvB37s7ki58ZLiicPiCR/0KW46kjpMVEpBma+4it8+/tiDtzmYTd aomQ== X-Gm-Message-State: ABUngvdLibpYijnTZ/dfumJlLFPe2n8FCZorbqAqMCfpdONij7OYkB5vp8+Jdau+QQwKRqFPA1Cl0G8mjDOyeQja X-Received: by 10.55.179.131 with SMTP id c125mr11156503qkf.14.1477678629051; Fri, 28 Oct 2016 11:17:09 -0700 (PDT) MIME-Version: 1.0 Received: by 10.55.144.132 with HTTP; Fri, 28 Oct 2016 11:17:08 -0700 (PDT) In-Reply-To: References: From: Frank McQuillan Date: Fri, 28 Oct 2016 11:17:08 -0700 Message-ID: Subject: Re: Encoding categorical variables To: dev@madlib.incubator.apache.org Cc: user@madlib.incubator.apache.org Content-Type: multipart/alternative; boundary=94eb2c06568ec7559b053ff0dda0 archived-at: Fri, 28 Oct 2016 18:17:16 -0000 --94eb2c06568ec7559b053ff0dda0 Content-Type: text/plain; charset=UTF-8 Jarrod, Just trying to write up detailed requirements. How would you see this one working? "2) Option to dummy code only the top n most frequently occurring values in any column" With 1 column I can picture it, you would drop the rows with the less frequently occurring values and end up with a smaller table. But what if you are encoding multiple rows? Would you want a per row specification of n? i.e., top 3 values for column x, top 10 values for column y? If you did this then your result set might include low frequency values for column x (not in top 3) because they are in the top 10 for column y - this might be confusing. Frank On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan wrote: > great, thanks for the additional information > > Frank > > On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey > wrote: > >> IMO >> >> 1) Option to define resulting column names. Please see pdltools >> implementation - the ability to pass in a function is especially useful ( >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html) >> 2) Option to dummy code only the top n most frequently occurring values in >> any column >> 3) Option to create numeric column names (E.g. pivotcol_val1, >> pivotcol_val2 >> ...) instead of values in column names + secondary mapping table >> 4) Option to exclude original column from results table >> >> (1) & (2) are much higher priority than (3) & (4). >> >> Agreed that these could also be applied to Pivoting (especially 1). >> >> >> >> Jarrod Vawdrey >> Sr. Data Scientist >> Data Science & Engineering | Pivotal >> (650) 315-8905 >> https://pivotal.io/ >> >> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan >> wrote: >> >> > Thanks for those suggestions, Jarrod. They all sound pretty useful - >> > would you mind taking a crack at numbering them 1,2,3... etc, in the >> order >> > of priority as you see it? >> > >> > Also it seems like some of these could be applied to the Pivot function >> as >> > well, e.g., UDF for column naming. >> > >> > Frank >> > >> > >> > >> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey >> > wrote: >> > >> >> Hey Frank, >> >> >> >> How are special character values handled today? It is often not ideal >> to >> >> end up with column names that require double quotes to call due to >> >> downstream scripts. >> >> >> >> A couple of features that would be useful >> >> >> >> * Option to define resulting column names. Please see pdltools >> >> implementation - the ability to pass in a function is especially >> useful ( >> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html) >> >> * Option to dummy code only the top n most frequently occurring values >> in >> >> any column >> >> * Option to exclude original column from results table >> >> * Option to create numeric column names (E.g. pivotcol_val1, >> >> pivotcol_val2 ...) instead of values in column names + secondary >> mapping >> >> table >> >> >> >> Thank you >> >> >> >> Jarrod Vawdrey >> >> Sr. Data Scientist >> >> Data Science & Engineering | Pivotal >> >> (650) 315-8905 >> >> https://pivotal.io/ >> >> >> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan < >> fmcquillan@pivotal.io> >> >> wrote: >> >> >> >>> For the module encoding categorical variables >> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d >> >>> ata__prep.html >> >>> does anyone have any suggestions on improvements that we could make? >> >>> >> >>> Here is a video on how encoding categorical variables works for those >> not >> >>> familiar with it >> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6 >> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ >> >>> >> >> >> >> >> > >> > > --94eb2c06568ec7559b053ff0dda0--