madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: Encoding categorical variables
Date Wed, 19 Oct 2016 21:44:35 GMT
great, thanks for the additional information

Frank

On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jvawdrey@pivotal.io> wrote:

> IMO
>
> 1) Option to define resulting column names. Please see pdltools
> implementation - the ability to pass in a function is especially useful (
> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> 2) Option to dummy code only the top n most frequently occurring values in
> any column
> 3) Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
> ...) instead of values in column names + secondary mapping table
> 4) Option to exclude original column from results table
>
> (1) & (2) are much higher priority than (3) & (4).
>
> Agreed that these could also be applied to Pivoting (especially 1).
>
>
>
> Jarrod Vawdrey
> Sr. Data Scientist
> Data Science & Engineering | Pivotal
> (650) 315-8905
> https://pivotal.io/
>
> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fmcquillan@pivotal.io>
> wrote:
>
> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> > would you mind taking a crack at numbering them 1,2,3... etc, in the
> order
> > of priority as you see it?
> >
> > Also it seems like some of these could be applied to the Pivot function
> as
> > well, e.g., UDF for column naming.
> >
> > Frank
> >
> >
> >
> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jvawdrey@pivotal.io>
> > wrote:
> >
> >> Hey Frank,
> >>
> >> How are special character values handled today? It is often not ideal to
> >> end up with column names that require double quotes to call due to
> >> downstream scripts.
> >>
> >> A couple of features that would be useful
> >>
> >> * Option to define resulting column names. Please see pdltools
> >> implementation - the ability to pass in a function is especially useful
> (
> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >> * Option to dummy code only the top n most frequently occurring values
> in
> >> any column
> >> * Option to exclude original column from results table
> >> * Option to create numeric column names (E.g. pivotcol_val1,
> >> pivotcol_val2 ...) instead of values in column names + secondary mapping
> >> table
> >>
> >> Thank you
> >>
> >> Jarrod Vawdrey
> >> Sr. Data Scientist
> >> Data Science & Engineering | Pivotal
> >> (650) 315-8905
> >> https://pivotal.io/
> >>
> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fmcquillan@pivotal.io
> >
> >> wrote:
> >>
> >>> For the module encoding categorical variables
> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>> ata__prep.html
> >>> does anyone have any suggestions on improvements that we could make?
> >>>
> >>> Here is a video on how encoding categorical variables works for those
> not
> >>> familiar with it
> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
> >>>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message