madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Woo Jae Jung <wj...@pivotal.io>
Subject Re: Encoding categorical variables
Date Fri, 28 Oct 2016 20:05:53 GMT
I like the one-hot encoded feature.  Another variant of this idea would be
an "all other" variable (distinct from the reference class) that contains
occurrences of the less frequent category types.  In both of these
scenarios, the threshold for 'less frequent' could be user-supplied.

Thanks,
Woo

On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <rahulriyer@gmail.com> wrote:

> An alternative to dropping is to assign the less frequent values to the
> reference i.e. all one-hot encoded features will be 0.
> Also important to note: total runtime will increase with this option since
> we'll have to compute the exact frequency distribution.
>
> Another suggested change is to call this function 'one_hot_encoding' since
> that is the output here (similar to sklearn's OneHotEncoder
> <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.
> OneHotEncoder.html>).
> We can keep the current name as a deprecated alias till 2.0 is released.
>
> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fmcquillan@pivotal.io>
> wrote:
>
> > Jarrod,
> >
> > Just trying to write up detailed requirements.  How would you see this
> one
> > working?
> >
> > "2) Option to dummy code only the top n most frequently occurring values
> in
> > any column"
> >
> > With 1 column I can picture it, you would drop the rows with the less
> > frequently occurring values and end up with a smaller table.  But what if
> > you are encoding multiple rows?    Would you want a per row specification
> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
> you
> > did this then your result set might include low frequency values for
> column
> > x (not in top 3) because they are in the top 10 for column y - this might
> > be confusing.
> >
> > Frank
> >
> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fmcquillan@pivotal.io>
> > wrote:
> >
> >> great, thanks for the additional information
> >>
> >> Frank
> >>
> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jvawdrey@pivotal.io>
> >> wrote:
> >>
> >>> IMO
> >>>
> >>> 1) Option to define resulting column names. Please see pdltools
> >>> implementation - the ability to pass in a function is especially
> useful (
> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >>> 2) Option to dummy code only the top n most frequently occurring values
> >>> in
> >>> any column
> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
> >>> pivotcol_val2
> >>> ...) instead of values in column names + secondary mapping table
> >>> 4) Option to exclude original column from results table
> >>>
> >>> (1) & (2) are much higher priority than (3) & (4).
> >>>
> >>> Agreed that these could also be applied to Pivoting (especially 1).
> >>>
> >>>
> >>>
> >>> Jarrod Vawdrey
> >>> Sr. Data Scientist
> >>> Data Science & Engineering | Pivotal
> >>> (650) 315-8905
> >>> https://pivotal.io/
> >>>
> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
> fmcquillan@pivotal.io>
> >>> wrote:
> >>>
> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful
-
> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
> >>> order
> >>> > of priority as you see it?
> >>> >
> >>> > Also it seems like some of these could be applied to the Pivot
> >>> function as
> >>> > well, e.g., UDF for column naming.
> >>> >
> >>> > Frank
> >>> >
> >>> >
> >>> >
> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jvawdrey@pivotal.io
> >
> >>> > wrote:
> >>> >
> >>> >> Hey Frank,
> >>> >>
> >>> >> How are special character values handled today? It is often not
> ideal
> >>> to
> >>> >> end up with column names that require double quotes to call due
to
> >>> >> downstream scripts.
> >>> >>
> >>> >> A couple of features that would be useful
> >>> >>
> >>> >> * Option to define resulting column names. Please see pdltools
> >>> >> implementation - the ability to pass in a function is especially
> >>> useful (
> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >>> >> * Option to dummy code only the top n most frequently occurring
> >>> values in
> >>> >> any column
> >>> >> * Option to exclude original column from results table
> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
> >>> mapping
> >>> >> table
> >>> >>
> >>> >> Thank you
> >>> >>
> >>> >> Jarrod Vawdrey
> >>> >> Sr. Data Scientist
> >>> >> Data Science & Engineering | Pivotal
> >>> >> (650) 315-8905
> >>> >> https://pivotal.io/
> >>> >>
> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
> >>> fmcquillan@pivotal.io>
> >>> >> wrote:
> >>> >>
> >>> >>> For the module encoding categorical variables
> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>> >>> ata__prep.html
> >>> >>> does anyone have any suggestions on improvements that we could
> make?
> >>> >>>
> >>> >>> Here is a video on how encoding categorical variables works
for
> >>> those not
> >>> >>> familiar with it
> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
> >>> >>>
> >>> >>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message