madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: Encoding categorical variables
Date Fri, 28 Oct 2016 22:04:29 GMT
Yes thanks Vatsan we have been looking at that.

On Fri, Oct 28, 2016 at 2:39 PM, Srivatsan R <vatsan.cs@gmail.com> wrote:

> You guys may have already seen this, but linking just in case:
> http://pandas.pydata.org/pandas-docs/stable/generated/
> pandas.get_dummies.html
>
> On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <wjung@pivotal.io> wrote:
>
> > +Vatsan for his thoughts as well!
> >
> > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wjung@pivotal.io> wrote:
> >
> >> Also agree that double-quoted column names are not ideal.  In addition
> to
> >> the net-new features described in this thread, it'd be nice to see
> >> non-double-quoted output as default behavior in the
> >> existing create_indicator_variables() function.
> >>
> >> Thanks,
> >> Woo
> >>
> >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wjung@pivotal.io> wrote:
> >>
> >>> I like the one-hot encoded feature.  Another variant of this idea would
> >>> be an "all other" variable (distinct from the reference class) that
> >>> contains occurrences of the less frequent category types.  In both of
> these
> >>> scenarios, the threshold for 'less frequent' could be user-supplied.
> >>>
> >>> Thanks,
> >>> Woo
> >>>
> >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <rahulriyer@gmail.com>
> >>> wrote:
> >>>
> >>>> An alternative to dropping is to assign the less frequent values to
> the
> >>>> reference i.e. all one-hot encoded features will be 0.
> >>>> Also important to note: total runtime will increase with this option
> >>>> since
> >>>> we'll have to compute the exact frequency distribution.
> >>>>
> >>>> Another suggested change is to call this function 'one_hot_encoding'
> >>>> since
> >>>> that is the output here (similar to sklearn's OneHotEncoder
> >>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
> >>>> eprocessing.OneHotEncoder.html>).
> >>>> We can keep the current name as a deprecated alias till 2.0 is
> released.
> >>>>
> >>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
> >>>> fmcquillan@pivotal.io>
> >>>> wrote:
> >>>>
> >>>> > Jarrod,
> >>>> >
> >>>> > Just trying to write up detailed requirements.  How would you see
> >>>> this one
> >>>> > working?
> >>>> >
> >>>> > "2) Option to dummy code only the top n most frequently occurring
> >>>> values in
> >>>> > any column"
> >>>> >
> >>>> > With 1 column I can picture it, you would drop the rows with the
> less
> >>>> > frequently occurring values and end up with a smaller table.  But
> >>>> what if
> >>>> > you are encoding multiple rows?    Would you want a per row
> >>>> specification
> >>>> > of n? i.e., top 3 values for column x, top 10 values for column
y?
> >>>> If you
> >>>> > did this then your result set might include low frequency values
for
> >>>> column
> >>>> > x (not in top 3) because they are in the top 10 for column y -
this
> >>>> might
> >>>> > be confusing.
> >>>> >
> >>>> > Frank
> >>>> >
> >>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
> >>>> fmcquillan@pivotal.io>
> >>>> > wrote:
> >>>> >
> >>>> >> great, thanks for the additional information
> >>>> >>
> >>>> >> Frank
> >>>> >>
> >>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <
> jvawdrey@pivotal.io
> >>>> >
> >>>> >> wrote:
> >>>> >>
> >>>> >>> IMO
> >>>> >>>
> >>>> >>> 1) Option to define resulting column names. Please see
pdltools
> >>>> >>> implementation - the ability to pass in a function is especially
> >>>> useful (
> >>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__
> pivot01.html)
> >>>> >>> 2) Option to dummy code only the top n most frequently
occurring
> >>>> values
> >>>> >>> in
> >>>> >>> any column
> >>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
> >>>> >>> pivotcol_val2
> >>>> >>> ...) instead of values in column names + secondary mapping
table
> >>>> >>> 4) Option to exclude original column from results table
> >>>> >>>
> >>>> >>> (1) & (2) are much higher priority than (3) & (4).
> >>>> >>>
> >>>> >>> Agreed that these could also be applied to Pivoting (especially
> 1).
> >>>> >>>
> >>>> >>>
> >>>> >>>
> >>>> >>> Jarrod Vawdrey
> >>>> >>> Sr. Data Scientist
> >>>> >>> Data Science & Engineering | Pivotal
> >>>> >>> (650) 315-8905
> >>>> >>> https://pivotal.io/
> >>>> >>>
> >>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
> >>>> fmcquillan@pivotal.io>
> >>>> >>> wrote:
> >>>> >>>
> >>>> >>> > Thanks for those suggestions, Jarrod.  They all sound
pretty
> >>>> useful -
> >>>> >>> > would you mind taking a crack at numbering them 1,2,3...
etc, in
> >>>> the
> >>>> >>> order
> >>>> >>> > of priority as you see it?
> >>>> >>> >
> >>>> >>> > Also it seems like some of these could be applied
to the Pivot
> >>>> >>> function as
> >>>> >>> > well, e.g., UDF for column naming.
> >>>> >>> >
> >>>> >>> > Frank
> >>>> >>> >
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
> >>>> jvawdrey@pivotal.io>
> >>>> >>> > wrote:
> >>>> >>> >
> >>>> >>> >> Hey Frank,
> >>>> >>> >>
> >>>> >>> >> How are special character values handled today?
It is often not
> >>>> ideal
> >>>> >>> to
> >>>> >>> >> end up with column names that require double quotes
to call due
> >>>> to
> >>>> >>> >> downstream scripts.
> >>>> >>> >>
> >>>> >>> >> A couple of features that would be useful
> >>>> >>> >>
> >>>> >>> >> * Option to define resulting column names. Please
see pdltools
> >>>> >>> >> implementation - the ability to pass in a function
is
> especially
> >>>> >>> useful (
> >>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
> >>>> 1.html)
> >>>> >>> >> * Option to dummy code only the top n most frequently
occurring
> >>>> >>> values in
> >>>> >>> >> any column
> >>>> >>> >> * Option to exclude original column from results
table
> >>>> >>> >> * Option to create numeric column names (E.g.
pivotcol_val1,
> >>>> >>> >> pivotcol_val2 ...) instead of values in column
names +
> secondary
> >>>> >>> mapping
> >>>> >>> >> table
> >>>> >>> >>
> >>>> >>> >> Thank you
> >>>> >>> >>
> >>>> >>> >> Jarrod Vawdrey
> >>>> >>> >> Sr. Data Scientist
> >>>> >>> >> Data Science & Engineering | Pivotal
> >>>> >>> >> (650) 315-8905
> >>>> >>> >> https://pivotal.io/
> >>>> >>> >>
> >>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan
<
> >>>> >>> fmcquillan@pivotal.io>
> >>>> >>> >> wrote:
> >>>> >>> >>
> >>>> >>> >>> For the module encoding categorical variables
> >>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>>> >>> >>> ata__prep.html
> >>>> >>> >>> does anyone have any suggestions on improvements
that we could
> >>>> make?
> >>>> >>> >>>
> >>>> >>> >>> Here is a video on how encoding categorical
variables works
> for
> >>>> >>> those not
> >>>> >>> >>> familiar with it
> >>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
> >>>> >>> >>>
> >>>> >>> >>
> >>>> >>> >>
> >>>> >>> >
> >>>> >>>
> >>>> >>
> >>>> >>
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message