madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Srivatsan R <vatsan...@gmail.com>
Subject Re: Encoding categorical variables
Date Fri, 28 Oct 2016 21:39:48 GMT
You guys may have already seen this, but linking just in case:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <wjung@pivotal.io> wrote:

> +Vatsan for his thoughts as well!
>
> On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wjung@pivotal.io> wrote:
>
>> Also agree that double-quoted column names are not ideal.  In addition to
>> the net-new features described in this thread, it'd be nice to see
>> non-double-quoted output as default behavior in the
>> existing create_indicator_variables() function.
>>
>> Thanks,
>> Woo
>>
>> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wjung@pivotal.io> wrote:
>>
>>> I like the one-hot encoded feature.  Another variant of this idea would
>>> be an "all other" variable (distinct from the reference class) that
>>> contains occurrences of the less frequent category types.  In both of these
>>> scenarios, the threshold for 'less frequent' could be user-supplied.
>>>
>>> Thanks,
>>> Woo
>>>
>>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <rahulriyer@gmail.com>
>>> wrote:
>>>
>>>> An alternative to dropping is to assign the less frequent values to the
>>>> reference i.e. all one-hot encoded features will be 0.
>>>> Also important to note: total runtime will increase with this option
>>>> since
>>>> we'll have to compute the exact frequency distribution.
>>>>
>>>> Another suggested change is to call this function 'one_hot_encoding'
>>>> since
>>>> that is the output here (similar to sklearn's OneHotEncoder
>>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
>>>> eprocessing.OneHotEncoder.html>).
>>>> We can keep the current name as a deprecated alias till 2.0 is released.
>>>>
>>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
>>>> fmcquillan@pivotal.io>
>>>> wrote:
>>>>
>>>> > Jarrod,
>>>> >
>>>> > Just trying to write up detailed requirements.  How would you see
>>>> this one
>>>> > working?
>>>> >
>>>> > "2) Option to dummy code only the top n most frequently occurring
>>>> values in
>>>> > any column"
>>>> >
>>>> > With 1 column I can picture it, you would drop the rows with the less
>>>> > frequently occurring values and end up with a smaller table.  But
>>>> what if
>>>> > you are encoding multiple rows?    Would you want a per row
>>>> specification
>>>> > of n? i.e., top 3 values for column x, top 10 values for column y?
>>>> If you
>>>> > did this then your result set might include low frequency values for
>>>> column
>>>> > x (not in top 3) because they are in the top 10 for column y - this
>>>> might
>>>> > be confusing.
>>>> >
>>>> > Frank
>>>> >
>>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>>>> fmcquillan@pivotal.io>
>>>> > wrote:
>>>> >
>>>> >> great, thanks for the additional information
>>>> >>
>>>> >> Frank
>>>> >>
>>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jvawdrey@pivotal.io
>>>> >
>>>> >> wrote:
>>>> >>
>>>> >>> IMO
>>>> >>>
>>>> >>> 1) Option to define resulting column names. Please see pdltools
>>>> >>> implementation - the ability to pass in a function is especially
>>>> useful (
>>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>>> >>> 2) Option to dummy code only the top n most frequently occurring
>>>> values
>>>> >>> in
>>>> >>> any column
>>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>>> >>> pivotcol_val2
>>>> >>> ...) instead of values in column names + secondary mapping table
>>>> >>> 4) Option to exclude original column from results table
>>>> >>>
>>>> >>> (1) & (2) are much higher priority than (3) & (4).
>>>> >>>
>>>> >>> Agreed that these could also be applied to Pivoting (especially
1).
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Jarrod Vawdrey
>>>> >>> Sr. Data Scientist
>>>> >>> Data Science & Engineering | Pivotal
>>>> >>> (650) 315-8905
>>>> >>> https://pivotal.io/
>>>> >>>
>>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>>>> fmcquillan@pivotal.io>
>>>> >>> wrote:
>>>> >>>
>>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>>>> useful -
>>>> >>> > would you mind taking a crack at numbering them 1,2,3...
etc, in
>>>> the
>>>> >>> order
>>>> >>> > of priority as you see it?
>>>> >>> >
>>>> >>> > Also it seems like some of these could be applied to the
Pivot
>>>> >>> function as
>>>> >>> > well, e.g., UDF for column naming.
>>>> >>> >
>>>> >>> > Frank
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>>>> jvawdrey@pivotal.io>
>>>> >>> > wrote:
>>>> >>> >
>>>> >>> >> Hey Frank,
>>>> >>> >>
>>>> >>> >> How are special character values handled today? It
is often not
>>>> ideal
>>>> >>> to
>>>> >>> >> end up with column names that require double quotes
to call due
>>>> to
>>>> >>> >> downstream scripts.
>>>> >>> >>
>>>> >>> >> A couple of features that would be useful
>>>> >>> >>
>>>> >>> >> * Option to define resulting column names. Please see
pdltools
>>>> >>> >> implementation - the ability to pass in a function
is especially
>>>> >>> useful (
>>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>>>> 1.html)
>>>> >>> >> * Option to dummy code only the top n most frequently
occurring
>>>> >>> values in
>>>> >>> >> any column
>>>> >>> >> * Option to exclude original column from results table
>>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>>>> >>> >> pivotcol_val2 ...) instead of values in column names
+ secondary
>>>> >>> mapping
>>>> >>> >> table
>>>> >>> >>
>>>> >>> >> Thank you
>>>> >>> >>
>>>> >>> >> Jarrod Vawdrey
>>>> >>> >> Sr. Data Scientist
>>>> >>> >> Data Science & Engineering | Pivotal
>>>> >>> >> (650) 315-8905
>>>> >>> >> https://pivotal.io/
>>>> >>> >>
>>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>>>> >>> fmcquillan@pivotal.io>
>>>> >>> >> wrote:
>>>> >>> >>
>>>> >>> >>> For the module encoding categorical variables
>>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>>> >>> >>> ata__prep.html
>>>> >>> >>> does anyone have any suggestions on improvements
that we could
>>>> make?
>>>> >>> >>>
>>>> >>> >>> Here is a video on how encoding categorical variables
works for
>>>> >>> those not
>>>> >>> >>> familiar with it
>>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>>> >>> >>>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message