madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Iyer <rahulri...@gmail.com>
Subject Re: Encoding categorical variables
Date Fri, 28 Oct 2016 18:29:14 GMT
An alternative to dropping is to assign the less frequent values to the
reference i.e. all one-hot encoded features will be 0.
Also important to note: total runtime will increase with this option since
we'll have to compute the exact frequency distribution.

Another suggested change is to call this function 'one_hot_encoding' since
that is the output here (similar to sklearn's OneHotEncoder
<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>).
We can keep the current name as a deprecated alias till 2.0 is released.

On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fmcquillan@pivotal.io>
wrote:

> Jarrod,
>
> Just trying to write up detailed requirements.  How would you see this one
> working?
>
> "2) Option to dummy code only the top n most frequently occurring values in
> any column"
>
> With 1 column I can picture it, you would drop the rows with the less
> frequently occurring values and end up with a smaller table.  But what if
> you are encoding multiple rows?    Would you want a per row specification
> of n? i.e., top 3 values for column x, top 10 values for column y?  If you
> did this then your result set might include low frequency values for column
> x (not in top 3) because they are in the top 10 for column y - this might
> be confusing.
>
> Frank
>
> On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fmcquillan@pivotal.io>
> wrote:
>
>> great, thanks for the additional information
>>
>> Frank
>>
>> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jvawdrey@pivotal.io>
>> wrote:
>>
>>> IMO
>>>
>>> 1) Option to define resulting column names. Please see pdltools
>>> implementation - the ability to pass in a function is especially useful (
>>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> 2) Option to dummy code only the top n most frequently occurring values
>>> in
>>> any column
>>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>> pivotcol_val2
>>> ...) instead of values in column names + secondary mapping table
>>> 4) Option to exclude original column from results table
>>>
>>> (1) & (2) are much higher priority than (3) & (4).
>>>
>>> Agreed that these could also be applied to Pivoting (especially 1).
>>>
>>>
>>>
>>> Jarrod Vawdrey
>>> Sr. Data Scientist
>>> Data Science & Engineering | Pivotal
>>> (650) 315-8905
>>> https://pivotal.io/
>>>
>>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fmcquillan@pivotal.io>
>>> wrote:
>>>
>>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
>>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>>> order
>>> > of priority as you see it?
>>> >
>>> > Also it seems like some of these could be applied to the Pivot
>>> function as
>>> > well, e.g., UDF for column naming.
>>> >
>>> > Frank
>>> >
>>> >
>>> >
>>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jvawdrey@pivotal.io>
>>> > wrote:
>>> >
>>> >> Hey Frank,
>>> >>
>>> >> How are special character values handled today? It is often not ideal
>>> to
>>> >> end up with column names that require double quotes to call due to
>>> >> downstream scripts.
>>> >>
>>> >> A couple of features that would be useful
>>> >>
>>> >> * Option to define resulting column names. Please see pdltools
>>> >> implementation - the ability to pass in a function is especially
>>> useful (
>>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> >> * Option to dummy code only the top n most frequently occurring
>>> values in
>>> >> any column
>>> >> * Option to exclude original column from results table
>>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>>> >> pivotcol_val2 ...) instead of values in column names + secondary
>>> mapping
>>> >> table
>>> >>
>>> >> Thank you
>>> >>
>>> >> Jarrod Vawdrey
>>> >> Sr. Data Scientist
>>> >> Data Science & Engineering | Pivotal
>>> >> (650) 315-8905
>>> >> https://pivotal.io/
>>> >>
>>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>>> fmcquillan@pivotal.io>
>>> >> wrote:
>>> >>
>>> >>> For the module encoding categorical variables
>>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> >>> ata__prep.html
>>> >>> does anyone have any suggestions on improvements that we could make?
>>> >>>
>>> >>> Here is a video on how encoding categorical variables works for
>>> those not
>>> >>> familiar with it
>>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message