madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarrod Vawdrey <jvawd...@pivotal.io>
Subject Re: Encoding categorical variables
Date Wed, 19 Oct 2016 20:57:38 GMT
IMO

1) Option to define resulting column names. Please see pdltools
implementation - the ability to pass in a function is especially useful (
http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
2) Option to dummy code only the top n most frequently occurring values in
any column
3) Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
...) instead of values in column names + secondary mapping table
4) Option to exclude original column from results table

(1) & (2) are much higher priority than (3) & (4).

Agreed that these could also be applied to Pivoting (especially 1).



Jarrod Vawdrey
Sr. Data Scientist
Data Science & Engineering | Pivotal
(650) 315-8905
https://pivotal.io/

On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fmcquillan@pivotal.io>
wrote:

> Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> would you mind taking a crack at numbering them 1,2,3... etc, in the order
> of priority as you see it?
>
> Also it seems like some of these could be applied to the Pivot function as
> well, e.g., UDF for column naming.
>
> Frank
>
>
>
> On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jvawdrey@pivotal.io>
> wrote:
>
>> Hey Frank,
>>
>> How are special character values handled today? It is often not ideal to
>> end up with column names that require double quotes to call due to
>> downstream scripts.
>>
>> A couple of features that would be useful
>>
>> * Option to define resulting column names. Please see pdltools
>> implementation - the ability to pass in a function is especially useful (
>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> * Option to dummy code only the top n most frequently occurring values in
>> any column
>> * Option to exclude original column from results table
>> * Option to create numeric column names (E.g. pivotcol_val1,
>> pivotcol_val2 ...) instead of values in column names + secondary mapping
>> table
>>
>> Thank you
>>
>> Jarrod Vawdrey
>> Sr. Data Scientist
>> Data Science & Engineering | Pivotal
>> (650) 315-8905
>> https://pivotal.io/
>>
>> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fmcquillan@pivotal.io>
>> wrote:
>>
>>> For the module encoding categorical variables
>>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> ata__prep.html
>>> does anyone have any suggestions on improvements that we could make?
>>>
>>> Here is a video on how encoding categorical variables works for those not
>>> familiar with it
>>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message