madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: Encoding categorical variables
Date Fri, 28 Oct 2016 18:17:08 GMT
Jarrod,

Just trying to write up detailed requirements.  How would you see this one
working?

"2) Option to dummy code only the top n most frequently occurring values in
any column"

With 1 column I can picture it, you would drop the rows with the less
frequently occurring values and end up with a smaller table.  But what if
you are encoding multiple rows?    Would you want a per row specification
of n? i.e., top 3 values for column x, top 10 values for column y?  If you
did this then your result set might include low frequency values for column
x (not in top 3) because they are in the top 10 for column y - this might
be confusing.

Frank

On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fmcquillan@pivotal.io>
wrote:

> great, thanks for the additional information
>
> Frank
>
> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jvawdrey@pivotal.io>
> wrote:
>
>> IMO
>>
>> 1) Option to define resulting column names. Please see pdltools
>> implementation - the ability to pass in a function is especially useful (
>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> 2) Option to dummy code only the top n most frequently occurring values in
>> any column
>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> pivotcol_val2
>> ...) instead of values in column names + secondary mapping table
>> 4) Option to exclude original column from results table
>>
>> (1) & (2) are much higher priority than (3) & (4).
>>
>> Agreed that these could also be applied to Pivoting (especially 1).
>>
>>
>>
>> Jarrod Vawdrey
>> Sr. Data Scientist
>> Data Science & Engineering | Pivotal
>> (650) 315-8905
>> https://pivotal.io/
>>
>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fmcquillan@pivotal.io>
>> wrote:
>>
>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>> order
>> > of priority as you see it?
>> >
>> > Also it seems like some of these could be applied to the Pivot function
>> as
>> > well, e.g., UDF for column naming.
>> >
>> > Frank
>> >
>> >
>> >
>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jvawdrey@pivotal.io>
>> > wrote:
>> >
>> >> Hey Frank,
>> >>
>> >> How are special character values handled today? It is often not ideal
>> to
>> >> end up with column names that require double quotes to call due to
>> >> downstream scripts.
>> >>
>> >> A couple of features that would be useful
>> >>
>> >> * Option to define resulting column names. Please see pdltools
>> >> implementation - the ability to pass in a function is especially
>> useful (
>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> >> * Option to dummy code only the top n most frequently occurring values
>> in
>> >> any column
>> >> * Option to exclude original column from results table
>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >> pivotcol_val2 ...) instead of values in column names + secondary
>> mapping
>> >> table
>> >>
>> >> Thank you
>> >>
>> >> Jarrod Vawdrey
>> >> Sr. Data Scientist
>> >> Data Science & Engineering | Pivotal
>> >> (650) 315-8905
>> >> https://pivotal.io/
>> >>
>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>> fmcquillan@pivotal.io>
>> >> wrote:
>> >>
>> >>> For the module encoding categorical variables
>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>> >>> ata__prep.html
>> >>> does anyone have any suggestions on improvements that we could make?
>> >>>
>> >>> Here is a video on how encoding categorical variables works for those
>> not
>> >>> familiar with it
>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>> >>>
>> >>
>> >>
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message