Mailing-List: contact dev-help@madlib.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@madlib.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: <CAE8jLLPsuE7V3_KkyhNbqmarSKjo9=OFjRJ+ZAZr-sHFBMCbYw@mail.gmail.com>
References: <CAKBQfzSDuTUjb+YyJ4sWEF+c=7krb5eFEBSd-j=zgNTKu5peQQ@mail.gmail.com>
 <CAPwz_UrvN3LNWFefiWm+fnDyO7g250MhDg1JpF6uJ8ZCE8_3GA@mail.gmail.com>
 <CAKBQfzT5JrOXY_C05a95xWWPVPzPsJYxGZiFjgrp-Py2vcWDLQ@mail.gmail.com>
 <CAPwz_Uqx+6DMWsWMN6uh-NjO=bV+2UyX0wZ0Esdrtx4SYOX+LQ@mail.gmail.com>
 <CAKBQfzSzbQrOzOv3Pp4d4Kvpk0He_-_iN8qKD9sdexzMMr=3SA@mail.gmail.com>
 <CAKBQfzQBUCrsTD_s7rmYrDtk7qS0bmYyZ_T_8ZKo9JT3hEg-Pw@mail.gmail.com>
 <CAN1-RDN75P+LQS64VN5_MCnNrkG_szZXt_R9XGwNktrJjqSmpg@mail.gmail.com> <CAE8jLLPsuE7V3_KkyhNbqmarSKjo9=OFjRJ+ZAZr-sHFBMCbYw@mail.gmail.com>
From: Woo Jae Jung <wjung@pivotal.io>
Date: Fri, 28 Oct 2016 13:29:55 -0700
Message-ID: <CAE8jLLMjbEbWzqs4_+yzTi2Bo6z2HFmTGrNUrTCC=0qcSfsHVA@mail.gmail.com>
Subject: Re: Encoding categorical variables
To: dev@madlib.incubator.apache.org
Cc: user@madlib.incubator.apache.org
Content-Type: multipart/alternative; boundary=001a1141a2eea42d9c053ff2b8e1
archived-at: Fri, 28 Oct 2016 20:30:05 -0000

--001a1141a2eea42d9c053ff2b8e1
Content-Type: text/plain; charset=UTF-8

Also agree that double-quoted column names are not ideal.  In addition to
the net-new features described in this thread, it'd be nice to see
non-double-quoted output as default behavior in the
existing create_indicator_variables() function.

Thanks,
Woo

On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wjung@pivotal.io> wrote:

> I like the one-hot encoded feature.  Another variant of this idea would be
> an "all other" variable (distinct from the reference class) that contains
> occurrences of the less frequent category types.  In both of these
> scenarios, the threshold for 'less frequent' could be user-supplied.
>
> Thanks,
> Woo
>
> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <rahulriyer@gmail.com> wrote:
>
>> An alternative to dropping is to assign the less frequent values to the
>> reference i.e. all one-hot encoded features will be 0.
>> Also important to note: total runtime will increase with this option since
>> we'll have to compute the exact frequency distribution.
>>
>> Another suggested change is to call this function 'one_hot_encoding' since
>> that is the output here (similar to sklearn's OneHotEncoder
>> <http://scikit-learn.org/stable/modules/generated/sklearn.
>> preprocessing.OneHotEncoder.html>).
>> We can keep the current name as a deprecated alias till 2.0 is released.
>>
>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fmcquillan@pivotal.io>
>> wrote:
>>
>> > Jarrod,
>> >
>> > Just trying to write up detailed requirements.  How would you see this
>> one
>> > working?
>> >
>> > "2) Option to dummy code only the top n most frequently occurring
>> values in
>> > any column"
>> >
>> > With 1 column I can picture it, you would drop the rows with the less
>> > frequently occurring values and end up with a smaller table.  But what
>> if
>> > you are encoding multiple rows?    Would you want a per row
>> specification
>> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
>> you
>> > did this then your result set might include low frequency values for
>> column
>> > x (not in top 3) because they are in the top 10 for column y - this
>> might
>> > be confusing.
>> >
>> > Frank
>> >
>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fmcquillan@pivotal.io
>> >
>> > wrote:
>> >
>> >> great, thanks for the additional information
>> >>
>> >> Frank
>> >>
>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jvawdrey@pivotal.io>
>> >> wrote:
>> >>
>> >>> IMO
>> >>>
>> >>> 1) Option to define resulting column names. Please see pdltools
>> >>> implementation - the ability to pass in a function is especially
>> useful (
>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> >>> 2) Option to dummy code only the top n most frequently occurring
>> values
>> >>> in
>> >>> any column
>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> >>> pivotcol_val2
>> >>> ...) instead of values in column names + secondary mapping table
>> >>> 4) Option to exclude original column from results table
>> >>>
>> >>> (1) & (2) are much higher priority than (3) & (4).
>> >>>
>> >>> Agreed that these could also be applied to Pivoting (especially 1).
>> >>>
>> >>>
>> >>>
>> >>> Jarrod Vawdrey
>> >>> Sr. Data Scientist
>> >>> Data Science & Engineering | Pivotal
>> >>> (650) 315-8905
>> >>> https://pivotal.io/
>> >>>
>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>> fmcquillan@pivotal.io>
>> >>> wrote:
>> >>>
>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful
>> -
>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>> >>> order
>> >>> > of priority as you see it?
>> >>> >
>> >>> > Also it seems like some of these could be applied to the Pivot
>> >>> function as
>> >>> > well, e.g., UDF for column naming.
>> >>> >
>> >>> > Frank
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>> jvawdrey@pivotal.io>
>> >>> > wrote:
>> >>> >
>> >>> >> Hey Frank,
>> >>> >>
>> >>> >> How are special character values handled today? It is often not
>> ideal
>> >>> to
>> >>> >> end up with column names that require double quotes to call due to
>> >>> >> downstream scripts.
>> >>> >>
>> >>> >> A couple of features that would be useful
>> >>> >>
>> >>> >> * Option to define resulting column names. Please see pdltools
>> >>> >> implementation - the ability to pass in a function is especially
>> >>> useful (
>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html
>> )
>> >>> >> * Option to dummy code only the top n most frequently occurring
>> >>> values in
>> >>> >> any column
>> >>> >> * Option to exclude original column from results table
>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
>> >>> mapping
>> >>> >> table
>> >>> >>
>> >>> >> Thank you
>> >>> >>
>> >>> >> Jarrod Vawdrey
>> >>> >> Sr. Data Scientist
>> >>> >> Data Science & Engineering | Pivotal
>> >>> >> (650) 315-8905
>> >>> >> https://pivotal.io/
>> >>> >>
>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>> >>> fmcquillan@pivotal.io>
>> >>> >> wrote:
>> >>> >>
>> >>> >>> For the module encoding categorical variables
>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>> >>> >>> ata__prep.html
>> >>> >>> does anyone have any suggestions on improvements that we could
>> make?
>> >>> >>>
>> >>> >>> Here is a video on how encoding categorical variables works for
>> >>> those not
>> >>> >>> familiar with it
>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>
>

--001a1141a2eea42d9c053ff2b8e1--