madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarrod Vawdrey <jvawd...@pivotal.io>
Subject Re: Encoding categorical variables
Date Fri, 14 Oct 2016 20:02:19 GMT
Hey Frank,

How are special character values handled today? It is often not ideal to
end up with column names that require double quotes to call due to
downstream scripts.

A couple of features that would be useful

* Option to define resulting column names. Please see pdltools
implementation - the ability to pass in a function is especially useful (
http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
* Option to dummy code only the top n most frequently occurring values in
any column
* Option to exclude original column from results table
* Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
...) instead of values in column names + secondary mapping table

Thank you

Jarrod Vawdrey
Sr. Data Scientist
Data Science & Engineering | Pivotal
(650) 315-8905
https://pivotal.io/

On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fmcquillan@pivotal.io>
wrote:

> For the module encoding categorical variables
> http://madlib.incubator.apache.org/docs/latest/group__grp__data__prep.html
> does anyone have any suggestions on improvements that we could make?
>
> Here is a video on how encoding categorical variables works for those not
> familiar with it
> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL62pIycqXx-
> Qf6EXu5FDxUgXW23BHOtcQ
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message