On Wed, Apr 30, 2014 at 9:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> On Wed, Apr 30, 2014 at 11:42 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > I also would suggest to take some guinea pigs to validate stuff.
> >
> > E.g. if i may make a suggestion, let's see how we'd do a categorical
> > variable vectorization into predictor variables in our wouldbe language
> > here.
> >
>
> to be a bit further specific here, here's what roughly happens here.
> assuming we have a column named "C1"
>
>
> (1) assess levels and their number (in R sense, aka R "factor" type)
> (2) assume there's n total levels (i.e. distinct categories). Assign each
> level, except one, to n1 Bernoulli features named according to certain
> convention e.g. "C1_<levelnameprefix>".
> (3) repeat that for all categorical variables in the data frame.
> (4) generate final dataframe executing mapping categories established in
> (2) and (3) (set predictors to 1 if current categorical value matches
> predictor's).
> (5) compute resulting data frame summaries (mean, variance, quartiles).
>
> seems simple enough, but how would it look like?
>
Sounds good. Minor nit in that 1 of n coding should be allowed as well.
I would also expect that we could do random hashing encoding as well.
A similar problem statement is possible for values that are textual, in
addition to categorical. The process is essentially the same in that you
have 0 or 1 passes to optionally agree on a dictionary and then another
pass to encode into n columns.
