mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Helping out on spark efforts
Date Wed, 30 Apr 2014 22:17:35 GMT
On Wed, Apr 30, 2014 at 9:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> On Wed, Apr 30, 2014 at 11:42 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > I also would suggest to take some guinea pigs to validate stuff.
> >
> > E.g. if i may make a suggestion, let's see how we'd do a categorical
> > variable vectorization into predictor variables in our would-be language
> > here.
> >
>
> to be a bit further specific here, here's what roughly happens here.
> assuming we have a column named "C1"
>
>
> (1) assess  levels and their number (in R sense, aka R "factor" type)
> (2) assume there's n total levels (i.e. distinct categories). Assign each
> level, except one, to n-1 Bernoulli features named according to certain
> convention e.g. "C1_<level-name-prefix>".
> (3) repeat that for all categorical variables in the data frame.
> (4) generate final dataframe executing mapping categories established in
> (2) and (3) (set predictors to 1 if current categorical value matches
> predictor's).
> (5) compute resulting data frame summaries (mean, variance, quartiles).
>
> seems simple enough, but how would it look like?
>

Sounds good.  Minor nit in that 1 of n coding should be allowed as well.

I would also expect that we could do random hashing encoding as well.

A similar problem statement is possible for values that are textual, in
addition to categorical.  The process is essentially the same in that you
have 0 or 1 passes to optionally agree on a dictionary and then another
pass to encode into n columns.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message