madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Iyer <ri...@pivotal.io>
Subject Re: create_indicator_variables() with svec?
Date Mon, 08 Aug 2016 17:05:35 GMT
Hi Satoshi,

Array output for *create_indicator_variables* would be quite helpful when
number of categories is large and the svec representation would be ideal
for it. There might be similar implications for *pivoting*, but we can keep
that as future discussion.

I'm curious about how you're using the indicator variables - svec is not
widely supported in MADlib (yet) and might not give much benefit after the
encoding is complete.

Best,
Rahul

On Sun, Aug 7, 2016 at 1:50 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:

> Hi,
>
> I'm trying create_indicator_variables() to encode categorical variables.
>
> https://madlib.incubator.apache.org/docs/latest/group__
> grp__data__prep.html
>
> And I found that PostgreSQL had a limitation of maximum number of variables
> in SELECT list (called target list in PostgreSQL), up to 1664.
>
> You may see this error when you have more than 1664 categories in your
> variable.
>
> spiexceptions.ProgramLimitExceeded: target lists can have at most 1664
> entries
>
> Now, I'm considering using PostgreSQL arrays to contain indicators instead
> of
> allocating single column per category.
>
> If create_indicator_variables() supports arrays as its output, it
> allows us to deal
> with categorical variables which have more than 1664 categories.
> And of course, I would like to use the sparse vector for it to compress
> them.
>
> https://madlib.incubator.apache.org/docs/latest/group__grp__svec.html
>
> Seems good to you? Any comments?
>
> Regards,
> --
> Satoshi Nagayasu <snaga@uptime.jp>
>



-- 

---------------------------------------------------------
Rahul Iyer
Principal software engineer | Predictive Analytics

*Pivotal**A new platform for a new era*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message