madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Iyer <ri...@pivotal.io>
Subject Re: MADlib 1.8 Random Forest error (array_of_bigint)
Date Mon, 30 Nov 2015 19:09:00 GMT
Hi Tetsuo,

I don't think it's the 'id' that is causing this issue, rather the array of
features. Decision tree combines the continuous and categorical features in
two separate arrays - one of those (most probably the continuous feature)
is empty for a particular tuple. I can't comment more without looking at
the dataset.

Within the array operations module, we're returning the message as
"array_of_bigint" for a float array. That's a minor messaging bug; I'll fix
that as part of the next commit.

Best,
Rahul

On Sun, Nov 29, 2015 at 12:41 AM, Tetsuo Kobayashi <tkobayashi@pivotal.io>
wrote:

> Hi,
>
> I am currently having an error with the MADlib Random Forest function in
> MADlib1.8.0.  Below is the code I tried.
>
> DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary;
> SELECT madlib.forest_train('test_rf_data', -- input table name
>                            'rf_output', -- output table name
>                            'id', -- id column
>                            'duration', -- dependent variable
>                            '*',  -- list of features
>                            NULL,-- exclude columns
>                            'linkid' -- grouping column
>   ,2::integer -- # of trees
>                            ,5::integer,  -- # of random features
>                            TRUE::boolean, -- importance
>                            1,  -- # of permutations
>                            5, -- max_tree_depth
>                            10,  -- min_split
>                            3,  -- min_bucket
>                            10  -- number of splits per continuous variable
>                            );
>
> When I tried this with all linkid (the grouping column with 362 linkids),
> I got an error as in "error_random_forest.txt" attached here. The error
> message is says I have the invalid array length but does not tell any
> details what features in the data have this issue.
>
> ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
> DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
>
> I guessed this is the error for the bigint columns but the only bigint
> columns is the "id" column. I once had an error that some features have
> identical values in all records, but it is not the case this time because I
> changed the sample size for each linkid as 1000 or above.
> It seems something is zero from the DETAIL saying "0 given" but I have no
> idea what in the data this is referring to.
>
>
> The schema of the input table is as below;
> CREATE TABLE input_table (
> id bigint,
> linkid varchar(32),
> duration double precision,
> sat_flg int,
> sun_flg int,
> holiday_flg int,
> semi_holiday_flg int,
> renkyu_flg int,
> ave_temp numeric,
> ave_wind numeric,
> precip numeric,
> radiation numeric,
> ave_speed numeric,
> travel_time numeric,
> );
>
> Can anybody please let me know what the possible cause of this error? The
> MADlib linear regression worked without any problems.
>
> I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS.
>
>
> Thank you,
>
> Tetsuo
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message