Hi Tetsuo,
I don't think it's the 'id' that is causing this issue, rather the array of
features. Decision tree combines the continuous and categorical features in
two separate arrays - one of those (most probably the continuous feature)
is empty for a particular tuple. I can't comment more without looking at
the dataset.
Within the array operations module, we're returning the message as
"array_of_bigint" for a float array. That's a minor messaging bug; I'll fix
that as part of the next commit.
Best,
Rahul
On Sun, Nov 29, 2015 at 12:41 AM, Tetsuo Kobayashi <tkobayashi@pivotal.io>
wrote:
> Hi,
>
> I am currently having an error with the MADlib Random Forest function in
> MADlib1.8.0. Below is the code I tried.
>
> DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary;
> SELECT madlib.forest_train('test_rf_data', -- input table name
> 'rf_output', -- output table name
> 'id', -- id column
> 'duration', -- dependent variable
> '*', -- list of features
> NULL,-- exclude columns
> 'linkid' -- grouping column
> ,2::integer -- # of trees
> ,5::integer, -- # of random features
> TRUE::boolean, -- importance
> 1, -- # of permutations
> 5, -- max_tree_depth
> 10, -- min_split
> 3, -- min_bucket
> 10 -- number of splits per continuous variable
> );
>
> When I tried this with all linkid (the grouping column with 362 linkids),
> I got an error as in "error_random_forest.txt" attached here. The error
> message is says I have the invalid array length but does not tell any
> details what features in the data have this issue.
>
> ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
> DETAIL: array_of_bigint: Size should be in [1, 1e7], 0 given
>
> I guessed this is the error for the bigint columns but the only bigint
> columns is the "id" column. I once had an error that some features have
> identical values in all records, but it is not the case this time because I
> changed the sample size for each linkid as 1000 or above.
> It seems something is zero from the DETAIL saying "0 given" but I have no
> idea what in the data this is referring to.
>
>
> The schema of the input table is as below;
> CREATE TABLE input_table (
> id bigint,
> linkid varchar(32),
> duration double precision,
> sat_flg int,
> sun_flg int,
> holiday_flg int,
> semi_holiday_flg int,
> renkyu_flg int,
> ave_temp numeric,
> ave_wind numeric,
> precip numeric,
> radiation numeric,
> ave_speed numeric,
> travel_time numeric,
> );
>
> Can anybody please let me know what the possible cause of this error? The
> MADlib linear regression worked without any problems.
>
> I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS.
>
>
> Thank you,
>
> Tetsuo
>
|