Thank you, we created a JIRA to investigate this
https://issues.apache.org/jira/browse/MADLIB-1257
On Tue, Jul 24, 2018 at 10:31 AM, LUYAO CHEN <luyao_chen@hotmail.com> wrote:
> Another observation - It crashed with 84 groups and 73K instance. In this
> scenario, I shall have pretty enough memory and disk.
>
> Also seems during the increasing of the groups, it used a lot of
> temporary disk space when the data is over certain groups.
>
>
> Regards,
>
> ------------------------------
> *From:* LUYAO CHEN <luyao_chen@hotmail.com>
> *Sent:* Tuesday, July 24, 2018 9:15 AM
> *To:* user@madlib.apache.org
> *Subject:* Re: PostgreSQL crashed during random forest training
>
>
> Hi Frank,
>
>
> You may refer to the enclosed dump data for the training table, and I used
> the below SQL for random forest.
>
>
> DROP TABLE IF EXISTS train_output, train_output_group,
> train_output_summary;
> SELECT madlib.forest_train('train_data', -- source table
> 'train_output', -- output model table
> 'rowid', -- id column
> 'positive', -- response
> 'features', -- features
> NULL, -- exclude columns
> 'caseid', -- grouping columns
> 30::integer, -- number of trees
> 30::integer, -- number of random features
> TRUE::boolean, -- variable importance
> 1::integer, -- num_permutations
> 10::integer, -- max depth
> 3::integer, -- min split
> 1::integer, -- min bucket
> 10::integer, -- number of splits per
> continuous variable
> NULL, -- null handling parameter
> TRUE -- verbose
> );
>
> Regards,
> Luyao Chen
>
> ------------------------------
> *From:* Frank McQuillan <fmcquillan@pivotal.io>
> *Sent:* Monday, July 23, 2018 4:59 PM
> *To:* user@madlib.apache.org
> *Subject:* Re: PostgreSQL crashed during random forest training
>
> Hi Luyao Chen
>
> It's hard to debug just looking at that trace.
>
> 1) If you increase your data size to more than 56K instances in 56
> groups, does it work? e.g., double it to approx 112K instances and 112
> groups.
>
> 2) Is it possible of you could share a sample of your data so that we
> could try? If not, perhaps anonymize a sample of the data so that we can
> multiply it out to make it bigger? Then we could take a closer look.
>
> Frank
>
> On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <luyao_chen@hotmail.com>
> wrote:
>
> Dear user group,
>
>
> I got a problem when training the grouped data with random forest(300
> features). Small data was fine ( eg, 56K instances in 56 groups), but
> failed for 240K instances in 250 groups. Postgres forced to disconnect the
> session after showing the below message in verbose mode:
>
>
> NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a
> temporary view
> NOTICE: sql_create_empty_result_table:
>
> CREATE TABLE analysis.dx_rf_train_output_1 (
> gid integer,
> sample_id integer,
> tree madlib.bytea8);
>
> NOTICE: sql_refresh_training_pois_cnt:
>
> TRUNCATE TABLE __madlib_temp_91155016_1532371657_5660955__
> CASCADE;
> INSERT INTO __madlib_temp_91155016_1532371
> 657_5660955__
> SELECT
> *,
> madlib.poisson_random(1) AS poisson_count
> FROM
> (
> SELECT
> *,
> 0.::double precision AS
> __madlib_temp_14328459_1532371657_7318497__
> FROM analysis.dxpredict_svec
> ) subq
> WHERE __madlib_temp_14328459_1532371657_7318497__
> < 1
>
> NOTICE:
> src_cnt: 158360,
> oob_cnt: 92418,
> dup_cnt: 250617.
>
> NOTICE: Started tree building for all groups
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> The PostgreSQL did not capture the detail log even I increased the
> logstatement to "all"
> 2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was
> terminated by signal 11: Segmentation fault
> 2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running:
> SELECT madlib.forest_train('analysis.dxpredict_svec',
> 'analysis.dx_rf_train_output_1',
> 'rowid',
> 'positive',
> '*',
> 'rowid,positive,case_icd',
> 'case_icd',
> 30::integer,
> 30::integer,
> TRUE::boolean,
> 1::integer,
> 10::integer,
> 3::integer,
> 1::integer,
> 10::integer,
> NULL,
> TRUE
> );
> 2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active
> server processes
> 2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection
> because of crash of another server process
>
>
>
>
>
>
|