systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Xu <ethan.yifa...@gmail.com>
Subject Re: parfor fails
Date Sat, 16 Apr 2016 23:12:34 GMT
Hi Matthias,

Thank you very much for the explanation and a better solution. s =
colSums(x==0) is more concise and works great!

For experiment I tried the original parfor script with SystemML
configuration file provided. On my cluster it's still failing with "PARFOR:
Failed to execute loop in parallel". It looks like the failed MR jobs are
caused by

Caused by: org.apache.sysml.runtime.DMLRuntimeException: Failed to create
non-existing local working directory:/path.to/ethan.xu/tmp/systemml

That directory '/path.to/ethan.xu/tmp/systemml' exists on the local server,
and it subdirectories named '_p22748_127.0.0.1' etc. It looks like other
SystemML jobs had no trouble writing to it.

The stderr and one failed MR log is attached.

Thanks,

Ethan

On Thu, Apr 14, 2016 at 11:14 PM, Matthias Boehm <mboehm@us.ibm.com> wrote:

> just for completeness, this issue is tracked with
> https://issues.apache.org/jira/browse/SYSTEMML-635 and the fix will be
> available tomorrow.
>
> Regards,
> Matthias
>
> [image: Inactive hide details for Matthias Boehm---04/14/2016 07:53:43
> PM---Hi Ethan, thanks for catching this issue. The parfor script]Matthias
> Boehm---04/14/2016 07:53:43 PM---Hi Ethan, thanks for catching this issue.
> The parfor script itself is perfectly fine
>
> From: Matthias Boehm/Almaden/IBM@IBMUS
> To: dev@systemml.incubator.apache.org
> Cc: "Ethan Xu" <ethan.yifanxu@gmail.com>
> Date: 04/14/2016 07:53 PM
> Subject: Re: parfor fails
> ------------------------------
>
>
>
> Hi Ethan,
>
> thanks for catching this issue. The parfor script itself is perfectly fine
> but you encountered an interesting runtime bug. Usually, you can find the
> actual cause at the bottom of the stacktrace or in previous exceptions. I
> was able to reproduce this issue if NO systemml config file is provided
> (fails on parsing this non-existing config in the parfor mr job task
> setup). So the workaround is to put a SystemML-config.xml into the same
> directory. Interestingly, the issue did not show up in our testsuite
> because we always specify a default configuration there (which was until
> recently mandatory).
>
> As a side note, we strongly recommend parfor over for loops here because
> it runs the entire loop in 1 instead of 2396 MR jobs due to automatic data
> partitioning. However, for the specific example at hand, a data-parallel
> formulation (with "s = colSums(x==0)") would be even better as it allows
> for partial aggregation and hence reduces shuffle.
>
> Regards,
> Matthias
>
> Ethan Xu ---04/14/2016 01:34:24 PM---Hello, I have a quick question. The
> following script fails with this error:
>
> From: Ethan Xu <ethan.yifanxu@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 04/14/2016 01:34 PM
> Subject: parfor fails
> ------------------------------
>
>
>
> Hello,
>
> I have a quick question. The following script fails with this error:
>
> org.apache.sysml.runtime.DMLRuntimeException: PARFOR: Failed to execute
> loop in parallel.
>
> Here is the dml script:
>
> x=read($X);
>
> print("number of rows of x = " + nrow(x));
> print("number of cols of x = " + ncol(x));
>
> parfor(i in 1:ncol(x), check=0){
>   a = x[,i];
>   print("number of 0's in col " + i + " = " + sum(a == 0));
> }
>
> where X is a 35 million by 2396 matrix (coded and dummy coded numerical
> matrix) on HDFS. The script runs fine with regular 'for' loops.
>
> Could someone explain why this script cannot run in parallel? Was it a
> wrong way to code parfor?
>
> Thanks,
>
> Ethan
>
>
>
>

On Thu, Apr 14, 2016 at 11:14 PM, Matthias Boehm <mboehm@us.ibm.com> wrote:

> just for completeness, this issue is tracked with
> https://issues.apache.org/jira/browse/SYSTEMML-635 and the fix will be
> available tomorrow.
>
> Regards,
> Matthias
>
> [image: Inactive hide details for Matthias Boehm---04/14/2016 07:53:43
> PM---Hi Ethan, thanks for catching this issue. The parfor script]Matthias
> Boehm---04/14/2016 07:53:43 PM---Hi Ethan, thanks for catching this issue.
> The parfor script itself is perfectly fine
>
> From: Matthias Boehm/Almaden/IBM@IBMUS
> To: dev@systemml.incubator.apache.org
> Cc: "Ethan Xu" <ethan.yifanxu@gmail.com>
> Date: 04/14/2016 07:53 PM
> Subject: Re: parfor fails
> ------------------------------
>
>
>
> Hi Ethan,
>
> thanks for catching this issue. The parfor script itself is perfectly fine
> but you encountered an interesting runtime bug. Usually, you can find the
> actual cause at the bottom of the stacktrace or in previous exceptions. I
> was able to reproduce this issue if NO systemml config file is provided
> (fails on parsing this non-existing config in the parfor mr job task
> setup). So the workaround is to put a SystemML-config.xml into the same
> directory. Interestingly, the issue did not show up in our testsuite
> because we always specify a default configuration there (which was until
> recently mandatory).
>
> As a side note, we strongly recommend parfor over for loops here because
> it runs the entire loop in 1 instead of 2396 MR jobs due to automatic data
> partitioning. However, for the specific example at hand, a data-parallel
> formulation (with "s = colSums(x==0)") would be even better as it allows
> for partial aggregation and hence reduces shuffle.
>
> Regards,
> Matthias
>
> Ethan Xu ---04/14/2016 01:34:24 PM---Hello, I have a quick question. The
> following script fails with this error:
>
> From: Ethan Xu <ethan.yifanxu@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 04/14/2016 01:34 PM
> Subject: parfor fails
> ------------------------------
>
>
>
> Hello,
>
> I have a quick question. The following script fails with this error:
>
> org.apache.sysml.runtime.DMLRuntimeException: PARFOR: Failed to execute
> loop in parallel.
>
> Here is the dml script:
>
> x=read($X);
>
> print("number of rows of x = " + nrow(x));
> print("number of cols of x = " + ncol(x));
>
> parfor(i in 1:ncol(x), check=0){
>   a = x[,i];
>   print("number of 0's in col " + i + " = " + sum(a == 0));
> }
>
> where X is a 35 million by 2396 matrix (coded and dummy coded numerical
> matrix) on HDFS. The script runs fine with regular 'for' loops.
>
> Could someone explain why this script cannot run in parallel? Was it a
> wrong way to code parfor?
>
> Thanks,
>
> Ethan
>
>
>
>

Mime
View raw message