systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SYSTEMML-2376) Preparation of baseline experiments
Date Fri, 15 Jun 2018 05:09:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513347#comment-16513347
] 

Matthias Boehm commented on SYSTEMML-2376:
------------------------------------------

That is a great start - thanks for the automated scripts. I made a couple of modifications:
* You can simply put the {{SystemML-config.xml}} into the same directory as SystemML.jar -
then it will be picked up.
* There is no need to include the log4j properties or nn library - the log4j is used out of
Spark's config directory and we package our own script already into SystemML.jar
* I added a {{./sparkDML2.sh}} which includes the memory configurations and spark submission.
With the local parameter server, we simply run in Spark's driver process which is equivalent
to standalone invocation but we can use one consistent setup later with distributed operations
as well.
* I modified the invocation script to run only the max number of epochs and workers to get
reasonable runtimes because running all combinations for workers (up to 80) and epochs would
take way too long. I also changed the batch sizes to a more common parameterization.

Furthermore, I tried to run with mkl but in the used environment, I run into the following
core dump (which might need a fix similar to the issue you encountered with openblas - let
us double check the used instruction parallelism):

{code}
2018-06-14 21:07:31 INFO  NativeHelper:185 - Using native blas: mkl
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGFPE (0x8) at pc=0x00007fbed61902d6, pid=352940, tid=0x00007fd886176700
#
# JRE version: OpenJDK Runtime Environment (8.0_161-b14) (build 1.8.0_161-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.161-b14 mixed mode linux-amd64 )
# Problematic frame:
# C  [libmkl_avx512.so+0x206d2d6]  mkl_dnn_avx512_bkdGemmDirectConv_F64+0x276
{code}

Most importantly, however, the experiments currently get stuck already on the first combination.
This requires some work. For the meantime, there are a couple of observations, that should
be addressed:
* In BSP, the execution hangs (before stats) independent of the number of epochs (tried with
1 and 2) no work is performed anymore.
* Double check the value accuracy (which is reported as 0 for mnist60k, at least with ASP)
which indicates some issue. Could we please include a test that checks for similar outputs
of the parameter server compared to basic mini batch execution?
* Currently, the number of workers is set to k-1 which is irritating when explicitly specified
by a user as k. 
* Having an optional flag for reporting progress would be nice (e.g., current epoch / current
batch, maybe max for ASP).


> Preparation of baseline experiments
> -----------------------------------
>
>                 Key: SYSTEMML-2376
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2376
>             Project: SystemML
>          Issue Type: Technical task
>            Reporter: LI Guobao
>            Assignee: LI Guobao
>            Priority: Major
>             Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message