systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Dusenberry (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SYSTEMML-493) Modularize Existing DML Algorithms
Date Mon, 31 Jul 2017 20:35:00 GMT

     [ https://issues.apache.org/jira/browse/SYSTEMML-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mike Dusenberry updated SYSTEMML-493:
-------------------------------------
    Issue Type: Epic  (was: Improvement)

> Modularize Existing DML Algorithms
> ----------------------------------
>
>                 Key: SYSTEMML-493
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-493
>             Project: SystemML
>          Issue Type: Epic
>          Components: Algorithms
>            Reporter: Mike Dusenberry
>
> Currently, our provided DML algorithms come in the form of single, long scripts that
contain the read and write statements, are usually not broken up into modular UDFs, and require
the user to supply all arguments via the command line or bash scripts.  As a high-level example:
> {code}
> // read statements, parameter parsing, etc.
> X = read(...)
> hyperparam1 = $1
> anotherHyperparam = $2
> ...
> // core part of the algorithm
> // note: this is not broken up by a udf, and instead is just a continuation of the script
> while(!converged) {
>  // do stuff
> }
> // outputs, test results, stats, etc
> write(...)
> print(...)
> {code}
> The issue here is that many ML algorithms require hyperparameter tuning, and are part
of a general data flow (data ingestion, cleaning, splitting, etc.).  Due to this, it would
be ideal if our algorithm scripts were modularized so that the core parts of the algorithms
were wrapped in UDFs (i.e. {{train(...)}}, {{test(...)}}, etc.).  Then, rather than having
to perform these additional steps from a bash script, a user could instead import our algorithm
scripts from DML, and make calls to the UDFs as necessary.  As an example of the modification
to our scripts:
> {code}
> // read statements, parameter parsing, etc.
> X = read(...)
> hyperparam1 = $1
> anotherHyperparam = $2
> ...
> // core part of the algorithm
> // note: this is wrapped in a udf, thus allowing the user to import and supply arguments
from another DML script if desired
> train = function (matrix[double] X, double hyperparam1, double hyperparam2) return (matrix[double]
model) {
>     while(!converged) {
>      // do stuff
>     }
> }
> // when run as a script, this will invoke the `train(...)` function, thus achieving the
same result as the previous script design
> model = train(X, hyperparam1, anotherHyperparam)
> // outputs, test results, stats, etc
> write(...)
> print(...)
> {code}
> By modularizing the core parts of the algorithms into UDFs, yet still keep the surrounding
read/write statements, this will allow our provided scripts to be executed as scripts in the
(currently) normal fashion, while also allowing them to be imported from other DML scripts
for the use of the UDFs directly.  As an example of a custom DML workflow script:
> {code}
> // import
> source("LinearReg.dml") as lr
> // ingest data
> X_dirty = read(...)
> // clean data
> X = ...
> // split
> X_train = ...
> X_val = ...
> X_test = ...
> // hyperparameter tuning
> while(tuning) {
>     hyperparam1= ...
>     hyperparam2 = ...
>     model = lr::train(X, hyperparam1, hyperparam2)
>     error = lr::test(X_val, ...)
>     ...
> }
> // use best hyperparameters
> ...
> // save model
> write(model)
> {code}
> This change could be applied to all of our provided DML algorithms, and many could be
broken up into {{train(...)}}, {{test(...)}}, {{stats(...)}}, etc. functions.  The goal here
is to promote the use of DML for the entire ML pipeline (i.e. the way Python, R, Scala, etc.
are currently being used), rather than encouraging the use of cumbersome bash scripts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message