madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [madlib] reductionista opened a new pull request #525: DL: Model Hopper Refactor
Date Thu, 19 Nov 2020 00:57:54 GMT

reductionista opened a new pull request #525:
URL: https://github.com/apache/madlib/pull/525


   This is a major refactor of the model hopper parallelism of the deep learning module.
   
   **A new data flow for the hop/training cycle**
   
   Background:
   
   In v1.17, MADlib creates 3 temporary tables to store the model weights while they are being
   hopped around and trained:  mst_weights_tbl, weights_to_update_tbl, and model_output_table.
   This gives rise to 3 different long-running queries that each involve at least one Redistribute
Motion:
   the UDA query (which does the actual training), the UPDATE query, and the HOP query.  Most
of the time in
   madlib_keras_fit_multiple() is spent in a loop over iterations and hops calling the run_training()
   function which performs each of these 3 stages, where all of the models circulate between
   these 3 temporary tables and get back to where they started after each full iteration.
   The overall effect was a lot of unnecessary motion of the model weights going back and
forth 
   between different segment hosts several times for each call to run_training().  Each hop
was
   really 3 or 4 hops in terms of data motion, which for a modest sized data set can cause
   the hopping phase of the cycle to take just as long as the training phase, leaving GPU's
with a lot
   of idle time.
   
   3 Biggest Changes:
   
   1.  This refactor involves a major simplification where the 3 temporary tables are replaced
by 2 temporary tables (model_input_tbl and model_output_tbl) and the UPDATE stage is eliminated
to leave only the HOP and
   UDA stages.  There were also 3 shorter-running TRUNCATE & DROP queries in between the
3 longer queries.
   These are also reduced to only 2 TRUNCATE & DROP queries.
   
   Two additional performance improvements closely related to the above are:
   2.  A __dist_key__ column is added to model_output table and both of the temporary tables
are
          DISTRIBUTED BY this key instead of mst.  The __dist_key__'s have a 1-to-1 mapping
with segment_id's,
          as in the batched source table.  This serves as the hash key for all JOIN's now.
          The schedule table has both a __dist_key__ column and a __previous_dist_key__ column,
          so that it can guide the weights from the specific segment they were on previously
          to the one they are scheduled to be on for the next hop.  This avoids any additional
          Redistribute Motions caused by table Hash Join's.  With this change, there is no
longer any
          movement of the models at all during except during the hop itself.  They only move
          exactly once per call to run_training().
   
   and
   3.  For model weight averaging (madlib_keras_fit()), we needed a merge and a final function
        in addition to the fit_transition function, so a UDA was the natural choice.  For
the model
        hopper (madlib_keras_fit_multiple()), there is no need for merge or final, all we
need is
        to call the fit_transition function directly on each row--so it was less clear whether
a UDA
        was the right choice.  We found that calling it as a UDF directly, and using SD to
pass
        the state from one row to the next (as we were doing already anyway with the UDA),
resulted
        in slightly better performance.  So now fit_transition can be called either as a UDF
or as part
        of a UDA, and fit_transition_multiple_model() is always called as a UDF.
   
       - Simplified schedule rotation: schedule table created only once, then gets
         rotated on segments, instead of re-creating many times by transfering
         data back and forth from master to segments to master each hop.  No longer
         need separate "current_schedule" and "grand_schedule" data structures.
   
   # Other performance enhancements
   
   4.  There is no need to hop the models between the last hop of the previous iteration and
the first hop of the next iteration, so once per iteration we skip the hop and instead just
rename model_output_tbl to model_input_tbl before truncating and dropping model_output_tbl.
 For example, if there were 4 segments then in v1.17 that means 4 model hops per iteration.
 In this branch, the last hop is skipped and the next iteration just starts with the models
on the same segments they are already on.  The 4 hops has been reduced to only 3 hops.  The
same amount of training occurs as before, and each model is paired exactly once per iteration
with each segment--just in a slightly different order.
   
   5.  Much faster initialization code:  previously, we were reading the weights
         in from the original model output table (during warm start) and the model
         arch table (for transfer learning) one mst row at a time from segment to
         master, then writing them each back out one row at a time from master
         back to segments with a large number of SELECT and INSERT queries.
         Now, we just use a single query to copy the weights directly from the
         original model output table into the new model output table on the
         segments, without ever sending them to master.  And a similar single
         query copies the transfer learning weights directly from model_arch to
         model_output for training.  Both of these happen in parallel on the
         segments, instead of in sequence on master.  During testing of warm
         start on a 20-segment cluster with 20 models, this resulted in a 10x reduction
         in initialization time (26s instead of 5 mins in v1.17)
   
   6.    Split get_model_arch_and_weights() into query_weights() and get_model_arch()
           So we don't have to transfer weights from segment to master in places
           where we only need the model_arch json
   
   7.     Enables JIT XLA auto-clustering, if available.
   
   # Other Changes
   
    8.  Simplified schedule rotation: schedule table is created only once, then gets
         rotated on segments, instead of re-creating many times by transferring
         data back and forth from master to segments to master each hop.  No longer
         need separate "current_schedule" and "grand_schedule" data structures.
         These tables do not contain model weights, so there is not much of a 
         performance benefit, but it's at least simpler and more maintainable.
   
   9.   Added some debugging that can be enabled to help profile the
         performance of fit multiple, and track which segment each mst_key
         is located during each hop. This also serves as an example for
         the utils/debug PR this is rebased on top of.
   
   10.  Remove AutoML dependency on internals of fit_multiple (needed to make AutoML
        compatible with the new FitMultiple class, but also good for avoiding having to
        do the same thing for any future updates to fit_multiple.  Better modularity.
   
    11.  Improved Exception handling:  send full stack traceback from segment back to master
(soon to be renamed "coordinator").  Now when an exception happens in fit_transition (or merge,
final, etc.) we'll be able to see the stack trace of the internal UDF or UDA running on the
segments, along with the stack trace of the outer UDF that runs on the
   coordinator.  It's attached to the DETAILS of the Exception and includes the line number
where the error occurred--something that we had to guess at before, making debugging difficult.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message