mxnet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] rahul003 commented on issue #8373: distribute training in fp16
Date Thu, 01 Jan 1970 00:00:00 GMT
rahul003 commented on issue #8373: distribute training in fp16
URL: https://github.com/apache/incubator-mxnet/pull/8373#issuecomment-365386805
 
 
   @solin319 Which machines did you run the above numbers on? Let us try to come up with an
easier interface for this so we can use this on the latest Nvidia Gpus. 
   
   Regarding 'I think merge the logic in '_init_kvstore_server_module' to the function 'kvstore.create'
may be a better way to start server and worker.': 
   But this would mean that the server is created when the code calls kvstore.create(). This
has the effect that we end up doing everything in the training script that was written before
creation of kvstore. This could possibly allocate memory for data, model, etc. Or we have
to instruct users to create the kvserver first (at which point the server process goes into
a loop, so other code isn't run), but this seems hacky for an official way of doing things
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message