singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liwen Xu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SINGA-472) Rafiki - Error 'cudnn PoolForward launch failed' when doing average pooling
Date Wed, 24 Jul 2019 09:26:00 GMT
Liwen Xu created SINGA-472:
------------------------------

             Summary: Rafiki - Error 'cudnn PoolForward launch failed' when doing average
pooling
                 Key: SINGA-472
                 URL: https://issues.apache.org/jira/browse/SINGA-472
             Project: Singa
          Issue Type: Bug
            Reporter: Liwen Xu


When I implement PGGANs model to the dev branch, I faced 'cudnn PoolForward launch failed'
error when doing average pooling.

I noticed that this error might be caused by GPU allocation. I got the same error when implementing
PGGANs with the master branch, and it has been solved by decreasing the minibatch size and
add sleep time before doing pooling. However, these solutions would not work for the dev branch.

The model works well for test_model_class, and there is also no error when I run test_model_class
manually in the worker image, thus I think this error might not be caused by the environment
problem.

Besides, I noticed that for my model with dev branch, no matter running it in rafiki or just
test_model_class, all GPU memory is allocated at the beginning of training a trial. However,
I have already setup tf.ConfigProto().gpu_options.allow_growth=True, which can allocate only
as much GPU memory as needed. My model with master branch does not have this problem, thus
I am not sure if this is the reason for the error I have.

Thank you so much for your help!

Following is error trace:

Traceback (most recent call last): File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1334, in _do_call return fn(*args) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError:
cudnn PoolForward launch failed [[\{{node GPU0/D_loss/D/cond/Downscale2D/AvgPool}} = AvgPool[T=DT_FLOAT,
data_format="NCHW", ksize=[1, 1, 8, 8], padding="VALID", strides=[1, 1, 8, 8], _device="/job:localhost/replica:0/task:0/device:GPU:0"](GPU0/D_loss/D/cond/Downscale2D/AvgPool/Switch)]]
[[\{{node TrainD/ApplyGrads0/UpdateWeights/cond/pred_id/_921}} = _HostRecv[client_terminated=false,
recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1, tensor_name="edge_14500_TrainD/ApplyGrads0/UpdateWeights/cond/pred_id",
tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] During handling
of the above exception, another exception occurred: Traceback (most recent call last): File
"/root/rafiki/worker/train.py", line 111, in _perform_trial self._train_model(model_inst,
proposal, shared_params) File "/root/rafiki/worker/train.py", line 167, in _train_model model_inst.train(train_dataset_path,
shared_params=shared_params, **(train_args or {})) File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py",
line 599, in train File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 910,
in _train_progressive_gan File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 929, in run run_metadata_ptr) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1328, in _do_run run_metadata) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError:
cudnn PoolForward launch failed [[node GPU0/D_loss/D/cond/Downscale2D/AvgPool (defined at
/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py:534) = AvgPool[T=DT_FLOAT, data_format="NCHW",
ksize=[1, 1, 8, 8], padding="VALID", strides=[1, 1, 8, 8], _device="/job:localhost/replica:0/task:0/device:GPU:0"](GPU0/D_loss/D/cond/Downscale2D/AvgPool/Switch)]]
[[\{{node TrainD/ApplyGrads0/UpdateWeights/cond/pred_id/_921}} = _HostRecv[client_terminated=false,
recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1, tensor_name="edge_14500_TrainD/ApplyGrads0/UpdateWeights/cond/pred_id",
tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Caused by
op 'GPU0/D_loss/D/cond/Downscale2D/AvgPool', defined at: File "scripts/start_worker.py", line
58, in <module> run_worker(meta_store, start_worker, stop_worker) File "/root/rafiki/utils/service.py",
line 50, in run_worker start_worker(service_id, service_type, container_id) File "scripts/start_worker.py",
line 40, in start_worker worker.start() File "/root/rafiki/worker/train.py", line 68, in start
result = self._perform_trial(proposal) File "/root/rafiki/worker/train.py", line 111, in _perform_trial
self._train_model(model_inst, proposal, shared_params) File "/root/rafiki/worker/train.py",
line 167, in _train_model model_inst.train(train_dataset_path, shared_params=shared_params,
**(train_args or {})) File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 599,
in train File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 875, in _train_progressive_gan
File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 1382, in _D_wgangp_acgan
File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 233, in get_output_for
File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 440, in D_paper File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py",
line 436, in grow File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 555,
in <lambda> File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py",
line 488, in new_func return func(*args, **kwargs) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py",
line 2097, in cond orig_res_f, res_f = context_f.BuildCondBranch(false_fn) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py",
line 1930, in BuildCondBranch original_result = fn() File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py",
line 433, in <lambda> File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py",
line 534, in _downscale2d File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py",
line 2110, in avg_pool name=name) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py",
line 72, in avg_pool data_format=data_format, name=name) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py",
line 787, in _apply_op_helper op_def=op_def) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py",
line 488, in new_func return func(*args, **kwargs) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/framework/ops.py",
line 3274, in create_op op_def=op_def) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/framework/ops.py",
line 1770, in __init__ self._traceback = tf_stack.extract_stack() InternalError (see above
for traceback): cudnn PoolForward launch failed [[node GPU0/D_loss/D/cond/Downscale2D/AvgPool
(defined at /root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py:534) = AvgPool[T=DT_FLOAT,
data_format="NCHW", ksize=[1, 1, 8, 8], padding="VALID", strides=[1, 1, 8, 8], _device="/job:localhost/replica:0/task:0/device:GPU:0"](GPU0/D_loss/D/cond/Downscale2D/AvgPool/Switch)]]
[[\{{node TrainD/ApplyGrads0/UpdateWeights/cond/pred_id/_921}} = _HostRecv[client_terminated=false,
recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1, tensor_name="edge_14500_TrainD/ApplyGrads0/UpdateWeights/cond/pred_id",
tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message