singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liwen Xu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SINGA-468) Rafiki - Random uuid name cause potential "No module named xxx" error during load parameters
Date Fri, 12 Jul 2019 02:45:00 GMT
Liwen Xu created SINGA-468:
------------------------------

             Summary: Rafiki - Random uuid name cause potential  "No module named xxx" error
during load parameters
                 Key: SINGA-468
                 URL: https://issues.apache.org/jira/browse/SINGA-468
             Project: Singa
          Issue Type: Bug
            Reporter: Liwen Xu


I encountered a "No module named xxx" error when loading parameter of my model is called when
launching an inference job. Here is the error trace:

 
{code:java}
2019-07-10 02:21:07,256 rafiki.utils.service INFO Starting worker "75be99ec25a6" for service
of ID "614d740e-9791-4c64-aafe-dc17cf7e7866"...
2019-07-10 02:21:07,511 rafiki.worker.inference INFO Starting inference worker for service
of id 614d740e-9791-4c64-aafe-dc17cf7e7866...
2019-07-10 02:21:07,519 rafiki.cache.cache INFO add_worker_of_inference_job:INFERENCE_WORKERS_b6592484-deb4-4df2-bce3-ffc82d9a125a=614d740e-9791-4c64-aafe-dc17cf7e7866
2019-07-10 02:21:09,131 rafiki.utils.service ERROR Error while running worker:
2019-07-10 02:21:09,131 rafiki.utils.service ERROR Traceback (most recent call last):
  File "/root/rafiki/utils/service.py", line 31, in run_worker
    start_worker(service_id, service_type, container_id)
  File "scripts/start_worker.py", line 24, in start_worker
    worker.start()
  File "/root/rafiki/worker/inference.py", line 41, in start
    self._model = self._load_model(trial_id)
  File "/root/rafiki/worker/inference.py", line 91, in _load_model
    model_inst.load_parameters(parameters)
  File "/root/e4568ce2-9d44-47b8-ac7f-1e8143168140.py", line 235, in load_parameters
ModuleNotFoundError: No module named '797342b4-9d38-432f-91f6-727eac25db71'

{code}
After debugging I figured that it is a potential bug of Rafiki and pickle. This bug is caused
by pickling self defined class objects(defined in model source code).

Pickle requires the pickled object's class to be importable during pickle.loads(), by using
the same import path memorized during pickle.dumps. However, each time a train trail or inference
job is launched, a random UUID name will be given to the model source code file name. This
caused inconsistency of import path during dumping and loading.

This bug is not revealed because currently the models in Rafiki are only pickling imported
class object or python "primitives". Their import path is consistent.

Potential fix for this bug could be:
 # Change randomly generated file name to hash of something (e.g. model name + trail id),
then use the same way of hashing for both train job and inference job.
 # Remember the generated name during train job and use the same name during inference job.
(Model.load_model_class do take the third parameter "temp_mod_name" but it is never called
except in "test_model_class")
 # Change the way of importing the model source file. (Not sure)

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message