singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ngin Yun Chuan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SINGA-399) Rafiki cannot test rebuilt image
Date Sun, 28 Oct 2018 05:08:00 GMT

    [ https://issues.apache.org/jira/browse/SINGA-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16666289#comment-16666289
] 

Ngin Yun Chuan commented on SINGA-399:
--------------------------------------

Regarding the error encountered in `rafiki-3.PNG`, the error signifies that the `predictor`
Flask HTTP server is not up yet when the method `make_predictions(...)` within `examples/scripts/client-usage.py`
is run, which sends queries to `127.0.0.1:30000`. From the images, I can think of 2 scenarios:

1) You edited `rafiki.predictor`'s code or its dependencies in a way that causes a runtime
error during the deployment of the inference job. Check `/var/log/rafiki` that contains all
components' logs for any error messages.

2) The `predictor` Flask HTTP server took more than 20s (the delay I added in `client-usage.py`
before the script sends queries) to setup and start accepting queries. This is possible as
`wait_until_inference_job_is_running(...)` doesn't actually yet ensure that the `predictor`'s
HTTP server is up before returning. 

Either way, I will be adding sufficient documentation and modify the code to better account
for both scenarios. Do let me know if you have any suggestions or figured out any sources
of errors.

> Rafiki cannot test rebuilt image
> --------------------------------
>
>                 Key: SINGA-399
>                 URL: https://issues.apache.org/jira/browse/SINGA-399
>             Project: Singa
>          Issue Type: Bug
>            Reporter: Zhu Lei
>            Priority: Major
>         Attachments: rafiki-1.PNG, rafiki-2.PNG, rafiki-3.PNG, rafiki-4.PNG
>
>
> After downloading the newest rafiki code, at commit 7b3b04e15c62233e515c4d82051cd5dfb799215f,
with comments "Add more error handling to notify user of invalid train job; compact exceptions",
I ran "bash ./scripts/build_images.sh" to build the new admin, advisor, predictor and worker
images. I got the images shown in attached image 'rafiki-1.PNG'. Then I run "bash ./script/start.sh"
to build the containers as shown in the attached image 'rafiki-2.PNG'. Finally when I ran
the client-usage.py example. I got the error in attached image 'rafiki-3.PNG'.
> And I find very surprising that the images of admin, advisor, predictor and worker I
built just now, become some images built weeks ago, shown in attached image 'rafiki-4.PNG'.
Could you kindly provide me some explanations on why this happens? I really do not understand
why this happened.
> And finally, when I run "bash ./script/stop.sh" and leave the swarm and repeat my previous
procedure again, now there is no errors. The only thing difference between the two runs I
think is only the images are different. So the current code of rafiki does not support newly
build images, that is my speculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message