From dev-return-2326-archive-asf-public=cust-asf.ponee.io@singa.incubator.apache.org Sun Oct 28 06:08:06 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id ACEC1180649 for ; Sun, 28 Oct 2018 06:08:05 +0100 (CET) Received: (qmail 24001 invoked by uid 500); 28 Oct 2018 05:08:04 -0000 Mailing-List: contact dev-help@singa.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@singa.incubator.apache.org Delivered-To: mailing list dev@singa.incubator.apache.org Received: (qmail 23991 invoked by uid 99); 28 Oct 2018 05:08:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 28 Oct 2018 05:08:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 367E41A5498 for ; Sun, 28 Oct 2018 05:08:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.301 X-Spam-Level: X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id c322S6poH0bT for ; Sun, 28 Oct 2018 05:08:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 77AB65F39B for ; Sun, 28 Oct 2018 05:08:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 976ECE0E5D for ; Sun, 28 Oct 2018 05:08:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1383C27763 for ; Sun, 28 Oct 2018 05:08:00 +0000 (UTC) Date: Sun, 28 Oct 2018 05:08:00 +0000 (UTC) From: "Ngin Yun Chuan (JIRA)" To: dev@singa.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SINGA-399) Rafiki cannot test rebuilt image MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SINGA-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16666289#comment-16666289 ] Ngin Yun Chuan commented on SINGA-399: -------------------------------------- Regarding the error encountered in `rafiki-3.PNG`, the error signifies that the `predictor` Flask HTTP server is not up yet when the method `make_predictions(...)` within `examples/scripts/client-usage.py` is run, which sends queries to `127.0.0.1:30000`. From the images, I can think of 2 scenarios: 1) You edited `rafiki.predictor`'s code or its dependencies in a way that causes a runtime error during the deployment of the inference job. Check `/var/log/rafiki` that contains all components' logs for any error messages. 2) The `predictor` Flask HTTP server took more than 20s (the delay I added in `client-usage.py` before the script sends queries) to setup and start accepting queries. This is possible as `wait_until_inference_job_is_running(...)` doesn't actually yet ensure that the `predictor`'s HTTP server is up before returning. Either way, I will be adding sufficient documentation and modify the code to better account for both scenarios. Do let me know if you have any suggestions or figured out any sources of errors. > Rafiki cannot test rebuilt image > -------------------------------- > > Key: SINGA-399 > URL: https://issues.apache.org/jira/browse/SINGA-399 > Project: Singa > Issue Type: Bug > Reporter: Zhu Lei > Priority: Major > Attachments: rafiki-1.PNG, rafiki-2.PNG, rafiki-3.PNG, rafiki-4.PNG > > > After downloading the newest rafiki code, at commit 7b3b04e15c62233e515c4d82051cd5dfb799215f, with comments "Add more error handling to notify user of invalid train job; compact exceptions", I ran "bash ./scripts/build_images.sh" to build the new admin, advisor, predictor and worker images. I got the images shown in attached image 'rafiki-1.PNG'. Then I run "bash ./script/start.sh" to build the containers as shown in the attached image 'rafiki-2.PNG'. Finally when I ran the client-usage.py example. I got the error in attached image 'rafiki-3.PNG'. > And I find very surprising that the images of admin, advisor, predictor and worker I built just now, become some images built weeks ago, shown in attached image 'rafiki-4.PNG'. Could you kindly provide me some explanations on why this happens? I really do not understand why this happened. > And finally, when I run "bash ./script/stop.sh" and leave the swarm and repeat my previous procedure again, now there is no errors. The only thing difference between the two runs I think is only the images are different. So the current code of rafiki does not support newly build images, that is my speculation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)