Return-Path: X-Original-To: apmail-singa-dev-archive@minotaur.apache.org Delivered-To: apmail-singa-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5442218F8A for ; Wed, 14 Oct 2015 04:43:14 +0000 (UTC) Received: (qmail 32827 invoked by uid 500); 14 Oct 2015 04:43:14 -0000 Delivered-To: apmail-singa-dev-archive@singa.apache.org Received: (qmail 32756 invoked by uid 500); 14 Oct 2015 04:43:14 -0000 Mailing-List: contact dev-help@singa.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@singa.incubator.apache.org Delivered-To: mailing list dev@singa.incubator.apache.org Received: (qmail 32745 invoked by uid 99); 14 Oct 2015 04:43:13 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Oct 2015 04:43:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 689671A07A9 for ; Wed, 14 Oct 2015 04:43:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.971 X-Spam-Level: X-Spam-Status: No, score=0.971 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id mBo-2xK3K0pX for ; Wed, 14 Oct 2015 04:43:07 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with SMTP id 7603B2054C for ; Wed, 14 Oct 2015 04:43:06 +0000 (UTC) Received: (qmail 32082 invoked by uid 99); 14 Oct 2015 04:43:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Oct 2015 04:43:05 +0000 Date: Wed, 14 Oct 2015 04:43:05 +0000 (UTC) From: "Anh Dinh (JIRA)" To: dev@singa.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SINGA-11) Start SINGA using Mesos MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SINGA-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anh Dinh updated SINGA-11: -------------------------- Description: Apache Mesos is a fine-grained cluster management framework which enables resource sharing in the same cluster. Mesos abstracts out the physical configurations of cluster nodes, and presents resources to the users in the form of "offers". SINGA uses Mesos for two purposes: 1. To acquire necessary resources for training the model. 2. To launch and monitor progress of the training task. To this end, we implement a "SINGA Scheduler" which interacts with Mesos master. The scheduler assumes that SINGA has been installed at the Mesos slave nodes. The scheduler is called when the user wants to start a new SINGA job, and it performs the following steps: Step 1. Read the job configuration file to determine necessary resources in terms of CPUs, memory and storage. Step 2. Wait for resource offers from the Mesos master. Step 3. Determine if the offers meet the requirement of resources. Step 4. Prepare the task to launch at each slave: + Deliver the job configuration file to the slave node. + Specify the command to run on the slave: "singa -conf ./job.conf" Step 5: Launch and monitor the progress For step 3, we currently implement a simple scheme: the number of CPUs offered by each Mesos slave exceed the total number of SINGA worker and SINGA server per process. In other words, each Mesos slave must be able to run the entire worker group or server group. For step 4, we currently relies on HDFS to deliver the configuration file to each slave. Particularly, we write the file to a known directory (different for each job) on HDFS and ask the slave to use its Fetcher utility to download the file before executing the task. We will create a README.md file explaining the steps. was: Mesos helps to mange resources in large clusters. This ticket is an initial integration of SINGA with Mesos, which aims to simply start SINGA through Mesos and run multiple SINGA tasks in the same cluster. The fully integration should include, 1. start SINGA by Mesos, including requesting processes, memory, CPU, etc. 2. detect failures and recovery through Mesos 3. TBD. > Start SINGA using Mesos > ----------------------- > > Key: SINGA-11 > URL: https://issues.apache.org/jira/browse/SINGA-11 > Project: Singa > Issue Type: New Feature > Reporter: wangwei > Assignee: Anh Dinh > > Apache Mesos is a fine-grained cluster management framework which enables resource sharing in the > same cluster. Mesos abstracts out the physical configurations of cluster nodes, and presents > resources to the users in the form of "offers". SINGA uses Mesos for two purposes: > 1. To acquire necessary resources for training the model. > 2. To launch and monitor progress of the training task. > To this end, we implement a "SINGA Scheduler" which interacts with Mesos master. The scheduler > assumes that SINGA has been installed at the Mesos slave nodes. The scheduler is called when the > user wants to start a new SINGA job, and it performs the following steps: > Step 1. Read the job configuration file to determine necessary resources in terms of CPUs, memory and storage. > Step 2. Wait for resource offers from the Mesos master. > Step 3. Determine if the offers meet the requirement of resources. > Step 4. Prepare the task to launch at each slave: > + Deliver the job configuration file to the slave node. > + Specify the command to run on the slave: "singa -conf ./job.conf" > Step 5: Launch and monitor the progress > For step 3, we currently implement a simple scheme: the number of CPUs offered by each Mesos slave > exceed the total number of SINGA worker and SINGA server per process. In other words, each Mesos > slave must be able to run the entire worker group or server group. > For step 4, we currently relies on HDFS to deliver the configuration file to each slave. > Particularly, we write the file to a known directory (different for each job) on HDFS and ask the > slave to use its Fetcher utility to download the file before executing the task. > We will create a README.md file explaining the steps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)