Return-Path: X-Original-To: apmail-singa-commits-archive@minotaur.apache.org Delivered-To: apmail-singa-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0DDF117704 for ; Tue, 28 Jul 2015 12:16:17 +0000 (UTC) Received: (qmail 64588 invoked by uid 500); 28 Jul 2015 12:16:14 -0000 Delivered-To: apmail-singa-commits-archive@singa.apache.org Received: (qmail 64570 invoked by uid 500); 28 Jul 2015 12:16:14 -0000 Mailing-List: contact commits-help@singa.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@singa.incubator.apache.org Delivered-To: mailing list commits@singa.incubator.apache.org Received: (qmail 64561 invoked by uid 99); 28 Jul 2015 12:16:13 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2015 12:16:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 83B1A1A7ADB for ; Tue, 28 Jul 2015 12:16:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.488 X-Spam-Level: **** X-Spam-Status: No, score=4.488 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, TVD_PH_BODY_META=2.697, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 6yWu8QZm19uK for ; Tue, 28 Jul 2015 12:16:01 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with SMTP id 540E6203A3 for ; Tue, 28 Jul 2015 12:16:00 +0000 (UTC) Received: (qmail 63090 invoked by uid 99); 28 Jul 2015 12:15:59 -0000 Received: from eris.apache.org (HELO hades.apache.org) (140.211.11.105) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2015 12:15:59 +0000 Received: from hades.apache.org (localhost [127.0.0.1]) by hades.apache.org (ASF Mail Server at hades.apache.org) with ESMTP id 3DFFEAC0586 for ; Tue, 28 Jul 2015 12:15:59 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1693077 - in /incubator/singa/site/trunk/content/markdown/docs: data.md neuralnet-partition.md Date: Tue, 28 Jul 2015 12:15:58 -0000 To: commits@singa.incubator.apache.org From: wangwei@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20150728121559.3DFFEAC0586@hades.apache.org> Author: wangwei Date: Tue Jul 28 12:15:58 2015 New Revision: 1693077 URL: http://svn.apache.org/r1693077 Log: add docs for data preparation from Chonho Modified: incubator/singa/site/trunk/content/markdown/docs/data.md incubator/singa/site/trunk/content/markdown/docs/neuralnet-partition.md Modified: incubator/singa/site/trunk/content/markdown/docs/data.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/data.md?rev=1693077&r1=1693076&r2=1693077&view=diff ============================================================================== --- incubator/singa/site/trunk/content/markdown/docs/data.md (original) +++ incubator/singa/site/trunk/content/markdown/docs/data.md Tue Jul 28 12:15:58 2015 @@ -1,18 +1,110 @@ ## Data Preparation -To submit a training job, users need to convert raw data (e.g., images, text -documents) into records that can be recognized by SINGA. SINGA uses a DataLayer -to load these records into memory and uses ParserLayer to parse features (e.g., -image pixels and labels) from these records. The records could be organized and -stored using many different ways, e.g., using a light database, or a file or -HDFS, as long as there is a corresponding DataLayer that can load the records. +To submit a training job, users need to convert raw data (e.g., images, text documents) into records that can be recognized by SINGA. SINGA uses a DataLayer +to load these records into memory and uses ParserLayer to parse features (e.g., image pixels and labels) from these records. The records could be organized and +stored using many different ways, e.g., a file, a light database, or HDFS, as long as there is a corresponding DataLayer that can load the records. ### DataShard +To create shard for your own data, users may need to implement or modify the following files + +- common.proto +- create_shard.cc +- Makefile +**1. Define record** + +Record class is inherited from Message class whose format follows Google protocol buffers. Please refer to the [Tutorial][1]. + +Your record will be defined in a file, SINGAfolder/src/proto/common.proto + +(a) Define the record + + message UserRecord { + repeated int userVAR1 = 1; // unique id + optional string userVAR2 = 2; // unique id + ... + } + +(b) Declare user own record in Record + + message Record { + optional UserRecord user_record = 1; // unique id + ... + } + +(c) Compile SINGA + + cd SINGAfolder + ./configure + make + + +**2. Create shard** + +(a) Create a folder for dataset, e.g., we call it "USERDATAfolder". + +(b) Source files for creating shard will be in SINGAfolder/USERDATAfolder/ + +- For example of RNNLM, create_shard.cc is in SINGAfolder/examples/rnnlm + +(c) Create shard + + singa::DataShard myShard( outputpath, mode); + +- `string outputpath`, where user wants to create shard. +- `int mode := kRead | kCreate | kAppend`, is defined in SINGAfolder/include/utils/data_shard.h + + +**3. Store record into shard** + +(a) xxx + + singa::Record record; + singa::UserRecord *myRecord = record.mutable_user_record(); + +`mutable_user_record()` method is automatically generated after compiling SINGA at Step 1-(c). + +(b) Set/Add values into the record + + myRecord->add_userVAR1( int_val ); + myRecord->set_userVAR2( string_val ); + +(c) Store the record to shard + + myShard.Insert( key, myRecord ); +- `String key`, will be a unique id for a message + +**Example of RNNLM** + +You can refer to RNNLM example at SINGAfolder/example/rnnlm/ + + message SingleWordRecord { + optional string word = 1; + optional int32 word_index = 2; + optional int32 class_index =3;` + } + + message Record { + optional SingleWordRecord word_record = 4; + } + + make download + to download raw data from https://www.rnnlm.org + +In this example, rnnlm-0.4b is used. + + make create + +to process input text file, create records, and store it into shard + +We create 3 shards for training data, which are class_shard, vocab_shard, word_shard. ### LMDB ### HDFS + + + [1]: https://developers.google.com/protocol-buffers/docs/cpptutorial \ No newline at end of file Modified: incubator/singa/site/trunk/content/markdown/docs/neuralnet-partition.md URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/neuralnet-partition.md?rev=1693077&r1=1693076&r2=1693077&view=diff ============================================================================== --- incubator/singa/site/trunk/content/markdown/docs/neuralnet-partition.md (original) +++ incubator/singa/site/trunk/content/markdown/docs/neuralnet-partition.md Tue Jul 28 12:15:58 2015 @@ -22,7 +22,7 @@ The above figure shows a convolutional n has 8 layers in total (one rectangular represents one layer). The first layer is DataLayer (data) which reads data from local disk files/databases (or HDFS). The second layer is a MnistLayer which parses the records from MNIST data to get the pixels of a batch -of 28 images (each image is of size 28x28). The LabelLayer (label) parses the records to get the label +of 8 images (each image is of size 28x28). The LabelLayer (label) parses the records to get the label of each image in the batch. The ConvolutionalLayer (conv1) transforms the input image to the shape of 8x27x27. The ReLULayer (relu1) conducts elementwise transformations. The PoolingLayer (pool1) sub-samples the images. The fc1 layer is fully connected with pool1 layer. It