Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 45C2A200C6A for ; Wed, 19 Apr 2017 20:46:24 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 441E9160B9C; Wed, 19 Apr 2017 18:46:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3D9F5160B94 for ; Wed, 19 Apr 2017 20:46:23 +0200 (CEST) Received: (qmail 534 invoked by uid 500); 19 Apr 2017 18:46:22 -0000 Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.incubator.apache.org Delivered-To: mailing list dev@systemml.incubator.apache.org Received: (qmail 520 invoked by uid 99); 19 Apr 2017 18:46:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Apr 2017 18:46:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id AB070C0C69 for ; Wed, 19 Apr 2017 18:46:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.881 X-Spam-Level: * X-Spam-Status: No, score=1.881 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id dqp4XgCjg0qp for ; Wed, 19 Apr 2017 18:46:18 +0000 (UTC) Received: from mail-yb0-f174.google.com (mail-yb0-f174.google.com [209.85.213.174]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 147C55FB2C for ; Wed, 19 Apr 2017 18:46:18 +0000 (UTC) Received: by mail-yb0-f174.google.com with SMTP id 6so11072797ybq.2 for ; Wed, 19 Apr 2017 11:46:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:mime-version:date:subject:message-id :references:in-reply-to:to; bh=WQ1fWSu6QTGcESwp6gne60ELvo+SR3DdhI38mImR/DQ=; b=sYM+/RIU09AftXu/O6FgpJyYWov49GRofqYWvV54OylreP0MfmLWjru7nkYPVrtWDi gPdz4+7zrbaZ3lQbN6dhAiSLY4HVEDTLW+8LbtWWJRFbysVBSb/xElBr9Bpe+rjEGSRb 7T6tLwVch9SPiuGlK6ntn1oi/lO618VxrXJ+VKRkfFPTLkqe6wOb8/IwE1V2vLDKXk9C egleBwFTGRZtAf0xWgygJ8Zu5kL01PrOX4K7DbIkuh6DW/ZVQ0xRru4sHVCk680eTEqb CyF5RpyhXjsdLlZE/0MHS/TLR+6ek/RNmPWcKAp/M3EdYoIudoG3GGOu/vK4O6xzdq/J eb2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version:date :subject:message-id:references:in-reply-to:to; bh=WQ1fWSu6QTGcESwp6gne60ELvo+SR3DdhI38mImR/DQ=; b=Zh4p+uiq+3YjmqCU/l0YHh2FzpVdfdvRjX+0Yr+wtXzh3TFYNT2PCe2QfG3lkCttBZ kcJprX31n7jzid2trBTtl2LbE0X+irBPO0BjZuzg0dDgHbuRozGmhltTOHHGTbz9yHXJ NTtkml1fybQ7p6zfScjUn0NFK7AVcLC2mfexGmULNHvfEjw52gwuGiR52HFCNzYpQo1K tr3DyfyA+E7AHth+sh7h3Vr8T0gP2VFuNi6wMrfQDE6qZOUzv35tARSUtLnpC1Be/2B8 B9SSUNZqSH5yjoJ1zn32i2SK7acux9IS9MKbICn3uGzsQS/hNC3eQWLuaMDKiKvebdkW 4oew== X-Gm-Message-State: AN3rC/4Xy7MQlkbCnfGSuPu2QVwdIpPcEVggBhDSm44BU1gU02XTW3qd UphSu+kdMOuTyJ8Im+Q= X-Received: by 10.98.50.71 with SMTP id y68mr4338203pfy.220.1492627577205; Wed, 19 Apr 2017 11:46:17 -0700 (PDT) Received: from ?IPv6:2600:1010:b069:742b:61f9:8280:34d6:2956? ([2600:1010:b069:742b:61f9:8280:34d6:2956]) by smtp.gmail.com with ESMTPSA id 70sm5922489pfk.49.2017.04.19.11.46.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 19 Apr 2017 11:46:16 -0700 (PDT) From: dusenberrymw@gmail.com Content-Type: multipart/alternative; boundary=Apple-Mail-870C56CF-6159-44DE-B298-9BCAEEFBC6AC Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (1.0) Date: Wed, 19 Apr 2017 11:46:15 -0700 Subject: Re: Regarding incubator systemml/breast_cancer project Message-Id: References: <3BBAF0E1-A2E9-4CDB-AF7F-137A0529C473@gmail.com> In-Reply-To: To: dev@systemml.incubator.apache.org X-Mailer: iPhone Mail (14E304) archived-at: Wed, 19 Apr 2017 18:46:24 -0000 --Apple-Mail-870C56CF-6159-44DE-B298-9BCAEEFBC6AC Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Hi Aishwarya, Looks like you've just encountered an out of memory error on one of the exec= utors. Therefore, you just need to adjust the `spark.executor.memory` and `= spark.driver.memory` settings with higher amounts of RAM. What is your curr= ent setup? I.e. are you using a cluster of machines, or a single machine? W= e generally use a large driver on one machine, and then a single large execu= tor on each other machine. I would give a sizable amount of memory to the d= river, and about half the possible memory on the executors so that the Pytho= n processes have enough memory as well. PySpark has JVM and Python componen= ts, and the Spark memory settings only pertain to the JVM side, thus the nee= d to save about half the executor memory for the Python side. Thanks! - Mike -- Mike Dusenberry GitHub: github.com/dusenberrymw LinkedIn: linkedin.com/in/mikedusenberry Sent from my iPhone. > On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia = wrote: >=20 > Hello sir, >=20 > We also wanted to ensure that the spark-submit command we're using is the > correct one for running 'preprocess.py'. > Command : /home/new/sparks/bin/spark-submit preprocess.py >=20 >=20 > Thank you. > Aishwarya Chaurasia. >=20 > On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" > wrote: >=20 > Hello sir, > On running the file preprocess.py we are getting the following error : >=20 > https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG > YhyRLivL9gydE=3D >=20 > Can you please help us by looking into the error and kindly tell us the > solution for it. > Thanks a lot. > Aishwarya Chaurasia >=20 >=20 >> On 19-Apr-2017 12:43 AM, wrote: >>=20 >> Hi Aishwarya, >>=20 >> Certainly, here is some more detailed information about`preprocess.py`: >>=20 >> * The preprocessing Python script is located at >> https://github.com/apache/incubator-systemml/blob/master/ >> projects/breast_cancer/preprocess.py. Note that this is different than >> the library module at https://github.com/apache/incu >> bator-systemml/blob/master/projects/breast_cancer/breastc >> ancer/preprocessing.py. >> * This script is used to preprocess a set of histology slide images, >> which are `.svs` files in our case, and `.tiff` files in your case. >> * Lines 63-79 contain "settings" such as the output image sizes, folder >> paths, etc. Of particular interest, line 72 has the folder path for the >> original slide images that should be commonly accessible from all machine= s >> being used, and lines 74-79 contain the names of the output DataFrames th= at >> will be saved. >> * Line 82 performs the actual preprocessing and creates a Spark >> DataFrame with the following columns: slide number, tumor score, molecula= r >> score, sample. The "sample" in this case is the actual small, chopped-up= >> section of the image that has been extracted and flattened into a row >> Vector. For test images without labels (`training=3Dfalse`), only the sl= ide >> number and sample will be contained in the DataFrame (i.e. no labels). >> This calls the `preprocess(...)` function located on line 371 of >> https://github.com/apache/incubator-systemml/blob/master/ >> projects/breast_cancer/breastcancer/preprocessing.py, which is a >> different file. >> * Line 87 simply saves the above DataFrame to HDFS with the name from >> line 74. >> * Line 93 splits the above DataFrame row-wise into separate "training" >> and "validation" DataFrames, based on the split percentage from line 70 >> (`train_frac`). This is performed so that downstream machine learning >> tasks can learn from the training set, and validate performance and >> hyperparameter choices on the validation set. These DataFrames will star= t >> with the same columns as the above DataFrame. If `add_row_indices` from >> line 69 is true, then an additional row index column (`__INDEX`) will be >> pretended. This is useful for SystemML in downstream machine learning >> tasks as it gives the DataFrame row numbers like a real matrix would have= , >> and SystemML is built to operate on matrices. >> * Lines 97 & 98 simply save the training and validation DataFrames using= >> the names defined on lines 76 & 78. >> * Lines 103-137 create smaller train and validation DataFrames by taking= >> small row-wise samples of the full train and validation DataFrames. The >> percentage of the sample is defined on line 111 (`p=3D0.01` for a 1% >> sample). This is generally useful for quicker downstream tasks without >> having to load in the larger DataFrames, assuming you have a large amount= >> of data. For us, we have ~7TB of data, so having 1% sampled DataFrames i= s >> useful for quicker downstream tests. Once again, the same columns from t= he >> larger train and validation DataFrames will be used. >> * Lines 146 & 147 simply save these sampled train and validation >> DataFrames. >>=20 >> As a summary, after running `preprocess.py`, you will be left with the >> following saved DataFrames in HDFS: >> * Full DataFrame >> * Training DataFrame >> * Validation DataFrame >> * Sampled training DataFrame >> * Sampled validation DataFrame >>=20 >> As for visualization, you may visualize a "sample" (i.e. small, chopped-u= p >> section of original image) from a DataFrame by using the ` >> breastcancer.visualization.visualize_sample(...)` function. You will >> need to do this after creating the DataFrames. Here is a snippet to >> visualize the first row sample in a DataFrame, where `df` is one of the >> DataFrames from above: >>=20 >> ``` >> from breastcancer.visualization import visualize_sample >> visualize_sample(df.first().sample) >> ``` >>=20 >> Please let me know if you have any additional questions. >>=20 >> Thanks! >>=20 >> - Mike >>=20 >> -- >>=20 >> Mike Dusenberry >> GitHub: github.com/dusenberrymw >> LinkedIn: linkedin.com/in/mikedusenberry >>=20 >> Sent from my iPhone. >>=20 >>=20 >>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia < >> aishwarya2612@gmail.com> wrote: >>>=20 >>> Hello sir, >>> Can you please elaborate more on what output we would be getting because= >> we >>> tried executing the preprocess.py file using spark submit it keeps on >>> adding the tiles in rdd and while running the visualisation.py file it >>> isn't showing any output. Can you please help us out asap stating the >>> output we will be getting and the sequence of execution of files. >>> Thank you. >>>=20 >>>> On 07-Apr-2017 5:54 AM, wrote: >>>>=20 >>>> Hi Aishwarya, >>>>=20 >>>> Thanks for sharing more info on the issue! >>>>=20 >>>> To facilitate easier usage, I've updated the preprocessing code by >> pulling >>>> out most of the logic into a `breastcancer/preprocessing.py` module, >>>> leaving just the execution in the `Preprocessing.ipynb` notebook. >> There is >>>> also a `preprocess.py` script with the same contents as the notebook fo= r >>>> use with `spark-submit`. The choice of the notebook or the script is >> just >>>> a matter of convenience, as they both import from the same >>>> `breastcancer/preprocessing.py` package. >>>>=20 >>>> As part of the updates, I've added an explicit SparkSession parameter >>>> (`spark`) to the `preprocess(...)` function, and updated the body to us= e >>>> this SparkSession object rather than the older SparkContext `sc` object= . >>>> Previously, the `preprocess(...)` function accessed the `sc` object tha= t >>>> was pulled in from the enclosing scope, which would work while all of >> the >>>> code was colocated within the notebook, but not if the code was >> extracted >>>> and imported. The explicit parameter now allows for the code to be >>>> imported. >>>>=20 >>>> Can you please try again with the latest updates? We are currently >> using >>>> Spark 2.x with Python 3. If you use the notebook, the pyspark kernel >>>> should have a `spark` object available that can be supplied to the >>>> functions (as is done now in the notebook), and if you use the >>>> `preprocess.py` script with `spark-submit`, the `spark` object will be >>>> created explicitly by the script. >>>>=20 >>>> For a bit of context to others, Aishwarya initially reached out to find= >>>> out if our breast cancer project could be applied to TIFF images, rathe= r >>>> than the SVS images we are currently using (the answer is "yes" so long= >> as >>>> they are "generic tiled TIFF images, according to the OpenSlide >>>> documentation), and then followed up with Spark issues related to the >>>> preprocessing code. This conversation has been promptly moved to the >>>> mailing list so that others in the community can benefit. >>>>=20 >>>>=20 >>>> Thanks! >>>>=20 >>>> -Mike >>>>=20 >>>> -- >>>>=20 >>>> Mike Dusenberry >>>> GitHub: github.com/dusenberrymw >>>> LinkedIn: linkedin.com/in/mikedusenberry >>>>=20 >>>> Sent from my iPhone. >>>>=20 >>>>=20 >>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia < >> aishwarya2612@gmail.com> >>>> wrote: >>>>>=20 >>>>> Hey, >>>>>=20 >>>>> The object sc is already defined in pyspark and yet this name error >> keeps >>>>> occurring. We are using spark 2.* >>>>>=20 >>>>> Here is the link to error that we are getting : >>>>> https://paste.fedoraproject.org/paste/89iQODxzpNZVbSfgwocH8l5M1UNdIG >>>> YhyRLivL9gydE=3D >>>>=20 >>=20 --Apple-Mail-870C56CF-6159-44DE-B298-9BCAEEFBC6AC--