Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@systemml.incubator.apache.org
From: dusenberrymw@gmail.com
Content-Type: multipart/alternative;
	boundary=Apple-Mail-870C56CF-6159-44DE-B298-9BCAEEFBC6AC
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (1.0)
Date: Wed, 19 Apr 2017 11:46:15 -0700
Subject: Re: Regarding incubator systemml/breast_cancer project
Message-Id: <D4D4D162-BBED-4829-B747-9996B4C8907A@gmail.com>
References: <CAGDR_9Gc5cAyW3OaD8hK8oRiARxTiSeMZA4rsC5pU+Rk6aveuA@mail.gmail.com> <CAGDR_9G7DabUMJ9jiAM7OPFSqjwFperdkdLx6gGAoSJ6M6ODCQ@mail.gmail.com> <CAGDR_9Hk20W1ss-ptaC7mz0AAiumP9M+e2O7EMJu8RB9m_8pZA@mail.gmail.com> <CAGDR_9F-yO07TGbFxU8c+cp=qkKqiH+eSBKCJR2hG2iTx-esag@mail.gmail.com> <CAGDR_9GXxczfkOKBHcmrWJ1qYyGhNqw+nTv_Q8X-JRkmVdM9mQ@mail.gmail.com> <CB7A8B0E-C7C7-4379-A43A-6E81B447D3E8@gmail.com> <CAGDR_9EAEB6Gb=hLF2zCRhmOeR8nXEm1Fsi8NdOW=xOPJVtg0w@mail.gmail.com> <3BBAF0E1-A2E9-4CDB-AF7F-137A0529C473@gmail.com> <CAGDR_9GwfDMsT9OEXLpHJ2fPdjzf6viFgfvByaUVU2oJpkX9_A@mail.gmail.com> <CAGDR_9Gq3T7Dw2HY-NhD2OyiAu3G-A+UiBNi_iqF57nmdFy5Jg@mail.gmail.com> <CAGDR_9HBE0cTQUD9PHF0XqiYLyhbK0bREicVFoGBbud3xGOB1g@mail.gmail.com>
In-Reply-To: <CAGDR_9HBE0cTQUD9PHF0XqiYLyhbK0bREicVFoGBbud3xGOB1g@mail.gmail.com>
To: dev@systemml.incubator.apache.org
archived-at: Wed, 19 Apr 2017 18:46:24 -0000

--Apple-Mail-870C56CF-6159-44DE-B298-9BCAEEFBC6AC
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable

Hi Aishwarya,

Looks like you've just encountered an out of memory error on one of the exec=
utors.  Therefore, you just need to adjust the `spark.executor.memory` and `=
spark.driver.memory` settings with higher amounts of RAM.  What is your curr=
ent setup?  I.e. are you using a cluster of machines, or a single machine?  W=
e generally use a large driver on one machine, and then a single large execu=
tor on each other machine.  I would give a sizable amount of memory to the d=
river, and about half the possible memory on the executors so that the Pytho=
n processes have enough memory as well.  PySpark has JVM and Python componen=
ts, and the Spark memory settings only pertain to the JVM side, thus the nee=
d to save about half the executor memory for the Python side.

Thanks!

- Mike

--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <aishwarya2612@gmail.com>=
 wrote:
>=20
> Hello sir,
>=20
> We also wanted to ensure that the spark-submit command we're using is the
> correct one for running 'preprocess.py'.
> Command :  /home/new/sparks/bin/spark-submit preprocess.py
>=20
>=20
> Thank you.
> Aishwarya Chaurasia.
>=20
> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <aishwarya2612@gmail.com>
> wrote:
>=20
> Hello sir,
> On running the file preprocess.py we are getting the following error :
>=20
> https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
> YhyRLivL9gydE=3D
>=20
> Can you please help us by looking into the error and kindly tell us the
> solution for it.
> Thanks a lot.
> Aishwarya Chaurasia
>=20
>=20
>> On 19-Apr-2017 12:43 AM, <dusenberrymw@gmail.com> wrote:
>>=20
>> Hi Aishwarya,
>>=20
>> Certainly, here is some more detailed information about`preprocess.py`:
>>=20
>>  * The preprocessing Python script is located at
>> https://github.com/apache/incubator-systemml/blob/master/
>> projects/breast_cancer/preprocess.py.  Note that this is different than
>> the library module at https://github.com/apache/incu
>> bator-systemml/blob/master/projects/breast_cancer/breastc
>> ancer/preprocessing.py.
>>  * This script is used to preprocess a set of histology slide images,
>> which are `.svs` files in our case, and `.tiff` files in your case.
>>  * Lines 63-79 contain "settings" such as the output image sizes, folder
>> paths, etc.  Of particular interest, line 72 has the folder path for the
>> original slide images that should be commonly accessible from all machine=
s
>> being used, and lines 74-79 contain the names of the output DataFrames th=
at
>> will be saved.
>>  * Line 82 performs the actual preprocessing and creates a Spark
>> DataFrame with the following columns: slide number, tumor score, molecula=
r
>> score, sample.  The "sample" in this case is the actual small, chopped-up=

>> section of the image that has been extracted and flattened into a row
>> Vector.  For test images without labels (`training=3Dfalse`), only the sl=
ide
>> number and sample will be contained in the DataFrame (i.e. no labels).
>> This calls the `preprocess(...)` function located on line 371 of
>> https://github.com/apache/incubator-systemml/blob/master/
>> projects/breast_cancer/breastcancer/preprocessing.py, which is a
>> different file.
>>  * Line 87 simply saves the above DataFrame to HDFS with the name from
>> line 74.
>>  * Line 93 splits the above DataFrame row-wise into separate "training"
>> and "validation" DataFrames, based on the split percentage from line 70
>> (`train_frac`).  This is performed so that downstream machine learning
>> tasks can learn from the training set, and validate performance and
>> hyperparameter choices on the validation set.  These DataFrames will star=
t
>> with the same columns as the above DataFrame.  If `add_row_indices` from
>> line 69 is true, then an additional row index column (`__INDEX`) will be
>> pretended.  This is useful for SystemML in downstream machine learning
>> tasks as it gives the DataFrame row numbers like a real matrix would have=
,
>> and SystemML is built to operate on matrices.
>>  * Lines 97 & 98 simply save the training and validation DataFrames using=

>> the names defined on lines 76 & 78.
>>  * Lines 103-137 create smaller train and validation DataFrames by taking=

>> small row-wise samples of the full train and validation DataFrames.  The
>> percentage of the sample is defined on line 111 (`p=3D0.01` for a 1%
>> sample).  This is generally useful for quicker downstream tasks without
>> having to load in the larger DataFrames, assuming you have a large amount=

>> of data.  For us, we have ~7TB of data, so having 1% sampled DataFrames i=
s
>> useful for quicker downstream tests.  Once again, the same columns from t=
he
>> larger train and validation DataFrames will be used.
>>  * Lines 146 & 147 simply save these sampled train and validation
>> DataFrames.
>>=20
>> As a summary, after running `preprocess.py`, you will be left with the
>> following saved DataFrames in HDFS:
>>  * Full DataFrame
>>  * Training DataFrame
>>  * Validation DataFrame
>>  * Sampled training DataFrame
>>  * Sampled validation DataFrame
>>=20
>> As for visualization, you may visualize a "sample" (i.e. small, chopped-u=
p
>> section of original image) from a DataFrame by using the `
>> breastcancer.visualization.visualize_sample(...)` function.  You will
>> need to do this after creating the DataFrames.  Here is a snippet to
>> visualize the first row sample in a DataFrame, where `df` is one of the
>> DataFrames from above:
>>=20
>> ```
>> from breastcancer.visualization import visualize_sample
>> visualize_sample(df.first().sample)
>> ```
>>=20
>> Please let me know if you have any additional questions.
>>=20
>> Thanks!
>>=20
>> - Mike
>>=20
>> --
>>=20
>> Mike Dusenberry
>> GitHub: github.com/dusenberrymw
>> LinkedIn: linkedin.com/in/mikedusenberry
>>=20
>> Sent from my iPhone.
>>=20
>>=20
>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
>> aishwarya2612@gmail.com> wrote:
>>>=20
>>> Hello sir,
>>> Can you please elaborate more on what output we would be getting because=

>> we
>>> tried executing the preprocess.py file using spark submit it keeps on
>>> adding the tiles in rdd and while running the visualisation.py file it
>>> isn't showing any output. Can you please help us out asap stating the
>>> output we will be getting and the sequence of execution of files.
>>> Thank you.
>>>=20
>>>> On 07-Apr-2017 5:54 AM, <dusenberrymw@gmail.com> wrote:
>>>>=20
>>>> Hi Aishwarya,
>>>>=20
>>>> Thanks for sharing more info on the issue!
>>>>=20
>>>> To facilitate easier usage, I've updated the preprocessing code by
>> pulling
>>>> out most of the logic into a `breastcancer/preprocessing.py` module,
>>>> leaving just the execution in the `Preprocessing.ipynb` notebook.
>> There is
>>>> also a `preprocess.py` script with the same contents as the notebook fo=
r
>>>> use with `spark-submit`.  The choice of the notebook or the script is
>> just
>>>> a matter of convenience, as they both import from the same
>>>> `breastcancer/preprocessing.py` package.
>>>>=20
>>>> As part of the updates, I've added an explicit SparkSession parameter
>>>> (`spark`) to the `preprocess(...)` function, and updated the body to us=
e
>>>> this SparkSession object rather than the older SparkContext `sc` object=
.
>>>> Previously, the `preprocess(...)` function accessed the `sc` object tha=
t
>>>> was pulled in from the enclosing scope, which would work while all of
>> the
>>>> code was colocated within the notebook, but not if the code was
>> extracted
>>>> and imported.  The explicit parameter now allows for the code to be
>>>> imported.
>>>>=20
>>>> Can you please try again with the latest updates?  We are currently
>> using
>>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark kernel
>>>> should have a `spark` object available that can be supplied to the
>>>> functions (as is done now in the notebook), and if you use the
>>>> `preprocess.py` script with `spark-submit`, the `spark` object will be
>>>> created explicitly by the script.
>>>>=20
>>>> For a bit of context to others, Aishwarya initially reached out to find=

>>>> out if our breast cancer project could be applied to TIFF images, rathe=
r
>>>> than the SVS images we are currently using (the answer is "yes" so long=

>> as
>>>> they are "generic tiled TIFF images, according to the OpenSlide
>>>> documentation), and then followed up with Spark issues related to the
>>>> preprocessing code.  This conversation has been promptly moved to the
>>>> mailing list so that others in the community can benefit.
>>>>=20
>>>>=20
>>>> Thanks!
>>>>=20
>>>> -Mike
>>>>=20
>>>> --
>>>>=20
>>>> Mike Dusenberry
>>>> GitHub: github.com/dusenberrymw
>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>=20
>>>> Sent from my iPhone.
>>>>=20
>>>>=20
>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
>> aishwarya2612@gmail.com>
>>>> wrote:
>>>>>=20
>>>>> Hey,
>>>>>=20
>>>>> The object sc is already defined in pyspark and yet this name error
>> keeps
>>>>> occurring. We are using spark 2.*
>>>>>=20
>>>>> Here is the link to error that we are getting :
>>>>> https://paste.fedoraproject.org/paste/89iQODxzpNZVbSfgwocH8l5M1UNdIG
>>>> YhyRLivL9gydE=3D
>>>>=20
>>=20

--Apple-Mail-870C56CF-6159-44DE-B298-9BCAEEFBC6AC--