systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dusenberr...@gmail.com
Subject Re: Please reply ASAP : Regarding incubator systemml/breast_cancer project
Date Mon, 24 Apr 2017 21:20:03 GMT
Hi Aishwarya,

For the error message, that just means that the SystemML jar isn't being found.  Can you add
a `--driver-class-path $SYSTEMML_HOME/target/SystemML.jar` to the invocation of Jupyter? 
I.e. `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"
pyspark  --jars $SYSTEMML_HOME/target/SystemML.jar --driver-class-path $SYSTEMML_HOME/target/SystemML.jar`.
There was a PySpark bug that was supposed to have been fixed in Spark 2.x, but it's possible
that it is still an issue.

As for the output, the notebook will create SystemML `Matrix` objects for all of the weights
and biases of the trained models.  To save, please convert each one to a DataFrame, i.e. `Wc1.toDF()`
and repeated for each matrix, and then simply save the DataFrames.  This could be done all
at once like this for a SystemML Matrix object `Wc1`: `Wc1.toDf().write.save("path/to/save/Wc1.parquet",
format="parquet")`.  Just repeat for each matrix returned by the "Train" code for the algorithms.
 At that point, you will have a set of saved DataFrames representing a trained SystemML model,
and these can be used in downstream classification tasks in a similar manner to the "Eval"
sections.

-Mike

--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On Apr 24, 2017, at 3:07 AM, Aishwarya Chaurasia <aishwarya2612@gmail.com> wrote:
> 
> Further more :
> What is the output of MachineLearning.ipynb you're obtaining sir?
> We are actually nearing our deadline for our problem.
> Thanks a lot.
> 
> On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <aishwarya2612@gmail.com>
> wrote:
> 
> Hello sir,
> 
> Thanks a lot for replying sir. But unfortunately it did not work. Although
> the NameError did not appear this time but another error came about :
> 
> https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V
> 5M1UNdIGYhyRLivL9gydE=
> 
> This error was obtained after executing the second block of code of
> MachineLearning.py in terminal. ( ml = MLContext(sc) )
> 
> We have installed the bleeding-edge version of systemml only and the
> installation was done correctly. We are in a fix now. :/
> Kindly look into the matter asap
> 
> On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <dusenberrymw@gmail.com> wrote:
> 
> Hi Aishwarya,
> 
> Glad to hear that the preprocessing stage was successful!  As for the
> `MachineLearning.ipynb` notebook, here is a general guide:
> 
> 
>   - The `MachineLearning.ipynb` notebook essentially (1) loads in the
>   training and validation DataFrames from the preprocessing step, (2)
>   converts them to normalized & one-hot encoded SystemML matrices for
>   consumption by the ML algorithms, and (3) explores training a couple of
>   models.
>   - To run, you'll need to start Jupyter in the context of PySpark via
>   `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
>   PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
>   $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have installed
>   SystemML with pip from PyPy (`pip3 install systemml`), this will install
>   our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar`
> will
>   not be necessary.  If you instead have installed a bleeding-edge version
> of
>   SystemML locally (git clone locally, maven build, `pip3 install -e
>   src/main/python` as listed in `projects/breast_cancer/README.md`), the
>   `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We are
>   about to release 0.14, and for this project, I *would* recommend using a
>   bleeding edge install.
>   - Once Jupyter has been started in the context of PySpark, the `sc`
>   SparkContext object should be available.  Please let me know if you
>   continue to see this issue.
>   - The "Read in train & val data" section simply reads in the training
>   and validation data generated in the preprocessing stage.  Be sure that
> the
>   `size` setting is the same as the preprocessing size.  The percentage `p`
>   setting determines whether the full or sampled DataFrames are loaded.  If
>   you set `p = 1`, the full DataFrames will be used.  If you instead would
>   prefer to use the smaller sampled DataFrames while getting started,
> please
>   set it to the same value as used in the preprocessing to generate the
>   smaller sampled DataFrames.
>   - The `Extract X & Y matrices` section splits each of the train and
>   validation DataFrames into effectively X & Y matrices (still as DataFrame
>   types), with X containing the images, and Y containing the labels.
>   - The `Convert to SystemML Matrices` section passes the X & Y DataFrames
>   into a SystemML script that performs some normalization of the images &
>   one-hot encoding of the labels, and then returns SystemML `Matrix` types.
>   These are now ready to be passed into the subsequent algorithms.
>   - The "Trigger Caching" and "Save Matrices" are experimental features,
>   and not necessary to execute.
>   - Next comes the two algorithms being explored in this notebook.  The
>   "Softmax Classifier" is just a multi-class logistic regression model, and
>   is simply there to serve as a baseline comparison with the subsequent
>   convolutional neural net model.  You may wish to simply skip this softmax
>   model and move to the latter convnet model further down in the notebook.
>   - The actual softmax model is located at [
>   https://github.com/apache/incubator-systemml/blob/master/
> projects/breast_cancer/softmax_clf.dml],
>   and the notebook calls functions from that file.
>   - The softmax sanity check just ensures that the model is able to
>   completely overfit when given a tiny sample size.  This should yield
> ~100%
>   training accuracy if the sample size in this section is small enough.
> This
>   is just a check to ensure that nothing else is wrong with the math or the
>   data.
>   - The softmax "Train" section will train a softmax model and return the
>   weights (`W`) and biases (`b`) of the model as SystemML `Matrix` objects.
>   Please adjust the hyperparameters in this section to your problem.
>   - The softmax "Eval" section takes the trained weights and biases and
>   evaluates the training and validation performance.
>   - The next model is a LeNet-like convnet model.  The actual model is
>   located at [
>   https://github.com/apache/incubator-systemml/blob/master/
> projects/breast_cancer/convnet.dml],
>   and the notebook simply calls functions from that file.
>   - Once again, there is an initial sanity check for the ability to
>   overfit on a small amount of data.
>   - The "Hyperparameter Search" contains a script to sample different
>   hyperparams for the convnet, and save the hyperparams + validation
> accuracy
>   of each set after a single epoch of training.  These string files will be
>   saved to HDFS.  Please feel free to adjust the range of the
> hyperparameters
>   for your problem.  Please also feel free to try using the `parfor`
>   (parallel for-loop) instead of the while loop to speed up this section.
>   Note that this is still a work in progress.  The hyperparameter tuning in
>   this section makes use of random search (as opposed to grid search),
> which
>   has been promoted by Bengio et al. to speed up the search time.
>   - The "Train" section trains the convnet and returns the weights and
>   biases as SystemML `Matrix` types.  In this section, please replace the
>   hyperparameters with the best ones from above, and please increase the
>   number of epochs given your time constraints.
>   - The "Eval" section evaluates the performance of the trained convnet.
>   - Although it is not shown in the notebook yet, to save the weights and
>   biases, please use the `toDF()` method on each weight and biases (i.e.
>   `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save the
>   DataFrame as desired.
>   - Finally, please feel free to extend the model in `convnet.dml` for
>   your particular problem!  The LeNet-like model just serves as a simple
>   convnet, but there are much richer models currently, such as resnets,
> that
>   we are experimenting with.  To make larger models such as resnets easier
> to
>   define, we are also working on other tools for converting model
> definitions
>   + pretrained weights from other systems into SystemML.
> 
> 
> Also, please keep in mind that the deep learning support in SystemML is
> still a work in progress.  Therefore, if you run into issues, please let us
> know and we'll do everything possible to help get things running!
> 
> 
> Thanks!
> 
> - Mike
> 
> 
> --
> 
> Michael W. Dusenberry
> GitHub: github.com/dusenberrymw
> LinkedIn: linkedin.com/in/mikedusenberry
> 
> On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia <
> aishwarya2612@gmail.com> wrote:
> 
>> Hey,
>> 
>> Thank you so much for your help sir. We were finally able to run
>> preprocess.py without any errors. And the results obtained were
>> satisfactory i.e we got five set of data frames like you said we would.
>> 
>> But alas! when we tried to run MachineLearning.ipynb the same NameError
>> came : https://paste.fedoraproject.org/paste/l3LFJreg~
>> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE=
>> 
>> Could you guide us again as to how to proceed now?
>> Also, could you please provide an overview of the process
>> MachineLearning.ipynb is following to train the samples.
>> 
>> Thanks a lot!
>> 
>>> On 20-Apr-2017 12:16 AM, <dusenberrymw@gmail.com> wrote:
>>> 
>>> Hi Aishwarya,
>>> 
>>> Looks like you've just encountered an out of memory error on one of the
>>> executors.  Therefore, you just need to adjust the
>> `spark.executor.memory`
>>> and `spark.driver.memory` settings with higher amounts of RAM.  What is
>>> your current setup?  I.e. are you using a cluster of machines, or a
>> single
>>> machine?  We generally use a large driver on one machine, and then a
>> single
>>> large executor on each other machine.  I would give a sizable amount of
>>> memory to the driver, and about half the possible memory on the
> executors
>>> so that the Python processes have enough memory as well.  PySpark has
> JVM
>>> and Python components, and the Spark memory settings only pertain to the
>>> JVM side, thus the need to save about half the executor memory for the
>>> Python side.
>>> 
>>> Thanks!
>>> 
>>> - Mike
>>> 
>>> --
>>> 
>>> Mike Dusenberry
>>> GitHub: github.com/dusenberrymw
>>> LinkedIn: linkedin.com/in/mikedusenberry
>>> 
>>> Sent from my iPhone.
>>> 
>>> 
>>>> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
>>> aishwarya2612@gmail.com> wrote:
>>>> 
>>>> Hello sir,
>>>> 
>>>> We also wanted to ensure that the spark-submit command we're using is
>> the
>>>> correct one for running 'preprocess.py'.
>>>> Command :  /home/new/sparks/bin/spark-submit preprocess.py
>>>> 
>>>> 
>>>> Thank you.
>>>> Aishwarya Chaurasia.
>>>> 
>>>> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <aishwarya2612@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>> Hello sir,
>>>> On running the file preprocess.py we are getting the following error :
>>>> 
>>>> https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
>>>> YhyRLivL9gydE=
>>>> 
>>>> Can you please help us by looking into the error and kindly tell us
> the
>>>> solution for it.
>>>> Thanks a lot.
>>>> Aishwarya Chaurasia
>>>> 
>>>> 
>>>>> On 19-Apr-2017 12:43 AM, <dusenberrymw@gmail.com> wrote:
>>>>> 
>>>>> Hi Aishwarya,
>>>>> 
>>>>> Certainly, here is some more detailed information
>> about`preprocess.py`:
>>>>> 
>>>>> * The preprocessing Python script is located at
>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>> projects/breast_cancer/preprocess.py.  Note that this is different
>> than
>>>>> the library module at https://github.com/apache/incu
>>>>> bator-systemml/blob/master/projects/breast_cancer/breastc
>>>>> ancer/preprocessing.py.
>>>>> * This script is used to preprocess a set of histology slide images,
>>>>> which are `.svs` files in our case, and `.tiff` files in your case.
>>>>> * Lines 63-79 contain "settings" such as the output image sizes,
>> folder
>>>>> paths, etc.  Of particular interest, line 72 has the folder path for
>> the
>>>>> original slide images that should be commonly accessible from all
>>> machines
>>>>> being used, and lines 74-79 contain the names of the output
> DataFrames
>>> that
>>>>> will be saved.
>>>>> * Line 82 performs the actual preprocessing and creates a Spark
>>>>> DataFrame with the following columns: slide number, tumor score,
>>> molecular
>>>>> score, sample.  The "sample" in this case is the actual small,
>>> chopped-up
>>>>> section of the image that has been extracted and flattened into a row
>>>>> Vector.  For test images without labels (`training=false`), only the
>>> slide
>>>>> number and sample will be contained in the DataFrame (i.e. no
> labels).
>>>>> This calls the `preprocess(...)` function located on line 371 of
>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>> projects/breast_cancer/breastcancer/preprocessing.py, which is a
>>>>> different file.
>>>>> * Line 87 simply saves the above DataFrame to HDFS with the name
> from
>>>>> line 74.
>>>>> * Line 93 splits the above DataFrame row-wise into separate
>> "training"
>>>>> and "validation" DataFrames, based on the split percentage from line
>> 70
>>>>> (`train_frac`).  This is performed so that downstream machine
> learning
>>>>> tasks can learn from the training set, and validate performance and
>>>>> hyperparameter choices on the validation set.  These DataFrames will
>>> start
>>>>> with the same columns as the above DataFrame.  If `add_row_indices`
>> from
>>>>> line 69 is true, then an additional row index column (`__INDEX`) will
>> be
>>>>> pretended.  This is useful for SystemML in downstream machine
> learning
>>>>> tasks as it gives the DataFrame row numbers like a real matrix would
>>> have,
>>>>> and SystemML is built to operate on matrices.
>>>>> * Lines 97 & 98 simply save the training and validation DataFrames
>>> using
>>>>> the names defined on lines 76 & 78.
>>>>> * Lines 103-137 create smaller train and validation DataFrames by
>>> taking
>>>>> small row-wise samples of the full train and validation DataFrames.
>> The
>>>>> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
>>>>> sample).  This is generally useful for quicker downstream tasks
>> without
>>>>> having to load in the larger DataFrames, assuming you have a large
>>> amount
>>>>> of data.  For us, we have ~7TB of data, so having 1% sampled
>> DataFrames
>>> is
>>>>> useful for quicker downstream tests.  Once again, the same columns
>> from
>>> the
>>>>> larger train and validation DataFrames will be used.
>>>>> * Lines 146 & 147 simply save these sampled train and validation
>>>>> DataFrames.
>>>>> 
>>>>> As a summary, after running `preprocess.py`, you will be left with
> the
>>>>> following saved DataFrames in HDFS:
>>>>> * Full DataFrame
>>>>> * Training DataFrame
>>>>> * Validation DataFrame
>>>>> * Sampled training DataFrame
>>>>> * Sampled validation DataFrame
>>>>> 
>>>>> As for visualization, you may visualize a "sample" (i.e. small,
>>> chopped-up
>>>>> section of original image) from a DataFrame by using the `
>>>>> breastcancer.visualization.visualize_sample(...)` function.  You will
>>>>> need to do this after creating the DataFrames.  Here is a snippet to
>>>>> visualize the first row sample in a DataFrame, where `df` is one of
>> the
>>>>> DataFrames from above:
>>>>> 
>>>>> ```
>>>>> from breastcancer.visualization import visualize_sample
>>>>> visualize_sample(df.first().sample)
>>>>> ```
>>>>> 
>>>>> Please let me know if you have any additional questions.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> - Mike
>>>>> 
>>>>> --
>>>>> 
>>>>> Mike Dusenberry
>>>>> GitHub: github.com/dusenberrymw
>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>> 
>>>>> Sent from my iPhone.
>>>>> 
>>>>> 
>>>>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
>>>>> aishwarya2612@gmail.com> wrote:
>>>>>> 
>>>>>> Hello sir,
>>>>>> Can you please elaborate more on what output we would be getting
>>> because
>>>>> we
>>>>>> tried executing the preprocess.py file using spark submit it keeps
> on
>>>>>> adding the tiles in rdd and while running the visualisation.py file
>> it
>>>>>> isn't showing any output. Can you please help us out asap stating
> the
>>>>>> output we will be getting and the sequence of execution of files.
>>>>>> Thank you.
>>>>>> 
>>>>>>> On 07-Apr-2017 5:54 AM, <dusenberrymw@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi Aishwarya,
>>>>>>> 
>>>>>>> Thanks for sharing more info on the issue!
>>>>>>> 
>>>>>>> To facilitate easier usage, I've updated the preprocessing code
by
>>>>> pulling
>>>>>>> out most of the logic into a `breastcancer/preprocessing.py`
>> module,
>>>>>>> leaving just the execution in the `Preprocessing.ipynb` notebook.
>>>>> There is
>>>>>>> also a `preprocess.py` script with the same contents as the
> notebook
>>> for
>>>>>>> use with `spark-submit`.  The choice of the notebook or the script
>> is
>>>>> just
>>>>>>> a matter of convenience, as they both import from the same
>>>>>>> `breastcancer/preprocessing.py` package.
>>>>>>> 
>>>>>>> As part of the updates, I've added an explicit SparkSession
>> parameter
>>>>>>> (`spark`) to the `preprocess(...)` function, and updated the
body
> to
>>> use
>>>>>>> this SparkSession object rather than the older SparkContext `sc`
>>> object.
>>>>>>> Previously, the `preprocess(...)` function accessed the `sc`
object
>>> that
>>>>>>> was pulled in from the enclosing scope, which would work while
all
>> of
>>>>> the
>>>>>>> code was colocated within the notebook, but not if the code was
>>>>> extracted
>>>>>>> and imported.  The explicit parameter now allows for the code
to be
>>>>>>> imported.
>>>>>>> 
>>>>>>> Can you please try again with the latest updates?  We are currently
>>>>> using
>>>>>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark
>> kernel
>>>>>>> should have a `spark` object available that can be supplied to
the
>>>>>>> functions (as is done now in the notebook), and if you use the
>>>>>>> `preprocess.py` script with `spark-submit`, the `spark` object
will
>> be
>>>>>>> created explicitly by the script.
>>>>>>> 
>>>>>>> For a bit of context to others, Aishwarya initially reached out
to
>>> find
>>>>>>> out if our breast cancer project could be applied to TIFF images,
>>> rather
>>>>>>> than the SVS images we are currently using (the answer is "yes"
so
>>> long
>>>>> as
>>>>>>> they are "generic tiled TIFF images, according to the OpenSlide
>>>>>>> documentation), and then followed up with Spark issues related
to
>> the
>>>>>>> preprocessing code.  This conversation has been promptly moved
to
>> the
>>>>>>> mailing list so that others in the community can benefit.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> -Mike
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> Mike Dusenberry
>>>>>>> GitHub: github.com/dusenberrymw
>>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>>>> 
>>>>>>> Sent from my iPhone.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
>>>>> aishwarya2612@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hey,
>>>>>>>> 
>>>>>>>> The object sc is already defined in pyspark and yet this
name
> error
>>>>> keeps
>>>>>>>> occurring. We are using spark 2.*
>>>>>>>> 
>>>>>>>> Here is the link to error that we are getting :
>>>>>>>> https://paste.fedoraproject.org/paste/
>> 89iQODxzpNZVbSfgwocH8l5M1UNdIG
>>>>>>> YhyRLivL9gydE=
>>>>>>> 
>>>>> 
>>> 
>> 

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message