systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aishwarya Chaurasia <aishwarya2...@gmail.com>
Subject Re: Please reply ASAP : Regarding incubator systemml/breast_cancer project
Date Tue, 25 Apr 2017 10:30:50 GMT
Hello sir,

The NameError is occuring again sir. Why does it keep resurfacing?

Attaching the screenshot of the error.

On 25-Apr-2017 2:50 AM, <dusenberrymw@gmail.com> wrote:

> Hi Aishwarya,
>
> For the error message, that just means that the SystemML jar isn't being
> found.  Can you add a `--driver-class-path $SYSTEMML_HOME/target/SystemML.jar`
> to the invocation of Jupyter?  I.e. `PYSPARK_PYTHON=python3
> PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> pyspark  --jars $SYSTEMML_HOME/target/SystemML.jar --driver-class-path
> $SYSTEMML_HOME/target/SystemML.jar`. There was a PySpark bug that was
> supposed to have been fixed in Spark 2.x, but it's possible that it is
> still an issue.
>
> As for the output, the notebook will create SystemML `Matrix` objects for
> all of the weights and biases of the trained models.  To save, please
> convert each one to a DataFrame, i.e. `Wc1.toDF()` and repeated for each
> matrix, and then simply save the DataFrames.  This could be done all at
> once like this for a SystemML Matrix object `Wc1`:
> `Wc1.toDf().write.save("path/to/save/Wc1.parquet", format="parquet")`.
> Just repeat for each matrix returned by the "Train" code for the
> algorithms.  At that point, you will have a set of saved DataFrames
> representing a trained SystemML model, and these can be used in downstream
> classification tasks in a similar manner to the "Eval" sections.
>
> -Mike
>
> --
>
> Mike Dusenberry
> GitHub: github.com/dusenberrymw
> LinkedIn: linkedin.com/in/mikedusenberry
>
> Sent from my iPhone.
>
>
> > On Apr 24, 2017, at 3:07 AM, Aishwarya Chaurasia <
> aishwarya2612@gmail.com> wrote:
> >
> > Further more :
> > What is the output of MachineLearning.ipynb you're obtaining sir?
> > We are actually nearing our deadline for our problem.
> > Thanks a lot.
> >
> > On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <aishwarya2612@gmail.com>
> > wrote:
> >
> > Hello sir,
> >
> > Thanks a lot for replying sir. But unfortunately it did not work.
> Although
> > the NameError did not appear this time but another error came about :
> >
> > https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V
> > 5M1UNdIGYhyRLivL9gydE=
> >
> > This error was obtained after executing the second block of code of
> > MachineLearning.py in terminal. ( ml = MLContext(sc) )
> >
> > We have installed the bleeding-edge version of systemml only and the
> > installation was done correctly. We are in a fix now. :/
> > Kindly look into the matter asap
> >
> > On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <dusenberrymw@gmail.com>
> wrote:
> >
> > Hi Aishwarya,
> >
> > Glad to hear that the preprocessing stage was successful!  As for the
> > `MachineLearning.ipynb` notebook, here is a general guide:
> >
> >
> >   - The `MachineLearning.ipynb` notebook essentially (1) loads in the
> >   training and validation DataFrames from the preprocessing step, (2)
> >   converts them to normalized & one-hot encoded SystemML matrices for
> >   consumption by the ML algorithms, and (3) explores training a couple of
> >   models.
> >   - To run, you'll need to start Jupyter in the context of PySpark via
> >   `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
> >   PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
> >   $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have installed
> >   SystemML with pip from PyPy (`pip3 install systemml`), this will
> install
> >   our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar`
> > will
> >   not be necessary.  If you instead have installed a bleeding-edge
> version
> > of
> >   SystemML locally (git clone locally, maven build, `pip3 install -e
> >   src/main/python` as listed in `projects/breast_cancer/README.md`), the
> >   `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We
> are
> >   about to release 0.14, and for this project, I *would* recommend using
> a
> >   bleeding edge install.
> >   - Once Jupyter has been started in the context of PySpark, the `sc`
> >   SparkContext object should be available.  Please let me know if you
> >   continue to see this issue.
> >   - The "Read in train & val data" section simply reads in the training
> >   and validation data generated in the preprocessing stage.  Be sure that
> > the
> >   `size` setting is the same as the preprocessing size.  The percentage
> `p`
> >   setting determines whether the full or sampled DataFrames are loaded.
> If
> >   you set `p = 1`, the full DataFrames will be used.  If you instead
> would
> >   prefer to use the smaller sampled DataFrames while getting started,
> > please
> >   set it to the same value as used in the preprocessing to generate the
> >   smaller sampled DataFrames.
> >   - The `Extract X & Y matrices` section splits each of the train and
> >   validation DataFrames into effectively X & Y matrices (still as
> DataFrame
> >   types), with X containing the images, and Y containing the labels.
> >   - The `Convert to SystemML Matrices` section passes the X & Y
> DataFrames
> >   into a SystemML script that performs some normalization of the images &
> >   one-hot encoding of the labels, and then returns SystemML `Matrix`
> types.
> >   These are now ready to be passed into the subsequent algorithms.
> >   - The "Trigger Caching" and "Save Matrices" are experimental features,
> >   and not necessary to execute.
> >   - Next comes the two algorithms being explored in this notebook.  The
> >   "Softmax Classifier" is just a multi-class logistic regression model,
> and
> >   is simply there to serve as a baseline comparison with the subsequent
> >   convolutional neural net model.  You may wish to simply skip this
> softmax
> >   model and move to the latter convnet model further down in the
> notebook.
> >   - The actual softmax model is located at [
> >   https://github.com/apache/incubator-systemml/blob/master/
> > projects/breast_cancer/softmax_clf.dml],
> >   and the notebook calls functions from that file.
> >   - The softmax sanity check just ensures that the model is able to
> >   completely overfit when given a tiny sample size.  This should yield
> > ~100%
> >   training accuracy if the sample size in this section is small enough.
> > This
> >   is just a check to ensure that nothing else is wrong with the math or
> the
> >   data.
> >   - The softmax "Train" section will train a softmax model and return the
> >   weights (`W`) and biases (`b`) of the model as SystemML `Matrix`
> objects.
> >   Please adjust the hyperparameters in this section to your problem.
> >   - The softmax "Eval" section takes the trained weights and biases and
> >   evaluates the training and validation performance.
> >   - The next model is a LeNet-like convnet model.  The actual model is
> >   located at [
> >   https://github.com/apache/incubator-systemml/blob/master/
> > projects/breast_cancer/convnet.dml],
> >   and the notebook simply calls functions from that file.
> >   - Once again, there is an initial sanity check for the ability to
> >   overfit on a small amount of data.
> >   - The "Hyperparameter Search" contains a script to sample different
> >   hyperparams for the convnet, and save the hyperparams + validation
> > accuracy
> >   of each set after a single epoch of training.  These string files will
> be
> >   saved to HDFS.  Please feel free to adjust the range of the
> > hyperparameters
> >   for your problem.  Please also feel free to try using the `parfor`
> >   (parallel for-loop) instead of the while loop to speed up this section.
> >   Note that this is still a work in progress.  The hyperparameter tuning
> in
> >   this section makes use of random search (as opposed to grid search),
> > which
> >   has been promoted by Bengio et al. to speed up the search time.
> >   - The "Train" section trains the convnet and returns the weights and
> >   biases as SystemML `Matrix` types.  In this section, please replace the
> >   hyperparameters with the best ones from above, and please increase the
> >   number of epochs given your time constraints.
> >   - The "Eval" section evaluates the performance of the trained convnet.
> >   - Although it is not shown in the notebook yet, to save the weights and
> >   biases, please use the `toDF()` method on each weight and biases (i.e.
> >   `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save the
> >   DataFrame as desired.
> >   - Finally, please feel free to extend the model in `convnet.dml` for
> >   your particular problem!  The LeNet-like model just serves as a simple
> >   convnet, but there are much richer models currently, such as resnets,
> > that
> >   we are experimenting with.  To make larger models such as resnets
> easier
> > to
> >   define, we are also working on other tools for converting model
> > definitions
> >   + pretrained weights from other systems into SystemML.
> >
> >
> > Also, please keep in mind that the deep learning support in SystemML is
> > still a work in progress.  Therefore, if you run into issues, please let
> us
> > know and we'll do everything possible to help get things running!
> >
> >
> > Thanks!
> >
> > - Mike
> >
> >
> > --
> >
> > Michael W. Dusenberry
> > GitHub: github.com/dusenberrymw
> > LinkedIn: linkedin.com/in/mikedusenberry
> >
> > On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia <
> > aishwarya2612@gmail.com> wrote:
> >
> >> Hey,
> >>
> >> Thank you so much for your help sir. We were finally able to run
> >> preprocess.py without any errors. And the results obtained were
> >> satisfactory i.e we got five set of data frames like you said we would.
> >>
> >> But alas! when we tried to run MachineLearning.ipynb the same NameError
> >> came : https://paste.fedoraproject.org/paste/l3LFJreg~
> >> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE=
> >>
> >> Could you guide us again as to how to proceed now?
> >> Also, could you please provide an overview of the process
> >> MachineLearning.ipynb is following to train the samples.
> >>
> >> Thanks a lot!
> >>
> >>> On 20-Apr-2017 12:16 AM, <dusenberrymw@gmail.com> wrote:
> >>>
> >>> Hi Aishwarya,
> >>>
> >>> Looks like you've just encountered an out of memory error on one of the
> >>> executors.  Therefore, you just need to adjust the
> >> `spark.executor.memory`
> >>> and `spark.driver.memory` settings with higher amounts of RAM.  What is
> >>> your current setup?  I.e. are you using a cluster of machines, or a
> >> single
> >>> machine?  We generally use a large driver on one machine, and then a
> >> single
> >>> large executor on each other machine.  I would give a sizable amount of
> >>> memory to the driver, and about half the possible memory on the
> > executors
> >>> so that the Python processes have enough memory as well.  PySpark has
> > JVM
> >>> and Python components, and the Spark memory settings only pertain to
> the
> >>> JVM side, thus the need to save about half the executor memory for the
> >>> Python side.
> >>>
> >>> Thanks!
> >>>
> >>> - Mike
> >>>
> >>> --
> >>>
> >>> Mike Dusenberry
> >>> GitHub: github.com/dusenberrymw
> >>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>
> >>> Sent from my iPhone.
> >>>
> >>>
> >>>> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
> >>> aishwarya2612@gmail.com> wrote:
> >>>>
> >>>> Hello sir,
> >>>>
> >>>> We also wanted to ensure that the spark-submit command we're using is
> >> the
> >>>> correct one for running 'preprocess.py'.
> >>>> Command :  /home/new/sparks/bin/spark-submit preprocess.py
> >>>>
> >>>>
> >>>> Thank you.
> >>>> Aishwarya Chaurasia.
> >>>>
> >>>> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <
> aishwarya2612@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>> Hello sir,
> >>>> On running the file preprocess.py we are getting the following error
:
> >>>>
> >>>> https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
> >>>> YhyRLivL9gydE=
> >>>>
> >>>> Can you please help us by looking into the error and kindly tell us
> > the
> >>>> solution for it.
> >>>> Thanks a lot.
> >>>> Aishwarya Chaurasia
> >>>>
> >>>>
> >>>>> On 19-Apr-2017 12:43 AM, <dusenberrymw@gmail.com> wrote:
> >>>>>
> >>>>> Hi Aishwarya,
> >>>>>
> >>>>> Certainly, here is some more detailed information
> >> about`preprocess.py`:
> >>>>>
> >>>>> * The preprocessing Python script is located at
> >>>>> https://github.com/apache/incubator-systemml/blob/master/
> >>>>> projects/breast_cancer/preprocess.py.  Note that this is different
> >> than
> >>>>> the library module at https://github.com/apache/incu
> >>>>> bator-systemml/blob/master/projects/breast_cancer/breastc
> >>>>> ancer/preprocessing.py.
> >>>>> * This script is used to preprocess a set of histology slide images,
> >>>>> which are `.svs` files in our case, and `.tiff` files in your case.
> >>>>> * Lines 63-79 contain "settings" such as the output image sizes,
> >> folder
> >>>>> paths, etc.  Of particular interest, line 72 has the folder path
for
> >> the
> >>>>> original slide images that should be commonly accessible from all
> >>> machines
> >>>>> being used, and lines 74-79 contain the names of the output
> > DataFrames
> >>> that
> >>>>> will be saved.
> >>>>> * Line 82 performs the actual preprocessing and creates a Spark
> >>>>> DataFrame with the following columns: slide number, tumor score,
> >>> molecular
> >>>>> score, sample.  The "sample" in this case is the actual small,
> >>> chopped-up
> >>>>> section of the image that has been extracted and flattened into
a row
> >>>>> Vector.  For test images without labels (`training=false`), only
the
> >>> slide
> >>>>> number and sample will be contained in the DataFrame (i.e. no
> > labels).
> >>>>> This calls the `preprocess(...)` function located on line 371 of
> >>>>> https://github.com/apache/incubator-systemml/blob/master/
> >>>>> projects/breast_cancer/breastcancer/preprocessing.py, which is a
> >>>>> different file.
> >>>>> * Line 87 simply saves the above DataFrame to HDFS with the name
> > from
> >>>>> line 74.
> >>>>> * Line 93 splits the above DataFrame row-wise into separate
> >> "training"
> >>>>> and "validation" DataFrames, based on the split percentage from
line
> >> 70
> >>>>> (`train_frac`).  This is performed so that downstream machine
> > learning
> >>>>> tasks can learn from the training set, and validate performance
and
> >>>>> hyperparameter choices on the validation set.  These DataFrames
will
> >>> start
> >>>>> with the same columns as the above DataFrame.  If `add_row_indices`
> >> from
> >>>>> line 69 is true, then an additional row index column (`__INDEX`)
will
> >> be
> >>>>> pretended.  This is useful for SystemML in downstream machine
> > learning
> >>>>> tasks as it gives the DataFrame row numbers like a real matrix would
> >>> have,
> >>>>> and SystemML is built to operate on matrices.
> >>>>> * Lines 97 & 98 simply save the training and validation DataFrames
> >>> using
> >>>>> the names defined on lines 76 & 78.
> >>>>> * Lines 103-137 create smaller train and validation DataFrames by
> >>> taking
> >>>>> small row-wise samples of the full train and validation DataFrames.
> >> The
> >>>>> percentage of the sample is defined on line 111 (`p=0.01` for a
1%
> >>>>> sample).  This is generally useful for quicker downstream tasks
> >> without
> >>>>> having to load in the larger DataFrames, assuming you have a large
> >>> amount
> >>>>> of data.  For us, we have ~7TB of data, so having 1% sampled
> >> DataFrames
> >>> is
> >>>>> useful for quicker downstream tests.  Once again, the same columns
> >> from
> >>> the
> >>>>> larger train and validation DataFrames will be used.
> >>>>> * Lines 146 & 147 simply save these sampled train and validation
> >>>>> DataFrames.
> >>>>>
> >>>>> As a summary, after running `preprocess.py`, you will be left with
> > the
> >>>>> following saved DataFrames in HDFS:
> >>>>> * Full DataFrame
> >>>>> * Training DataFrame
> >>>>> * Validation DataFrame
> >>>>> * Sampled training DataFrame
> >>>>> * Sampled validation DataFrame
> >>>>>
> >>>>> As for visualization, you may visualize a "sample" (i.e. small,
> >>> chopped-up
> >>>>> section of original image) from a DataFrame by using the `
> >>>>> breastcancer.visualization.visualize_sample(...)` function.  You
> will
> >>>>> need to do this after creating the DataFrames.  Here is a snippet
to
> >>>>> visualize the first row sample in a DataFrame, where `df` is one
of
> >> the
> >>>>> DataFrames from above:
> >>>>>
> >>>>> ```
> >>>>> from breastcancer.visualization import visualize_sample
> >>>>> visualize_sample(df.first().sample)
> >>>>> ```
> >>>>>
> >>>>> Please let me know if you have any additional questions.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> - Mike
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Mike Dusenberry
> >>>>> GitHub: github.com/dusenberrymw
> >>>>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>>>
> >>>>> Sent from my iPhone.
> >>>>>
> >>>>>
> >>>>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
> >>>>> aishwarya2612@gmail.com> wrote:
> >>>>>>
> >>>>>> Hello sir,
> >>>>>> Can you please elaborate more on what output we would be getting
> >>> because
> >>>>> we
> >>>>>> tried executing the preprocess.py file using spark submit it
keeps
> > on
> >>>>>> adding the tiles in rdd and while running the visualisation.py
file
> >> it
> >>>>>> isn't showing any output. Can you please help us out asap stating
> > the
> >>>>>> output we will be getting and the sequence of execution of files.
> >>>>>> Thank you.
> >>>>>>
> >>>>>>> On 07-Apr-2017 5:54 AM, <dusenberrymw@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Aishwarya,
> >>>>>>>
> >>>>>>> Thanks for sharing more info on the issue!
> >>>>>>>
> >>>>>>> To facilitate easier usage, I've updated the preprocessing
code by
> >>>>> pulling
> >>>>>>> out most of the logic into a `breastcancer/preprocessing.py`
> >> module,
> >>>>>>> leaving just the execution in the `Preprocessing.ipynb`
notebook.
> >>>>> There is
> >>>>>>> also a `preprocess.py` script with the same contents as
the
> > notebook
> >>> for
> >>>>>>> use with `spark-submit`.  The choice of the notebook or
the script
> >> is
> >>>>> just
> >>>>>>> a matter of convenience, as they both import from the same
> >>>>>>> `breastcancer/preprocessing.py` package.
> >>>>>>>
> >>>>>>> As part of the updates, I've added an explicit SparkSession
> >> parameter
> >>>>>>> (`spark`) to the `preprocess(...)` function, and updated
the body
> > to
> >>> use
> >>>>>>> this SparkSession object rather than the older SparkContext
`sc`
> >>> object.
> >>>>>>> Previously, the `preprocess(...)` function accessed the
`sc` object
> >>> that
> >>>>>>> was pulled in from the enclosing scope, which would work
while all
> >> of
> >>>>> the
> >>>>>>> code was colocated within the notebook, but not if the code
was
> >>>>> extracted
> >>>>>>> and imported.  The explicit parameter now allows for the
code to be
> >>>>>>> imported.
> >>>>>>>
> >>>>>>> Can you please try again with the latest updates?  We are
currently
> >>>>> using
> >>>>>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark
> >> kernel
> >>>>>>> should have a `spark` object available that can be supplied
to the
> >>>>>>> functions (as is done now in the notebook), and if you use
the
> >>>>>>> `preprocess.py` script with `spark-submit`, the `spark`
object will
> >> be
> >>>>>>> created explicitly by the script.
> >>>>>>>
> >>>>>>> For a bit of context to others, Aishwarya initially reached
out to
> >>> find
> >>>>>>> out if our breast cancer project could be applied to TIFF
images,
> >>> rather
> >>>>>>> than the SVS images we are currently using (the answer is
"yes" so
> >>> long
> >>>>> as
> >>>>>>> they are "generic tiled TIFF images, according to the OpenSlide
> >>>>>>> documentation), and then followed up with Spark issues related
to
> >> the
> >>>>>>> preprocessing code.  This conversation has been promptly
moved to
> >> the
> >>>>>>> mailing list so that others in the community can benefit.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> -Mike
> >>>>>>>
> >>>>>>> --
> >>>>>>>
> >>>>>>> Mike Dusenberry
> >>>>>>> GitHub: github.com/dusenberrymw
> >>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>>>>>
> >>>>>>> Sent from my iPhone.
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
> >>>>> aishwarya2612@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hey,
> >>>>>>>>
> >>>>>>>> The object sc is already defined in pyspark and yet
this name
> > error
> >>>>> keeps
> >>>>>>>> occurring. We are using spark 2.*
> >>>>>>>>
> >>>>>>>> Here is the link to error that we are getting :
> >>>>>>>> https://paste.fedoraproject.org/paste/
> >> 89iQODxzpNZVbSfgwocH8l5M1UNdIG
> >>>>>>> YhyRLivL9gydE=
> >>>>>>>
> >>>>>
> >>>
> >>
>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message