systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sourav Mazumder <sourav.mazumde...@gmail.com>
Subject Re: SystemML Notebook docker image
Date Sat, 06 Feb 2016 01:26:34 GMT
Hi Deron,

I can surely share that. Can I upload it somewhere in the SystemML's site ?

My notebook covers end to end stuff involving Spark, Zeppelin, R and
SystemML. It reads two different datasets from HDFS, explore them and merge
them using Spark SQL. Then use that merged dataset for creating the model
using SystemML also using SparkR (and Spark MLLib). Then use that model to
make prediction using data from one of the datasets. It also has
visualization of the prediction using R's visualization libraries. However
for the R based visualization it requires a specific PR of Zeppelin.

If I compare ease of of use of SystemML with SparkMlLib and SparkR (however
please not that there is nothing specific to use either of these three in
Zeppelin. The same set of code can be used in scala shell/REPL also)

1. SparkR is way more simpler than both SystemML and SparkMlLib. The
modeling can be done just by making a single API call and passing the
feature sets and label (the dependent variable) very easily within that
same api call. Similarly us eof prediction API is also very straight
forward. However, SparkR right now supports only single algorithm GLM. Even
with that the flexibility to parametrize is very limited.

2. In SparkMlLib to prepare the dataset with separation of features and
label needs understanding of RDD, its map function and the LabelPoint API.
It is according to me unnecessary. Why data scientists need to know all
these ? After that calling the APIs are straight forward though it needs
knowing some of the particular classes related to the algorithm. And if
someone needs to use Cross Validator it is really complex code for the data
scientists - though doable can be learnt if someone is patient enough to go
through that process.

3. In case of SyystemML the amount of code to be written is similar to
SparkMlLib but pretty straight forward - get and put for the data to be
passed and to get the result out. But the crux of the challenge is
understanding the inputs/outputs to a DML. That is not always that easy.
Niketan from this group helped me a lot to go through that. However use of
the APIs are much simpler than SparkML Lib. But I would like them to be as
easy as in case of R. Also ability to apply Cross validation - that is a
very powerful feature.

4. In case of SystemML one thing must be addressed is ability to map the
results of prediction easily with the original dataset used for
prediction/scoring. Right now one has to create an ID first to attach with
the dataset. Then using that ID the predicted values have to be merged.
This is really cumbersome and should be ideally transparent. Both R and
SparkMLLib provides this much of support. Again Niketan and the team helped
me to understand the issue and overcome this limitation. Data Scientists
may not be patient to go through this unnecessary learning process.

Let me know if you need any details/clarifications on these points.

Regards,
Sourav



On Fri, Feb 5, 2016 at 1:49 PM, Deron Eriksson <deroneriksson@gmail.com>
wrote:

> Hi Sourav,
>
> That sounds very useful for people who are interested in running SystemML
> through Zeppelin. It would be great if you could share that.
>
> I was wondering, what is your opinion of running SystemML through Zeppelin?
> Do you think that is a path that is going to be most useful for data
> scientists to do exploratory work with SystemML? Is there anything that you
> would like to see improved with regards to the MLContext API?
>
> Deron
>
>
> On Thu, Feb 4, 2016 at 4:01 PM, Sourav Mazumder <
> sourav.mazumder00@gmail.com
> > wrote:
>
> > Hi,
> >
> > I have a complete end to end Modeling and Prediction using Zepplein and
> > also visualization of the prediction using R plots.
> >
> > I can share the same too if that is useful.
> >
> > Regards,
> > Sourav
> >
> > On Thu, Feb 4, 2016 at 3:20 PM, Luciano Resende <luckbr1975@gmail.com>
> > wrote:
> >
> > > I started experimenting with some nice ways to enable data scientists
> to
> > > get started with SystemML with the minimum setup and a pleasant user
> > > experience.
> > >
> > > Following the guide published in the SystemML project documentation
> page
> > > [1], I created a docker image containing the necessary infrastructure
> for
> > > running SystemML in a cluster mode, and also installed and configured
> > > Zeppelin with SystemML and the sample notebook available.
> > >
> > > Please see more detailed instructions to use it at
> > >
> > > https://github.com/lresende/docker-systemml-notebook
> > >
> > > If people start to find this very useful we could move this to SystemML
> > > project itself and start making more scenarios available as sample
> > > Notebooks
> > >
> > > [1]
> > >
> > >
> >
> http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html#zeppelin-notebook-example---linear-regression-algorithm
> > >
> > > [2] https://github.com/lresende/docker-systemml-notebook
> > >
> > > --
> > > Luciano Resende
> > > http://people.apache.org/~lresende
> > > http://twitter.com/lresende1975
> > > http://lresende.blogspot.com/
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message