systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arijit chakraborty <ak...@hotmail.com>
Subject Re: Distinct Item of a column
Date Tue, 18 Apr 2017 04:59:15 GMT
Thank you Niketan! Your answer completely answer my question.


Regards,

Arijit

________________________________
From: Niketan Pansare <npansar@us.ibm.com>
Sent: Tuesday, April 18, 2017 12:55:28 AM
To: dev@systemml.incubator.apache.org
Subject: Re: Distinct Item of a column


Hi Arijit,

PySpark and SystemML are complimentary and both serve different purpose. PySpark primarily
operates on a collection of datapoints (i.e. RDD) or a DataFrame and exposes the Spark programming
model (i.e. transformation and actions). SystemML primarily operates on matrices and provides
wide variety of linear algebra operators required for implementing Machine Learning algorithms.
Personally, I would use PySpark for data preprocessing and SystemML for training/prediction
(YMMV!!). As an example: in our breast cancer project, we use PySpark APIs in https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/Preprocessing.ipynb
and SystemML APIs in https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/MachineLearning.ipynb
... Yes, some operations (such as distinct) can be done in both SystemML and PySpark, in which
case, you should chose the one that best fits your need.

PySpark ML (or MLLib) is more closer to SystemML. I agree with you that there is not enough
comparisons out there, probably because benchmarking ML systems is non-trivial. For apples
to apples comparison, you need compare both accuracy and runtime performance of a given ML
model on variety of datasets. I am using the term "accuracy" broadly, so please refer to http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.
Also, since different ML systems use different optimization algorithms (i.e. SGD, conjugate
gradient, direct solve, ...), one needs to reason about hyperparameters as well as convergence
behavior before making a judgement.

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

PS: SystemML has recently added support for frames (http://apache.github.io/incubator-systemml/dml-language-reference.html#frames)
that simplifies common data transformation operations such as recoding, dummy coding, binning
and handling of missing values.

[Inactive hide details for arijit chakraborty ---04/17/2017 08:50:51 AM---Hi, I'm curious
to know what's the advantage of system]arijit chakraborty ---04/17/2017 08:50:51 AM---Hi,
I'm curious to know what's the advantage of systemML over pyspark? Especially in terms of
perfor

From: arijit chakraborty <akc14@hotmail.com>
To: "dev@systemml.incubator.apache.org" <dev@systemml.incubator.apache.org>
Date: 04/17/2017 08:50 AM
Subject: Distinct Item of a column

________________________________



Hi,


I'm curious to know what's the advantage of systemML over pyspark? Especially in terms of
performance. I tried looking for some reading on it, but hardly could find one.


Thank you!

Arijit




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message