systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deron Eriksson <deroneriks...@gmail.com>
Subject Re: API documentation for SystemML
Date Tue, 08 Dec 2015 00:28:27 GMT
Thank you Niketan for providing such useful information. The
RDDConverterUtilsExt javadoc example is great.

The MLContext API has a tremendous amount of potential given that is has
such clean integration with Spark (for example, it's so easy to create an
MLContext from a SparkContext in the Spark Shell). I'm really interested in
seeing how data scientists and developers embrace it in the coming months.


Deron



On Mon, Dec 7, 2015 at 3:31 PM, Niketan Pansare <npansar@us.ibm.com> wrote:

> Thanks Deron for your response :)
>
> Sourav: Few additional comments:
> 1. MLContext allows the users to pass RDDs to SystemML and MLOutput allows
> them to fetch the result RDD after the execution of a DML script.
>
> 2. MLContext exposes registerInput("variableName", RDD) interface, while
> MLOutput has get..("variableName") methods. Eg: getDF, getBinaryBlockedRDD,
> ...
>
> 3. With exception of DataFrame, the RDDs supported by these classes mirror
> the RDDs in the symbol table and the format supported by read()/write()
> built-in functions. Following types of RDDs are supported by these classes:
> a. Binary blocked RDD (JavaPairRDD<MatrixIndexes, MatrixBlock>) =>
> corresponds to format="binary"
> b. String-based RDD (JavaRDD<String>) => corresponds to format="csv" or
> format="text"
> c. DataFrame
>
> See
> http://apache.github.io/incubator-systemml/dml-language-reference.html#readwrite-built-in-functions
> for more details about the formats supported by read()/write() built-in
> functions.
>
> 4. For all other types of RDDs, we decided to expose them through
> converter utils:
>
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtils.java
>
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java
>
> 5. The utility functions in RDDConverterUtilsExt are not tested for
> performance and robustness. Once they are tested, they will be moved into
> RDDConverterUtils. Most of these utils have javadocs within the code and we
> will add both usage guide and external javadoc for them. Following types of
> conversions are supported by the converter utils:
> a. CoordinateMatrix to Binary blocked RDD (See
> coordinateMatrixToBinaryBlock in RDDConverterUtilsExt).
> b. Binary blocked RDD to String RDD.
> c. DataFrame with a column with Vector UDT to Binary Block and viceversa.
> This is useful while working with RDD<LabelPoints>. (See
> vectorDataFrameToBinaryBlock and binaryBlockToVectorDataFrame in
> RDDConverterUtilsExt).
> d. DataFrame with double columns (See dataFrameToBinaryBlock in
> RDDConverterUtilsExt). Since DataFrame/RDD is a collection not a
> indexed/ordered sequence (at least not at API level), an ID column is
> inserted by MLOutput to denote the row index.
> e. Binary block to Labeled points (See binaryBlockToLabeledPoints in
> RDDConverterUtils).
> f. Conversion between text/cell/csv formats to and from Binary blocked RDD
> (See RDDConverterUtils).
>
> 6. MLContext interface is Scala compatible i.e. we support both JavaRDD
> and RDD, JavaSparkContext and SparkContext, java.util.HashMap and
> scala.collection.immutable.Map, and so on.
>
> 7. MatrixCharacteristics is used to provide the metadata information (such
> as number of rows, number of columns, block row length, block column length
> and number of non-zeros) of a RDD to the SystemML's optimizer. In some
> cases, it is required (for example: text, binary format) while in some
> cases, it can be skipped (for example: csv, dataframe). MLContext exposes
> convenient wrappers such as *void* registerInput(String varName,
> JavaPairRDD<MatrixIndexes,MatrixBlock> rdd, *long* rlen, *long* clen,
> *int* brlen, *int* bclen) to avoid creating MatrixCharacteristics. Here
> is the source code if you are interested:
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/matrix/MatrixCharacteristics.java
>
> A good example of using MatrixCharacteristics and converter utils is
> provided in RDDConverterUtilsExt's javadoc:
> * import
> org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> * import org.apache.sysml.runtime.matrix.MatrixCharacteristics
> * import org.apache.spark.api.java.JavaSparkContext
> * import org.apache.spark.mllib.linalg.distributed.MatrixEntry
> * import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
> * *val* matRDD = sc.textFile("ratings.text").map(_.split(" ")).map(x =>
> new MatrixEntry(x(0).toLong, x(1).toLong, x(2).toDouble)).filter(_.value !=
> 0).cache
> * require(matRDD.filter(x => x.i == 0 || x.j == 0).count == 0, "Expected 1
> -based ratings file")
> * *val* *nnz* = matRDD.count
> * *val* numRows = matRDD.map(_.i).max
> * *val* numCols = matRDD.map(_.j).max
> * *val* coordinateMatrix = new CoordinateMatrix(matRDD, numRows, numCols)
> * *val* *mc* = new MatrixCharacteristics(numRows, numCols, 1000, 1000,
> *nnz*)
> * *val* binBlocks =
> RDDConverterUtilsExt.coordinateMatrixToBinaryBlock(new JavaSparkContext(
> *sc*), coordinateMatrix, *mc*, true)
>
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Deron Eriksson ---12/07/2015 02:50:30
> PM---Hi Sourav, One way to generate Javadocs for the entire Sys]Deron
> Eriksson ---12/07/2015 02:50:30 PM---Hi Sourav, One way to generate
> Javadocs for the entire SystemML project is "mvn
>
> From: Deron Eriksson <deroneriksson@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/07/2015 02:50 PM
> Subject: Re: API documentation for SystemML
> ------------------------------
>
>
>
> Hi Sourav,
>
> One way to generate Javadocs for the entire SystemML project is "mvn
> javadoc:javadoc".
>
> Unfortunately, classes such as MatrixCharacteristics and RDDConverterUtils
> currently have very minimal API documentation. We are hoping to address
> this in the near future. However, you may find that the following
> documentation link could be of assistance in getting started, given your
> interest in Scala:
>
> http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html
>
> Deron
>
>
> On Mon, Dec 7, 2015 at 1:58 PM, Sourav Mazumder <
> sourav.mazumder00@gmail.com
> > wrote:
>
> > Hi,
> >
> > Is there any Scala/Java API documentation available for classes like
> >
> > MatrixCharacteristics, RDDConverterUtils ?
> >
> > What I need to understand is what all such helper utilities available
> > and the deatils of their signature/APIs.
> >
> > Regards,
> >
> > Sourav
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message