systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sourav Mazumder <sourav.mazumde...@gmail.com>
Subject Re: API documentation for SystemML
Date Tue, 08 Dec 2015 03:23:37 GMT
Hi Niketan/Deron,

Thanks for the inputs. Let me dig little deeper using these inputs. Shall
get back to you in case I have more Qs.

Regards,
Sourav

On Mon, Dec 7, 2015 at 4:28 PM, Deron Eriksson <deroneriksson@gmail.com>
wrote:

> Thank you Niketan for providing such useful information. The
> RDDConverterUtilsExt javadoc example is great.
>
> The MLContext API has a tremendous amount of potential given that is has
> such clean integration with Spark (for example, it's so easy to create an
> MLContext from a SparkContext in the Spark Shell). I'm really interested in
> seeing how data scientists and developers embrace it in the coming months.
>
>
> Deron
>
>
>
> On Mon, Dec 7, 2015 at 3:31 PM, Niketan Pansare <npansar@us.ibm.com>
> wrote:
>
> > Thanks Deron for your response :)
> >
> > Sourav: Few additional comments:
> > 1. MLContext allows the users to pass RDDs to SystemML and MLOutput
> allows
> > them to fetch the result RDD after the execution of a DML script.
> >
> > 2. MLContext exposes registerInput("variableName", RDD) interface, while
> > MLOutput has get..("variableName") methods. Eg: getDF,
> getBinaryBlockedRDD,
> > ...
> >
> > 3. With exception of DataFrame, the RDDs supported by these classes
> mirror
> > the RDDs in the symbol table and the format supported by read()/write()
> > built-in functions. Following types of RDDs are supported by these
> classes:
> > a. Binary blocked RDD (JavaPairRDD<MatrixIndexes, MatrixBlock>) =>
> > corresponds to format="binary"
> > b. String-based RDD (JavaRDD<String>) => corresponds to format="csv" or
> > format="text"
> > c. DataFrame
> >
> > See
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#readwrite-built-in-functions
> > for more details about the formats supported by read()/write() built-in
> > functions.
> >
> > 4. For all other types of RDDs, we decided to expose them through
> > converter utils:
> >
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtils.java
> >
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java
> >
> > 5. The utility functions in RDDConverterUtilsExt are not tested for
> > performance and robustness. Once they are tested, they will be moved into
> > RDDConverterUtils. Most of these utils have javadocs within the code and
> we
> > will add both usage guide and external javadoc for them. Following types
> of
> > conversions are supported by the converter utils:
> > a. CoordinateMatrix to Binary blocked RDD (See
> > coordinateMatrixToBinaryBlock in RDDConverterUtilsExt).
> > b. Binary blocked RDD to String RDD.
> > c. DataFrame with a column with Vector UDT to Binary Block and viceversa.
> > This is useful while working with RDD<LabelPoints>. (See
> > vectorDataFrameToBinaryBlock and binaryBlockToVectorDataFrame in
> > RDDConverterUtilsExt).
> > d. DataFrame with double columns (See dataFrameToBinaryBlock in
> > RDDConverterUtilsExt). Since DataFrame/RDD is a collection not a
> > indexed/ordered sequence (at least not at API level), an ID column is
> > inserted by MLOutput to denote the row index.
> > e. Binary block to Labeled points (See binaryBlockToLabeledPoints in
> > RDDConverterUtils).
> > f. Conversion between text/cell/csv formats to and from Binary blocked
> RDD
> > (See RDDConverterUtils).
> >
> > 6. MLContext interface is Scala compatible i.e. we support both JavaRDD
> > and RDD, JavaSparkContext and SparkContext, java.util.HashMap and
> > scala.collection.immutable.Map, and so on.
> >
> > 7. MatrixCharacteristics is used to provide the metadata information
> (such
> > as number of rows, number of columns, block row length, block column
> length
> > and number of non-zeros) of a RDD to the SystemML's optimizer. In some
> > cases, it is required (for example: text, binary format) while in some
> > cases, it can be skipped (for example: csv, dataframe). MLContext exposes
> > convenient wrappers such as *void* registerInput(String varName,
> > JavaPairRDD<MatrixIndexes,MatrixBlock> rdd, *long* rlen, *long* clen,
> > *int* brlen, *int* bclen) to avoid creating MatrixCharacteristics. Here
> > is the source code if you are interested:
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/matrix/MatrixCharacteristics.java
> >
> > A good example of using MatrixCharacteristics and converter utils is
> > provided in RDDConverterUtilsExt's javadoc:
> > * import
> > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > * import org.apache.sysml.runtime.matrix.MatrixCharacteristics
> > * import org.apache.spark.api.java.JavaSparkContext
> > * import org.apache.spark.mllib.linalg.distributed.MatrixEntry
> > * import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
> > * *val* matRDD = sc.textFile("ratings.text").map(_.split(" ")).map(x =>
> > new MatrixEntry(x(0).toLong, x(1).toLong, x(2).toDouble)).filter(_.value
> !=
> > 0).cache
> > * require(matRDD.filter(x => x.i == 0 || x.j == 0).count == 0, "Expected
> 1
> > -based ratings file")
> > * *val* *nnz* = matRDD.count
> > * *val* numRows = matRDD.map(_.i).max
> > * *val* numCols = matRDD.map(_.j).max
> > * *val* coordinateMatrix = new CoordinateMatrix(matRDD, numRows, numCols)
> > * *val* *mc* = new MatrixCharacteristics(numRows, numCols, 1000, 1000,
> > *nnz*)
> > * *val* binBlocks =
> > RDDConverterUtilsExt.coordinateMatrixToBinaryBlock(new JavaSparkContext(
> > *sc*), coordinateMatrix, *mc*, true)
> >
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Deron Eriksson ---12/07/2015 02:50:30
> > PM---Hi Sourav, One way to generate Javadocs for the entire Sys]Deron
> > Eriksson ---12/07/2015 02:50:30 PM---Hi Sourav, One way to generate
> > Javadocs for the entire SystemML project is "mvn
> >
> > From: Deron Eriksson <deroneriksson@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/07/2015 02:50 PM
> > Subject: Re: API documentation for SystemML
> > ------------------------------
> >
> >
> >
> > Hi Sourav,
> >
> > One way to generate Javadocs for the entire SystemML project is "mvn
> > javadoc:javadoc".
> >
> > Unfortunately, classes such as MatrixCharacteristics and
> RDDConverterUtils
> > currently have very minimal API documentation. We are hoping to address
> > this in the near future. However, you may find that the following
> > documentation link could be of assistance in getting started, given your
> > interest in Scala:
> >
> >
> http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html
> >
> > Deron
> >
> >
> > On Mon, Dec 7, 2015 at 1:58 PM, Sourav Mazumder <
> > sourav.mazumder00@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > Is there any Scala/Java API documentation available for classes like
> > >
> > > MatrixCharacteristics, RDDConverterUtils ?
> > >
> > > What I need to understand is what all such helper utilities available
> > > and the deatils of their signature/APIs.
> > >
> > > Regards,
> > >
> > > Sourav
> > >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message