systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <mboe...@gmail.com>
Subject Re: Passing a CoordinateMatrix to SystemML
Date Fri, 22 Dec 2017 13:48:13 GMT
well, let's do the following to figure this out:

1) If the schema is indeed [label: Integer, features: SparseVector], 
please change the third line to val y = input_data.select("label").

2) For debugging, I would recommend to use a simple script like 
"print(sum(X));" and try converting X and y separately to isolate the 
problem.

3) If it's still failing, it would be helpful to known (a) if it's an 
issue of converting X, y, or both, as well as (b) the full stacktrace.

4) As a workaround you might also call our internal converter directly via:
RDDConverterUtils.dataFrameToBinaryBlock(jsc, df, mc, containsID, 
isVector),
where jsc is the java spark context, df is the dataset, mc are matrix 
characteristics (if unknown, simply use new MatrixCharacteristics()), 
containsID indicates if the dataset contains a column "__INDEX" with the 
row indexes, and isVector indicates if the passed datasets contains 
vectors or basic types such as double.


Regards,
Matthias

On 12/22/2017 12:00 AM, Anthony Thomas wrote:
> Hi SystemML folks,
>
> I'm trying to pass some data from Spark to a DML script via the MLContext
> API. The data is derived from a parquet file containing a dataframe with
> the schema: [label: Integer, features: SparseVector]. I am doing the
> following:
>
>         val input_data = spark.read.parquet(inputPath)
>         val x = input_data.select("features")
>         val y = input_data.select("y")
>         val x_meta = new MatrixMetadata(DF_VECTOR)
>         val y_meta = new MatrixMetadata(DF_DOUBLES)
>         val script = dmlFromFile(s"${script_path}/script.dml").
>                 in("X", x, x_meta).
>                 in("Y", y, y_meta)
>         ...
>
> However, this results in an error from SystemML:
> java.lang.ArrayIndexOutOfBoundsException: 0
> I'm guessing this has something to do with SparkML being zero indexed and
> SystemML being 1 indexed. Is there something I should be doing differently
> here? Note that I also tried converting the dataframe to a CoordinateMatrix
> and then creating an RDD[String] in IJV format. That too resulted in
> "ArrayIndexOutOfBoundsExceptions." I'm guessing there's something simple
> I'm doing wrong here, but I haven't been able to figure out exactly what.
> Please let me know if you need more information (I can send along the full
> error stacktrace if that would be helpful)!
>
> Thanks,
>
> Anthony
>

Mime
View raw message