systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Xu <ethan.yifa...@gmail.com>
Subject Re: 'sample.dml' replaces rows with 0's
Date Fri, 15 Apr 2016 00:02:45 GMT
OK this is interesting:

Scenario 1
I slightly modified 'sample.dml' to add statements to print dimensions of
SM, P and iX, and ran it on the same data. The dimensions AND the output
were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
original data.

Please see attached:
sample-debug.dml:
sample.dml with 3 print functions inserted
train-test-debug_1.mtd
train-test-debug_2.mtd:
meta data of outputs. Note 'rows' are correct.


Scenario 2
This is confusing so I commented out the 'print' statements in 'sample.dml'
and ran it on the same data, and the output were INCORRECT. That is, subset
'1' and '2' contain the same rows as the original data.

Please see attached:
Please see attached:
sample-debug-noprint.dml:
3 print functions were commented out
train-test-debug-noprint_1.mtd
train-test-debug-noprint_2.mtd
meta data of outputs. Note 'rows' are incorrect.

There was no errors in either trials.

Ethan

On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <ethan.yifanxu@gmail.com> wrote:

> Hello,
>
> I encountered an unexpected behavior from 'sample.dml' on a dataset on
> Hadoop. Instead of splitting the data, it replaced rows of original data
> with 0's. Here are the details:
>
> I called sample.dml in attempt to split is a 35 million by 2396 numeric
> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' both
> still contain 35 million rows, instead of 35*80% and 35*20% rows.
>
> However it looks like 20% of the rows in '1' are replaced with 0's (but
> not removed). It is as if line 66 of sample.dml (
> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml)
> that calls removeEmpty() doesn't exist.
>
> Here is the submission script:
>
> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols": 1,
> "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
>
> ## Split file to training and test sets
> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
> -config=$sysConfCust -nvargs X=/path/originalData.csv
> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
>
>
> There was no error messages and all MR jobs were executed successfully.
> What other information can I provide to diagnose the issue?
>
> Thanks,
>
> Ethan
>
>
>
>
>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message