well, it looks like an issue of incorrect meta data propagation (wrong propagation of dimensions through mr pmm instructions). The data itself looks good if I write a 20% sample to textcell (what is used in our testsuite).

@Shirish: thanks for looking into it. Just fyi, while testing this on an ultra-sparse scenario, I also encountered a runtime issue of deep copying sparse rows (fix will be available tomorrow), so for now don't worry about it if you encounter the same issue.

Regards,
Matthias


Inactive hide details for Shirish Tatikonda ---04/14/2016 08:43:34 PM---Hi Ethan, I just tried the script on a toy data and I cShirish Tatikonda ---04/14/2016 08:43:34 PM---Hi Ethan, I just tried the script on a toy data and I could reproduce this erroneous

From: Shirish Tatikonda <shirish.tatikonda@gmail.com>
To: dev@systemml.incubator.apache.org
Date: 04/14/2016 08:43 PM
Subject: Re: 'sample.dml' replaces rows with 0's





Hi Ethan,

I just tried the script on a toy data and I could reproduce this erroneous
behavior when run in Hadoop mode -- both local and Spark modes are good. I
will look into it.

BTW, you forgot to attach the scripts.

Shirish

On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <ethan.yifanxu@gmail.com> wrote:

> OK this is interesting:
>
> Scenario 1
> I slightly modified 'sample.dml' to add statements to print dimensions of
> SM, P and iX, and ran it on the same data. The dimensions AND the output
> were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
> original data.
>
> Please see attached:
> sample-debug.dml:
> sample.dml with 3 print functions inserted
> train-test-debug_1.mtd
> train-test-debug_2.mtd:
> meta data of outputs. Note 'rows' are correct.
>
>
> Scenario 2
> This is confusing so I commented out the 'print' statements in
> 'sample.dml' and ran it on the same data, and the output were INCORRECT.
> That is, subset '1' and '2' contain the same rows as the original data.
>
> Please see attached:
> Please see attached:
> sample-debug-noprint.dml:
> 3 print functions were commented out
> train-test-debug-noprint_1.mtd
> train-test-debug-noprint_2.mtd
> meta data of outputs. Note 'rows' are incorrect.
>
> There was no errors in either trials.
>
> Ethan
>
> On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <ethan.yifanxu@gmail.com> wrote:
>
>> Hello,
>>
>> I encountered an unexpected behavior from 'sample.dml' on a dataset on
>> Hadoop. Instead of splitting the data, it replaced rows of original data
>> with 0's. Here are the details:
>>
>> I called sample.dml in attempt to split is a 35 million by 2396 numeric
>> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' both
>> still contain 35 million rows, instead of 35*80% and 35*20% rows.
>>
>> However it looks like 20% of the rows in '1' are replaced with 0's (but
>> not removed). It is as if line 66 of sample.dml (
>>
https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml)
>> that calls removeEmpty() doesn't exist.
>>
>> Here is the submission script:
>>
>> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
>> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols":
>> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
>>
>> ## Split file to training and test sets
>> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
>> -config=$sysConfCust -nvargs X=/path/originalData.csv
>> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
>>
>>
>> There was no error messages and all MR jobs were executed successfully.
>> What other information can I provide to diagnose the issue?
>>
>> Thanks,
>>
>> Ethan
>>
>>
>>
>>
>>
>