systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deron Eriksson <deroneriks...@gmail.com>
Subject DML transform() function
Date Thu, 10 Dec 2015 00:53:56 GMT
Hi,

I'm working on updating the online docs for the DML transform() function
since a couple things didn't copy over in the conversion to markdown.
However, I've run into an issue when I execute the transform() example. In
summary, is the "scale" transformation no longer allowed, and "bin" is
allowed?

I did the following:

I created data.csv:

zipcode,district,sqft,numbedrooms,numbathrooms,floors,view,saleprice,askingprice
95141,south,3002,6,3,2,FALSE,929,934
NA,west,1373,,1,3,FALSE,695,698
91312,south,NA,6,2,2,FALSE,902,
94555,NA,1835,3,,3,,888,892
95141,west,2770,5,2.5,,TRUE,812,816
95141,east,2833,6,2.5,2,TRUE,927,
96334,NA,1339,6,3,1,FALSE,672,675
96334,south,2742,6,2.5,2,FALSE,872,876
96334,north,2195,5,2.5,2,FALSE,799,803

I created data.csv.mtd:

{
    "data_type": "frame",
    "format": "csv",
    "sep": ",",
    "header": true,
    "na.strings": [ "NA", "" ]
}

I created data.spec.json:

{
    "omit": [ "zipcode" ]
   ,"impute":
    [ { "name": "district"    , "method": "constant", "value": "south" }
     ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
     ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
     ,{ "name": "floors"      , "method": "constant", "value": 1 }
     ,{ "name": "view"        , "method": "global_mode" }
     ,{ "name": "askingprice" , "method": "global_mean" }
    ]

    ,"recode":
    [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
"view" ]

    ,"bin":
    [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
     ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
    ]

    ,"dummycode":
    [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]

    ,"scale":
    [ { "name": "sqft", "method": "mean-subtraction" }
     ,{ "name": "saleprice", "method": "z-score" }
     ,{ "name": "askingprice", "method": "z-score" }
    ]
}

I executed the following DML:

D = read("data.csv");
tfD = transform(target=D,
                transformSpec="data.spec.json",
                transformPath="example-transform");
s = sum(tfD);
print("Sum = " + s);

This generated the following error:

java.lang.IllegalArgumentException: Invalid transformations on column ID 3.
A column can not be binned and scaled.

So, I removed the "scale" from data.spec.json:

{
    "omit": [ "zipcode" ]
   ,"impute":
    [ { "name": "district"    , "method": "constant", "value": "south" }
     ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
     ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
     ,{ "name": "floors"      , "method": "constant", "value": 1 }
     ,{ "name": "view"        , "method": "global_mode" }
     ,{ "name": "askingprice" , "method": "global_mean" }
    ]

    ,"recode":
    [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
"view" ]

    ,"bin":
    [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
     ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
    ]

    ,"dummycode":
    [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]

}

This generated:

java.lang.RuntimeException: Encountered "NA" in column ID "3", when
expecting a numeric value. Consider adding "NA" to na.strings, along with
an appropriate imputation method.

So, I set "sqft" to be "global_mean" in the "impute" section of the spec.

{
    "omit": [ "zipcode" ]
   ,"impute":
    [ { "name": "district"    , "method": "constant", "value": "south" }
     ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
     ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
     ,{ "name": "floors"      , "method": "constant", "value": 1 }
     ,{ "name": "view"        , "method": "global_mode" }
     ,{ "name": "askingprice" , "method": "global_mean" }
     ,{ "name": "sqft"        , "method": "global_mean" }
    ]

    ,"recode":
    [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
"view" ]

    ,"bin":
    [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
     ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
    ]

    ,"dummycode":
    [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]

}

This allowed the DML to execute successfully.

So, is "scale" not allowed anymore? And "bin" is allowed (despite the
message saying it isn't allowed)?

Thank you,
Deron

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message