spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: A note about MLlib's StandardScaler
Date Mon, 09 Jan 2017 06:50:18 GMT

Actually I think it is possibly that an user/developer needs the
standardized features with population mean and std in some cases. It would
be better if StandardScaler can offer the option to do that.



Holden Karau wrote
> Hi Gilad,
> 
> Spark uses the sample standard variance inside of the StandardScaler (see
> https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler
> ) which I think would explain the results you are seeing you are seeing. I
> believe the scalers are intended to be used on larger sized datasets You
> can verify this yourself doing the same computation in Python and see the
> scaling using the sample deviation result in the values you are seeing
> from
> Spark.
> 
> Cheers,
> 
> Holden :)
> 
> 
> On Sun, Jan 8, 2017 at 12:06 PM, Gilad Barkan &lt;

> gilad.barkan@

> &gt;
> wrote:
> 
>> Hi
>>
>> It seems that the output of MLlib's *StandardScaler*(*withMean=*True,
>> *withStd*=True)are not as expected.
>>
>> The above configuration is expected to do the following transformation:
>>
>> X -> Y = (X-Mean)/Std  - Eq.1
>>
>> This transformation (a.k.a. Standardization) should result in a
>> "standardized" vector with unit-variance and zero-mean.
>>
>> I'll demonstrate my claim using the current documentation example:
>>
>> >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0,
>> 1.9])]>>> dataset = sc.parallelize(vs)>>> standardizer =
>> StandardScaler(True, True)>>> model = standardizer.fit(dataset)>>>
result
>> = model.transform(dataset)>>> for r in result.collect(): print r
>>     DenseVector([-0.7071, 0.7071, -0.7071])    DenseVector([0.7071,
>> -0.7071, 0.7071])
>>
>> This result in std = sqrt(1/2) foreach column instead of std=1.
>>
>> Applying Standardization transformation on the above 2 vectors result in
>> the following output
>>
>>     DenseVector([-1.0, 1.0, -1.0])    DenseVector([1.0, -1.0, 1.0])
>>
>>
>> Another example:
>>
>> Adding another DenseVector([2.4, 0.8, 3.5]) to the above we get a 3 rows
>> of DenseVectors:
>> [DenseVector([-2.0, 2.3, 0.0]), DenseVector([3.8, 0.0, 1.9]),
>> DenseVector([2.4, 0.8, 3.5])]
>>
>> The StandardScaler result the following scaled vectors:
>> [DenseVector([-1.12339, 1.084829, -1.02731]), DenseVector([0.792982,
>> -0.88499, 0.057073]), DenseVector([0.330409, 4
>> -0.19984, 0.970241])
>>
>> This result has std=sqrt(2/3)
>>
>> Instead it should have resulted other 3 vectors that form std=1 for each
>> column.
>>
>> Adding another vector (4 total) results in 4 scaled vectors that form
>> std= sqrt(3/4) instead of std=1
>>
>> I hope all the examples help to make my point clear.
>>
>> I hope I don't miss here something.
>>
>> Thank you
>>
>> Gilad Barkan
>>
>>
>>
>>
>>
>>
> 
> 
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-note-about-MLlib-s-StandardScaler-tp20513p20517.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message