spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carlo.Allocca <carlo.allo...@open.ac.uk>
Subject Re: LinearRegressionWithSGD and Rank Features By Importance
Date Mon, 07 Nov 2016 17:12:14 GMT
Hi Masood,

Thank you very much for your insight.
I am going to scale all my features as you described.

As I am beginners, Is there any paper/book that would explain the suggested approaches? I
would love to read.

Many Thanks,
Best Regards,
Carlo





On 7 Nov 2016, at 16:27, Masood Krohy <masood.krohy@intact.net<mailto:masood.krohy@intact.net>>
wrote:

Yes, you would want to scale those features before feeding into any algorithm, one typical
way would be to calculate the average and std for each feature, deduct the avg, then divide
by std. Dividing by "max - min" is also a good option if you're sure there is no outlier shooting
up your max or lowering your min significantly for each feature. After you have scaled each
feature, then you can feed the data into the algo for training.

For prediction on new samples, you need to scale each sample first before making predictions
using your trained model.

It's not too complicated to implement manually, but Spark API has some support for this already:
ML: http://spark.apache.org/docs/latest/ml-features.html#standardscaler
MLlib: http://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler

Masood


------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh



De :        Carlo.Allocca <carlo.allocca@open.ac.uk<mailto:carlo.allocca@open.ac.uk>>
A :        Masood Krohy <masood.krohy@intact.net<mailto:masood.krohy@intact.net>>
Cc :        Carlo.Allocca <carlo.allocca@open.ac.uk<mailto:carlo.allocca@open.ac.uk>>,
Mohit Jaggi <mohitjaggi@gmail.com<mailto:mohitjaggi@gmail.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>"
<user@spark.apache.org<mailto:user@spark.apache.org>>
Date :        2016-11-07 10:50
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance

________________________________



Hi Masood,

thank you very much for the reply. It is very a good point as I am getting very bed result
so far.

If I understood well what you suggest is to scale the date below (it is part of my dataset)
before applying linear regression SGD.

is it correct?

Many Thanks in advance.

Best Regards,
Carlo

<Mail Attachment.png>

On 7 Nov 2016, at 15:31, Masood Krohy <masood.krohy@intact.net<mailto:masood.krohy@intact.net>>
wrote:

If you go down this route (look at actual coefficients/weights), then make sure your features
are scaled first and have more or less the same mean when feeding them into the algo. If not,
then actual coefficients/weights wouldn't tell you much. In any case, SGD performs badly with
unscaled features, so you gain if you scale the features beforehand.

Masood

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh



De :        Carlo.Allocca <carlo.allocca@open.ac.uk<mailto:carlo.allocca@open.ac.uk>>
A :        Mohit Jaggi <mohitjaggi@gmail.com<mailto:mohitjaggi@gmail.com>>
Cc :        Carlo.Allocca <carlo.allocca@open.ac.uk<mailto:carlo.allocca@open.ac.uk>>,
"user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>>
Date :        2016-11-04 03:39
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance

________________________________



Hi Mohit,

Thank you for your reply.
OK. it means coefficient with high score are more important that other with low score…

Many Thanks,
Best Regards,
Carlo


> On 3 Nov 2016, at 20:41, Mohit Jaggi <mohitjaggi@gmail.com<mailto:mohitjaggi@gmail.com>>
wrote:
>
> For linear regression, it should be fairly easy. Just sort the co-efficients :)
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com<x-msg://61/www.dataorchardllc.com>
>
>
>
>
>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <carlo.allocca@open.ac.uk<mailto:carlo.allocca@open.ac.uk>>
wrote:
>>
>> Hi All,
>>
>> I am using SPARK and in particular the MLib library.
>>
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>>
>> For my problem I am using the LinearRegressionWithSGD and I would like to perform
a “Rank Features By Importance”.
>>
>> I checked the documentation and it seems that does not provide such methods.
>>
>> Am I missing anything?  Please, could you provide any help on this?
>> Should I change the approach?
>>
>> Many Thanks in advance,
>>
>> Best Regards,
>> Carlo
>>
>>
>> -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity
in England & Wales and a charity registered in Scotland (SC 038302). The Open University
is authorised and regulated by the Financial Conduct Authority.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>
>>
>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>






Mime
View raw message