spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
Date Wed, 02 Jul 2014 09:09:24 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049765#comment-14049765
] 

Xiangrui Meng edited comment on SPARK-2341 at 7/2/14 9:09 AM:
--------------------------------------------------------------

It is a little awkward to have both `regression` and `multiclass` as input arguments. I agree
that a correct name should be `multiclassOrRegression` or `multiclassOrContinuous`. But it
is certainly too long. We tried to make this clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false, any label with
value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. So it works for both +1/-1
and 1/0 cases. If true, the double value parsed directly from the label string will be used
as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But for the API,
I don't feel that it is necessary to change.



was (Author: mengxr):
It is a little awkward to have both `regression` and `multiclass` as input arguments. I agree
that a correct name should be `multiclassOrRegression`. But it is certainly too long. We tried
to make this clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false, any label with
value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. So it works for both +1/-1
and 1/0 cases. If true, the double value parsed directly from the label string will be used
as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But for the API,
I don't feel that it is necessary to change.


> loadLibSVMFile doesn't handle regression datasets
> -------------------------------------------------
>
>                 Key: SPARK-2341
>                 URL: https://issues.apache.org/jira/browse/SPARK-2341
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.0.0
>            Reporter: Eustache
>            Priority: Minor
>              Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile
primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser.
What happens then is that the file is loaded but in multiclass mode : each target value is
interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target values to Double
and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message