hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningless columns before future selections
Date Thu, 29 Mar 2018 22:25:00 GMT

     [ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Takeshi Yamamuro updated HIVEMALL-181:
--------------------------------------
    Description: 
In machine learning and statistics, feature selection is a useful techniqe to choose a subset
of relevant features in model construction for simplification of models and shorter training
times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html),
but this selection is too time-consuming process if training data have a large number of columns
(the number could frequently go over 1,000 in bisiness use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless
columns before feature selection.  As a simple example, Spark might be able to filter out
columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn)
by implicitly adding a `Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`)
and the plan execution could be significantly faster. Moreover, more sophicated techniques
have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities)
in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?:
Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning
high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379,
2017. 

  was:
In machine learning and statistics, feature selection is a useful techniqe to choose a subset
of relevant features in model construction for simplification of models and shorter training
times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html),
but this selection is too time-consuming process if training data have a large number of columns
(the number could frequently go over 1,000 in bisiness use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless
columns before feature selection. 
As a simple example, Spark might be able to filter out columns with low variances (This process
is corresponding to `VarianceThreshold` in scikit-learn)
by implicitly adding a `Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`)
and
the plan execution could be significantly faster.
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities)
in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?:
Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning
high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379,
2017. 


> Plan rewrting rules to filter out meaningless columns before future selections
> ------------------------------------------------------------------------------
>
>                 Key: HIVEMALL-181
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-181
>             Project: Hivemall
>          Issue Type: Improvement
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>              Labels: spark
>
> In machine learning and statistics, feature selection is a useful techniqe to choose
a subset of relevant features in model construction for simplification of models and shorter
training times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html),
but this selection is too time-consuming process if training data have a large number of columns
(the number could frequently go over 1,000 in bisiness use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless
columns before feature selection.  As a simple example, Spark might be able to filter out
columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn)
by implicitly adding a `Project` node in the top of an user plan.
> Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g.,
`LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated
techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers and other
OSS functinalities) in this ticket to track them.
> References:
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to
Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
> [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when
learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3,
Pages 366-379, 2017. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message