hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningful training data before future selections
Date Wed, 04 Apr 2018 00:49:00 GMT

     [ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Takeshi Yamamuro updated HIVEMALL-181:
--------------------------------------
    Summary: Plan rewrting rules to filter out meaningful training data before future selections
 (was: Plan rewrting rules to filter out meaningless columns before future selections)

> Plan rewrting rules to filter out meaningful training data before future selections
> -----------------------------------------------------------------------------------
>
>                 Key: HIVEMALL-181
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-181
>             Project: Hivemall
>          Issue Type: Improvement
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>              Labels: spark
>
> In machine learning and statistics, feature selection is a useful techniqe to choose
a subset of relevant features in model construction for simplification of models and shorter
training times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html),
but this selection is too time-consuming process if training data have a large number of columns
(the number could frequently go over 1,000 in bisiness use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless
columns before feature selection.  As a simple example, Spark might be able to filter out
columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn)
by implicitly adding a `Project` node in the top of an user plan.
> Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g.,
`LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated
techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers and other
OSS functinalities) in this ticket to track them.
> References:
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to
Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
> [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when
learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3,
Pages 366-379, 2017. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message