hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <>
Subject [jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
Date Thu, 26 Apr 2018 08:28:00 GMT


Takeshi Yamamuro updated HIVEMALL-181:
    Attachment: fig1.png

> Plan rewriting rules to filter meaningful training data before feature selections
> ---------------------------------------------------------------------------------
>                 Key: HIVEMALL-181
>                 URL:
>             Project: Hivemall
>          Issue Type: Improvement
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>              Labels: spark
> In machine learning and statistics, feature selection is one of useful techniques to
choose a subset of relevant data in model construction for simplification of models and shorter
training times. scikit-learn has some APIs for feature selection ([]),
but this selection is too time-consuming process if training data have a large number of columns
(the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter meaningful
training data before feature selection. As a pretty simple example, Spark might be able to
filter out columns with low variances (This process is corresponding to `VarianceThreshold`
in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the
Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`)
and the plan execution could be significantly faster. Moreover, more sophisticated techniques
have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers and other
OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not
to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid
when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue
3, Pages 366-379, 2017.

This message was sent by Atlassian JIRA

View raw message