hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
Date Thu, 26 Apr 2018 08:53:00 GMT

     [ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Takeshi Yamamuro updated HIVEMALL-181:
--------------------------------------
    Description: 
In machine learning and statistics, feature selection is one of useful techniques to choose
a subset of relevant data in model construction for simplification of models and shorter training
times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]).
But, this selection is too time-consuming process if training data have a large number of
columns and rows (For example, the number of columns could frequently go over 1,000 in real
business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter
meaningful training data before feature selection. We assume a workflow below from data extraction
to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure,
by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC,
...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature
selection apply to them. In real business use cases, it sometimes happens that raw training
data have many meaningless columns because of historical reasons (e.g., redundant schema designs).
So, if we could filter out these meaningless data in the phase of data extraction, we should
efficiently process the data extraction itself and following feature selection. In the example
above, we actually need not join the relation R3 because all the columns in the relation are
filtered out in feature selection. Also, the join processing should be faster if we could
sample data directly in the input data (R1 and R2). This optimized workflow is as following;

!fig2.png!

This optimization might be achived by rewriting a plan tree for data extraction as following;

!fig3.png!

Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules)
and a framework to collect data statistics for input data in data sources, the major tasks
of this ticket are to add plan rewriting rules to filter meaningful training data before feature
selections.

As a pretty simple task, Spark might have a rule to filter out columns with low variances
(This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding
a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this
`Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly
faster. Moreover, more sophisticated techniques have been proposed in [1, 2, 3].

I will make pull requests as sub-tasks and put relevant activities (reseaches and other OSS
functionalities) in this ticket to track them.

 

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?:
Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning
high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379,
2017.
[3] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins Revisited, Proceedings
of SIGMOD, 2018.

  was:
In machine learning and statistics, feature selection is one of useful techniques to choose
a subset of relevant data in model construction for simplification of models and shorter training
times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]).
But, this selection is too time-consuming process if training data have a large number of
columns and rows (For example, the number of columns could frequently go over 1,000 in real
business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter
meaningful training data before feature selection. We assume a workflow below from data extraction
to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure,
by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC,
...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature
selection apply to them. In real business use cases, it sometimes happens that raw training
data have many meaningless columns because of historical reasons (e.g., redundant schema designs).
So, if we could filter out these meaningless data in the phase of data extraction, we should
efficiently process the data extraction itself and following feature selection. In the example
above, we actually need not join the relation R3 because all the columns in the relation are
filtered out in feature selection. Also, the join processing should be faster if we could
sample data directly in the input data (R1 and R2). This optimized workflow is as following;

!fig2.png!

This optimization might be achived by rewriting a plan tree for data extraction as following;

!fig3.png!

Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules)
and a framework to collect data statistics for input data in data sources, the major tasks
of this ticket are to add plan rewriting rules to filter meaningful training data before feature
selections.

As a pretty simple task, Spark might have a rule to filter out columns with low variances
(This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding
a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this
`Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly
faster. Moreover, more sophisticated techniques have been proposed in [1, 2, 3].

I will make pull requests as sub-tasks and put relevant activities (reseaches and other OSS
functionalities) in this ticket to track them.

 

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?:
Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when
learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3,
Pages 366-379, 2017.


> Plan rewriting rules to filter meaningful training data before feature selections
> ---------------------------------------------------------------------------------
>
>                 Key: HIVEMALL-181
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-181
>             Project: Hivemall
>          Issue Type: Improvement
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>              Labels: spark
>         Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful techniques to
choose a subset of relevant data in model construction for simplification of models and shorter
training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]).
But, this selection is too time-consuming process if training data have a large number of
columns and rows (For example, the number of columns could frequently go over 1,000 in real
business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to
filter meaningful training data before feature selection. We assume a workflow below from
data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the
figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS,
S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling
and feature selection apply to them. In real business use cases, it sometimes happens that
raw training data have many meaningless columns because of historical reasons (e.g., redundant
schema designs). So, if we could filter out these meaningless data in the phase of data extraction,
we should efficiently process the data extraction itself and following feature selection.
In the example above, we actually need not join the relation R3 because all the columns in
the relation are filtered out in feature selection. Also, the join processing should be faster
if we could sample data directly in the input data (R1 and R2). This optimized workflow is
as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules)
and a framework to collect data statistics for input data in data sources, the major tasks
of this ticket are to add plan rewriting rules to filter meaningful training data before feature
selections.
> As a pretty simple task, Spark might have a rule to filter out columns with low variances
(This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding
a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this
`Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly
faster. Moreover, more sophisticated techniques have been proposed in [1, 2, 3].
> I will make pull requests as sub-tasks and put relevant activities (reseaches and other
OSS functionalities) in this ticket to track them.
>  
> References:
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to
Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
> [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when
learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3,
Pages 366-379, 2017.
> [3] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins Revisited,
Proceedings of SIGMOD, 2018.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message