hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <>
Subject [jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
Date Thu, 26 Apr 2018 08:52:00 GMT


Takeshi Yamamuro updated HIVEMALL-181:
    Attachment:     (was: fig2.png)

> Plan rewriting rules to filter meaningful training data before feature selections
> ---------------------------------------------------------------------------------
>                 Key: HIVEMALL-181
>                 URL:
>             Project: Hivemall
>          Issue Type: Improvement
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>              Labels: spark
>         Attachments: fig1.png, fig2.png, fig3.png
> In machine learning and statistics, feature selection is one of useful techniques to
choose a subset of relevant data in model construction for simplification of models and shorter
training times, e.g., scikit-learn has some APIs for feature selection ([]).
But, this selection is too time-consuming process if training data have a large number of
columns and rows (For example, the number of columns could frequently go over 1,000 in real
business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to
filter meaningful training data before feature selection. We assume a workflow below from
data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the
figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS,
S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling
and feature selection apply to them. In real business use cases, it sometimes happens that
raw training data have many meaningless columns because of historical reasons (e.g., redundant
schema designs). So, if we could filter out these meaningless data in the phase of data extraction,
we should efficiently process the data extraction itself and following feature selection.
In the example above, we actually need not join the relation R3 because all the columns in
the relation are filtered out in feature selection. Also, the join processing should be faster
if we could sample data directly in the input data (R1 and R2). This optimized workflow is
as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules)
and a framework to collect data statistics for input data in data sources, the major tasks
of this ticket are to add plan rewriting rules to filter meaningful training data before feature
> As a pretty simple task, Spark might have a rule to filter out columns with low variances
(This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding
a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this
`Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly
faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers and other
OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not
to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid
when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue
3, Pages 366-379, 2017.

This message was sent by Atlassian JIRA

View raw message