hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information
Date Wed, 04 Apr 2018 02:06:00 GMT

     [ https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Takeshi Yamamuro updated HIVEMALL-184:
--------------------------------------
    Description: 
Mutual Information (MI) is an indicator to find and quantify dependencies between variables,
so the indicator is useful to filter out columns in feature selection. Nearest-neighbor distances
are frequently used to estimate MI [1], so we could use the distances to compute MI between
columns for each relation when running an ANALYZE command. Then, we could filter out "similar"
columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).

In another story, we need to consider a light-weight way to update MI when re-running an ANALYZE
command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data.

[1] Dafydd Evans, A computationally efficient estimator for mutual information. In Proceedings
of the Royal Society of London A: Mathematical, Physical
 and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
 [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information Estimation
on Static and Dynamic Data, Proceedings of EDBT, 2018.

  was:
Mutual Information (MI) is an indicator to find and quantify dependencies between variables,
so the indicator is useful to filter out columns in feature selection. Nearest-neighbor distances
are frequently used to estimate MI [1], so we could use the distances to compute MI between
columns for each relation when running an ANALYZE command. Then, we could filter out "similar"
columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).

In another story, we need to consider a light-weight way to update MI when re-running an ANALYZE
command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data.

[1] Dafydd Evans, A computationally efficient estimator for mutual information.
In Proceedings of the Royal Society of London A: Mathematical, Physical
and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
[2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information
Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.



> Add an optimizer rule to filter out columns by using Mutual Information
> -----------------------------------------------------------------------
>
>                 Key: HIVEMALL-184
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-184
>             Project: Hivemall
>          Issue Type: Sub-task
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>              Labels: spark
>
> Mutual Information (MI) is an indicator to find and quantify dependencies between variables,
so the indicator is useful to filter out columns in feature selection. Nearest-neighbor distances
are frequently used to estimate MI [1], so we could use the distances to compute MI between
columns for each relation when running an ANALYZE command. Then, we could filter out "similar"
columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).
> In another story, we need to consider a light-weight way to update MI when re-running
an ANALYZE command. A recent study [2] proposed a sophisticated technique to compute MI for
dynamic data.
> [1] Dafydd Evans, A computationally efficient estimator for mutual information. In Proceedings
of the Royal Society of London A: Mathematical, Physical
>  and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
>  [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information Estimation
on Static and Dynamic Data, Proceedings of EDBT, 2018.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message