[ https://issues.apache.org/jira/browse/HIVEMALL184?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Takeshi Yamamuro updated HIVEMALL184:

Description:
Mutual Information (MI) is an indicator to find and quantify dependencies between variables,
so the indicator is useful to filter out columns in feature selection. Nearestneighbor distances
are frequently used to estimate MI [1], so we could use the distances to compute MI between
columns for each relation when running an ANALYZE command. Then, we could filter out "similar"
columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).
In another story, we need to consider a lightweight way to update MI when rerunning an ANALYZE
command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data.
[1] Dafydd Evans, A computationally efficient estimator for mutual information. In Proceedings
of the Royal Society of London A: Mathematical, Physical
and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
[2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information Estimation
on Static and Dynamic Data, Proceedings of EDBT, 2018.
was:
Mutual Information (MI) is an indicator to find and quantify dependencies between variables,
so the indicator is useful to filter out columns in feature selection. Nearestneighbor distances
are frequently used to estimate MI [1], so we could use the distances to compute MI between
columns for each relation when running an ANALYZE command. Then, we could filter out "similar"
columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).
In another story, we need to consider a lightweight way to update MI when rerunning an ANALYZE
command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data.
[1] Dafydd Evans, A computationally efficient estimator for mutual information.
In Proceedings of the Royal Society of London A: Mathematical, Physical
and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
[2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information
Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.
> Add an optimizer rule to filter out columns by using Mutual Information
> 
>
> Key: HIVEMALL184
> URL: https://issues.apache.org/jira/browse/HIVEMALL184
> Project: Hivemall
> Issue Type: Subtask
> Reporter: Takeshi Yamamuro
> Assignee: Takeshi Yamamuro
> Priority: Major
> Labels: spark
>
> Mutual Information (MI) is an indicator to find and quantify dependencies between variables,
so the indicator is useful to filter out columns in feature selection. Nearestneighbor distances
are frequently used to estimate MI [1], so we could use the distances to compute MI between
columns for each relation when running an ANALYZE command. Then, we could filter out "similar"
columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).
> In another story, we need to consider a lightweight way to update MI when rerunning
an ANALYZE command. A recent study [2] proposed a sophisticated technique to compute MI for
dynamic data.
> [1] Dafydd Evans, A computationally efficient estimator for mutual information. In Proceedings
of the Royal Society of London A: Mathematical, Physical
> and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
> [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information Estimation
on Static and Dynamic Data, Proceedings of EDBT, 2018.

This message was sent by Atlassian JIRA
(v7.6.3#76005)
