hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <>
Subject [jira] [Created] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information
Date Wed, 04 Apr 2018 02:06:00 GMT
Takeshi Yamamuro created HIVEMALL-184:

             Summary: Add an optimizer rule to filter out columns by using Mutual Information
                 Key: HIVEMALL-184
             Project: Hivemall
          Issue Type: Sub-task
            Reporter: Takeshi Yamamuro
            Assignee: Takeshi Yamamuro

Mutual Information (MI) is an indicator to find and quantify dependencies between variables,
so the indicator is useful to filter out columns in feature selection. Nearest-neighbor distances
are frequently used to estimate MI [1], so we could use the distances to compute MI between
columns for each relation when running an ANALYZE command. Then, we could filter out "similar"
columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).

In another story, we need to consider a light-weight way to update MI when re-running an ANALYZE
command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data.

[1] Dafydd Evans, A computationally efficient estimator for mutual information.
In Proceedings of the Royal Society of London A: Mathematical, Physical
and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
[2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information
Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.

This message was sent by Atlassian JIRA

View raw message