hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Makoto Yui (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVEMALL-44) Support Top-K joins for DataFrame/Spark
Date Wed, 08 Nov 2017 07:45:00 GMT

     [ https://issues.apache.org/jira/browse/HIVEMALL-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Makoto Yui updated HIVEMALL-44:
-------------------------------
    Fix Version/s: 0.5.0

> Support Top-K joins for DataFrame/Spark
> ---------------------------------------
>
>                 Key: HIVEMALL-44
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-44
>             Project: Hivemall
>          Issue Type: New Feature
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Blocker
>              Labels: Spark
>             Fix For: 0.5.0
>
>
> In Hivemall, `each_top_k` is useful for practical use cases. On the other hand, there
are some cases we need to join tables then compute Top-K entries.You know we can compute this
query by using regular joins + `each_top_k`. However, we have space to improve this query
more; that is, we compute Top-K entries while processing joins. This optimization avoids a
substantial amount of  I/O for joins.
> An example query is as follows;
> {code}
> val inputDf = Seq(
>   ("user1", 1, 0.3, 0.5),
>   ("user2", 2, 0.1, 0.1),
>   ("user3", 3, 0.8, 0.0),
>   ("user4", 1, 0.9, 0.9),
>   ("user5", 3, 0.7, 0.2),
>   ("user6", 1, 0.5, 0.4),
>   ("user7", 2, 0.6, 0.8)
> ).toDF("userId", "group", "x", "y")
> val masterDf = Seq(
>   (1, "pos1-1", 0.5, 0.1),
>   (1, "pos1-2", 0.0, 0.0),
>   (1, "pos1-3", 0.3, 0.3),
>   (2, "pos2-3", 0.1, 0.3),
>   (2, "pos2-3", 0.8, 0.8),
>   (3, "pos3-1", 0.1, 0.7),
>   (3, "pos3-1", 0.7, 0.1),
>   (3, "pos3-1", 0.9, 0.0),
>   (3, "pos3-1", 0.1, 0.3)
> ).toDF("group", "position", "x", "y")
> // Compute top-1 rows for each group
> val distance = sqrt(
>   pow(inputDf("x") - masterDf("x"), lit(2.0)) +
>   pow(inputDf("y") - masterDf("y"), lit(2.0))
> )
> val top1Df = inputDf.join_top_k(
>   lit(1), masterDf, inputDf("group") === masterDf("group"),
>   distance.as("score")
> )
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message