hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From goun na <gou...@gmail.com>
Subject Re: How can i merge multiple rows to one row in sparksql or hivesql?
Date Mon, 15 May 2017 15:50:05 GMT
Hi, Jone Zhang

1. Hive UDF
You might need collect_set or collect_list (to eliminate duplication), but
make sure reduce its cardinality before applying UDFs as it can cause
problems while handling 1 billion records. Union dataset 1,2,3 -> group by
user_id1 -> collect_set (feature column) would works.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

2.Spark Dataframe Pivot
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

- Goun

2017-05-15 22:15 GMT+09:00 Jone Zhang <joyoungzhang@gmail.com>:

> For example
> Data1(has 1 billion records)
> user_id1  feature1
> user_id1  feature2
>
> Data2(has 1 billion records)
> user_id1  feature3
>
> Data3(has 1 billion records)
> user_id1  feature4
> user_id1  feature5
> ...
> user_id1  feature100
>
> I want to get the result as follow
> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
>
> Is there a more efficient way except join?
>
> Thanks!
>

Mime
View raw message