flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximilian Michels <...@apache.org>
Subject Re: Hi, Flink people, a question about translation from HIVE Query to Flink fucntioin by using Table API
Date Mon, 19 Oct 2015 11:01:02 GMT
Hi Philip,

Thank you for your questions. I think you have mapped the HIVE
functions to the Flink ones correctly. Just a remark on the ORDER BY.
You wrote that it produces a total order of all the records. In this
case, you'd have do a SortPartition operation with parallelism set to
1. This is necessary because we need to have all records in one place
to perform a sort on them.

Considering your reduce question: There is no fundamental
advantage/disadvantage of using GroupReduce over Reduce. It depends on
your use case which one is more convenient or efficient. For the
regular reduce, you just get two elements and produce one. You can't
easily keep state between the reduces other than in the value itself.
The GroupReduce, on the other hand, may produce none, one, or multiple
elements per grouping and keep state in between emitting values. Thus,
GroupReduce is a more powerful operator and can be seen as a superset
of the Reduce operator. I would advise you to use the one you find
easiest to use.

Best regards,

On Sun, Oct 18, 2015 at 9:16 PM, Philip Lee <philjjoon@gmail.com> wrote:
> Hi, Flink people, a question about translation from HIVE Query to Flink
> fucntioin by using Table API. In sum up, I am working on some benchmark for
> flink
> I am Philip Lee majoring in Computer Science in Master Degree of TUB. , I
> work on translation from Hive Query of Benchmark to Flink codes.
> As I stuided it, I have a few of questions.
> First of all, if there are people who do no know Hive functions, let me
> briefly explan.
> ORDER BY: it just guarntees total order in the output.
> SORT BY: it only guarntess ordering of the rows within a reducer.
> GROUP BY: this is just groupBy function in SQL.
> DISTRIBUTE BY: all rows with the same distributed by columns will go to the
> same reducer.
> CLUSTER BY: this is just consisted of Distribute By the same column + Sort
> By the same column.
> I just want to check that the flink functions I use are equal to Hive one.
> < Hive SQL Query = Flink functions >
> ORDER BY = sortPartition(,)
> SORT BY= groupBy(`col).sortPartition(,)
> GROUP BY: this is just groupBy function.
> DISTRIBUTE BY = groupBy(`col)
> CLUSTER BY = groupBy(`col).sortPartition(,)
> I do not see much difference between groupBy and distributed by if I apply
> it to flink function.
> If this is hadoop version, we could say mapper is distribute by on hadoop.
> However, I am not much sure what could be DISTRIBUTE BY on flink. I tried to
> guess groupBy on Flink could be the function which is to distribute the rows
> by the specified key.
> Please feel free to correct what I suggested.
> Secondly, I just want to make sure the difference between reduce function
> and reduceGroup. I guess there must be a trade-off between two functinos. I
> know reduceGroup is invoked with an Iterator, but which case is more proper
> and benifical to use reduceGroup function rather than reduce function?
> Best Regards,
> Philip
> --
> ==========================================================
> Hae Joon Lee
> Now, in Germany,
> M.S. Candidate, Interested in Distributed System, Iterative Processing
> Dept. of Computer Science, Informatik in German, TUB
> Technical University of Berlin
> In Korea,
> M.S. Candidate, Computer Architecture Laboratory
> Dept. of Computer Science, KAIST
> Rm# 4414 CS Dept. KAIST
> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
> ==========================================================

View raw message