flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shuyi Chen <suez1...@gmail.com>
Subject Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
Date Wed, 31 Oct 2018 18:46:56 GMT
Hi Xuefu,

Thanks a lot for driving this big effort. I would suggest convert your
proposal and design doc into a google doc, and share it on the dev mailing
list for the community to review and comment with title like "[DISCUSS] ...
Hive integration design ..." . Once approved,  we can document it as a FLIP
(Flink Improvement Proposals), and use JIRAs to track the implementations.
What do you think?

Shuyi

On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xuefu.z@alibaba-inc.com>
wrote:

> Hi all,
>
> I have also shared a design doc on Hive metastore integration that is
> attached here and also to FLINK-10556[1]. Please kindly review and share
> your feedback.
>
>
> Thanks,
> Xuefu
>
> [1] https://issues.apache.org/jira/browse/FLINK-10556
>
> ------------------------------------------------------------------
> Sender:Xuefu <xuefu.z@alibaba-inc.com>
> Sent at:2018 Oct 25 (Thu) 01:08
> Recipient:Xuefu <xuefu.z@alibaba-inc.com>; Shuyi Chen <suez1224@gmail.com>
> Cc:yanghua1127 <yanghua1127@gmail.com>; Fabian Hueske <fhueske@gmail.com>;
> dev <dev@flink.apache.org>; user <user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi all,
>
> To wrap up the discussion, I have attached a PDF describing the proposal,
> which is also attached to FLINK-10556 [1]. Please feel free to watch that
> JIRA to track the progress.
>
> Please also let me know if you have additional comments or questions.
>
> Thanks,
> Xuefu
>
> [1] https://issues.apache.org/jira/browse/FLINK-10556
>
>
> ------------------------------------------------------------------
> Sender:Xuefu <xuefu.z@alibaba-inc.com>
> Sent at:2018 Oct 16 (Tue) 03:40
> Recipient:Shuyi Chen <suez1224@gmail.com>
> Cc:yanghua1127 <yanghua1127@gmail.com>; Fabian Hueske <fhueske@gmail.com>;
> dev <dev@flink.apache.org>; user <user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Shuyi,
>
> Thank you for your input. Yes, I agreed with a phased approach and like to
> move forward fast. :) We did some work internally on DDL utilizing babel
> parser in Calcite. While babel makes Calcite's grammar extensible, at
> first impression it still seems too cumbersome for a project when too
> much extensions are made. It's even challenging to find where the extension
> is needed! It would be certainly better if Calcite can magically support
> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
> see that this could mean a lot of work on Calcite. Nevertheless, I will
> bring up the discussion over there and to see what their community thinks.
>
> Would mind to share more info about the proposal on DDL that you
> mentioned? We can certainly collaborate on this.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Shuyi Chen <suez1224@gmail.com>
> Sent at:2018 Oct 14 (Sun) 08:30
> Recipient:Xuefu <xuefu.z@alibaba-inc.com>
> Cc:yanghua1127 <yanghua1127@gmail.com>; Fabian Hueske <fhueske@gmail.com>;
> dev <dev@flink.apache.org>; user <user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Welcome to the community and thanks for the great proposal, Xuefu! I think
> the proposal can be divided into 2 stages: making Flink to support Hive
> features, and make Hive to work with Flink. I agreed with Timo that on
> starting with a smaller scope, so we can make progress faster. As for [6],
> a proposal for DDL is already in progress, and will come after the unified
> SQL connector API is done. For supporting Hive syntax, we might need to
> work with the Calcite community, and a recent effort called babel (
> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help
> here.
>
> Thanks
> Shuyi
>
> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xuefu.z@alibaba-inc.com>
> wrote:
> Hi Fabian/Vno,
>
> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
>
> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
>
> 1. Hive metastore connectivity - This covers both read/write access, which
> means Flink can make full use of Hive's metastore as its catalog (at least
> for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
> created by Hive can be understood by Flink and the reverse direction is
> true also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
> its own implementation or make Hive's implementation work in Flink.
> Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003)
> with extension to support Hive's syntax and language features, around DDL,
> DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in
> thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
>
> As you can see, achieving all those requires significant effort and across
> all layers in Flink. However, a short-term goal could  include only core
> areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3,
> #6).
>
> Please share your further thoughts. If we generally agree that this is the
> right direction, I could come up with a formal proposal quickly and then we
> can follow up with broader discussions.
>
> Thanks,
> Xuefu
>
>
>
> ------------------------------------------------------------------
> Sender:vino yang <yanghua1127@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fhueske@gmail.com>
> Cc:dev <dev@flink.apache.org>; Xuefu <xuefu.z@alibaba-inc.com>; user <
> user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Appreciate this proposal, and like Fabian, it would look better if you can
> give more details of the plan.
>
> Thanks, vino.
>
> Fabian Hueske <fhueske@gmail.com> 于2018年10月10日周三 下午5:27写道:
> Hi Xuefu,
>
> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple
> ways to improve Flink in that regard:
>
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
>
> Best, Fabian
>
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> Hi all,
>
> Along with the community's effort, inside Alibaba we have explored Flink's
> potential as an execution engine not just for stream processing but also
> for batch processing. We are encouraged by our findings and have initiated
> our effort to make Flink's SQL capabilities full-fledged. When comparing
> what's available in Flink to the offerings from competitive data processing
> engines, we identified a major gap in Flink: a well integration with Hive
> ecosystem. This is crucial to the success of Flink SQL and batch due to the
> well-established data ecosystem around Hive. Therefore, we have done some
> initial work along this direction but there are still a lot of effort
> needed.
>
> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
>
> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
>
> I'm completely new to Flink(, with a short bio [2] below), though many of
> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
> like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
>
> While the ideas are simple, each approach will demand significant effort,
> more than what we can afford. Thus, the input and contributions from the
> communities are greatly welcome and appreciated.
>
> Regards,
>
>
> Xuefu
>
> References:
>
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
>
>
>
>
> --
> "So you have to trust that the dots will somehow connect in your future."
>
>

-- 
"So you have to trust that the dots will somehow connect in your future."

Mime
View raw message