Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7636A200AF8 for ; Thu, 5 May 2016 10:08:41 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 74F13160A03; Thu, 5 May 2016 08:08:41 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6C1FB160A02 for ; Thu, 5 May 2016 10:08:40 +0200 (CEST) Received: (qmail 92110 invoked by uid 500); 5 May 2016 08:08:39 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 92100 invoked by uid 99); 5 May 2016 08:08:39 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 May 2016 08:08:39 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 1DF79C21A5 for ; Thu, 5 May 2016 08:08:39 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.28 X-Spam-Level: * X-Spam-Status: No, score=1.28 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=okkam-it.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id m4CO71dkWcDa for ; Thu, 5 May 2016 08:08:31 +0000 (UTC) Received: from mail-wm0-f45.google.com (mail-wm0-f45.google.com [74.125.82.45]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 27BB85F368 for ; Thu, 5 May 2016 08:08:31 +0000 (UTC) Received: by mail-wm0-f45.google.com with SMTP id a17so14293478wme.0 for ; Thu, 05 May 2016 01:08:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=okkam-it.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=h3tzqxFyCZSboPAEiNtW57N0+pDEehD9hMDwIuKdLu8=; b=0RXLZPsXPt4xdz3sYlwCyNE+P2qN23ZyjfREtMq/ZLQP/nMAknyslFC6LEZvYKkuTw 4UcjLx8RhqmRCLrR4AFfWatQGNhFWq8rKGqsTK+aO9Oxh+d7C0BQSNP7b3jCSOEARDF1 o0yW0YFXKcjAedv93uCplJxvyQRqqADkkeZHg68ET4ws684M8cerhMrfH3bykw0K0evq xPRb8ooRIYUjsz4Mo5VIBsl6CyJ4rD4/wEDUoF52ctS4CKfWPVTWFTSkSPzZfz9lbtYc rN27L3dwGkU1ftCXFj3YWmw54xijTFSjRsZr1GoIgnu94wpnc/XgKMxA80PCFbpU7aU6 8W+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=h3tzqxFyCZSboPAEiNtW57N0+pDEehD9hMDwIuKdLu8=; b=dvWEI21XKHQ69o3gJ9yxr7vk4MgR71wY6MCOpjqD6g7VtH+0wkuXVfPZg3j3d6cxuB wd336sDCyn2cFmeAYfEULLwlgLEu5/WJXAS3+DokHWEvi7VkD/XC7Mk7IqMr3F6jaOGB XHwKDEOfWclV37vqYsThs97DiXSSfaf92DE3ZuSsr0pbQj8U3fQkiA7ClABvnIO/XxCI tEcrn6vhY2uB++VHgcetYHswDkvb0c7z5udQDTkc3iyF+GvkyhQ/Cj0WBEBI3BHB3yfJ m/v+p7lgiI6L1zrbfOuxpF+37UVG7Zz48Zaoe6Zmns+j7eTE71Ck+4KaT1qcX94JOjKv +JBw== X-Gm-Message-State: AOPr4FVPBwBZFKDN/kkVB6Av853OfE3NSul7Rty55T7MVP0iXpQ+7mTE+bVCbEfytQATPKbI2N169Kx8f0b+Mw== X-Received: by 10.194.55.10 with SMTP id n10mr12752884wjp.28.1462435710807; Thu, 05 May 2016 01:08:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.186.198 with HTTP; Thu, 5 May 2016 01:08:11 -0700 (PDT) X-Originating-IP: [213.203.177.29] In-Reply-To: References: From: Flavio Pompermaier Date: Thu, 5 May 2016 10:08:11 +0200 Message-ID: Subject: Re: Discussion about a Flink DataSource repository To: user Content-Type: multipart/alternative; boundary=047d7b66f41b0cd3a4053213d9e3 archived-at: Thu, 05 May 2016 08:08:41 -0000 --047d7b66f41b0cd3a4053213d9e3 Content-Type: text/plain; charset=UTF-8 HI Fabian, thanks for your detailed answer, as usual ;) I think that an external service it's ok,actually I wasn't aware of the TableSource interface. As you said, an utility to serialize and deserialize them would be very helpful and will ease this thing. However, registering metadata for a table is a very common task to do. Wouldn't be of useful for other Flink-related projects (I was thinking to Nifi for example) to define a common minimal set of (optional) metadata to display in a UI for a TableSource (like name, description, creationDate, creator, field aliases)? About point 2, I think that dataset broadcasting or closure variables are useful when you write a program, not if you try to "compose" it using reusable UDFs (using a script like in Pig). Of course, the worst case scenario for us (e.g. right now) is to connect to our repository within rich operators but I thought that it could be easy to define a link from operators to TableEnvironment and then to TableSource (using the lineage tag/source-id you said) and, finally to its metadata. I don't know whether this is specific only to us, I just wanted to share our needs and see if the table API development could benefit from them. Best, Flavio On Wed, May 4, 2016 at 10:35 AM, Fabian Hueske wrote: > Hi Flavio, > > I thought a bit about your proposal. I am not sure if it is actually > necessary to integrate a central source repository into Flink. It should be > possible to offer this as an external service which is based on the > recently added TableSource interface. TableSources could be extended to be > able to serialize and descerialize their configuration to/from JSON. When > the external repository service starts, it can read the JSON fields and > instantiate and register TableSource objectes. The repository could also > hold metadata about the sources and serve a (web) UI to list available > source. When a Flink program wants to access a data source which is > registered in the repository, it could lookup the respective TableSouce > object from the repository. > > Given that an integration of metadata with Flink user functions (point 2. > in your proposal) is a very special requirement, I am not sure how much > "native" support should be added to Flink. Would it be possible to add a > lineage tag to each record and ship the metadata of all sources as > broadcast set to each operator? Then user functions could lookup the > metadata from the broadcast set. > > Best, Fabian > > 2016-04-29 12:49 GMT+02:00 Flavio Pompermaier : > >> Hi to all, >> >> as discussed briefly with Fabian, for our products in Okkam we need a >> central repository of DataSources processed by Flink. >> With respect to existing external catalogs, such as Hive or Confluent's >> SchemaRegistry, whose objective is to provide necessary metadata to >> read/write the registered tables, we would also need a way to acess to >> other general metadata (e.g. name, description, creator, creation date, >> lastUpdate date, processedRecords, certificationLevel of provided data, >> provenance, language, etc). >> >> This integration has 2 main goals: >> >> 1. In a UI: to enable the user to choose (or even create) a >> datasource to process with some task (e.g. quality assessment) and then see >> its metadata (name, description, creator user, etc) >> 2. During a Flink job: when 2 datasource gets joined and we have >> multiple values for an attribute (e.g. name or lastname) we can access the >> datasource metadata to decide which value to retain (e.g. the one coming >> from the most authoritative/certified source for that attribute) >> >> We also think that this could be of interest for projects like Apache >> Zeppelin or Nifi enabling them to suggest to the user the sources to start >> from. >> >> Do you think it makes sense to think about designing such a module for >> Flink? >> >> Best, >> Flavio >> > > --047d7b66f41b0cd3a4053213d9e3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
HI Fabian,
thanks for your d= etailed answer, as usual ;)

I think that an external service i= t's ok,actually I wasn't aware of the TableSource interface.
As you said, an utility to serialize and deserialize them would be very = helpful and will ease this thing.
However, registering metada= ta for a table is a very common task to do. Wouldn't be of useful for o= ther Flink-related projects (I was thinking to Nifi for example) to define = a common minimal set of (optional) metadata to display in a UI for a TableS= ource (like name, description, creationDate, creator, field aliases)?

About point 2, I think that dataset broadcasting or closure va= riables are useful when you write a program, not if you try to "compos= e" it using reusable UDFs (using a script like in Pig).
Of course,= the worst case scenario for us (e.g. right now) is to connect to our repos= itory within rich operators but I thought that it could be easy to define a= link from operators to TableEnvironment and then to TableSource (using the= lineage tag/source-id you said) and, finally to its metadata. I don't = know whether this is specific only to us, I just wanted to share our needs = and see if the table API development could benefit from them.

=
Best,
Flavio

On Wed, May 4, 2016 at 10:35 AM, Fabian Hueske <fh= ueske@gmail.com> wrote:
Hi Flavio,

I thought a bit about = your proposal. I am not sure if it is actually necessary to integrate a cen= tral source repository into Flink. It should be possible to offer this as a= n external service which is based on the recently added TableSource interfa= ce. TableSources could be extended to be able to serialize and descerialize= their configuration to/from JSON. When the external repository service sta= rts, it can read the JSON fields and instantiate and register TableSource o= bjectes. The repository could also hold metadata about the sources and serv= e a (web) UI to list available source. When a Flink program wants to access= a data source which is registered in the repository, it could lookup the r= espective TableSouce object from the repository.

Given th= at an integration of metadata with Flink user functions (point 2. in your p= roposal) is a very special requirement, I am not sure how much "native= " support should be added to Flink. Would it be possible to add a line= age tag to each record and ship the metadata of all sources as broadcast se= t to each operator? Then user functions could lookup the metadata from the = broadcast set.

Best, Fabian

2016-04-29 12:49 GMT+02:00 Flavio Pompermaier <pomperm= aier@okkam.it>:
Hi to all,

as discussed briefly with Fabian, for our products in Okkam we need a = central repository of DataSources processed by Flink.
With respec= t to existing external catalogs, such as Hive or=C2=A0Confluent's SchemaRegistry, whose objective is to provide n= ecessary metadata to read/write the registered tables, we would also need a= way to acess to other general metadata (e.g. name, description, creator, c= reation date, lastUpdate date, processedRecords, certificationLevel of prov= ided data, provenance, language, etc).

This i= ntegration has 2 main goals:
  1. In a UI: to enable the user to choose (or even create) a dataso= urce to process with some task (e.g. quality assessment) and then see its m= etadata (name, description, =C2=A0creator user, etc)
  2. During a Flink job: when 2 datasource gets joined= and we have multiple values for an attribute (e.g. name or lastname) we ca= n access the datasource metadata to decide which value to retain (e.g. the = one coming from the most authoritative/certified source for that attribute)=
We also think that t= his could be of interest for projects like Apache Zeppelin or Nifi enabling= them to suggest to the user the sources to start from.
<= div>
Do you think it makes sense to think about designing such = a module for Flink?

<= /span>
Best,
Flavio


--047d7b66f41b0cd3a4053213d9e3--