Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0DB4610961 for ; Thu, 5 Mar 2015 09:58:27 +0000 (UTC) Received: (qmail 86829 invoked by uid 500); 5 Mar 2015 09:58:26 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 86763 invoked by uid 500); 5 Mar 2015 09:58:26 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 86753 invoked by uid 99); 5 Mar 2015 09:58:26 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Mar 2015 09:58:26 +0000 Received: from mail-la0-f51.google.com (mail-la0-f51.google.com [209.85.215.51]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 5503E1A01E0 for ; Thu, 5 Mar 2015 09:58:26 +0000 (UTC) Received: by labgd6 with SMTP id gd6so3249995lab.6 for ; Thu, 05 Mar 2015 01:58:24 -0800 (PST) X-Received: by 10.112.188.165 with SMTP id gb5mr7228335lbc.35.1425549504577; Thu, 05 Mar 2015 01:58:24 -0800 (PST) MIME-Version: 1.0 Received: by 10.152.180.198 with HTTP; Thu, 5 Mar 2015 01:58:04 -0800 (PST) In-Reply-To: References: From: Robert Metzger Date: Thu, 5 Mar 2015 10:58:04 +0100 Message-ID: Subject: Re: Strategies for reading structured file formats as POJO DataSets To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a11c36da2d48bb80510879b8c --001a11c36da2d48bb80510879b8c Content-Type: text/plain; charset=UTF-8 Hi Elliot, Right now there is no tooling support for reading CSV/TSV data into a POJO, but there is a pull request open where a user contributes such a feature: https://github.com/apache/flink/pull/426 So its probably only a matter of days until it is available in master. Your suggested approach of using a mapper is perfectly fine. You can do it a bit easier by using env.readCsvFile(). It will do the parsing into the types for you. Sorry that the feature is not already available for you. Please let us know if you have more questions regarding Flink. Best, Robert On Thu, Mar 5, 2015 at 10:18 AM, Elliot West wrote: > Hello, > > As a new Flink user I wondered if there are any existing approaches or > practices for reading file formats such as CSV, TSV, etc. as DataSets or > POJOs? My current approach can be illustrated with a contrived example: > > // Simulating a TSV file DataSet > > DataSet tsvRatings = env.fromElements("category-1\t10"); > > // Mapping to a POJO > > DataSet ratings = tsvRatings.map(line -> { > String[] elements = line.split("\t"); > return new Rating(elements[0], Integer.parseInt(elements[1])); }); > > > While such a mapping could be implemented in a more general form, I'm keen > to avoid wheel reinvention and therefore wonder if there are already good > ways of doing this? > > Thanks - Elliot. > > --001a11c36da2d48bb80510879b8c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Elliot,

Right now there is no toolin= g support for reading CSV/TSV data into a POJO, but there is a pull request= open where a user contributes such a feature:=C2=A0https://github.com/apache/flink/pull/426<= /div>
So its probably only a matter of days until it is available in ma= ster.

Your suggested approach of using a mapper is perf= ectly fine.
You can do it a bit easier by using env.readCsvFile(). It wi= ll do the parsing into the types for you.

Sorry that the= feature is not already available for you.

Please = let us know if you have more questions regarding Flink.


Best,
Robert

<= div class=3D"gmail_extra">
On Thu, Mar 5, 201= 5 at 10:18 AM, Elliot West <teabot@gmail.com> wrote:
Hello,

As a n= ew Flink user I wondered if there are any existing approaches or practices = for reading file formats such as CSV, TSV, etc. as DataSets or POJOs? My cu= rrent approach can be illustrated with a contrived example:

<= /div>
<= div>// Simulating a TSV file DataSet
DataSet<= String> tsvRatings =3D env.fromElements("category-1\t10");

=
// Mapping to a POJO
<= /blockquote>
DataSet<Rating> rating= s =3D tsvRatings.map(line -> {
=C2=A0 String[] elements =3D line.split("\t&quo= t;);
=C2=A0= return new Rating(elements[0], Integer.parseInt(elements[1])); =C2=A0 =C2= =A0 });
=

While such a mapping could be implem= ented in a more general form, I'm keen to avoid wheel reinvention and t= herefore wonder if there are already good ways of doing this?
Thanks - Elliot.


--001a11c36da2d48bb80510879b8c--