Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAAdrtT1CcdndccBUv+_cTkMkrYjzmtEk=UjQKCJ9X4SS9o=GKQ@mail.gmail.com>
References: 
 <CAJBLQXUXdJ4DgQbS1kyxw9FwOX=UM4c-1QBitkp3+yTGD148Pg@mail.gmail.com>
 <CAAdrtT1CcdndccBUv+_cTkMkrYjzmtEk=UjQKCJ9X4SS9o=GKQ@mail.gmail.com>
From: Srikanth <srikanth.ht@gmail.com>
Date: Tue, 16 Feb 2016 13:52:02 -0500
Message-ID: 
 <CAJBLQXVUk8CqyiuRBVbhi_-tB-ro3hgWC+GwKwkCgG8LVgipXQ@mail.gmail.com>
Subject: Re: writeAsCSV with partitionBy
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=047d7b1637ad299659052be7a2ec

--047d7b1637ad299659052be7a2ec
Content-Type: text/plain; charset=UTF-8

Fabian,

Not sure if we are on the same page. If I do something like below code, it
will groupby field 0 and each task will write a separate part file in
parallel.

    val sink = data1.join(data2)
    .where(1).equalTo(0) { ((l,r) => ( l._3, r._3) ) }
    .partitionByHash(0)
    .writeAsCsv(pathBase + "output/test", rowDelimiter="\n",
fieldDelimiter="\t" , WriteMode.OVERWRITE)

This will create folder ./output/test/<1,2,3,4...>

But what I was looking for is Hive style partitionBy that will output with
folder structure

   ./output/field0=1/file
   ./output/field0=2/file
   ./output/field0=3/file
   ./output/field0=4/file

Assuming field0 is Int and has unique values 1,2,3&4.

Srikanth


On Mon, Feb 15, 2016 at 6:20 AM, Fabian Hueske <fhueske@gmail.com> wrote:

> Hi Srikanth,
>
> DataSet.partitionBy() will partition the data on the declared partition
> fields.
> If you append a DataSink with the same parallelism as the partition
> operator, the data will be written out with the defined partitioning.
> It should be possible to achieve the behavior you described using
> DataSet.partitionByHash() or partitionByRange().
>
> Best, Fabian
>
>
> 2016-02-12 20:53 GMT+01:00 Srikanth <srikanth.ht@gmail.com>:
>
>> Hello,
>>
>>
>>
>> Is there a Hive(or Spark dataframe) partitionBy equivalent in Flink?
>>
>> I'm looking to save output as CSV files partitioned by two columns(date
>> and hour).
>>
>> The partitionBy dataset API is more to partition the data based on a
>> column for further processing.
>>
>>
>>
>> I'm thinking there is no direct API to do this. But what will be the best
>> way of achieving this?
>>
>>
>>
>> Srikanth
>>
>>
>>
>
>

--047d7b1637ad299659052be7a2ec
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Fabian,<div><br></div><div>Not sure if we are on the same =
page. If I do something like below code, it will groupby field 0 and each t=
ask will write a separate part file in parallel.</div><div><br></div><div><=
div>=C2=A0 =C2=A0 val sink =3D data1.join(data2)</div><div>=C2=A0 =C2=A0 <s=
pan class=3D"" style=3D"white-space:pre">	</span>.where(1).equalTo(0) { ((l=
,r) =3D&gt; ( l._3, r._3) ) }</div><div>=C2=A0 =C2=A0 <span class=3D"" styl=
e=3D"white-space:pre">	</span>.partitionByHash(0)</div><div>=C2=A0 =C2=A0 <=
span class=3D"" style=3D"white-space:pre">	</span>.writeAsCsv(pathBase + &q=
uot;output/test&quot;, rowDelimiter=3D&quot;\n&quot;, fieldDelimiter=3D&quo=
t;\t&quot; , WriteMode.OVERWRITE)</div><div></div></div><div><br></div><div=
>This will create folder ./output/test/&lt;1,2,3,4...&gt;</div><div><br></d=
iv><div>But what I was looking for is Hive style partitionBy that will outp=
ut with folder structure</div><div><br></div><div>=C2=A0 =C2=A0./output/fie=
ld0=3D1/file</div><div>=C2=A0 =C2=A0./output/field0=3D2/file<br></div><div>=
=C2=A0 =C2=A0./output/field0=3D3/file<br></div><div>=C2=A0 =C2=A0./output/f=
ield0=3D4/file<br></div><div><br></div><div>Assuming field0 is Int and has =
unique values 1,2,3&amp;4.</div><div><br></div><div>Srikanth</div><div><br>=
</div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mo=
n, Feb 15, 2016 at 6:20 AM, Fabian Hueske <span dir=3D"ltr">&lt;<a href=3D"=
mailto:fhueske@gmail.com" target=3D"_blank">fhueske@gmail.com</a>&gt;</span=
> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>=
Hi Srikanth,<br><br></div>DataSet.partitionBy() will partition the data on =
the declared partition fields. <br>If you append a DataSink with the same p=
arallelism as the partition operator, the data will be written out with the=
 defined partitioning.<br></div>It should be possible to achieve the behavi=
or you described using DataSet.partitionByHash() or partitionByRange().<br>=
<br></div>Best, Fabian<br><div><div><div><br></div></div></div></div><div c=
lass=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">2016-02-12 20:53 GMT+01:00 Srikanth <span dir=3D"ltr">&lt;=
<a href=3D"mailto:srikanth.ht@gmail.com" target=3D"_blank">srikanth.ht@gmai=
l.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><p=
 class=3D"MsoNormal"><span style=3D"font-size:9.5pt;font-family:Arial,sans-=
serif">Hello,</span><span style=3D"font-size:12pt;font-family:&#39;Times Ne=
w Roman&#39;,serif"></span></p>

<p class=3D"MsoNormal"><span style=3D"font-size:9.5pt;font-family:Arial,san=
s-serif">=C2=A0</span></p>

<p class=3D"MsoNormal"><span style=3D"font-size:9.5pt;font-family:Arial,san=
s-serif">Is there a Hive(or
Spark dataframe) partitionBy equivalent in Flink?</span></p>

<p class=3D"MsoNormal"><span style=3D"font-size:9.5pt;font-family:Arial,san=
s-serif">I&#39;m looking to save
output as CSV files partitioned by two columns(date and hour).</span></p>

<p class=3D"MsoNormal"><span style=3D"font-size:9.5pt;font-family:Arial,san=
s-serif">The partitionBy
dataset API is more to partition the data based on a column for further
processing.</span></p>

<p class=3D"MsoNormal"><span style=3D"font-size:9.5pt;font-family:Arial,san=
s-serif">=C2=A0</span></p>

<p class=3D"MsoNormal"><span style=3D"font-size:9.5pt;font-family:Arial,san=
s-serif">I&#39;m thinking there is
no direct API to do this. But what will be the best way of achieving this?<=
/span></p><span><font color=3D"#888888">

<p class=3D"MsoNormal"><span style=3D"font-size:9.5pt;font-family:Arial,san=
s-serif">=C2=A0</span></p>

<p class=3D"MsoNormal"><span style=3D"font-size:9.5pt;font-family:Arial,san=
s-serif">Srikanth</span></p>

<p class=3D"MsoNormal">=C2=A0</p></font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--047d7b1637ad299659052be7a2ec--