Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <etPan.59b8f2a2.73b4e670.123@apache.org>
References: <CAELUF_D8mXjBoKWRY2X6ATG6rKQxsAB_qVKeuUbhknc72Ogw2w@mail.gmail.com>
 <87FCF441-EE2B-4EEA-9826-A1D1252AA6F9@data-artisans.com> <EF3A19D0-22DB-492D-B89D-745772B13E3E@apache.org>
 <CAELUF_BGKpHFATHrnv7t+REPFeXqHom68qngbLnFXoSiSDzaeg@mail.gmail.com> <etPan.59b8f2a2.73b4e670.123@apache.org>
From: Flavio Pompermaier <pompermaier@okkam.it>
Date: Wed, 13 Sep 2017 12:04:06 +0200
Message-ID: <CAELUF_DNn=uPTBnKfk7zC+aB13bB66nYnQGwusucKA+m6u8YrQ@mail.gmail.com>
Subject: Re: BucketingSink never closed
To: "Tzu-Li (Gordon) Tai" <tzulitai@apache.org>
Cc: Aljoscha Krettek <aljoscha@apache.org>, Kostas Kloudas <k.kloudas@data-artisans.com>,
	user <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="001a113cfd2cf28f3605590f489e"
archived-at: Wed, 13 Sep 2017 10:04:35 -0000

--001a113cfd2cf28f3605590f489e
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Gordon,
thanks for your feedback. The main problem for me is that moving from batch
to stream should be much easier IMHO.

Rows should be a first class citizen in Flink and should be VERY easy to
read/write them, while at the moment it seems that Tuples are the
dominating type...I don't want to write a serializer/outputFormat to
persist Rows as Parquet, Avro, Thrift, OCR, Kudu, Hive, etc..I expect to
have some already existing (and mantained) connector already available
somewhere. The case of the Parquet Rollink sink is just an example.

Regarding state backends I think that its not so easy to understand how to
design and monitor it properly: there are many parameters/variables to take
into account and it would be helpful to have a proper hands-on training
course/certification about this...

About ES indexing monitoring see my discussion with Chesnay at
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Streami=
ng-job-monitoring-td13583.html:
what I need is just to have recordsIn/recordsOut reflecting real values.

Best,
Flavio

On Wed, Sep 13, 2017 at 10:56 AM, Tzu-Li (Gordon) Tai <tzulitai@apache.org>
wrote:

> Hi Flavio,
>
> Let me try to understand / look at some of the problems you have
> encountered.
>
>
>    - checkpointing: it's not clear which checkpointing system to use and
>    how to tune/monitor it and avoid OOM exceptions.
>
> What do you mean be which "checkpointing system=E2=80=9D to use? Do you m=
ean state
> backends? Typically, you would only get OOM exceptions for memory-backed
> state backends if the state size exceeds the memory capacity. State sizes
> can be queried from the REST APIs / Web UI.
>
>
>    - cleanup: BucketingSink doesn't always move to final state
>
> This sounds like a bug that we should look into. Do you have any logs on
> which you observed this?
>
>
>    - missing output formats: parquet support to write generic Rows not
>    very well supported (at least out of the box) [1]
>
> Would you be up to opening up JIRAs for what you think is missing (if
> there isn=E2=80=99t one already)?
>
>
>    - progress monitoring: for example in the ES connector there's no way
>    (apart from using accumulators) to monitor the progress of the indexin=
g
>
> Maybe we can add some built-in metric in the ES sink connector that track=
s
> the number of successfully indexed elements, which can then be queried fr=
om
> the REST API / Web UI. That wouldn=E2=80=99t be too much effort. What do =
you think,
> would that be useful for your case?
> Would be happy to hear your thoughts on this!
>
> Cheers,
> Gordon
>
>
> On 12 September 2017 at 11:36:27 AM, Flavio Pompermaier (
> pompermaier@okkam.it) wrote:
>
> For the moment I give up with streaming...too many missing/unclear
> features wrt batch.
> For example:
>
>    - checkpointing: it's not clear which checkpointing system to use and
>    how to tune/monitor it and avoid OOM exceptions. Moreover is it really
>    necessary to use it? For example if I read a file from HDFS and I don'=
t
>    have a checkpoint it could be ok to re-run the job on all the data in =
case
>    of errors (i.e. the stream is managed like a batch)
>    - cleanup: BucketingSink doesn't always move to final state
>    - missing output formats: parquet support to write generic Rows not
>    very well supported (at least out of the box) [1]
>    - progress monitoring: for example in the ES connector there's no way
>    (apart from using accumulators) to monitor the progress of the indexin=
g
>
> [1] https://stackoverflow.com/questions/41144659/flink-avro-
> parquet-writer-in-rollingsink
>
> Maybe I'm wrong with those points but the attempt to replace my current
> batch system with a streaming one had no luck with those points.
>
> Best,
> Flavio
>
> On Fri, Sep 8, 2017 at 5:29 PM, Aljoscha Krettek <aljoscha@apache.org>
> wrote:
>
>> Hi,
>>
>> Expanding a bit on Kostas' answer. Yes, your analysis is correct, the
>> problem is that the job is shutting down before a last checkpoint can
>> "confirm" the written bucket data by moving it to the final state. The
>> problem, as Kostas noted is that a user function (and thus also
>> BucketingSink) does not know whether close() is being called because of =
a
>> failure or because normal job shutdown. Therefore, we cannot move data t=
o
>> the final stage there.
>>
>> Once we have the issue that Kostas posted resolve we can also resolve
>> this problem for the BucketingSink.
>>
>> Best,
>> Aljoscha
>>
>> On 8. Sep 2017, at 16:48, Kostas Kloudas <k.kloudas@data-artisans.com>
>> wrote:
>>
>> Hi Flavio,
>>
>> If I understand correctly, I think you bumped into this issue:
>> https://issues.apache.org/jira/browse/FLINK-2646
>>
>> There is also a similar discussion on the BucketingSink here:
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>> com/DISCUSS-Adding-a-dispose-method-in-the-RichFunction-
>> td14466.html#a14468
>>
>> Kostas
>>
>> On Sep 8, 2017, at 4:27 PM, Flavio Pompermaier <pompermaier@okkam.it>
>> wrote:
>>
>> Hi to all,
>> I'm trying to test a streaming job but the files written by
>> the BucketingSink are never finalized (remains into the pending state).
>> Is this caused by the fact that the job finishes before the checkpoint?
>> Shouldn't the sink properly close anyway?
>>
>> This is my code:
>>
>>   @Test
>>   public void testBucketingSink() throws Exception {
>>     final StreamExecutionEnvironment senv =3D StreamExecutionEnvironment=
.get
>> ExecutionEnvironment();
>>     final StreamTableEnvironment tEnv =3D TableEnvironment.getTableEnvir
>> onment(senv);
>>     senv.enableCheckpointing(5000);
>>     DataStream<String> testStream =3D senv.fromElements(//
>>         "1,aaa,white", //
>>         "2,bbb,gray", //
>>         "3,ccc,white", //
>>         "4,bbb,gray", //
>>         "5,bbb,gray" //
>>     );
>>     final RowTypeInfo rtf =3D new RowTypeInfo(
>>         BasicTypeInfo.STRING_TYPE_INFO,
>>         BasicTypeInfo.STRING_TYPE_INFO,
>>         BasicTypeInfo.STRING_TYPE_INFO);
>>     DataStream<Row> rows =3D testStream.map(new MapFunction<String, Row>=
() {
>>
>>       private static final long serialVersionUID =3D 1L;
>>
>>       @Override
>>       public Row map(String str) throws Exception {
>>         String[] split =3D str.split(Pattern.quote(","));
>>         Row ret =3D new Row(3);
>>         ret.setField(0, split[0]);
>>         ret.setField(1, split[1]);
>>         ret.setField(2, split[2]);
>>         return ret;
>>       }
>>     }).returns(rtf);
>>
>>     String columnNames =3D "id,value,state";
>>     final String dsName =3D "test";
>>     tEnv.registerDataStream(dsName, rows, columnNames);
>>     final String whiteAreaFilter =3D "state =3D 'white'";
>>     DataStream<Row> grayArea =3D rows;
>>     DataStream<Row> whiteArea =3D null;
>>     if (whiteAreaFilter !=3D null) {
>>       String sql =3D "SELECT *, (%s) as _WHITE FROM %s";
>>       sql =3D String.format(sql, whiteAreaFilter, dsName);
>>       Table table =3D tEnv.sql(sql);
>>       grayArea =3D tEnv.toDataStream(table.where(
>> "!_WHITE").select(columnNames), rtf);
>>       DataStream<Row> nw =3D tEnv.toDataStream(table.where("_WHITE").sel=
ect(columnNames),
>> rtf);
>>       whiteArea =3D whiteArea =3D=3D null ? nw : whiteArea.union(nw);
>>     }
>>     Writer<Row> bucketSinkwriter =3D new RowCsvWriter("UTF-8", "\t", "\n=
");
>>
>>     String datasetWhiteDir =3D "/tmp/bucket/white";
>>     BucketingSink<Row> whiteAreaSink =3D new BucketingSink<>(datasetWhit=
eDi
>> r.toString());
>>     whiteAreaSink.setWriter(bucketSinkwriter);
>>     whiteAreaSink.setBatchSize(10);
>>     whiteArea.addSink(whiteAreaSink);
>>
>>     String datasetGrayDir =3D "/tmp/bucket/gray";
>>     BucketingSink<Row> grayAreaSink =3D new BucketingSink<>(datasetGrayD=
ir
>> .toString());
>>     grayAreaSink.setWriter(bucketSinkwriter);
>>     grayAreaSink.setBatchSize(10);
>>     grayArea.addSink(grayAreaSink);
>>
>>     JobExecutionResult jobInfo =3D senv.execute("Buketing sink test ");
>>     System.out.printf("Job took %s minutes",
>> jobInfo.getNetRuntime(TimeUnit.MINUTES));
>>   }
>>
>>
>>
>>
>>
>>
>>
>> public class RowCsvWriter extends StreamWriterBase<Row> {
>>   private static final long serialVersionUID =3D 1L;
>>
>>   private final String charsetName;
>>   private transient Charset charset;
>>   private String fieldDelimiter;
>>   private String recordDelimiter;
>>   private boolean allowNullValues =3D true;
>>   private boolean quoteStrings =3D false;
>>
>>   /**
>>    * Creates a new {@code StringWriter} that uses {@code "UTF-8"} charse=
t
>> to convert strings to
>>    * bytes.
>>    */
>>   public RowCsvWriter() {
>>     this("UTF-8", CsvOutputFormat.DEFAULT_FIELD_DELIMITER,
>> CsvOutputFormat.DEFAULT_LINE_DELIMITER);
>>   }
>>
>>   /**
>>    * Creates a new {@code StringWriter} that uses the given charset to
>> convert strings to bytes.
>>    *
>>    * @param charsetName Name of the charset to be used, must be valid
>> input for
>>    *        {@code Charset.forName(charsetName)}
>>    */
>>   public RowCsvWriter(String charsetName, String fieldDelimiter, String
>> recordDelimiter) {
>>     this.charsetName =3D charsetName;
>>     this.fieldDelimiter =3D fieldDelimiter;
>>     this.recordDelimiter =3D recordDelimiter;
>>   }
>>
>>   @Override
>>   public void open(FileSystem fs, Path path) throws IOException {
>>     super.open(fs, path);
>>
>>     try {
>>       this.charset =3D Charset.forName(charsetName);
>>     } catch (IllegalCharsetNameException ex) {
>>       throw new IOException("The charset " + charsetName + " is not
>> valid.", ex);
>>     } catch (UnsupportedCharsetException ex) {
>>       throw new IOException("The charset " + charsetName + " is not
>> supported.", ex);
>>     }
>>   }
>>
>>   @Override
>>   public void write(Row element) throws IOException {
>>     FSDataOutputStream outputStream =3D getStream();
>>     writeRow(element, outputStream);
>>   }
>>
>>   private void writeRow(Row element, FSDataOutputStream out) throws
>> IOException {
>>     int numFields =3D element.getArity();
>>
>>     for (int i =3D 0; i < numFields; i++) {
>>       Object obj =3D element.getField(i);
>>       if (obj !=3D null) {
>>         if (i !=3D 0) {
>>           out.write(this.fieldDelimiter.getBytes(charset));
>>         }
>>
>>         if (quoteStrings) {
>>           if (obj instanceof String || obj instanceof StringValue) {
>>             out.write('"');
>>             out.write(obj.toString().getBytes(charset));
>>             out.write('"');
>>           } else {
>>             out.write(obj.toString().getBytes(charset));
>>           }
>>         } else {
>>           out.write(obj.toString().getBytes(charset));
>>         }
>>       } else {
>>         if (this.allowNullValues) {
>>           if (i !=3D 0) {
>>             out.write(this.fieldDelimiter.getBytes(charset));
>>           }
>>         } else {
>>           throw new RuntimeException("Cannot write tuple with <null>
>> value at position: " + i);
>>         }
>>       }
>>     }
>>
>>     // add the record delimiter
>>     out.write(this.recordDelimiter.getBytes(charset));
>>   }
>>
>>   @Override
>>   public Writer<Row> duplicate() {
>>     return new RowCsvWriter(charsetName, fieldDelimiter, recordDelimiter=
);
>>   }
>> }
>>
>>
>>
>> Any help is appreciated,
>> Flavio
>>
>>
>>
>>
>

--001a113cfd2cf28f3605590f489e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Gordon,=C2=A0<div>thanks for your feedback. The main pr=
oblem for me is that moving from batch to stream should be much easier IMHO=
.</div><div><br></div><div>Rows should be a first class citizen in Flink an=
d should be VERY easy to read/write them, while at the moment it seems that=
 Tuples are the dominating type...I don&#39;t want to write a serializer/ou=
tputFormat to persist Rows as Parquet, Avro, Thrift, OCR, Kudu, Hive, etc..=
I expect to have some already existing (and mantained) connector already av=
ailable somewhere. The case of the Parquet Rollink sink is just an example.=
</div><div><br></div><div>Regarding state backends I think that its not so =
easy to understand how to design and monitor it properly: there are many pa=
rameters/variables to take into account and it would be helpful to have a p=
roper hands-on=C2=A0training course/certification about this...</div><div><=
br></div><div>About ES indexing monitoring see my discussion with=C2=A0Ches=
nay=C2=A0at=C2=A0<a href=3D"http://apache-flink-user-mailing-list-archive.2=
336050.n4.nabble.com/Streaming-job-monitoring-td13583.html">http://apache-f=
link-user-mailing-list-archive.2336050.n4.nabble.com/Streaming-job-monitori=
ng-td13583.html</a>: what I need is just to have recordsIn/recordsOut refle=
cting real values.</div><div><br></div><div>Best,</div><div>Flavio</div><di=
v class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Sep 13, 2017=
 at 10:56 AM, Tzu-Li (Gordon) Tai <span dir=3D"ltr">&lt;<a href=3D"mailto:t=
zulitai@apache.org" target=3D"_blank">tzulitai@apache.org</a>&gt;</span> wr=
ote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">=
<div id=3D"m_9136920724834066075bloop_customfont" style=3D"font-family:Helv=
etica,Arial;font-size:13px;color:rgba(0,0,0,1.0);margin:0px;line-height:aut=
o">Hi Flavio,</div><div id=3D"m_9136920724834066075bloop_customfont" style=
=3D"font-family:Helvetica,Arial;font-size:13px;color:rgba(0,0,0,1.0);margin=
:0px;line-height:auto"><br></div><div id=3D"m_9136920724834066075bloop_cust=
omfont" style=3D"font-family:Helvetica,Arial;font-size:13px;color:rgba(0,0,=
0,1.0);margin:0px;line-height:auto">Let me try to understand / look at some=
 of the problems you have encountered.</div><div id=3D"m_913692072483406607=
5bloop_customfont" style=3D"font-family:Helvetica,Arial;font-size:13px;colo=
r:rgba(0,0,0,1.0);margin:0px;line-height:auto"><span class=3D""><div><block=
quote type=3D"cite" class=3D"m_9136920724834066075clean_bq" style=3D"font-f=
amily:Helvetica,Arial;font-size:13px;font-style:normal;font-variant-caps:no=
rmal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:=
0px;text-transform:none;white-space:normal;word-spacing:0px"><div dir=3D"lt=
r"><ul><li>checkpointing: it&#39;s not clear which checkpointing system to =
use and how to tune/monitor it and avoid OOM exceptions.</li></ul></div></b=
lockquote></div></span><div>What do you mean be which &quot;checkpointing s=
ystem=E2=80=9D to use? Do you mean state backends? Typically, you would onl=
y get OOM exceptions for memory-backed state backends if the state size exc=
eeds the memory capacity. State sizes can be queried from the REST APIs / W=
eb UI.</div><div><span class=3D""><div><blockquote type=3D"cite" class=3D"m=
_9136920724834066075clean_bq" style=3D"font-family:Helvetica,Arial;font-siz=
e:13px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter=
-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-=
space:normal;word-spacing:0px"><div dir=3D"ltr"><ul><li>cleanup: BucketingS=
ink doesn&#39;t always move to final state</li></ul></div></blockquote></di=
v></span><p>This sounds like a bug that we should look into. Do you have an=
y logs on which you observed this?</p><div><span class=3D""><div><blockquot=
e type=3D"cite" class=3D"m_9136920724834066075clean_bq" style=3D"font-famil=
y:Helvetica,Arial;font-size:13px;font-style:normal;font-variant-caps:normal=
;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;=
text-transform:none;white-space:normal;word-spacing:0px"><div dir=3D"ltr"><=
ul><li>missing output formats: parquet support to write generic Rows not ve=
ry well supported (at least out of the box) [1]</li></ul></div></blockquote=
></div></span><p>Would you be up to opening up JIRAs for what you think is =
missing (if there isn=E2=80=99t one already)?</p><div><span class=3D""><div=
><blockquote type=3D"cite" class=3D"m_9136920724834066075clean_bq" style=3D=
"font-family:Helvetica,Arial;font-size:13px;font-style:normal;font-variant-=
caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-=
indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><div di=
r=3D"ltr"><ul><li>progress monitoring: for example in the ES connector ther=
e&#39;s no way (apart from using accumulators) to monitor the progress of t=
he indexing</li></ul></div></blockquote></div></span><p>Maybe we can add so=
me built-in metric in the ES sink connector that tracks the number of succe=
ssfully indexed elements, which can then be queried from the REST API / Web=
 UI. That wouldn=E2=80=99t be too much effort. What do you think, would tha=
t be useful for your case?</p><div>Would be happy to hear your thoughts on =
this!</div></div></div></div><div><br></div><div>Cheers,</div><div>Gordon</=
div></div><div><div class=3D"h5"> <br> <div class=3D"m_9136920724834066075b=
loop_sign" id=3D"m_9136920724834066075bloop_sign_1505291974740800000"></div=
> <br><p class=3D"m_9136920724834066075airmail_on">On 12 September 2017 at =
11:36:27 AM, Flavio Pompermaier (<a href=3D"mailto:pompermaier@okkam.it" ta=
rget=3D"_blank">pompermaier@okkam.it</a>) wrote:</p> <blockquote type=3D"ci=
te" class=3D"m_9136920724834066075clean_bq"><span><div><div></div><div>


<div dir=3D"ltr">
<div>For the moment I give up with streaming...too many
missing/unclear features wrt batch.=C2=A0</div>
<div>For example:<br></div>
<div>
<ul>
<li>checkpointing: it&#39;s not clear which checkpointing system to use
and how to tune/monitor it and avoid OOM exceptions. Moreover is it
really necessary to use it? For example if I read a file from HDFS
and I don&#39;t have a checkpoint it could be ok to re-run the job on
all the data in case of errors (i.e. the stream is managed like a
batch)</li>
<li>cleanup: BucketingSink doesn&#39;t always move to final state</li>
<li>missing output formats: parquet support to write generic Rows
not very well supported (at least out of the box) [1]<br></li>
<li>progress monitoring: for example in the ES connector there&#39;s no
way (apart from using accumulators) to monitor the progress of the
indexing</li>
</ul>
<div>[1]=C2=A0<a href=3D"https://stackoverflow.com/questions/41144659/flink=
-avro-parquet-writer-in-rollingsink" target=3D"_blank">https://stackoverflo=
w.com/<wbr>questions/41144659/flink-avro-<wbr>parquet-writer-in-rollingsink=
</a></div>
<div><br></div>
<div>Maybe I&#39;m wrong with those points but the attempt to replace
my current batch system with a streaming one had no luck with those
points.</div>
</div>
<div><br></div>
<div>Best,</div>
<div>Flavio</div>
<div class=3D"gmail_extra"><br>
<div class=3D"gmail_quote">On Fri, Sep 8, 2017 at 5:29 PM, Aljoscha
Krettek <span dir=3D"ltr">&lt;<a href=3D"mailto:aljoscha@apache.org" target=
=3D"_blank">aljoscha@apache.org</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word">Hi,
<div><br></div>
<div>Expanding a bit on Kostas&#39; answer. Yes, your analysis is
correct, the problem is that the job is shutting down before a last
checkpoint can &quot;confirm&quot; the written bucket data by moving it to
the final state. The problem, as Kostas noted is that a user
function (and thus also BucketingSink) does not know whether
close() is being called because of a failure or because normal job
shutdown. Therefore, we cannot move data to the final stage
there.</div>
<div><br></div>
<div>Once we have the issue that Kostas posted resolve we can also
resolve this problem for the BucketingSink.</div>
<div><br></div>
<div>Best,</div>
<div>Aljoscha</div>
<div>
<div class=3D"m_9136920724834066075h5">
<div><br>
<div>
<blockquote type=3D"cite">
<div>On 8. Sep 2017, at 16:48, Kostas Kloudas &lt;<a href=3D"mailto:k.kloud=
as@data-artisans.com" target=3D"_blank">k.kloudas@data-artisans.com</a>&gt;=
 wrote:</div>
<br class=3D"m_9136920724834066075m_3714486154876366559Apple-interchange-ne=
wline">
<div>
<div style=3D"word-wrap:break-word">Hi Flavio,
<div><br></div>
<div>If I understand correctly, I think you bumped into this
issue:=C2=A0<a href=3D"https://issues.apache.org/jira/browse/FLINK-2646" ta=
rget=3D"_blank">https://issues.apache.o<wbr>rg/jira/browse/FLINK-2646</a></=
div>
<div><br></div>
<div>There is also a similar discussion on the BucketingSink
here:=C2=A0</div>
<div><a href=3D"http://apache-flink-mailing-list-archive.1008284.n3.nabble.=
com/DISCUSS-Adding-a-dispose-method-in-the-RichFunction-td14466.html#a14468=
" target=3D"_blank">http://apache-flink-mailing-li<wbr>st-archive.1008284.n=
3.nabble.<wbr>com/DISCUSS-Adding-a-dispose-<wbr>method-in-the-RichFunction-=
<wbr>td14466.html#a14468</a></div>
<div><br></div>
<div>Kostas</div>
<div><br>
<div>
<blockquote type=3D"cite">
<div>On Sep 8, 2017, at 4:27 PM, Flavio Pompermaier &lt;<a href=3D"mailto:p=
ompermaier@okkam.it" target=3D"_blank">pompermaier@okkam.it</a>&gt; wrote:<=
/div>
<br class=3D"m_9136920724834066075m_3714486154876366559Apple-interchange-ne=
wline">
<div>
<div dir=3D"ltr">Hi to all,
<div>I&#39;m trying to test a streaming job but the files written by
the=C2=A0BucketingSink are never finalized (remains into the
pending state).</div>
<div>Is this caused by the fact that the job finishes before the
checkpoint?</div>
<div>Shouldn&#39;t the sink properly close anyway?</div>
<div><br></div>
<div>This is my code:</div>
<div><br></div>
<div>
<div>=C2=A0 @Test</div>
<div>=C2=A0 public void testBucketingSink() throws Exception
{</div>
<div>=C2=A0 =C2=A0 final StreamExecutionEnvironment senv =3D
StreamExecutionEnvironment.get<wbr>ExecutionEnvironment();</div>
<div>=C2=A0 =C2=A0 final StreamTableEnvironment tEnv =3D
TableEnvironment.getTableEnvir<wbr>onment(senv);</div>
<div>=C2=A0 =C2=A0 senv.enableCheckpointing(5000)<wbr>;</div>
<div>=C2=A0 =C2=A0 DataStream&lt;String&gt; testStream =3D
senv.fromElements(//</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &quot;1,aaa,white&quot;, //</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &quot;2,bbb,gray&quot;, //</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &quot;3,ccc,white&quot;, //</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &quot;4,bbb,gray&quot;, //</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &quot;5,bbb,gray&quot; //</div>
<div>=C2=A0 =C2=A0 );</div>
<div>=C2=A0 =C2=A0 final RowTypeInfo rtf =3D new RowTypeInfo(</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0
BasicTypeInfo.STRING_TYPE_INFO<wbr>,</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0
BasicTypeInfo.STRING_TYPE_INFO<wbr>,=C2=A0</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0
BasicTypeInfo.STRING_TYPE_INFO<wbr>);</div>
<div>=C2=A0 =C2=A0 DataStream&lt;Row&gt; rows =3D testStream.map(new
MapFunction&lt;String, Row&gt;() {</div>
<div><br></div>
<div>=C2=A0 =C2=A0 =C2=A0 private static final long
serialVersionUID =3D 1L;</div>
<div><br></div>
<div>=C2=A0 =C2=A0 =C2=A0 @Override</div>
<div>=C2=A0 =C2=A0 =C2=A0 public Row map(String str) throws
Exception {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 String[] split =3D
str.split(Pattern.quote(&quot;,&quot;));</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 Row ret =3D new Row(3);</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 ret.setField(0, split[0]);</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 ret.setField(1, split[1]);</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 ret.setField(2, split[2]);</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 return ret;</div>
<div>=C2=A0 =C2=A0 =C2=A0 }</div>
<div>=C2=A0 =C2=A0 }).returns(rtf);</div>
<div><br></div>
<div>=C2=A0 =C2=A0 String columnNames =3D &quot;id,value,state&quot;;</div>
<div>=C2=A0 =C2=A0 final String dsName =3D &quot;test&quot;;</div>
<div>=C2=A0 =C2=A0 tEnv.registerDataStream(dsName<wbr>, rows,
columnNames);</div>
<div>=C2=A0 =C2=A0 final String whiteAreaFilter =3D &quot;state =3D
&#39;white&#39;&quot;;</div>
<div>=C2=A0 =C2=A0 DataStream&lt;Row&gt; grayArea =3D rows;</div>
<div>=C2=A0 =C2=A0 DataStream&lt;Row&gt; whiteArea =3D null;</div>
<div>=C2=A0 =C2=A0 if (whiteAreaFilter !=3D null) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 String sql =3D &quot;SELECT *, (%s) as _WHITE
FROM %s&quot;;</div>
<div>=C2=A0 =C2=A0 =C2=A0 sql =3D String.format(sql, whiteAreaFilter,
dsName);</div>
<div>=C2=A0 =C2=A0 =C2=A0 Table table =3D tEnv.sql(sql);</div>
<div>=C2=A0 =C2=A0 =C2=A0 grayArea =3D
tEnv.toDataStream(table.where(<wbr>&quot;!_WHITE&quot;).select(columnNames)=
<wbr>,
rtf);</div>
<div>=C2=A0 =C2=A0 =C2=A0 DataStream&lt;Row&gt; nw =3D
tEnv.toDataStream(table.where(<wbr>&quot;_WHITE&quot;).select(columnNames),
rtf);</div>
<div>=C2=A0 =C2=A0 =C2=A0 whiteArea =3D whiteArea =3D=3D null ? nw :
whiteArea.union(nw);</div>
<div>=C2=A0 =C2=A0 }</div>
<div>=C2=A0 =C2=A0 Writer&lt;Row&gt; bucketSinkwriter =3D new
RowCsvWriter(&quot;UTF-8&quot;, &quot;\t&quot;, &quot;\n&quot;);</div>
<div><br></div>
<div>=C2=A0 =C2=A0 String datasetWhiteDir =3D
&quot;/tmp/bucket/white&quot;;</div>
<div>=C2=A0 =C2=A0 BucketingSink&lt;Row&gt; whiteAreaSink =3D new
BucketingSink&lt;&gt;(datasetWhiteDi<wbr>r.toString());</div>
<div>=C2=A0 =C2=A0
whiteAreaSink.setWriter(bucket<wbr>Sinkwriter);</div>
<div>=C2=A0 =C2=A0 whiteAreaSink.setBatchSize(10)<wbr>;</div>
<div>=C2=A0 =C2=A0 whiteArea.addSink(whiteAreaSin<wbr>k);</div>
<div><br></div>
<div>=C2=A0 =C2=A0 String datasetGrayDir =3D
&quot;/tmp/bucket/gray&quot;;</div>
<div>=C2=A0 =C2=A0 BucketingSink&lt;Row&gt; grayAreaSink =3D new
BucketingSink&lt;&gt;(datasetGrayDir<wbr>.toString());</div>
<div>=C2=A0 =C2=A0
grayAreaSink.setWriter(bucketS<wbr>inkwriter);</div>
<div>=C2=A0 =C2=A0 grayAreaSink.setBatchSize(10);</div>
<div>=C2=A0 =C2=A0 grayArea.addSink(grayAreaSink)<wbr>;</div>
<div><br></div>
<div>=C2=A0 =C2=A0 JobExecutionResult jobInfo =3D
senv.execute(&quot;Buketing sink test &quot;);</div>
<div>=C2=A0 =C2=A0 System.out.printf(&quot;Job took %s minutes&quot;,
jobInfo.getNetRuntime(TimeUnit<wbr>.MINUTES));</div>
<div>=C2=A0 }<br></div>
</div>
<div><br></div>
<div><br></div>
<div><br></div>
<div><br></div>
<div><br></div>
<div><br></div>
<div><br></div>
<div>
<div>public class RowCsvWriter extends StreamWriterBase&lt;Row&gt;
{</div>
<div>=C2=A0 private static final long serialVersionUID =3D 1L;</div>
<div><br></div>
<div>=C2=A0 private final String charsetName;</div>
<div>=C2=A0 private transient Charset charset;</div>
<div>=C2=A0 private String fieldDelimiter;</div>
<div>=C2=A0 private String recordDelimiter;</div>
<div>=C2=A0 private boolean allowNullValues =3D true;</div>
<div>=C2=A0 private boolean quoteStrings =3D false;</div>
<div><br></div>
<div>=C2=A0 /**</div>
<div>=C2=A0 =C2=A0* Creates a new {@code StringWriter} that uses
{@code &quot;UTF-8&quot;} charset to convert strings to</div>
<div>=C2=A0 =C2=A0* bytes.</div>
<div>=C2=A0 =C2=A0*/</div>
<div>=C2=A0 public RowCsvWriter() {</div>
<div>=C2=A0 =C2=A0 this(&quot;UTF-8&quot;,
CsvOutputFormat.DEFAULT_FIELD_<wbr>DELIMITER,
CsvOutputFormat.DEFAULT_LINE_D<wbr>ELIMITER);</div>
<div>=C2=A0 }</div>
<div><br></div>
<div>=C2=A0 /**</div>
<div>=C2=A0 =C2=A0* Creates a new {@code StringWriter} that uses
the given charset to convert strings to bytes.</div>
<div>=C2=A0 =C2=A0*</div>
<div>=C2=A0 =C2=A0* @param charsetName Name of the charset to be
used, must be valid input for</div>
<div>=C2=A0 =C2=A0* =C2=A0 =C2=A0 =C2=A0 =C2=A0{@code
Charset.forName(charsetName)}</div>
<div>=C2=A0 =C2=A0*/</div>
<div>=C2=A0 public RowCsvWriter(String charsetName, String
fieldDelimiter, String recordDelimiter) {</div>
<div>=C2=A0 =C2=A0 this.charsetName =3D charsetName;</div>
<div>=C2=A0 =C2=A0 this.fieldDelimiter =3D fieldDelimiter;</div>
<div>=C2=A0 =C2=A0 this.recordDelimiter =3D recordDelimiter;</div>
<div>=C2=A0 }</div>
<div><br></div>
<div>=C2=A0 @Override</div>
<div>=C2=A0 public void open(FileSystem fs, Path path) throws
IOException {</div>
<div>=C2=A0 =C2=A0 super.open(fs, path);</div>
<div><br></div>
<div>=C2=A0 =C2=A0 try {</div>
<div>=C2=A0 =C2=A0 =C2=A0 this.charset =3D
Charset.forName(charsetName);</div>
<div>=C2=A0 =C2=A0 } catch (IllegalCharsetNameException ex) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 throw new IOException(&quot;The charset &quot; +
charsetName + &quot; is not valid.&quot;, ex);</div>
<div>=C2=A0 =C2=A0 } catch (UnsupportedCharsetException ex) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 throw new IOException(&quot;The charset &quot; +
charsetName + &quot; is not supported.&quot;, ex);</div>
<div>=C2=A0 =C2=A0 }</div>
<div>=C2=A0 }</div>
<div><br></div>
<div>=C2=A0 @Override</div>
<div>=C2=A0 public void write(Row element) throws IOException
{</div>
<div>=C2=A0 =C2=A0 FSDataOutputStream outputStream =3D
getStream();</div>
<div>=C2=A0 =C2=A0 writeRow(element, outputStream);</div>
<div>=C2=A0 }</div>
<div><br></div>
<div>=C2=A0 private void writeRow(Row element, FSDataOutputStream
out) throws IOException {</div>
<div>=C2=A0 =C2=A0 int numFields =3D element.getArity();</div>
<div><br></div>
<div>=C2=A0 =C2=A0 for (int i =3D 0; i &lt; numFields; i++) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 Object obj =3D element.getField(i);</div>
<div>=C2=A0 =C2=A0 =C2=A0 if (obj !=3D null) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (i !=3D 0) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
out.write(this.fieldDelimiter.<wbr>getBytes(charset));</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }</div>
<div><br></div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (quoteStrings) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (obj instanceof String
|| obj instanceof StringValue) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
out.write(&#39;&quot;&#39;);</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
out.write(obj.toString().getBy<wbr>tes(charset));</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
out.write(&#39;&quot;&#39;);</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 } else {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
out.write(obj.toString().getBy<wbr>tes(charset));</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 } else {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
out.write(obj.toString().getBy<wbr>tes(charset));</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }</div>
<div>=C2=A0 =C2=A0 =C2=A0 } else {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (this.allowNullValues) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (i !=3D 0) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
out.write(this.fieldDelimiter.<wbr>getBytes(charset));</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 } else {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 throw new
RuntimeException(&quot;Cannot write tuple with &lt;null&gt; value at
position: &quot; + i);</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }</div>
<div>=C2=A0 =C2=A0 =C2=A0 }</div>
<div>=C2=A0 =C2=A0 }</div>
<div><br></div>
<div>=C2=A0 =C2=A0 // add the record delimiter</div>
<div>=C2=A0 =C2=A0
out.write(this.recordDelimiter<wbr>.getBytes(charset));</div>
<div>=C2=A0 }</div>
<div><br></div>
<div>=C2=A0 @Override</div>
<div>=C2=A0 public Writer&lt;Row&gt; duplicate() {</div>
<div>=C2=A0 =C2=A0 return new RowCsvWriter(charsetName,
fieldDelimiter, recordDelimiter);</div>
<div>=C2=A0 }</div>
<div>}</div>
</div>
<div><br></div>
<div><br></div>
<div><br></div>
<div>Any help is appreciated,</div>
<div>Flavio</div>
</div>
</div>
</blockquote>
</div>
<br></div>
</div>
</div>
</blockquote>
</div>
<br></div>
</div>
</div>
</div>
</blockquote>
</div>
<br></div>
</div>


</div></div></span></blockquote></div></div></div></blockquote></div><br>
</div></div>

--001a113cfd2cf28f3605590f489e--