Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CANC1h_tenkn9t1pcxm5Oaus_rOCCboVdUUumZpR=d53Q5iJpWg@mail.gmail.com>
References: 
 <CANAvhV7okxY5U-vxppKNOL08dJxaWWKPajD+0-naiCTK33jjBA@mail.gmail.com>
 <CAKADb_N2VXHGkEMPcLRh7AXSpu1twpvs+hBTtpdPurrZW6du2w@mail.gmail.com>
 <CAKADb_P6HdsXdKDaXoNPsw+uWGJTxSqNAihOp27UPF6H15uuJA@mail.gmail.com>
 <CAGco--YyAXdoR+L3Tbb+fH-8CQ4Fj+WgSpcNsEby8+e_onwRaA@mail.gmail.com>
 <CAKADb_MwfqN+bkEqdPa8L77oav_qj=O8y=0X1NTuoNh+MrKxRw@mail.gmail.com>
 <CANC1h_tenkn9t1pcxm5Oaus_rOCCboVdUUumZpR=d53Q5iJpWg@mail.gmail.com>
From: Maximilian Michels <mxm@apache.org>
Date: Wed, 18 Nov 2015 17:27:57 +0100
Message-ID: 
 <CAGco--YutM0VQ=yfNK+VoJMcO0hTM+7A5MiL0gra03U=rQT1Tw@mail.gmail.com>
Subject: Re: Does 'DataStream.writeAsCsv' suppose to work like this?
To: "user@flink.apache.org" <user@flink.apache.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Yes, that does make sense! Thank you for explaining. Have you made the
change yet? I couldn't find it on the master.

On Wed, Nov 18, 2015 at 5:16 PM, Stephan Ewen <sewen@apache.org> wrote:
> That makes sense...
>
> On Mon, Oct 26, 2015 at 12:31 PM, M=C3=A1rton Balassi <balassi.marton@gma=
il.com>
> wrote:
>>
>> Hey Max,
>>
>> The solution I am proposing is not flushing on every record, but it make=
s
>> sure to forward the flushing from the sinkfunction to the outputformat
>> whenever it is triggered. Practically this means that the buffering is d=
one
>> (almost) solely in the sink and not in the outputformat any more.
>>
>> On Mon, Oct 26, 2015 at 10:11 AM, Maximilian Michels <mxm@apache.org>
>> wrote:
>>>
>>> Not sure whether we really want to flush at every invoke call. If you
>>> want to flush every time, you may want to set the update condition to 0
>>> milliseconds. That way, flush will be called every time. In the API thi=
s is
>>> exposed by using the FileSinkFunctionByMillis. If you flush every time,
>>> performance might degrade.
>>>
>>> By the way, you may also use the RollingFileSink which splits the outpu=
t
>>> into several files for each hour/week/day. You can then be sure those f=
iles
>>> are already completely written to HDFS.
>>>
>>> Best regards,
>>> Max
>>>
>>> On Mon, Oct 26, 2015 at 8:36 AM, M=C3=A1rton Balassi
>>> <balassi.marton@gmail.com> wrote:
>>>>
>>>> The problem persists in the current master, simply a format.flush() is
>>>> needed here [1]. I'll do a quick hotfix, thanks for the report again!
>>>>
>>>> [1]
>>>> https://github.com/apache/flink/blob/master/flink-streaming-java/src/m=
ain/java/org/apache/flink/streaming/api/functions/sink/FileSinkFunction.jav=
a#L99
>>>>
>>>> On Mon, Oct 26, 2015 at 8:23 AM, M=C3=A1rton Balassi
>>>> <balassi.marton@gmail.com> wrote:
>>>>>
>>>>> Hey Rex,
>>>>>
>>>>> Writing half-baked records is definitely unwanted, thanks for spottin=
g
>>>>> this. Most likely it can be solved by adding a flush at the end of ev=
ery
>>>>> invoke call, let me check.
>>>>>
>>>>> Best,
>>>>>
>>>>> Marton
>>>>>
>>>>> On Mon, Oct 26, 2015 at 7:56 AM, Rex Ge <lungothrin@gmail.com> wrote:
>>>>>>
>>>>>> Hi, flinkers!
>>>>>>
>>>>>> I'm new to this whole thing,
>>>>>> and it seems to me that '
>>>>>> org.apache.flink.streaming.api.datastream.DataStream.writeAsCsv(Stri=
ng,
>>>>>> WriteMode, long)' does not work properly.
>>>>>> To be specific, data were not flushed by update frequency when write
>>>>>> to HDFS.
>>>>>>
>>>>>> what make it more disturbing is that, if I check the content with
>>>>>> 'hdfs dfs -cat xxx', sometimes I got partial records.
>>>>>>
>>>>>>
>>>>>> I did a little digging in flink-0.9.1.
>>>>>> And it turns out all
>>>>>> 'org.apache.flink.streaming.api.functions.sink.FileSinkFunction.invo=
ke(IN)'
>>>>>> does
>>>>>> is pushing data to
>>>>>> 'org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream'
>>>>>> which is a delegate of  'org.apache.hadoop.fs.FSDataOutputStream'.
>>>>>>
>>>>>> In this scenario, 'org.apache.hadoop.fs.FSDataOutputStream' is never
>>>>>> flushed.
>>>>>> Which result in data being held in local buffer, and 'hdfs dfs -cat
>>>>>> xxx' might return partial records.
>>>>>>
>>>>>>
>>>>>> Does 'DataStream.writeAsCsv' suppose to work like this? Or I messed =
up
>>>>>> somewhere?
>>>>>>
>>>>>>
>>>>>> Best regards and thanks for your time!
>>>>>>
>>>>>> Rex
>>>>>
>>>>>
>>>>
>>>
>>
>