flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: HDFS append
Date Tue, 09 Dec 2014 14:37:52 GMT
I didn't know such difference! Thus, Flink is very smart :)
Thank for the explanation Robert.

On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <rmetzger@apache.org> wrote:

> Vasia is working on support for reading directories recursively. But I
> thought that this is also allowing you to simulate something like an append.
>
> Did you notice an issue when reading many small files with Flink? Flink is
> handling the reading of files differently than Spark.
>
> Spark basically starts a task for each file / file split. So if you have
> millions of small files in your HDFS, spark will start millions of tasks
> (queued however). You need to coalesce in spark to reduce the number of
> partitions. by default, they re-use the partitions of the preceding
> operator.
> Flink on the other hand is starting a fixed number of tasks which are
> reading multiple input splits which are lazily assigned to these tasks once
> they ready to process new splits.
> Flink will not create a partition for each (small) input file. I expect
> Flink to handle that case a bit better than Spark (I haven't tested it
> though)
>
>
>
> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <pompermaier@okkam.it>
> wrote:
>
>> Great! Append data to HDFS will be a very useful feature!
>> I think that then you should think also how to read efficiently
>> directories containing a lot of small files. I know that this can be quite
>> inefficient so that's why in Spark they give you a coalesce operation to be
>> able to deal siwth such cases..
>>
>>
>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
>> vasilikikalavri@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> Yes, I took a look into this. I hope I'll be able to find some time to
>>> work on it this week.
>>> I'll keep you updated :)
>>>
>>> Cheers,
>>> V.
>>>
>>> On 9 December 2014 at 14:03, Robert Metzger <rmetzger@apache.org> wrote:
>>>
>>>> It seems that Vasia started working on adding support for recursive
>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>>> I'm still occupied with refactoring the YARN client, the HDFS
>>>> refactoring is next on my list.
>>>>
>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> Any news about this Robert?
>>>>>
>>>>> Thanks in advance,
>>>>> Flavio
>>>>>
>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rmetzger@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I think there is no support for appending to HDFS files in Flink
yet.
>>>>>> HDFS supports it, but there are some adjustments in the system
>>>>>> required (not deleting / creating directories before writing; exposing
the
>>>>>> append() methods in the FS abstractions).
>>>>>>
>>>>>> I'm planning to work on the FS abstractions in the next week, if
I
>>>>>> have enough time, I can also look into adding support for append().
>>>>>>
>>>>>> Another approach could be adding support for recursively reading
>>>>>> directories with the input formats. Vasia asked for this feature
a few days
>>>>>> ago on the mailing list. If we would have that feature, you could
just
>>>>>> write to a directory and read the parent directory (with all the
dirs for
>>>>>> the appends).
>>>>>>
>>>>>> Best,
>>>>>> Robert
>>>>>>
>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>> how can I efficiently appends data (as plain strings or also
avro
>>>>>>> records) to  HDFS using Flink?
>>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Flavio
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message