flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: HDFS append
Date Tue, 09 Dec 2014 14:33:28 GMT
Vasia is working on support for reading directories recursively. But I
thought that this is also allowing you to simulate something like an append.

Did you notice an issue when reading many small files with Flink? Flink is
handling the reading of files differently than Spark.

Spark basically starts a task for each file / file split. So if you have
millions of small files in your HDFS, spark will start millions of tasks
(queued however). You need to coalesce in spark to reduce the number of
partitions. by default, they re-use the partitions of the preceding
operator.
Flink on the other hand is starting a fixed number of tasks which are
reading multiple input splits which are lazily assigned to these tasks once
they ready to process new splits.
Flink will not create a partition for each (small) input file. I expect
Flink to handle that case a bit better than Spark (I haven't tested it
though)



On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <pompermaier@okkam.it>
wrote:

> Great! Append data to HDFS will be a very useful feature!
> I think that then you should think also how to read efficiently
> directories containing a lot of small files. I know that this can be quite
> inefficient so that's why in Spark they give you a coalesce operation to be
> able to deal siwth such cases..
>
>
> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com> wrote:
>
>> Hi!
>>
>> Yes, I took a look into this. I hope I'll be able to find some time to
>> work on it this week.
>> I'll keep you updated :)
>>
>> Cheers,
>> V.
>>
>> On 9 December 2014 at 14:03, Robert Metzger <rmetzger@apache.org> wrote:
>>
>>> It seems that Vasia started working on adding support for recursive
>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>> I'm still occupied with refactoring the YARN client, the HDFS
>>> refactoring is next on my list.
>>>
>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>> pompermaier@okkam.it> wrote:
>>>
>>>> Any news about this Robert?
>>>>
>>>> Thanks in advance,
>>>> Flavio
>>>>
>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rmetzger@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I think there is no support for appending to HDFS files in Flink yet.
>>>>> HDFS supports it, but there are some adjustments in the system
>>>>> required (not deleting / creating directories before writing; exposing
the
>>>>> append() methods in the FS abstractions).
>>>>>
>>>>> I'm planning to work on the FS abstractions in the next week, if I
>>>>> have enough time, I can also look into adding support for append().
>>>>>
>>>>> Another approach could be adding support for recursively reading
>>>>> directories with the input formats. Vasia asked for this feature a few
days
>>>>> ago on the mailing list. If we would have that feature, you could just
>>>>> write to a directory and read the parent directory (with all the dirs
for
>>>>> the appends).
>>>>>
>>>>> Best,
>>>>> Robert
>>>>>
>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>> how can I efficiently appends data (as plain strings or also avro
>>>>>> records) to  HDFS using Flink?
>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Flavio
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message