flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: HDFS append
Date Mon, 15 Dec 2014 11:53:19 GMT
Hey Flavio,

this pull request got merged:

With this, you now can simulate an append behavior with Flink:

- You have a directory in HDFS where you put the files you want to append
- each time you want to append something, you run your job and let it
create a new directory in hdfs:///data/appendjob/, lets
say hdfs:///data/appendjob/run-X/
- Now, you can instruct the job to read the full output by letting it
recursively read hdfs:///data/appendjob/.

I hope that helps.


On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <pompermaier@okkam.it>
> I didn't know such difference! Thus, Flink is very smart :)
> Thank for the explanation Robert.
> On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <rmetzger@apache.org>
> wrote:
>> Vasia is working on support for reading directories recursively. But I
>> thought that this is also allowing you to simulate something like an append.
>> Did you notice an issue when reading many small files with Flink? Flink
>> is handling the reading of files differently than Spark.
>> Spark basically starts a task for each file / file split. So if you have
>> millions of small files in your HDFS, spark will start millions of tasks
>> (queued however). You need to coalesce in spark to reduce the number of
>> partitions. by default, they re-use the partitions of the preceding
>> operator.
>> Flink on the other hand is starting a fixed number of tasks which are
>> reading multiple input splits which are lazily assigned to these tasks once
>> they ready to process new splits.
>> Flink will not create a partition for each (small) input file. I expect
>> Flink to handle that case a bit better than Spark (I haven't tested it
>> though)
>> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <pompermaier@okkam.it>
>> wrote:
>>> Great! Append data to HDFS will be a very useful feature!
>>> I think that then you should think also how to read efficiently
>>> directories containing a lot of small files. I know that this can be quite
>>> inefficient so that's why in Spark they give you a coalesce operation to be
>>> able to deal siwth such cases..
>>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
>>> vasilikikalavri@gmail.com> wrote:
>>>> Hi!
>>>> Yes, I took a look into this. I hope I'll be able to find some time to
>>>> work on it this week.
>>>> I'll keep you updated :)
>>>> Cheers,
>>>> V.
>>>> On 9 December 2014 at 14:03, Robert Metzger <rmetzger@apache.org>
>>>> wrote:
>>>>> It seems that Vasia started working on adding support for recursive
>>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>>>> I'm still occupied with refactoring the YARN client, the HDFS
>>>>> refactoring is next on my list.
>>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>> Any news about this Robert?
>>>>>> Thanks in advance,
>>>>>> Flavio
>>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rmetzger@apache.org>
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>> I think there is no support for appending to HDFS files in Flink
>>>>>>> yet.
>>>>>>> HDFS supports it, but there are some adjustments in the system
>>>>>>> required (not deleting / creating directories before writing;
exposing the
>>>>>>> append() methods in the FS abstractions).
>>>>>>> I'm planning to work on the FS abstractions in the next week,
if I
>>>>>>> have enough time, I can also look into adding support for append().
>>>>>>> Another approach could be adding support for recursively reading
>>>>>>> directories with the input formats. Vasia asked for this feature
a few days
>>>>>>> ago on the mailing list. If we would have that feature, you could
>>>>>>> write to a directory and read the parent directory (with all
the dirs for
>>>>>>> the appends).
>>>>>>> Best,
>>>>>>> Robert
>>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>> Hi guys,
>>>>>>>> how can I efficiently appends data (as plain strings or also
>>>>>>>> records) to  HDFS using Flink?
>>>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>>> Thanks in advance,
>>>>>>>> Flavio

View raw message