flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: HDFS append
Date Mon, 15 Dec 2014 12:02:11 GMT
Thanks a lot Robert!
On Dec 15, 2014 12:54 PM, "Robert Metzger" <rmetzger@apache.org> wrote:

> Hey Flavio,
>
> this pull request got merged:
> https://github.com/apache/incubator-flink/pull/260
>
> With this, you now can simulate an append behavior with Flink:
>
> - You have a directory in HDFS where you put the files you want to append
> hdfs:///data/appendjob/
> - each time you want to append something, you run your job and let it
> create a new directory in hdfs:///data/appendjob/, lets
> say hdfs:///data/appendjob/run-X/
> - Now, you can instruct the job to read the full output by letting it
> recursively read hdfs:///data/appendjob/.
>
> I hope that helps.
>
>
> Best,
> Robert
>
>
> On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <pompermaier@okkam.it>
> wrote:
>>
>> I didn't know such difference! Thus, Flink is very smart :)
>> Thank for the explanation Robert.
>>
>> On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <rmetzger@apache.org>
>> wrote:
>>
>>> Vasia is working on support for reading directories recursively. But I
>>> thought that this is also allowing you to simulate something like an append.
>>>
>>> Did you notice an issue when reading many small files with Flink? Flink
>>> is handling the reading of files differently than Spark.
>>>
>>> Spark basically starts a task for each file / file split. So if you have
>>> millions of small files in your HDFS, spark will start millions of tasks
>>> (queued however). You need to coalesce in spark to reduce the number of
>>> partitions. by default, they re-use the partitions of the preceding
>>> operator.
>>> Flink on the other hand is starting a fixed number of tasks which are
>>> reading multiple input splits which are lazily assigned to these tasks once
>>> they ready to process new splits.
>>> Flink will not create a partition for each (small) input file. I expect
>>> Flink to handle that case a bit better than Spark (I haven't tested it
>>> though)
>>>
>>>
>>>
>>> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <pompermaier@okkam.it
>>> > wrote:
>>>
>>>> Great! Append data to HDFS will be a very useful feature!
>>>> I think that then you should think also how to read efficiently
>>>> directories containing a lot of small files. I know that this can be quite
>>>> inefficient so that's why in Spark they give you a coalesce operation to
be
>>>> able to deal siwth such cases..
>>>>
>>>>
>>>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
>>>> vasilikikalavri@gmail.com> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> Yes, I took a look into this. I hope I'll be able to find some time to
>>>>> work on it this week.
>>>>> I'll keep you updated :)
>>>>>
>>>>> Cheers,
>>>>> V.
>>>>>
>>>>> On 9 December 2014 at 14:03, Robert Metzger <rmetzger@apache.org>
>>>>> wrote:
>>>>>
>>>>>> It seems that Vasia started working on adding support for recursive
>>>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>>>>> I'm still occupied with refactoring the YARN client, the HDFS
>>>>>> refactoring is next on my list.
>>>>>>
>>>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> Any news about this Robert?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Flavio
>>>>>>>
>>>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rmetzger@apache.org
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I think there is no support for appending to HDFS files in
Flink
>>>>>>>> yet.
>>>>>>>> HDFS supports it, but there are some adjustments in the system
>>>>>>>> required (not deleting / creating directories before writing;
exposing the
>>>>>>>> append() methods in the FS abstractions).
>>>>>>>>
>>>>>>>> I'm planning to work on the FS abstractions in the next week,
if I
>>>>>>>> have enough time, I can also look into adding support for
append().
>>>>>>>>
>>>>>>>> Another approach could be adding support for recursively
reading
>>>>>>>> directories with the input formats. Vasia asked for this
feature a few days
>>>>>>>> ago on the mailing list. If we would have that feature, you
could just
>>>>>>>> write to a directory and read the parent directory (with
all the dirs for
>>>>>>>> the appends).
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Robert
>>>>>>>>
>>>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>> how can I efficiently appends data (as plain strings
or also avro
>>>>>>>>> records) to  HDFS using Flink?
>>>>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>

Mime
View raw message