flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel <ele...@msn.com>
Subject RE: Maintaining data locality with list of paths (strings) as input
Date Sat, 14 Mar 2015 19:31:20 GMT
I stand corrected: a file does not take an entire bloc, according to Hadoop: The Definitive
Guide. As Stephan mentioned it takes space in the namenode (150kb/file as i read). So the
limitation with many small files is with the capacity of the namenode.
Still the data, when replicated is distributed by the HDFS engine; I don't see how there can
be any assurance of the locality of the data. it's possible to figure where it is physically
located, but I'm not sure it can be predicted.
The whole idea behind Hadoop is to bring the computation to the data, and that's what YARN
does in distributing processing tasks. (Yahoo Dev network explains this well: https://developer.yahoo.com/hadoop/tutorial/module1.html)
Is the Flink behavior mentioned native or is this something happening when running Flink on

Date: Sat, 14 Mar 2015 19:16:24 +0100
Subject: Re: Maintaining data locality with list of paths (strings) as input
From: sewen@apache.org
To: user@flink.apache.org

Hi Guy,
This sounds like a use case that should workwith Flink.
When it comes to input handling, Flink differs a bit from Spark. Flink creates a set of input
tasks and a set of input splits. The splits are then on-the-fly assigned to the tasks. Each
task may work on multiple input spits, which are pulled from the JobManager. The input localization
also happens on the fly, as input tasks requests the next input split to process.
It should be possible to make sure this is executed in a data-local way, if the input splits
expose the location properly.

If you share a bit more information about how your input format looks like, we can definitely
help you to realize that.
As a side comment: Small files in HDFS do not take up 64 megabytes on disk, there must be
some confusion. In early versions of HDFS, there has been a big problem with small files,
because the Namenode was getting overloaded with metadata. This has gotten better, as far
as I know, but I am not sure if they have fully solved it.
On Sat, Mar 14, 2015 at 5:49 PM, Emmanuel <eleroy@msn.com> wrote:

Hi guy,

I don't have an answer about flink but a couple comments on your use case I hope might help:

- you should view HDFS as a giant RAID across nodes: the namenode maintains the file table
but the data is distributed and replicated across nodes by bloc. There is no 'data locality'
guarantee: the dat is distributed and replicated so it could be spread
 across many nodes.

- small files on HDFS is not a good idea because the typical minimal bloc size is 64MB, which
means even if your file is 1kB, it will use 64MB on disk. It is best to aggregate those small
files into a big one or use a dB storage like hbase or cassandra.

In Spark, you can load files from local file system, but usually it requires that files be
in each node which defeats the purpose.


-------- Original message --------

From: Guy Rapaport <guy4261@gmail.com> 

Date:03/14/2015 8:38 AM (GMT-08:00) 

To: Flink Users <user@flink.apache.org> 

Subject: Maintaining data locality with list of paths (strings) as input 


Here's a use case I'd like to implement, and I wonder if Flink is the answer:

My input is a file containing a list of paths.
(It's actually a message queue with incoming messages, each containing a path, but let's use
a simpler use-case.)

Each path points at a file stored on HDFS. The files are rather small so although they are
replicated, they are not broken into chunks.

I want each file to get processed on the note on which it is stored, for the sake of data
However, if I run such a job on Spark, what I get is that the input path gets to some node,
which should access the file by pulling it from the HDFS - no data locality, but instead network

Can Flink solve this problem for me?

Note: I saw similar examples in which file lists are processed on Spark... By having each
file in the list downloaded from the internet to the node processing it. That's not my use
case - I already have the files on
 HDFS, all I want is to enjoy data locality in a cluster-like environment!


View raw message