hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Order of files in Map class
Date Wed, 20 Jul 2011 10:07:48 GMT
Florin,

On Wed, Jul 20, 2011 at 2:03 PM, Florin P <florinpico@yahoo.com> wrote:
> Hello, Harsh!
>  Thank you for your quick response. I have another questions:
> 1. your are saying that each map task will take as an input one file, but when the files
size are less than the block size then it is possible that a map task to take more than one
file, isn't it?

If you have a 2 MB file on DFS with a configured block size of 256 MB,
the file still takes only 2 MB. See [1]. The block size factor is a
mere splitting enforcer, not a fill-up thing. No two files can reside
on the same 'block'.

> 2.In this particular case, the same behavior will happen (meaning each file will be processed
till end and then next one)?

Unless you pack in more blocks per split with an input format like
CombineFileInputFormat, this does not happen.

But if you do use CombineFileInputFormat then yes it does happen like that.

Of course, you can also write your own custom InputFormat+RecordReader
that can mix files' records as you want it to (the mapjoin example
reads off multiple files at a time, for example, to join).

> --- On Wed, 7/20/11, Harsh J <harsh@cloudera.com> wrote:
>
>> From: Harsh J <harsh@cloudera.com>
>> Subject: Re: Order of files in Map class
>> To: hdfs-user@hadoop.apache.org
>> Date: Wednesday, July 20, 2011, 3:44 AM
>> Florin,
>>
>> Your second example is how it happens in Hadoop, but
>> there's more here
>> to understand.
>>
>> To start with, your InputFormat (input splitter) computes
>> and
>> publishes a set amount of InputSplits. The total number of
>> input
>> splits is gonna be your total number of 'Map Tasks' in
>> Hadoop as the
>> job proceeds. The input splits are generally block splits,
>> i.e.,
>> start-and-stop lengths over the same file.
>>
>> Each 'MapTask' is designated one split from this list of
>> splits. So
>> every map task would initialize separately, in its own JVM
>> (no shared
>> resources -- again, its a different instance of mappers per
>> file or
>> block!) and read the input split alone, into its map(key,
>> value,
>> context) function.
>>
>> So to summarize, your second example is what will happen,
>> but it would
>> be in parallel instead, such as:
>>
>> map1 | map2 | …
>> file1 | file2 | …
>> row1 | row1 | …
>> row2 | row 2 | …
>>
>> P.s. What I've explained here is the default behavior. Of
>> course
>> things can be highly tweaked to achieve other things, like
>> your first
>> example, but those probably come with greater read costs
>> attached. The
>> 'hadoop' way is data local, and one-file-per-task.
>>
>> On Wed, Jul 20, 2011 at 12:11 PM, Florin P <florinpico@yahoo.com>
>> wrote:
>> > Hello!
>> >  Suppose that we have the files F1, F2,..Fk given by
>> the input splitter to the map class, what is the order in
>> which they will arrive when map function  is applied?
>> >  What is interesting me  if  it is possible that in
>> the map function to arrive mixed key-value pairs from
>> different files? They keys will arrive related with their
>> file, till no more keys are left from source file or they
>> can arrive one key from F1 one key from Fk and so on.
>> >  Example:
>> >   Mixed key value pairs at the map function:
>> >    K1 from F1
>> >    K5 from F5
>> >    K7 from F8
>> >  etc
>> >
>> >  ordered key-value pairs:
>> >    K1 from F1
>> >   ..
>> >    K_end_F1 from F1
>> >    K5 from F5
>> > ..
>> >  K_end_F5 from F5
>> >  and so on.
>> >
>> > I'll look forward for your answer.
>> >  Regards,
>> >  Florin
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>



-- 
Harsh J

Mime
View raw message