hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: Reading multiple input files.
Date Sat, 11 Jan 2014 00:53:38 GMT
How many mappers being generated depends on the InputFormat class you choose for your MR job.
The default one, which in case you didn't specify in your job, is TextInputFormat, which will
generate one split per block, assuming your file is splitable.
Which task node will run the mapper depends on a lot of conditions. You can search online
about it, but just assume it is unpredictable as now. Then all of your task nodes need to
be able to access your data.
If you are using HDFS, then you should have no problem, as HDFS is available to all the task
nodes. If the data is in the local disk, then it has to be available on all the task nodes.
Yong

Date: Fri, 10 Jan 2014 14:18:53 -0800
Subject: Re: Reading multiple input files.
From: kchew534@gmail.com
To: user@hadoop.apache.org

Yong, this is very helpful! Thanks.

Still trying to wrap my around all this :)

Let's stick to this hypothetical scenario that my data files are located on different server,
for example,


machine-1:/foo/bar.txt
machine-2:/foo/bar.txt
machine-3:/foo/bar.txt
machine-4:/foo/bar.txt
machine-5:/foo/bar.txt
...........


So how does Hadoop determine how many mapper does it need? Can I run my job like this?

hadoop MyJob -input /foo -output output

Kim



On Fri, Jan 10, 2014 at 8:04 AM, java8964 <java8964@hotmail.com> wrote:




Yes. 
The hadoop is very flexible for underline storage system. It is in your control,  how to utilize
the cluster's resource, include CPU, memory, IO and network bandwidth.

Check out hadoop NLineInportFormat, it maybe the right choice for your case.
You can put all the metadata of your files (data) into one text file, and send this text file
to your MR job.

Each mapper will get one line text from the above file, and start to process data representing
by this one line text.
Is it a good solution for you? You have to judge it by yourself. Keep in mind followings:

1) Normally, the above case is good for a MR job to load data from a third party system, for
CPU intensive jobs.2) You do utilize the cluster, as if you have 100 mapper tasks, and 100
files to be processed, you get pretty good concurrency.

But:
1) Are your files (or data) equally split around the third party system? In the above example,
for 100 files (or chunks of data), if one file is 10G, and the rest are only 100M, then one
mapper will take MUCH longer than the rest. You will have lone tail problem, and hurt overall
performance.
2) NO data locality advantage compared to HDFS. All the mappers need to load the data from
a third party system remotely.3) If each file (or chunk data) are very large, what about fail
over? For example, if you have 100 mapper task slots, but only 20 files, with 10G data each,
then you under-utilize your cluster resource, as only 20 mappers will handle them, the rest
80 mapper tasks will be just idle. More important, if one mapper failed, all the already processed
data has to be discard. Another mapper has to restart from beginning for this chunk of data.
Your overall performance is hurt.

As you can see, you get a lot of benefits from the HDFS.  You lost all of them. Sometimes
you have no other choices, but have to load the data on the fly from some 3rd party system.
But you need to think above, and try to seek all the benefits which HDFS can provide to you,
from the 3rd party system, if you can.

Yong

Date: Fri, 10 Jan 2014 01:21:19 -0800
Subject: Reading multiple input files.
From: kchew534@gmail.com
To: user@hadoop.apache.org


How does a MR job read multiple input files from different locations?

What if the input files are not in hdfs and located on different servers? Do I have to copy
them to hdfs first and instruct my MR job to read from them? Can I instruct my MR job to read
directly from those servers?



Thanks.

Kim
 		 	   		  

 		 	   		  
Mime
View raw message