hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devaraj k <devara...@huawei.com>
Subject RE: spawn maps without any input data - hadoop streaming
Date Wed, 17 Jul 2013 03:30:53 GMT
Hi Austin,

                Here number of maps  for a Job  depends on the splits return by InputFormat.getSplits()
API. We can have an input format which decides the number of maps(by returning the splits)
for a Job according to the need.

If we use FileInputFormat, these number of splits depend on the input data for the Job, that's
why you see no of mappers is proportional to the Job input size.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/InputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext)

Thanks
Devaraj k

From: Austin Chungath [mailto:austincv@gmail.com]
Sent: 16 July 2013 14:40
To: user@hadoop.apache.org
Subject: spawn maps without any input data - hadoop streaming

Hi,

I am trying to generate random data using hadoop streaming & python. It's a map only job
and I need to run a number of maps. There is no input to the map as it's just going to generate
random data.

How do I specify the number of maps to run? ( I am confused here because, if I am not wrong,
the number of maps spawned is related to the input data size )
Any ideas as to how this can be done?

Warm regards,
Austin

Mime
View raw message