hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Kelkar <rohitkel...@gmail.com>
Subject Re: How to reduce number of splits in DataDrivenDBInputFormat?
Date Thu, 20 Jan 2011 07:50:27 GMT
you could try out this piece of code before job.waitForCompletion()

FileSystem dfs = FileSystem.get(conf);
long fileSize = dfs.getFileStatus(new Path(hdfsFile)).getLen();
long maxSplitSize = fileSize / NUM_OF_MAP_TASKS; //in your case
NUM_OF_MAP_TASKS = 4
conf.setLong("mapred.max.split.size", maxSplitSize);
job.waitForCompletion(true/false as required);

If you intend to have more map tasks then set the
"mapred.max.split.size" parameter by doing the above computations. If
you need to restrict the num of map tasks then you could do a similar
computation as above and set the "mapred.min.split.size" parameter

- Rohit

On Thu, Jan 20, 2011 at 1:03 PM, Joan <joan.monplet@gmail.com> wrote:
> Hi Sonal,
>
> I put both configurations:
>
>         job.getConfiguration().set("mapreduce.job.maps","4");
>         job.getConfiguration().set("mapreduce.map.tasks","4");
>
> But both configurations don't run. I also try to set "mapred.map.task" but
> It neither run.
>
> Joan
>
> 2011/1/20 Sonal Goyal <sonalgoyal4@gmail.com>
>>
>> Joan,
>>
>> You should be able to set the mapred.map.tasks property to the maximum
>> number of mappers you want. This can control parallelism.
>>
>> Thanks and Regards,
>> Sonal
>> Connect Hadoop with databases, Salesforce, FTP servers and others
>> Nube Technologies
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jan 19, 2011 at 9:32 PM, Joan <joan.monplet@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I want to reduce number of splits because I think that I get many splits
>>> and I want to reduce these splits.
>>> While my job is running I can see:
>>>
>>> INFO mapreduce.Job:  map ∞% reduce 0%
>>>
>>> I'm using DataDrivenDBInputFormat:
>>>
>>> setInput
>>>
>>> public static void setInput(Job job,
>>>                             Class<? extends DBWritable> inputClass,
>>>
>>>
>>>
>>>                             String tableName,
>>>                             String conditions,
>>>
>>>
>>>
>>>                             String splitBy,
>>>                             String... fieldNames)
>>>
>>> Note that the "orderBy" column is called the "splitBy" in this version.
>>> We reuse the same field, but it's not strictly ordering it -- just
>>> partitioning the results.
>>>
>>>
>>> So I get all data from myTable and I try to split by date column. I
>>> obtain milions rows and I supose that DataDrivenDBInputFormat generates many
>>> splits and i don't know how to reduce this splits or how to indicates to
>>> DataDrivenDBInputFormat splits by my date column (corresponds to splitBy).
>>>
>>> The main goal's improve performance, so I want to my Map's faster.
>>>
>>>
>>> Can someone help me?
>>>
>>> Thanks
>>>
>>> Joan
>>>
>>>
>>>
>>
>
>

Mime
View raw message