hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: How to reduce number of splits in DataDrivenDBInputFormat?
Date Thu, 20 Jan 2011 09:21:37 GMT
Moving this offline from the list.

Thanks and Regards,
Sonal
<https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Thu, Jan 20, 2011 at 2:18 PM, Joan <joan.monplet@gmail.com> wrote:

> Hi Sonal,
>
> I've downloaded hiho project and I can see that hiho's a
> DBInputAvroMapper.java very interesting.
>
> I want to read from DB using this Mapper and its Reducer can write
> serialize object too. How can I do?
>
> After I want create other job that its Mapper reads the output (serialize
> object) from previous Reducer. How can I do?
>
> Thanks Sonal,
>
>
> Joan
>
>
> 2011/1/20 Sonal Goyal <sonalgoyal4@gmail.com>
>
>> Which hadoop version are you on?
>>
>> You can alternatively try using hiho from
>> https://github.com/sonalgoyal/hiho  to get your data from the db. Please
>> write to me directly if you need any help there.
>>
>>
>> Thanks and Regards,
>> Sonal
>> <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
>> Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
>> Nube Technologies <http://www.nubetech.co>
>>
>> <http://in.linkedin.com/in/sonalgoyal>
>>
>>
>>
>>
>>
>> On Thu, Jan 20, 2011 at 1:03 PM, Joan <joan.monplet@gmail.com> wrote:
>>
>>> Hi Sonal,
>>>
>>> I put both configurations:
>>>
>>>         job.getConfiguration().set("mapreduce.job.maps","4");
>>>         job.getConfiguration().set("mapreduce.map.tasks","4");
>>>
>>> But both configurations don't run. I also try to set "mapred.map.task"
>>> but It neither run.
>>>
>>> Joan
>>>
>>> 2011/1/20 Sonal Goyal <sonalgoyal4@gmail.com>
>>>
>>> Joan,
>>>>
>>>> You should be able to set the mapred.map.tasks property to the maximum
>>>> number of mappers you want. This can control parallelism.
>>>>
>>>> Thanks and Regards,
>>>> Sonal
>>>> <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
>>>> Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
>>>> Nube Technologies <http://www.nubetech.co>
>>>>
>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jan 19, 2011 at 9:32 PM, Joan <joan.monplet@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I want to reduce number of splits because I think that I get many
>>>>> splits and I want to reduce these splits.
>>>>> While my job is running I can see:
>>>>>
>>>>> *INFO mapreduce.Job:  map ∞% reduce 0%*
>>>>>
>>>>> I'm using DataDrivenDBInputFormat:
>>>>> *
>>>>> ** setInput*
>>>>>
>>>>> *public static void setInput(Job <http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html>
job,
>>>>>                             Class <http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true><?
extends DBWritable <http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/db/DBWritable.html>>
inputClass,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                             String <http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
tableName,
>>>>>                             String <http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
conditions,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                             String <http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
splitBy,
>>>>>                             String <http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>...
fieldNames)*
>>>>>
>>>>> *Note that the "orderBy" column is called the "splitBy" in this
>>>>> version. We reuse the same field, but it's not strictly ordering it --
just
>>>>> partitioning the results.
>>>>> *
>>>>>
>>>>> So I get all data from myTable and I try to split by date column. I
>>>>> obtain milions rows and I supose that DataDrivenDBInputFormat generates
many
>>>>> splits and i don't know how to reduce this splits or how to indicates
to
>>>>> DataDrivenDBInputFormat splits by my date column (corresponds to splitBy).
>>>>>
>>>>> The main goal's improve performance, so I want to my Map's faster.
>>>>>
>>>>>
>>>>> Can someone help me?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Joan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message