hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joman Chu" <jom...@andrew.cmu.edu>
Subject Re: parallel mapping on single server
Date Sat, 12 Jul 2008 16:09:07 GMT
If it is a Sequence File, then Hadoop should split it between a
key/value pair. For text files, I believe there is an import format
for it,  but I haven't worked with non-sequence file input before. But
from looking over some files in Hadoop, try taking a look at
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
and http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html

Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net


On Sat, Jul 12, 2008 at 11:39 AM, hong <minghong.zhou@163.com> wrote:
> Hi,
>
> I have a question about the strategy described by Jonman Chu:
> "Hadoop will try to split the file according to how it is split up in the
> HDFS"
>
> use wordcount as example. suppose  "hadoop" is a word in input file. and
> block 1 ends with "had", block 2 starts with "oop",  how to handle this
> case?
>
> Thanks for your reply
>
> 在 2008-7-11,上午5:27,Joman Chu 写道:
>
>> Hadoop will try to split the file according to how it is split up in
>> the HDFS. For example, if an input file has three blocks with a
>> replication factor of two, there are six total blocks. Say there are
>> six machines, each with a single block. Block 1 is on machines 1 and
>> 2, block 2 is on 3 and 4, and block 3 is on 5 and 6. Hadoop will make
>> three Map tasks. Each task is assigned to a machine and it will
>> process the block that is locally on that machine. If it can't do
>> this, then blocks are transferred among the rack and then to other
>> machines in the cluster but further away.
>>
>> Joman Chu
>> AIM: ARcanUSNUMquam
>> IRC: irc.liquid-silver.net
>>
>>
>> On Thu, Jul 10, 2008 at 10:40 AM, hong <minghong.zhou@163.com> wrote:
>>>
>>> Hi
>>>
>>> Follows Cao Haijun's reply:
>>>
>>> Suppose we have set 8 map tasks. How does each map know which part of
>>> input
>>> file it should process?
>>>
>>> 在 2008-7-10,上午2:33,Haijun Cao 写道:
>>>
>>>> Set number of map slots per tasktracker to 8 in order to run 8 map tasks
>>>> on one machine (assuming one tasktracker per machine) at the same time:
>>>>
>>>>
>>>> <property>
>>>>  <name>mapred.tasktracker.map.tasks.maximum</name>
>>>>  <value>8</value>
>>>>  <description>The maximum number of map tasks that will be run
>>>>  simultaneously by a task tracker.
>>>>  </description>
>>>> </property>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Deepak Diwakar [mailto:ddeepak4u@gmail.com]
>>>> Sent: Monday, July 07, 2008 1:29 AM
>>>> To: core-user@hadoop.apache.org
>>>> Subject: parallel mapping on single server
>>>>
>>>> Hi,
>>>>
>>>> I am pretty naive to hadoop. I ran a modification of wordcount  on
>>>> almost a
>>>> TB data on single server, but found that it takes too much time.
>>>> Actually i
>>>> found that at a time only one core is utilized even though my server is
>>>> of 8
>>>> cores.  I read that hadoop speeds up computation in DFS mode.But how to
>>>> make
>>>> full utilization of a single server with multicore processors?  Is there
>>>> in
>>>> pseudo dfs mode in hadoop? What are the changes required in config files
>>>> .Please let me know in detail. Is there anything to do with
>>>> hadoop-site.xml
>>>> and mapred-default.xml?
>>>>
>>>> Thanks in advance.
>>>> --
>>>> - Deepak Diwakar,
>>>> Associate Software Eng.,
>>>> Pubmatic, pune
>>>> Contact: +919960930405
>>>
>>>
>>>
>>>
>
>
>
>
Mime
View raw message