hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 2003 qz <qz2...@gmail.com>
Subject Thinking of distributed (parallel) computing
Date Wed, 20 Jul 2011 15:30:42 GMT
**



The opinion of MapRdeuce currently has three problems which can’t be solved.

1. The opinion of Map Reduce is inefficiency.

2. Reusability of the code can’t be globally optimized.

3. Linear model of data processing.



Following is the detailed explanation for above 3 problems:

1.       The opinion of MapReduce is inefficiency;

    The opinion of MapReduce will divide one calculation into multiple
calculations, inside which is inserted some data storage operations. The
data processing of MR can be summarized as the sequence:
map-reduce-map-reduce-……,where map process must exist (even is do nothing) .

    The data processing of MR can be divided three kinds:

A)      The first is “one map and one reduce”. This kind of data processing
can be effectively completed by almost all the distributed computing
framework. MR framework also does well;

B)      The second is “one map+one combiner+one reduce”, where the map and
the combiner are executed on the same physical node. After the map is
finished and its results are one calculation (memory), the combiner begins
to calculate. The process really divides one calculation (map+combiner) into
two calculations.

C)      The third is the execption sequence of  “map-reduce-map-reduce-……”
where the implementation of the map can be empty or reduce can be omitted.
The map process is a mapping process, which function is to map a <key.yalue>
to a set of <key.yalue>. We can realize the map function (except the first
one) inside the reduce of last job. So.MR once again divides one calculation
(reduce+map) into a two calculations.



What I have to say is that the opinion of Map Reduce is inefficiency,
because it has lots of redundant data storage process.



Let we see some examples (reference the code of hadoop’s wordcount)

WordCount1 (W1): It has two jobs,job1 and job2. Job1is the same with
wordcount, and job2 has only one map which do nothing.

WordCount2 (W2): It is the same with wordcount, except the reduce process
writes nothing.

The difference of running time between the two tests is the MR’s waste time.
It includes four parts:

A)      T_Job 1 Finished: the time from the end of job1’s reduce to the
start of job2 subtracts the time the temp data deletion of job1;(which is
more than 1 second)

B)      T_Job2 Start the time of Job2 configuration; (3-5 seconds)

C)      T_RM: the time of reading data of job2’s map and the time of task
scheduling; (3(64M)+3=6 second, the time changes with the block size)

D)      T_WR: the time of writing DFS of job1’s Reduce,which changes with
the amount of data size.

In the test, the running time of an empty map is 15 second, which is almost
the same with the sum of above times.

2.       Reusability of the code can’t be globally optimized;

Suppose you need to extract a part from the source data and modify some
content in the result. How will you do?

Now, if you want to extract a part from the source data and modify some
content in the result. How will you do?

One way is that you can first run code1, and after code1 is finished you can
run code2. If the data is large and there are several steps to do, may I
have a sleep?

Another way is that you can rewrite a MR program. OK, it is efficient.
Consider there the number of situations is N, then you have to write C
(N,2)= N (N-1) /2 programs to satisfy all 2 combinations, and C (N,3) for
all 3 combination… (Oh, my god)

3.       Linear model of data procession

    If data procession is a directed acyclic graph, then you will be tired
if you use MapReduce because you need to consider how to split tasks, how
these tasks in parallel, and so on.





The conclusion is: the opintion of MapReduce is inefficiency except the
simple process (one map+ one reduce).

If you have any questions,you can contact me: qz2003@gmail.com.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message