** The opinion of MapRdeuce currently has three problems which can¡¯t be solved. 1. The opinion of Map Reduce is inefficiency. 2. Reusability of the code can¡¯t be globally optimized. 3. Linear model of data processing. Following is the detailed explanation for above 3 problems: 1. The opinion of MapReduce is inefficiency; The opinion of MapReduce will divide one calculation into multiple calculations, inside which is inserted some data storage operations. The data processing of MR can be summarized as the sequence: map-reduce-map-reduce-¡­¡­,where map process must exist (even is do nothing) . The data processing of MR can be divided three kinds: A) The first is ¡°one map and one reduce¡±. This kind of data processing can be effectively completed by almost all the distributed computing framework. MR framework also does well; B) The second is ¡°one map+one combiner+one reduce¡±, where the map and the combiner are executed on the same physical node. After the map is finished and its results are one calculation (memory), the combiner begins to calculate. The process really divides one calculation (map+combiner) into two calculations. C) The third is the execption sequence of ¡°map-reduce-map-reduce-¡­¡­¡± where the implementation of the map can be empty or reduce can be omitted. The map process is a mapping process, which function is to map a to a set of . We can realize the map function (except the first one) inside the reduce of last job. So.MR once again divides one calculation (reduce+map) into a two calculations. What I have to say is that the opinion of Map Reduce is inefficiency, because it has lots of redundant data storage process. Let we see some examples (reference the code of hadoop¡¯s wordcount) WordCount1 (W1): It has two jobs,job1 and job2. Job1is the same with wordcount, and job2 has only one map which do nothing. WordCount2 (W2): It is the same with wordcount, except the reduce process writes nothing. The difference of running time between the two tests is the MR¡¯s waste time. It includes four parts: A) T_Job 1 Finished: the time from the end of job1¡¯s reduce to the start of job2 subtracts the time the temp data deletion of job1£»(which is more than 1 second) B) T_Job2 Start the time of Job2 configuration; (3-5 seconds) C) T_RM: the time of reading data of job2¡¯s map and the time of task scheduling; (3(64M)+3=6 second, the time changes with the block size) D) T_WR: the time of writing DFS of job1¡¯s Reduce,which changes with the amount of data size. In the test, the running time of an empty map is 15 second, which is almost the same with the sum of above times. 2. Reusability of the code can¡¯t be globally optimized; Suppose you need to extract a part from the source data and modify some content in the result. How will you do? Now, if you want to extract a part from the source data and modify some content in the result. How will you do? One way is that you can first run code1, and after code1 is finished you can run code2. If the data is large and there are several steps to do, may I have a sleep? Another way is that you can rewrite a MR program. OK, it is efficient. Consider there the number of situations is N, then you have to write C (N,2)= N (N-1) /2 programs to satisfy all 2 combinations, and C (N,3) for all 3 combination¡­ (Oh, my god) 3. Linear model of data procession If data procession is a directed acyclic graph, then you will be tired if you use MapReduce because you need to consider how to split tasks, how these tasks in parallel, and so on. The conclusion is: the opintion of MapReduce is inefficiency except the simple process (one map+ one reduce). If you have any questions,you can contact me: qz2003@gmail.com.