hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek" <lukas.vl...@gmail.com>
Subject Hadoop efficiency question
Date Sat, 18 Aug 2007 07:13:45 GMT

I have just found an interesting video called "Scalability and Efficiency on
Data Mining Applied to Internet Applications". The link is:

It touches MapReduce paradigm and wide portion of the presentation is
devoted to classical data mining task Frequent Itemset Mining (experimental
results for other tasks are presented as well). If I undestood correctly
then one of the main points of this presentation is that current MapReduce
is great for stateless computations but it can be a problem (less effective)
when stateful approach is needed. For their needs they created MapReduce
derived implementations where each Reduce phase can store results and other
metadata into external repository so that other tasks can learn about it
very fast (so that subsequent Map task can start earlier if it has all the
information it needs and dose not have to wait until the whole Map phase

Would this be possible in current Hadoop implementation? Or would such
modification go far beyond current Hadoop architectural concepts? (I noticed
the question from audience in the end of the presentation was about node
failures so maybe even big guys from Google haven't been using this approach


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message