hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "DataProcessingBenchmarks" by JeffHammerbacher
Date Tue, 10 Aug 2010 17:20:05 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "DataProcessingBenchmarks" page has been changed by JeffHammerbacher.
The comment on this change is: Removed the old page as the results were from five vesions
ago, they worked with a tiny data set, and they weren't validated by anyone else. .
http://wiki.apache.org/hadoop/DataProcessingBenchmarks?action=diff&rev1=37&rev2=38

--------------------------------------------------

+ Some benchmarks from 2009 done at Yahoo!: http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html
- <<TableOfContents(4)>>
- ----
-  * Reporter : [[udanax|Edward Yoon]]
  
- == Hadoop Map/Reduce Data Processing Benchmarks ==
- === Group/Sort ===
- 
-  * Finds the most connected networks.
- 
- SQL > select ipaddress, count(*) from access_log group by ipaddress order by count(*)
desc limit 0,100;
- <<BR>>''σ ,,count. ipaddress,, (τ ,,count,, (γ ,,count(ipaddress). ipaddress,,
(access_log)))''
- 
- ==== MapReduce Flow ====
- 
-  * Map was used for extract the IP address of the client requesting the web page.
-  * Reduce was used for summation.
-  * 1 more Map/Reduce was used for sort by count.
- 
- ==== Benchmarks ====
- 
- ===== 1.5 GB access_log on 10 node cluster =====
- 
- This test should include the data load time for the MySql column, not just the SQL time.
- 
- [[http://wiki.apache.org/hadoop-data/attachments/DataProcessingBenchmarks/attachments/C__Users_udanax_Desktop_test-10.png]]
- 
- ||<bgcolor="#E5E5E5">||<bgcolor="#E5E5E5">!MySql 5.0.27 ||<bgcolor="#E5E5E5">Hadoop-0.15.2
||<bgcolor="#E5E5E5">Hadoop-0.15.2 ||<bgcolor="#E5E5E5">Hadoop-0.15.2 ||<bgcolor="#E5E5E5">Hadoop-0.15.2
||<bgcolor="#E5E5E5">Hadoop-0.15.2 ||
- ||<bgcolor="#E5E5E5">Data ||B-tree disk table (MyISAM)||Text files (access_log)||Text
files (access_log)||Text files (access_log)||Text files (access_log)||Text files (access_log)||
- ||<bgcolor="#E5E5E5">Machine ||1 || 2|| 4|| 6|| 8|| 10||
- ||<bgcolor="#E5E5E5">Rows ||5,914,669 ||5,914,669||5,914,669||5,914,669||5,914,669||5,914,669||
- ||<bgcolor="#E5E5E5">Results ||100 ||100||100||100||100||100||
- ||<bgcolor="#E5E5E5">Time  ||4.43 sec ||172.30 sec||108.01 sec||77.41 sec||66.30 sec||60.78
sec||
- 
- ----
- 
- I also investigate a lot of traditional methods of parallel processing and experiment some
high level processing (e.g. matrix algebra, graph algorithm) using Hadoop/Hbase/MapReduce.
The only way to increase speed linearly was locality (Do write all data even if there are
duplicated efforts). Increased node numbers, there is a linear increase of IO channel.
- 

Mime
View raw message