hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "BristolHadoopWorkshop" by SteveLoughran
Date Fri, 14 Aug 2009 14:44:15 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/BristolHadoopWorkshop

The comment on the change is:
Futures writeup

------------------------------------------------------------------------------
  == Hadoop Futures ==
  
   * [http://www.slideshare.net/steve_l/hadoop-futures Hadoop Futures] (Tom White, Cloudera)
+ 
+ Tom's goals for Hadoop
+  * make it modular
+  * support more languages than just Java
+  * better integration with management tools
+ 
+ === Schedulers ===
+ 
+ CapacityScheduler. Yahoo!'s -designed for very large clusters with different people working
on it. Can take RAM requirements into account and place work machines with free RAM space,
rather than just a free "slot"
+ 
+ FairScheduler -Facebook's. For a datacentre running production work with latency requirements,
some people also running Hive jobs which are lower priority. 
+ 
+ 
+ === Languages ===
+ 
+  * streaming: stdin and stdout, text or typed binaries; slow
+  * pipes: C++ interface
+  * HDFS is pure Java. FUSE is slow because of this
+ 
+ === Security ===
+ 
+ This is going to take lots of work. Its really hard to get security right.
+  
+ === Scaling Down ===
+  * standalone doesn't have >1 reducer.
+  * MiniMR will run multicore, but you have the overhead of the full RPC protocol, even though
everything is running in a single process.
+  * Ideal: a multicore-ready single client. 
+  
+ Someone needs to make the local job runner better. It's been neglected because all big projects
don't use it. To make Hadoop useful for small amounts of data, single machine work, the standalone
runner needs work. 
+ 
+ Pig in local mode doesnt use local job runner => need to take what they've done.
+ 
+ === Project split ===
+ 
+ New list structure
+  * ${project}-dev: every issue when created, other discussion 
+  * ${project}-issues: every JIRA update
+  * ${project}-user: user discussions. This is a bit confused now, there are so many of these.

+  * hadoop-general - worth getting on this list too
+ 
+ === 0.21 release ===
+  * Any 0.21 features must go in in this month!
+  * MAPREDUCE-207Computing splits on the cluster: reduces effort on the client 
+  * HADOOP-6165 Avro and Thrift
+  * Context Objects - new API, finished for 0.21
+  * new shuffle : read the Y! paper on sortbenchmark.org
+ 
+ === Hadoop 1.0 goals ===
+ 
+ The goal for 1.0 is to have some things stable for a year: API, wire format (Avro). Some
things will be marked "unstable, developer only" to avoid guaranteeing fixing things.
+  
+  * HADOOP-5073 -interface classification 
+  * HADOOP-5071 -wire protocol
+  
+ Paolo wants a hadoop-client POM that pulls in only the dependencies for the client. Similarly,
a hadoop-local that only pulls in stuff for local things. 
+ 
+ === Eclipse Plugin ===
+ 
+ The Eclipse plugin is not in sync with eclipse. Nobody is supporting/using it right now.
With a stable API/wire format it would work better. (of course, with a stable long-haul API,
the plugin could always work with a remote cluster, even if you had to be careful that the
jobs you ran were compiled against a compatible version.)
  
  == Benchmarking Hadoop ==
  
@@ -144, +203 @@

   * GPFS -Bristol. Taking a while to work as promised. 
  
  === Hadoop Integration ===
- HDFS has proven v. successful at T2 sites; popularity may increase as centres expand. Appreciated
features: checksumming, admin tools. Validate that the data is OK.
+ HDFS has proven v. successful at Tier-2 sites; popularity may increase as centres expand.
Appreciated features: checksumming, admin tools. Validate that the data is OK.
  
  Could you run CMSSW under Hadoop? Probably not. Very slow startup/teardown cost, so you
don't want to just run it for one/two events.
  

Mime
View raw message