hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "BristolHadoopWorkshop" by SteveLoughran
Date Fri, 21 Aug 2009 11:34:17 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/BristolHadoopWorkshop

------------------------------------------------------------------------------
  
  == Acknowledgements ==
  
- # Bristol Centre for Nanoscience and Quantum Information [http://www.bristol.ac.uk/nsqi-centre/
NSQI] for the room and other facilities
+  1. Bristol Centre for Nanoscience and Quantum Information [http://www.bristol.ac.uk/nsqi-centre/
NSQI] for the room and other facilities
- # University of Bristol Particle Physicics group for hosting the workshop
+  1. University of Bristol Particle Physicics group for hosting the workshop
- # HP Laboratories for the food and coffee
+  1. HP Laboratories for the food and coffee
- # Cloudera for supplying beer at the Highbury Vaults.
+  1. Cloudera for supplying beer at the Highbury Vaults.
  
- == Presentation == 
+ == Presentation ==
  
  These presentations were intended to start discussion and thought
  
@@ -20, +20 @@

    * [http://www.slideshare.net/steve_l/hdfs HDFS] (Johan Oskarsson, Last.fm)
    * [http://www.slideshare.net/steve_l/graphs-1848617 Graphs] Paolo Castagna, HP
    * [http://www.slideshare.net/steve_l/long-haul-hadoop Long Haul Hadoop] (Steve Loughran,
HP)
-  
+ 
  == Hadoop Futures ==
  
   * [http://www.slideshare.net/steve_l/hadoop-futures Hadoop Futures] (Tom White, Cloudera)
@@ -34, +34 @@

  
  CapacityScheduler. Yahoo!'s -designed for very large clusters with different people working
on it. Can take RAM requirements into account and place work machines with free RAM space,
rather than just a free "slot"
  
- FairScheduler -Facebook's. For a datacentre running production work with latency requirements,
some people also running Hive jobs which are lower priority. 
+ FairScheduler -Facebook's. For a datacentre running production work with latency requirements,
some people also running Hive jobs which are lower priority.
  
  
  === Languages ===
@@ -46, +46 @@

  === Security ===
  
  This is going to take lots of work. Its really hard to get security right.
-  
+ 
  === Scaling Down ===
   * standalone doesn't have >1 reducer.
   * MiniMR will run multicore, but you have the overhead of the full RPC protocol, even though
everything is running in a single process.
-  * Ideal: a multicore-ready single client. 
+  * Ideal: a multicore-ready single client.
-  
+ 
- Someone needs to make the local job runner better. It's been neglected because all big projects
don't use it. To make Hadoop useful for small amounts of data, single machine work, the standalone
runner needs work. 
+ Someone needs to make the local job runner better. It's been neglected because all big projects
don't use it. To make Hadoop useful for small amounts of data, single machine work, the standalone
runner needs work.
  
  Pig in local mode doesnt use local job runner => need to take what they've done.
  
  === Project split ===
  
  New list structure
-  * ${project}-dev: every issue when created, other discussion 
+  * ${project}-dev: every issue when created, other discussion
   * ${project}-issues: every JIRA update
-  * ${project}-user: user discussions. This is a bit confused now, there are so many of these.

+  * ${project}-user: user discussions. This is a bit confused now, there are so many of these.
   * hadoop-general - worth getting on this list too
  
  === 0.21 release ===
   * Any 0.21 features must go in in this month!
-  * MAPREDUCE-207Computing splits on the cluster: reduces effort on the client 
+  * MAPREDUCE-207 Computing splits on the cluster: reduces effort on the client
   * HADOOP-6165 Avro and Thrift
   * Context Objects - new API, finished for 0.21
   * new shuffle : read the Y! paper on sortbenchmark.org
@@ -74, +74 @@

  === Hadoop 1.0 goals ===
  
  The goal for 1.0 is to have some things stable for a year: API, wire format (Avro). Some
things will be marked "unstable, developer only" to avoid guaranteeing fixing things.
-  
+ 
-  * HADOOP-5073 -interface classification 
+  * HADOOP-5073 -interface classification
   * HADOOP-5071 -wire protocol
-  
+ 
- Paolo wants a hadoop-client POM that pulls in only the dependencies for the client. Similarly,
a hadoop-local that only pulls in stuff for local things. 
+ Paolo wants a hadoop-client POM that pulls in only the dependencies for the client. Similarly,
a hadoop-local that only pulls in stuff for local things.
  
  === Eclipse Plugin ===
  
@@ -178, +178 @@

  
   * [http://www.slideshare.net/steve_l/hadoop-hep Hadoop and High-Energy Physics] (Simon
Metson, Bristol University)
  
- The CMS experiment is on the Large Hadron Collider; it will run for 20-30 years colliding
heavy ions, such as lead ions. Every collision is an event; 1MB of data. Over a year, you
are looking at 10+PB of data. Right now, as the LHC isn't live, everything is simulation data,
which helps debug the dataflow and the code, but reduces the stress. Most events are unexciting,
you may need to run through a few hundred million events to find a handful that are relevant.

+ The CMS experiment is on the Large Hadron Collider; it will run for 20-30 years colliding
heavy ions, such as lead ions. Every collision is an event; 1MB of data. Over a year, you
are looking at 10+PB of data. Right now, as the LHC isn't live, everything is simulation data,
which helps debug the dataflow and the code, but reduces the stress. Most events are unexciting,
you may need to run through a few hundred million events to find a handful that are relevant.
  
- Jobs get sent to specific cluster round the world where the data exists. It is the "move
work to data" across datacentres, but once in place, there isn't so much locality. 
+ Jobs get sent to specific cluster round the world where the data exists. It is the "move
work to data" across datacentres, but once in place, there isn't so much locality.
- The Grid protocols are used to place work, but a lot of the underlying grid stuff isn't
appropriate; written with a vision that doesn't match the needs. Specifically, while the schedulers
are great at work placement on specific machine types, meeting hardware and software requirements
(Hadoop doesn't do any of that), you can't ask for time on the MPI-enabled bit of the infrastructure,
as the grid placement treats every machine as standalone; doesn't care about interconnectivity.

+ The Grid protocols are used to place work, but a lot of the underlying grid stuff isn't
appropriate; written with a vision that doesn't match the needs. Specifically, while the schedulers
are great at work placement on specific machine types, meeting hardware and software requirements
(Hadoop doesn't do any of that), you can't ask for time on the MPI-enabled bit of the infrastructure,
as the grid placement treats every machine as standalone; doesn't care about interconnectivity.
-   
+ 
  
  New concept: "dark data" - data kept on somebody's laptop. This makes up the secret heavy
weight of the data sets. When you think that its laptop data that is the enemy of corporate
security and AV teams, its apt everyone. In CMS, people like to have their own sample data
on their site, they are possessive about it. This is probably because the LHC isn't running,
and the data rate isn't overloading everyone. When the beam goes live, you will be grateful
for storage and processing anywhere.
-   
+ 
- A lot of the physicists who worked on the LEP predecessor are used to storing everything
on a hard disk. The data rates render this viewpoint obsolete. 
+ A lot of the physicists who worked on the LEP predecessor are used to storing everything
on a hard disk. The data rates render this viewpoint obsolete.
  
  In the LHC-era clusters, there is a big problem of disk, tape, CPU balance. For example,
multicore doesn't help as the memory footprint is such that multicore doesn't benefit that
much unless you have 32/64 GB. It also means that job setup/teardown costs are steep. You
don't want to work an event at a time, you want to run through a few thousand. The events
end up being stored in 2GB files for this reason.
  
  The code is all FORTRAN coded in C++.
  
- This was a really interesting talk that simon should give at apachecon. Physicists may be
used to discussing the event rate of a high-flux-density hadron beam, but for the rest of
us, it makes a change from web server logs. 
+ This was a really interesting talk that simon should give at apachecon. Physicists may be
used to discussing the event rate of a high-flux-density hadron beam, but for the rest of
us, it makes a change from web server logs.
  
  
  === Data flow ===
   * LHC -> Tier 0, in Geneva
-  * Tier 0 records everything to tape, pushes it out to the tier 1 sites round the world.
In the UK, Rutherford Appleton Labs is the tier 1 site. 
+  * Tier 0 records everything to tape, pushes it out to the tier 1 sites round the world.
In the UK, Rutherford Appleton Labs is the tier 1 site.
   * Tier 1 do some computation as well as storage -you can "skim" the data on on tier one,
quick reject of dull stuff -and can share the results. This is effectively a reduce.
   * Tier 2 sites do most of the computation; they have their own storage and are scattered
round the world in various institutions. In the US, the designs are fairly homogeneous, in
EU, less so.
  The architecture of the LHC pipeline is done more for national/organisation politics than
for efficient processing. The physicists don't get billed for network traffic.
@@ -209, +209 @@

   * Castor: CERN, includes tapes, team turnover/staffing bad
   * Dcache: fermilab. works there
   * DPM: doesn't scale beyond 10TB of RAID.
-  * GPFS -Bristol. Taking a while to work as promised. 
+  * GPFS -Bristol. Taking a while to work as promised.
  
  === Hadoop Integration ===
  HDFS has proven v. successful at Tier-2 sites; popularity may increase as centres expand.
Appreciated features: checksumming, admin tools. Validate that the data is OK.
@@ -224, +224 @@

  
  This was a talk by Paolo Castagna on graph work under MR, of which PageRank is classic application
   * graph topology does not change every iteration, so why ship it around every MR?
-  * the graph defines the other jobs you need to communicate with. 
+  * the graph defines the other jobs you need to communicate with.
- The graph is a massive data structure which, if you are doing inference work, only grows
in relationships. Steve thinks: You may need some graph model which is shared across servers,
which they can all add to. There is a small problem here: keeping the information current
for 4000 servers, but what if you don't have to, what if you treat updates to the graph as
lazy facts to propagate round? 
+ The graph is a massive data structure which, if you are doing inference work, only grows
in relationships. Steve thinks: You may need some graph model which is shared across servers,
which they can all add to. There is a small problem here: keeping the information current
for 4000 servers, but what if you don't have to, what if you treat updates to the graph as
lazy facts to propagate round?
  
  Google: pregel. what do you need from a language to describe PageRank in 15 lines?
  
  MS: dryad does not do graphs.
  
  Projects
-  * Apache Hamburg -proposed by Edward Yang, of Hamas, is it making progress? We need code.

+  * Apache Hamburg -proposed by Edward Yang, of Hamas, is it making progress? We need code.
   * Apache Common Graph. Dead, pre-MapReduce code.
  
  === Graph Algorithms ===
@@ -248, +248 @@

  
  The current best printed work on Graph over MapReduce is the recent paper [http://www2.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120
Graph Twiddling in a MapReduce World], by Jonathan Cohen.
  
-  * You can't do transitive closure in SQL. 
+  * You can't do transitive closure in SQL.
   * What would an efficient transitive closure algorithm over MR be?
  
  
@@ -264, +264 @@

  
  This was a discussion topic run by Julio
  
- * 400 Y! staff are moving to MS. How many are search specialists, versus Hadoop hackers.

+ * 400 Y! staff are moving to MS. How many are search specialists, versus Hadoop hackers.
  
- * Y! is driving large scale tests, facebook is #2. 
+ * Y! is driving large scale tests, facebook is #2.
- * Y! are making Hadoop the core of the company; it is their LOB of datacentre. 
+ * Y! are making Hadoop the core of the company; it is their LOB of datacentre.
  
  What are the risks of the Merger, and warning signs of trouble:
   # silence: Y! developers do their own fork, it goes closed source. We have seen this happen
in other OSS projects (Axis), where a single company suddenly disappears. There is no defence
from this other than making sure development knowledge is widespread. The JIRA-based discussion/documentation
is good here,
   as it preserves all knowledge, and makes decisions in the open.
-  
+ 
-  # staff departure. Key staff in the Hadoop team could leave, which would set things back.
Moving into MS could be bad, but moving to Google would set back development the worst. 
+  # staff departure. Key staff in the Hadoop team could leave, which would set things back.
Moving into MS could be bad, but moving to Google would set back development the worst.
   # slower development/rate of feature addition
   # reduced release rate. This can compensate for reduced testing resources.
-  # reduced rate of bug fixes. We can assume that Y!s own problems will be addressed, then
everything else is other people's problems. 
+  # reduced rate of bug fixes. We can assume that Y!s own problems will be addressed, then
everything else is other people's problems.
   # Less testing, reduced quality
  Apparently under [http://community.cloudera.com] - number of messages/JIRA and infer activity,
such as [http://community.cloudera.com/reports/47/contributors/ contributors] and [http://community.cloudera.com/reports/47/issues/
popular issues]
-  
+ 
- With Yahoo! outsourcing searching to MS, it means that MS can take on a big project that
-even if it isn't profitable to MS, can be subsidised by other parts of their business. It
ensures Yahoo! continuing survival as an independent company, which is the best state for
Hadoop development. It also frees up some Yahoo! datacentres for other projects. Those big
datacentres are large, slowly-depreciating assets, and by offloading the indexing to someone
else, there is now spare datacentre capacity for Yahoo! to use for other uses, uses that are
highly likely to use Hadoop at the back end -because what else would they build a datacentre-scale
application on? 
+ With Yahoo! outsourcing searching to MS, it means that MS can take on a big project that
-even if it isn't profitable to MS, can be subsidised by other parts of their business. It
ensures Yahoo! continuing survival as an independent company, which is the best state for
Hadoop development. It also frees up some Yahoo! datacentres for other projects. Those big
datacentres are large, slowly-depreciating assets, and by offloading the indexing to someone
else, there is now spare datacentre capacity for Yahoo! to use for other uses, uses that are
highly likely to use Hadoop at the back end -because what else would they build a datacentre-scale
application on?
  
  At the same time, there are opportunities for people outside Yahoo!
   * more agile deployments
@@ -289, +289 @@

  Clearly for Cloudera, this gives them a greater opportunity to position themselves as "the
owners of Hadoop", especially if they get more of the core Hadoop people on board. However,
Apache do try to add their own management layer to stop handing off full ownership of a project
to outside companies. But the reality is whoever provides the engineering effort owns the
project, so any organisation that can provide FTEs can dominate.
  
  What are the increased responsibilities for everyone else involved with Hadoop?
-  * Everyone has to test on larger cluster. EC2 may get tested, but it's not enough as it
is virtual, and only represents one single site/network config.  
+  * Everyone has to test on larger cluster. EC2 may get tested, but it's not enough as it
is virtual, and only represents one single site/network config.
   * Everyone should pull down and play with the pre-releases, on test clusters. Check the
FS upgrades work, etc.
  

Mime
View raw message