hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "BristolHadoopWorkshop" by SteveLoughran
Date Fri, 14 Aug 2009 10:53:23 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SteveLoughran:

The comment on the change is:
Hadoop and HEP

    * [http://www.slideshare.net/steve_l/hdfs HDFS] (Johan Oskarsson, Last.fm)
    * [http://www.slideshare.net/steve_l/graphs-1848617 Graphs] Paolo Castagna, HP
    * [http://www.slideshare.net/steve_l/long-haul-hadoop Long Haul Hadoop] (Steve Loughran,
-   * [http://www.slideshare.net/steve_l/benchmarking-1840029 Benchmarking Hadoop] (Steve
Loughran & Julio Guijarro, HP)
+ == Hadoop Futures ==
+  * [http://www.slideshare.net/steve_l/hadoop-futures Hadoop Futures] (Tom White, Cloudera)
- == Benchmarking ==
+ == Benchmarking Hadoop ==
+  * [http://www.slideshare.net/steve_l/benchmarking-1840029 Benchmarking Hadoop] (Steve Loughran
& Julio Guijarro, HP)
  [:Terasort: Terasort], while a good way of regression testing performance across Hadoop
versions, isn't ideal for assessing which hardware is best for other algorithms than sort,
because things that are more iterative and CPU/memory hungry may not behave as expected on
a cluster which has good IO, but not enough RAM for their algorithm.
@@ -26, +31 @@

  That is, for all those people asking for a !HappyHadoop JSP page, it isn't enough. A cluster
may cope with some of the workers going down, but it is not actually functional unless every
node that is up can talk to every other node that is up, that nothing is coming up listening
on IPv6, that the TaskTracker hasn't decided to only run on localhost, etc. etc.
  == Long-Haul Hadoop ==
+   * [http://www.slideshare.net/steve_l/long-haul-hadoop Long Haul Hadoop] (Steve Loughran,
  This talk discussed the notion of a long-haul interface to Hadoop.
@@ -98, +105 @@

  Discussion: Simon mentioned that they had a REST API to some of the CERN job submission
services, and later sent out [https://twiki.cern.ch/twiki/bin/view/CMS/DMWTTutorialDatabaseREST#REST_classes_in_Webtools
a link]. There was general agreement that you need to push out more than just MR jobs
+ == Hadoop and High-Energy Physics ==
+  * [http://www.slideshare.net/steve_l/hadoop-hep Hadoop and High-Energy Physics] (Simon
Metson, Bristol University)
+ The CMS experiment is on the Large Hadron Collider; it will run for 20-30 years colliding
heavy ions, such as lead ions. Every collision is an event; 1MB of data. Over a year, you
are looking at 10+PB of data. Right now, as the LHC isn't live, everything is simulation data,
which helps debug the dataflow and the code, but reduces the stress. Most events are unexciting,
you may need to run through a few hundred million events to find a handful that are relevant.

+ Jobs get sent to specific cluster round the world where the data exists. It is the "move
work to data" across datacentres, but once in place, there isn't so much locality. 
+ The Grid protocols are used to place work, but a lot of the underlying grid stuff isn't
appropriate; written with a vision that doesn't match the needs. Specifically, while the schedulers
are great at work placement on specific machine types, meeting hardware and software requirements
(Hadoop doesn't do any of that), you can't ask for time on the MPI-enabled bit of the infrastructure,
as the grid placement treats every machine as standalone; doesn't care about interconnectivity.

+ New concept: "dark data" - data kept on somebody's laptop. This makes up the secret heavy
weight of the data sets. When you think that its laptop data that is the enemy of corporate
security and AV teams, its apt everyone. In CMS, people like to have their own sample data
on their site, they are possessive about it. This is probably because the LHC isn't running,
and the data rate isn't overloading everyone. When the beam goes live, you will be grateful
for storage and processing anywhere.
+ A lot of the physicists who worked on the LEP predecessor are used to storing everything
on a hard disk. The data rates render this viewpoint obsolete. 
+ In the LHC-era clusters, there is a big problem of disk, tape, CPU balance. For example,
multicore doesn't help as the memory footprint is such that multicore doesn't benefit that
much unless you have 32/64 GB. It also means that job setup/teardown costs are steep. You
don't want to work an event at a time, you want to run through a few thousand. The events
end up being stored in 2GB files for this reason.
+ The code is all FORTRAN coded in C++.
+ This was a really interesting talk that simon should give at apachecon. Physicists may be
used to discussing the event rate of a high-flux-density hadron beam, but for the rest of
us, it makes a change from web server logs. 
+ === Data flow ===
+  * LHC -> Tier 0, in Geneva
+  * Tier 0 records everything to tape, pushes it out to the tier 1 sites round the world.
In the UK, Rutherford Appleton Labs is the tier 1 site. 
+  * Tier 1 do some computation as well as storage -you can "skim" the data on on tier one,
quick reject of dull stuff -and can share the results. This is effectively a reduce.
+  * Tier 2 sites do most of the computation; they have their own storage and are scattered
round the world in various institutions. In the US, the designs are fairly homogeneous, in
EU, less so.
+ The architecture of the LHC pipeline is done more for national/organisation politics than
for efficient processing. The physicists don't get billed for network traffic.
+ Staff issues: lots of spare time cluster managers. People are the SPOFs of the CERN tooling.
+ In the long term, they may consolidate onto one or two UK sites.
+ === File Systems ===
+  * Castor: CERN, includes tapes, team turnover/staffing bad
+  * Dcache: fermilab. works there
+  * DPM: doesn't scale beyond 10TB of RAID.
+  * GPFS -Bristol. Taking a while to work as promised. 
+ === Hadoop Integration ===
+ HDFS has proven v. successful at T2 sites; popularity may increase as centres expand. Appreciated
features: checksumming, admin tools. Validate that the data is OK.
+ Could you run CMSSW under Hadoop? Probably not. Very slow startup/teardown cost, so you
don't want to just run it for one/two events.
+ Issue: How to convince physicists to embrace MR? Need to see the benefits, as physicists
don't see/care about the the costs

View raw message