hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "BristolHadoopWorkshopSpring2010" by SteveLoughran
Date Mon, 22 Mar 2010 14:03:32 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "BristolHadoopWorkshopSpring2010" page has been changed by SteveLoughran.
The comment on this change is: HEP.


   * Easy to configure using the Hadoop config file format and Behemoth/UIMA rules in JARs
   * Works on Hadoop the ecosystem
- Demo: shows that the jobtracker JSP file has been extended with GATE metrics.
+ Demo: shows that the JobTracker JSP page had been extended with GATE metrics.
  Future work: cascading support and Avro for cross-language code, SOLR and Mahout. It needs
to be tested at scale. Run @200K documents so far, Julien would be interested in anyone with
a datacentre and an NLP problem.
+ == James Jackson: Hadoop and High Energy Physics ==
+ James is from CERN and the CMS experiment -he spoke about ongoing work exploring using Hadoop
for HEP event mining.
+ The LHC experiments -Atlas, CMS, etc- generate event data, most of which is uninteresting.
Physics events can be split into
+  * Uninteresting and known physics
+  * Unknown and uninteresting. We don't have the theory ready for these events yet
+  * Unknown and interesting: stuff people are looking for that matches (somewhat) the current
theories, gives you Nobel prizes and the like.
+ To make life complicated there is a lot of noise on the detectors, timing problems can have
stuff come in out of order. You need to do a lot of filtering and look for signals a long
way off random noise before you can declare that you've found something interesting.
+ Most physicists not only code as if they were writing FORTRAN, they never wrote good FORTRAN
either. (this is a complaint by [[http://www.cs.utoronto.ca/~gvwilson/|Greg Wilson in Toronto]]
- the computing departments never teach software engineering to all the scientists who are
expected to code as part of their day to day science).
+ HDFS has been used as a filestore in some of the US CMS Tier-2 sites, the new work that
James discussed was that of actually treating physics problems as MapReduce jobs. They are
bringing up a cluster of machines with storage for this, but would also like to use idle CPU
time on other machines in the datacentre -there was some discussion on how to do this MAPREDUCE-1603
is now a feature request asking for a way to make the assessing of availability a feature
that supported plugins. This would allow someone to write something that looked at non-Hadoop
workload of machines and reduced the number Hadoop slots to report as being available when
busy with other work.

View raw message