hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "BristolHadoopWorkshopSpring2010" by SteveLoughran
Date Mon, 22 Mar 2010 13:57:54 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "BristolHadoopWorkshopSpring2010" page has been changed by SteveLoughran.
The comment on this change is: starting workshop notes.


New page:
= Bristol Hadoop Workshop Spring 2010 =

This was a one-day event hosted by HP Laboratories, Bristol, and co-organised by HPLabs and
Bristol University. It was a followup to the [[BristolHadoopWorkshop|2009 workshop]], again
a meeting of locals to discuss what they were up to and look at Hadoop in physics, among other

== Julien Nioche: Behemoth ==

Julien Nioche at [[http://www.digitalpebble.com/|digitalPebble]] has been working on Natural
Language Processing at scale.
 * Started with Apache UIMA: fairly simple
 * Now working on Behemoth, "Hadoop's evil twin":not a nice elephant at all
The goal is large scale document analysis based on Hadoop; to let you deploy GATE or UIMA
applications on Hadoop clusters. It was driven by the need to implement this for more than
one client client, opened it up to avoid writing from scratch every time.

Workflow: load to HDFS, import to Behemoth Doc format (PDF, HTML, WARC, Nutch segments, etc.
uses Apache Tika to extract text and metadata). Output (key==URI, value=BehemothDocument)

 * Common ground between UIMA and Gate (Sheffield university closed source)
 * Supports different (non-Java) annotators
 * Easy to configure using the Hadoop config file format and Behemoth/UIMA rules in JARs
 * Works on Hadoop the ecosystem

Demo: shows that the jobtracker JSP file has been extended with GATE metrics.

Future work: cascading support and Avro for cross-language code, SOLR and Mahout. It needs
to be tested at scale. Run @200K documents so far, Julien would be interested in anyone with
a datacentre and an NLP problem.

View raw message