uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Announcement : Behemoth available on Google Code
Date Wed, 25 Nov 2009 13:53:05 GMT
Dear All,

Very early days, but I would like to announce a new Open Source project
named Behemoth which we have put on Google Code under Apache License (

Behemoth allows to deploy GATE or UIMA applications over a Hadoop cluster in
order to do very large scale document analysis. It uses a very simple
representation format which can be used as a common ground between UIMA and
GATE-generated annotations, hence achieving compatibility between both
systems. Since it is Hadoop-based it benefits from all its features
(scalability, fault-tolerance, etc...) and most notably the back up of a
thriving open source community. Quite a few Apache resources already do or
will fit into it: Nutch, Tika, Mahout, Hbase etc...

The documentation is virtually non existant (apart from some basic wiki
pages) but this should hopefully be fixed as some point soon. Again, the
project is at a very early stage so do not expect anything stable. This also
means that user feedback is more likely to influence the design or
implementation. Apart from the Google code pages for the project the best
place to discuss Behemoth or get updates on it is the DigitalPebble user
group on http://groups.google.com/group/digitalpebble.

We've used Behemoth on a 100K documents corpus on a small Amazon EC2 cluster
with a GATE application and found that it worked fine. If you have a cluster
available and a large corpus to process with UIMA or GATE maybe you should
give Behemoth a try?

Best regards,

Julien Nioche
DigitalPebble Ltd

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message