From mahout-user-return-1832-apmail-lucene-mahout-user-archive=lucene.apache.org@lucene.apache.org Mon Nov 30 13:56:29 2009 Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 75827 invoked from network); 30 Nov 2009 13:56:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Nov 2009 13:56:29 -0000 Received: (qmail 16623 invoked by uid 500); 30 Nov 2009 13:56:28 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 16589 invoked by uid 500); 30 Nov 2009 13:56:28 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 16579 invoked by uid 99); 30 Nov 2009 13:56:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Nov 2009 13:56:28 +0000 X-ASF-Spam-Status: No, hits=-3.4 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [80.237.132.155] (HELO wp148.webpack.hosteurope.de) (80.237.132.155) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Nov 2009 13:56:25 +0000 Received: from wlan-242-147.net.tu-berlin.de ([130.149.242.147]); authenticated by wp148.webpack.hosteurope.de running ExIM with esmtpsa (TLSv1:RC4-MD5:128) id 1NF6jT-0006Pb-Tu; Mon, 30 Nov 2009 14:56:04 +0100 Message-ID: <4B13CED0.3020103@marc-hofer.de> Date: Mon, 30 Nov 2009 14:55:28 +0100 From: Marc Hofer Reply-To: mail@marc-hofer.de User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: mahout-user@lucene.apache.org Subject: Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing References: <4B1180BD.3010208@marc-hofer.de> <16d405e0911300323u1b5ec59fm1c8fe24503f53372@mail.gmail.com> In-Reply-To: <16d405e0911300323u1b5ec59fm1c8fe24503f53372@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-bounce-key: webpack.hosteurope.de;mail@marc-hofer.de;1259589385;0461a269; > Hi guys, > Hi Julien, > Why not using Behemoth to deploy your UIMA application on Hadoop? ( > http://code.google.com/p/behemoth-pebble/) > Behemoth uses for input & output the HDFS. We integrated so far Heritrix in combination with the HBase writer ( http://code.google.com/p/hbase-writer/ ) and focus on our whole architecture on HBase. It will be nice, if Behemoth supports HBase in the future. > Behemoth is meant to do exactly what you described and has already an > adapter for Nutch & WARC archives. It can take a UIMA pear deploy it on a > Hadoop cluster and extract some of the UIMA-generated annotations + store > them at a neutral format which could then be used to generate vectors for > Mahout. The purpose of Behemoth is to facilitate the deployment of NLP > components for large scale processing and act as a bridge between common > inputs (e.g. Nutch, WARC) and other projects (Mahout, Tika) etc... > In the course of facilitating the deployment of NLP components, you are perfectly right. > If we had a mechanism for generating Mahout vectors from Behemoth > annotations we would be able to use other NLP frameworks such as GATE as > well. Doing something like this is on the roadmap for Behemoth anyway but it > sounds like what you are planning to do would be a perfect match. > > Any thoughts on this? > > Julien > > Marc