Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 40222 invoked from network); 28 Jan 2011 10:50:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Jan 2011 10:50:43 -0000 Received: (qmail 15197 invoked by uid 500); 28 Jan 2011 10:50:41 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 14821 invoked by uid 500); 28 Jan 2011 10:50:38 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 14813 invoked by uid 99); 28 Jan 2011 10:50:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Jan 2011 10:50:37 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FREEMAIL_FROM,SPF_HELO_PASS,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gcjlhu-hadoop-user@m.gmane.org designates 80.91.229.12 as permitted sender) Received: from [80.91.229.12] (HELO lo.gmane.org) (80.91.229.12) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Jan 2011 10:50:28 +0000 Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1Pilu4-00063q-Hq for common-user@hadoop.apache.org; Fri, 28 Jan 2011 11:50:08 +0100 Received: from rain.gmane.org ([80.91.229.7]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 28 Jan 2011 11:50:08 +0100 Received: from m.didonna86 by rain.gmane.org with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 28 Jan 2011 11:50:08 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: common-user@hadoop.apache.org From: Marco Didonna Subject: Distributed indexing with Hadoop Date: Fri, 28 Jan 2011 11:49:56 +0100 Lines: 38 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: rain.gmane.org User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101208 Thunderbird/3.1.7 X-Virus-Checked: Checked by ClamAV on apache.org Hello everyone, I am building an hadoop "app" to quickly index a corpus of documents. This app will accept one or more XML file that will contain the corpus. Each document is made up of several section: title, authors, body...these section are not static and depend on the collection. Here's a sample glimpse of how the xml input file looks like: the divine comedy Dante halfway along our life's path....... ... I would like to discuss some implementation choices: - which is the best way to "tell" my hadoop app which section to expect between and tags? - is it more appropriate to implement a record reader that passes to the mapper the whole content of the document tag or section by section. I was wondering which parser to use, a dom-like one or a sax-like one...any library (efficient) to recommend? - do you know any library I could use to process text? By text processing I mean common preprocessing operation like tokenization, stopword elimination...I was thinking of using lucene's engine...can it be a bottleneck? I am looking forward to read your opinion Thanks, Marco